Note

    3035_blog-editorial-ai-and-voice-program

    New umbrella program for editorial AI workflow, WhatsApp corpus ingestion, voic...

    Dates

    Created
    Not recorded
    Last updated
    Not recorded

    Document Metadata

    • title: 3035 - Blog Editorial AI and Voice Program
    • description: New umbrella program for editorial AI workflow, WhatsApp corpus ingestion, voice calibration, and governed blog production
    • status: active
    • lastUpdated: "2026-03-25 11:18 ET (America/New_York)"
    • owner: Product/Engineering

    3035 - Blog Editorial AI and Voice Program ## Goals - Build an AI-as

    3035 - Blog Editorial AI and Voice Program

    Goals

    • Build an AI-assisted editorial system that can source ideas, manage a content calendar, generate briefs/outlines/drafts, and support blog operations without flattening the brand voice.
    • Use WhatsApp exports as a high-signal research corpus for community language, concerns, provider/product recommendations, and visual culture while protecting member privacy.
    • Add review/authorship controls so sensitive or personal-story posts always require human signoff and can selectively use Maggie's named byline.

    Why This Program Exists

    • The current blog platform is already live and strong on publishing UX, feeds, SEO, and admin editing, but it is still a single-author Markdown workflow without:
      • source-ingestion pipelines
      • editorial-intelligence tooling
      • selective byline controls
      • signoff gates for sensitive content
      • structured voice/tone governance
    • The next body of work is large enough to justify its own umbrella program rather than squeezing it into the remaining 3001 stabilization backlog.

    Dependency-Ordered Project Sequence

    1. 3036_editorial-platform-architecture-and-repo-boundaries.md (completed)
    2. 3037_corpus-governance-redaction-and-privacy-policy.md (completed)
    3. 3038_whatsapp-intake-and-multimodal-normalization.md (active)
    4. 3039_editorial-intelligence-and-retrieval-layer.md (active)
    5. 3040_content-calendar-brief-and-draft-pipeline.md (completed)
    6. 3041_review-signoff-and-byline-controls.md (active)
    7. 3042_voice-calibration-and-model-routing.md (backlog)

    Why This Order

    • 3036 comes first because we need stable whole-repo system boundaries, usable repo organization, and editorial-platform architecture before choosing jobs, storage, schemas, or approval surfaces.
    • 3037 comes second because corpus-governance and privacy rules need to constrain ingestion design before large-scale data processing begins.
    • 3038 then builds the normalized corpus pipeline once the boundaries and redaction rules are explicit.
    • 3039 depends on the normalized corpus and turns it into retrieval-ready editorial intelligence.
    • 3040 depends on retrieval and annotations so the content calendar, briefs, and drafts are grounded rather than generic.
    • 3041 then integrates approval, byline, and publish-safety controls into the existing blog workflow.
    • 3042 comes last so model-tiering and any fine-tune decision are informed by the real tasks, corpus shape, and approval workflow instead of guesswork.

    Execution Status

    • Current state: active
    • Phase: umbrella execution with 3036 completed after the repo-wide architecture, organization, governance-definition, and safe reorganization lane
    • Current downstream state:
      • 3037 is completed after collaborative walkthrough/sign-off locked the governing corpus/privacy boundary for downstream work
      • 3038 remains active, but only as a conditional source-layer follow-up lane now that the phase-1 corpus contract and persisted pilot import are stable
      • 3039 remains active and is back on the critical path because weak-signal WhatsApp topic retrieval is currently the main blocker on useful brief/draft quality, even though live model-backed brief synthesis is now verified
      • 3040 is now completed after collaborative walkthrough/sign-off approved the phase-1 candidate -> calendar -> brief -> draft delivery, live run-key closeout reporting, and verified ready_for_3041_handoff evidence on the persisted pilot run
      • 3041 remains active, but the current highest-value work is now upstream quality improvement rather than more admin workflow expansion
    • Blocked by: none

    Planning Inputs

    • Existing repo baseline:
      • DOCS/features/blog.md
      • completed blog streams 3031, 3032, and 3033
    • External planning input reviewed at kickoff:
      • March 21, 2026 exported ChatGPT planning transcript on WhatsApp chat AI training
    • Core planning direction carried forward from that intake:
      • retrieval first
      • redaction before downstream AI use
      • multimodal corpus support (text + media + screenshots + voice notes)
      • optional fine-tuning only after runtime tasks and output shape are clear

    Non-Negotiable Operating Decisions

    • Keep the raw WhatsApp exports in a restricted raw vault and do all downstream AI work from a redacted derivative corpus.
    • Preserve provider, clinic, hospital, product, medication, and service recommendations as knowledge-layer data unless the surrounding context would re-identify a member.
    • Treat this as AI-assisted redaction plus human review for high-risk items, not as fully autonomous anonymization.
    • Default blog authorship remains generic/organizational unless a post is explicitly approved for Maggie's named byline.
    • Any post with strong personal-story framing, intimate health detail, direct first-person lived-experience claims, or near-direct source quotation must require human signoff before publishing.
    • Do not start with raw-corpus fine-tuning; first reach a working retrieval-grounded editorial pipeline and only then decide whether a curated fine-tune is worth the extra cost/maintenance.

    Scope

    In Scope

    • WhatsApp export intake architecture for large text + media corpora.
    • AI-assisted redaction, pseudonymization, OCR, transcription, media annotation, and sensitivity scoring.
    • Structured editorial research corpus design:
      • thread/chunk records
      • topic and pain-point labels
      • phrase banks
      • quote candidates
      • provider/product recommendation extraction
      • visual motif and media tags
    • Retrieval-powered editorial workflows:
      • article-idea sourcing
      • content-calendar planning
      • brief generation
      • outline generation
      • draft generation with internal source traceability
    • Voice/tone governance using redacted source material and human-approved exemplars.
    • Review, approval, authorship, and signoff workflow for blog publishing.
    • Stack/model evaluation for cost-quality tradeoffs.

    Out of Scope

    • Direct training on raw, unredacted personal chat logs.
    • Fully autonomous publishing with no human review.
    • Impersonation of named community members or preserving identifiable private speech as a style target.
    • Replacing the current blog platform before extending it.

    Success Criteria

    • A single approved architecture exists for raw-vault storage, redaction pipeline, structured corpus, retrieval layer, and editorial workflow orchestration.
    • One pilot WhatsApp corpus can be ingested end to end into a redacted, searchable editorial dataset.
    • The system can reliably generate:
      • grounded blog ideas
      • content calendar candidates
      • article briefs
      • first-draft posts with internal source traceability
    • Sensitive posts are always routed through explicit human review/signoff.
    • Byline policy is implemented at the data/workflow level:
      • generic byline default
      • Maggie byline only on approved posts
    • The program reaches a clear model-routing recommendation instead of defaulting to the most expensive model for every step.

    Recommended Architecture Direction

    Storage And Data Layers

    • Raw vault:
      • original WhatsApp exports and media stored outside normal app flows with restricted access
    • Redacted working corpus:
      • structured text/media records derived from the raw vault
    • Editorial intelligence layer:
      • thread summaries
      • topic clusters
      • recommendation entities
      • quote candidates
      • voice markers
      • content-angle suggestions
    • Retrieval layer:
      • searchable chunks plus editorial annotations for grounded drafting

    Stack Recommendation

    • Start repo-native instead of jumping straight to n8n.
    • Recommended first stack:
      • TypeScript batch scripts and server-side app actions in this repo
      • Neon/Postgres for metadata, workflow state, approvals, and editorial objects
      • private object storage for raw and redacted media artifacts
      • retrieval index added only after the normalized corpus schema is stable
    • n8n can still be useful later for notifications, inbox routing, calendar syncing, or cross-tool approvals, but it should not be the first core processing engine for parsing/redaction/governance logic.
    • If orchestration complexity grows beyond simple repo-native jobs, evaluate a job/workflow layer after schema + review rules are stable rather than before.

    Model Strategy Recommendation

    • Use a tiered model mix, not one model everywhere.
    • Lower-cost models are appropriate for:
      • first-pass parsing
      • candidate extraction
      • metadata normalization
      • coarse classification
    • Higher-end models are appropriate for:
      • multimodal understanding on difficult media
      • subtle voice/tone analysis
      • sensitive-subject classification
      • high-quality brief and draft generation
      • approval-ready synthesis
    • Do not assume expensive models are required for every ingestion step; reserve them for stages where quality materially changes editorial output.

    Workflow Guardrails

    Default Authorship Policy

    • Most posts stay on the generic site/organization byline.
    • Named Maggie byline is reserved for posts explicitly approved as personal/editorial voice pieces.

    Mandatory Signoff Categories

    • Personal-story or memoir-style posts
    • Posts that read as Maggie's own lived experience
    • Posts covering fertility, pregnancy loss, mental health, relationship trauma, or other intimate/identity-heavy topics in a first-person or advisory voice
    • Posts using direct or near-direct quotes from source conversations
    • Posts whose recommendation strength or emotional framing could be interpreted as personal endorsement

    Source-Material Policy

    • Provider/product/clinic recommendations may be preserved when identity-safe.
    • Direct identifiers and quasi-identifiers for community members must be redacted, generalized, or withheld.
    • High-risk media must receive human review before it can inform editorial outputs.

    Proposed Workstreams

    1. Corpus Governance And Privacy
      • data handling rules
      • redaction standards
      • approval boundaries
      • source-permission posture
    2. WhatsApp Intake And Normalization
      • parsing exports
      • media manifests
      • OCR/transcription pipeline
      • structured record schema
    3. Editorial Intelligence Layer
      • topic extraction
      • quote bank
      • provider/product recommendation maps
      • voice and phrase bank
    4. Content Operations Layer
      • idea backlog
      • article brief generation
      • calendar planning
      • freshness/reuse controls
    5. Drafting, Review, And Authorship Controls
      • sensitive-content routing
      • signoff states
      • named-byline policy
      • publish-safe approval flow
    6. Model And Automation Optimization
      • cost/quality benchmarks
      • model routing
      • optional fine-tune decision
      • optional automation tooling expansion

    Milestones

    • Milestone 1: Program architecture, governance rules, and stack decision locked.
    • Milestone 2: Pilot WhatsApp corpus ingested into a redacted multimodal research dataset.
    • Milestone 3: Retrieval-grounded editorial tools produce blog ideas, briefs, and calendar candidates.
    • Milestone 4: Draft-generation and signoff workflow integrated with the blog publishing system.
    • Milestone 5: Byline controls, sensitive-content gates, and model-routing recommendation finalized.

    Dependencies

    • Builds on the existing blog platform documented in DOCS/features/blog.md.
    • Needs a small carry-forward maintenance lane from 3008/3013 so governance and walkthrough debt do not accumulate while this program becomes the lead focus.

    Risks

    • Privacy and trust risk if redaction is treated as fully solved by AI without human review.
    • Voice quality risk if the system overfits to private conversati

    ...[truncated for intake]

    Provenance