A fully autonomous, 5-phase LangGraph pipeline that transforms any topic into a
production-ready podcast episode — complete with AI-generated dialogue, multi-speaker
TTS voice synthesis, conversational overlap mixing, and broadcast-grade audio mastering.
LangGraphFastAPIGPT-4Gemini 2.5 Pro TTSFAISSSentence-Transformerspydub + ffmpegAWS Lambda
The AI Podcast Generator is an autonomous system that converts any topic into a fully
produced podcast episode through a 5-phase, 30+ node LangGraph pipeline.
From web research to final MP3 export, every step is orchestrated by specialized AI agents
with no human intervention required.
End-to-end flow
Topic
→
Research
→
Plan
→
Script
→
Voice
→
Master
→
MP3
Researches the web: generates 10 diverse search queries via a ReAct agent, scrapes ~100 sources, deduplicates with FAISS
Plans the episode: creates chapter outlines, designs 2–3 speaker personas with distinct voices and speaking styles
Writes dialogue: beat-by-beat script generation with naturalness markers, fact-checking, QA review, and SSML annotation
Synthesizes voices: Gemini 2.5 Pro TTS with parallel synthesis, audio quality gates, and auto-repair
Masters audio: conversational overlap mixing, broadcast EQ/compression, loudness normalization to -16 LUFS, cold open generation, and final MP3 assembly with ID3 metadata
Zero databases, zero temp files. The entire pipeline runs in-memory within a single
AWS Lambda invocation (~210 MB RAM). This architecture decision yields 40% faster execution and
$0.0003 per podcast vs $0.0009 with vector DB alternatives.
02
5-Phase Pipeline Architecture
The system is built as a LangGraph StateGraph with 5 sequential phases, each implemented as a subgraph with its own nodes, conditional routing, retry logic, and quality gates.
INPUT { topic, preferences }
↓
🔍PHASE 1 — Research & IngestionReAct
↓
📋PHASE 2 — Content PlanningSequential
↓
🎭PHASE 3 — Dialogue Generation6 Nodes
↓
🎙️PHASE 4 — Voice SynthesisGemini TTS
↓
🎚️PHASE 5 — Audio Post-ProcessingDSP
↓
OUTPUT { podcast_episode_final.mp3 }
Complete LangGraph node topology across all 5 phases
03
Phase 1 — Research & Ingestion
Phase 1 transforms a raw topic into a ranked corpus of deduplicated, relevance-scored content chunks ready for script generation.
Query Producer Agent
A ReAct-style agent with 4 tools generates 10 diverse search queries using a freshness-aware routing strategy:
Freshness Classifier: determines if the topic is "recent" (needs live data) or "evergreen" (LLM-sufficient)
Phase 2 transforms ranked research chunks into a structured episode plan with chapter outlines and fully realized speaker personas.
Chapter Planner
Creates a multi-chapter episode structure with acts (setup → conflict → resolution), energy arcs, key points per chapter, source chunk assignments, and transition hooks between chapters.
Character Designer Agent
Designs 2–3 speaker personas per episode with detailed attributes:
Persona schema: name, role (host/expert/skeptic), speaking style, vocabulary level, catchphrases
Voice bank: 10 curated Gemini 2.5 Pro TTS voices mapped to persona archetypes
Single LLM call: all personas generated in one structured output call — no iterative refinement needed
Chapter context: personas are designed with awareness of the episode's topic and chapter structure
Why single-call persona design? Multi-turn persona refinement risks
personality drift between iterations. A single structured output call produces internally
consistent characters with complementary dynamics (e.g., curious host vs. authoritative expert).
05
Phase 3 — Dialogue Generation
The creative heart of the pipeline. Phase 3 is a 6-node LangGraph subgraph that transforms chapter outlines into SSML-annotated dialogue scripts.
Node
Role
Key Behavior
Dialogue Engine
Script writer
Beat-by-beat generation (5 beats/chapter) with context continuity
Expert Expander
Content enrichment
Expands expert utterances with detailed explanations while maintaining conversational flow
Validates claims against source chunks using grounding + semantic similarity
QA Reviewer
Quality gate
Scores engagement, repetition, clarity, transitions, and energy arc compliance
SSML Annotator
TTS preparation
Converts naturalness markers into Google Cloud TTS-compatible SSML with prosody control
Beat-by-beat, not monolithic. Each chapter is divided into 5 narrative beats
generated sequentially. This prevents context window overflow, maintains energy arc control,
and allows the QA reviewer to catch issues at a granular level before they propagate.
06
Phase 4 — Voice Synthesis
Phase 4 converts SSML-annotated scripts into per-utterance WAV files using Gemini 2.5 Pro TTS as the primary engine, implemented as an 8-step LangGraph subgraph.
Voice consistency is non-negotiable. Voice identity is fixed per speaker for the
entire episode. Any fallback voice must be pre-mapped by similarity profile and logged.
Listeners detect voice changes instantly — even subtle ones break immersion.
07
Phase 5 — Audio Post-Processing
Phase 5 is almost entirely pure audio DSP — no LLMs except one lightweight script-scan call. It transforms raw utterance WAVs into a broadcast-ready podcast episode.
🔊
Audio Overlap Engine
Mixes clips with conversational overlaps using INTERRUPT/BACKCHANNEL/LAUGH timing directives. Adds 50–100ms cross-fades between speaker turns. Backchannels mixed at -8dB.
LLM (Claude Haiku) scans the full script to identify the most compelling 15–30s moment, then extracts the corresponding audio slice as the episode hook.
📦
Chapter Stitcher
Assembles cold open + intro music + all chapters + host outro + outro music into final MP3 with ID3 metadata (title, artist, artwork, chapter markers).
A deliberate architecture decision: all deduplication, embedding, and relevance scoring runs entirely in RAM — no databases, no temp files, no external vector stores.
⚡
Performance
40% faster than Bedrock vector DB alternative. Sub-second processing latency with zero I/O overhead.
💰
Cost
$0.0003 per podcast (compute only) vs $0.0009 with Bedrock. 3× cheaper at scale.
🔒
Multi-Tenant Safety
Each Lambda invocation is isolated. No shared state, no cross-tenant data leakage possible by design.
📐
Memory Budget
~210 MB peak usage: models (~150 MB) + embeddings (~1.2 MB) + FAISS index (~0.6 MB) + chunks (~50 MB). Well under Lambda's 2 GB limit.
Why not a vector database? Each podcast generation is a one-shot operation —
embeddings are computed, used for deduplication, and discarded. There is no query-time retrieval,
no persistence requirement, and no cross-session reuse. A vector DB would add latency, cost,
and operational complexity with zero benefit.
09
Tech Stack
Agent Framework
LangGraph
Stateful graph with conditional edges, subgraphs per phase, checkpoint persistence
LLM
OpenAI GPT-4
Dialogue generation, query production, character design, content planning
TTS Engine
Gemini 2.5 Pro
Multi-voice synthesis with SSML prosody, 10-voice bank, parallel execution
Cold Open LLM
Claude Haiku
Lightweight script scan for compelling excerpt selection (single call)
API
FastAPI
Async REST endpoints for job submission and status tracking
Embeddings
Sentence-Transformers
384-dim embeddings for semantic deduplication and relevance scoring
Vector Index
FAISS
In-memory cosine similarity search for chunk deduplication
Clean text extraction from web pages, boilerplate removal
Search
Google Custom Search
SERP results for research phase query execution
Runtime
AWS Lambda
Serverless, 2 GB RAM, 5-min timeout, isolated per invocation
Validation
Pydantic v2
Typed state models, inter-phase contracts, auto-truncation validators
10
Design Decisions
Eight core principles that shaped the architecture of the podcast generation pipeline.
Principle 01
Phase-as-Subgraph Isolation
Each phase is a self-contained LangGraph subgraph with its own state, retry logic, and quality gates. Phases communicate through strongly-typed contracts — never shared globals.
Principle 02
In-Memory Everything
No databases, no temp files, no external vector stores. All intermediate state lives in Python objects within a single Lambda invocation. Simpler, faster, cheaper.
Principle 03
Beat-by-Beat Generation
Dialogue is generated in 5 narrative beats per chapter, not monolithically. This prevents context overflow, maintains energy arc control, and enables granular QA.
Principle 04
Naturalness Markers as First-Class Data
7 marker types ([FILLER], [PAUSE], [LAUGH], etc.) flow through the entire pipeline — from script to SSML to timing directives to audio overlap. They are data, not decoration.
Principle 05
Voice Consistency > Voice Quality
A speaker's voice is fixed for the entire episode. Fallback voices are pre-mapped by similarity. Listeners detect inconsistency faster than low quality.
Principle 06
DSP over LLMs for Audio
Phase 5 uses zero LLMs for audio processing (except one Haiku call for cold open selection). Mastering, mixing, and normalization are deterministic DSP operations.
Principle 07
Freshness-Aware Research
The query producer classifies topics as "recent" or "evergreen" and routes through different research paths. Recent topics get date-tagged queries and live web data.
Principle 08
Contract-Driven Phase Handoffs
Each phase validates its input contract before execution begins. If Phase 3's output doesn't match Phase 4's expected schema, the pipeline fails fast with a clear error — not silently.
Hear It In Action
Listen to a fully AI-generated podcast episode — from research to final master, produced entirely by this pipeline.