An autonomous, multi-speaker podcast creation system powered by LangGraph agent pipelines.
Given any topic, the pipeline researches the web, plans chapters, designs characters, generates
naturalistic dialogue, synthesises multi-voice audio, and delivers a finished podcast — with no
human in the loop.
Podcast Creator is a fully autonomous, agentic pipeline that transforms any text topic into a
broadcast-quality, multi-speaker podcast. It is architected as a sequence of five independent
LangGraph subgraphs, each responsible for a distinct phase of production —
from internet research all the way through to final audio post-processing.
The system was designed with two core engineering goals: composability
(each phase is a self-contained, testable unit) and serverless-first performance
(all intermediate state is held in memory, eliminating I/O overhead and enabling zero-config
horizontal scaling on AWS Lambda).
End-to-end flow
Topic
→
Research
→
Plan
→
Dialogue
→
TTS
→
Master
→
MP3
Researches the web: generates 10 diverse search queries, scrapes ~100 sources, deduplicates with FAISS + Sentence-Transformers
Plans the episode: creates chapter outlines with acts and energy arcs, designs 2–3 distinct speaker personas
Writes dialogue: generates naturalistic multi-speaker scripts with fact-checking, QA review, and SSML annotation
Synthesises voices: TTS router maps each speaker line to the correct voice provider and generates audio segments
Masters audio: overlap mixing, EQ, compression, loudness normalisation, and final MP3 assembly
Pure in-memory processing. The entire pipeline runs in RAM within a single
AWS Lambda invocation (~210 MB peak). No databases, no temp files — zero cleanup required.
This yields 40% faster execution at $0.0003 per podcast.
02
5-Phase Pipeline Architecture
The system is built as a LangGraph StateGraph with 5 sequential phases, each implemented as a compiled subgraph with its own state, nodes, and quality gates. Phases communicate through strongly-typed Pydantic contracts.
Modular subgraph design. Each phase follows the same pattern: a
TypedDict state, a set of nodes, and a compiled StateGraph.
This makes every phase independently importable, testable, and extensible. Adding Phase 6
requires zero changes to existing phases.
03
Phase 1 — Research & Ingestion ✅ Complete
Phase 1 transforms a raw topic into a ranked corpus of deduplicated, relevance-scored content chunks ready for planning. It consists of three sequential agents.
Query Producer Agent
Classifies the topic's freshness and generates 10 diverse search queries:
Freshness classification: determines if the topic is "recent" (needs live web data) or "evergreen" (LLM-sufficient)
Executes Google Custom Search per query and scrapes full text from result URLs. Uses Requests + BeautifulSoup4 for HTML extraction with boilerplate removal.
Deduplication & Relevance Scorer
~100 scraped documents are processed entirely in memory through a 5-step pipeline:
# In-memory processing — no databases, no files
merged_text (str)
→ chunks (List[Dict]) # ~200 items, 500-word segments
→ embeddings (np.ndarray) # Shape: (200, 384)
→ faiss_index (IndexFlatIP) # Cosine similarity dedup
→ unique_chunks (List[Dict]) # ~100-150 after dedup
→ ranked_chunks (List[Dict]) # Top 60 by cross-encoder
→ state["ranked_chunks"] # Passed to Phase 2
04
Phase 2 — Content Planning 🔄 In Progress
Phase 2 transforms ranked research chunks into a structured episode plan with chapter outlines and fully realized speaker personas.
📋
Chapter Planner
Outlines N chapters with titles, talking points, source chunk assignments, and transition hooks. Creates narrative acts (setup → conflict → resolution) with energy arcs.
🎭
Character Designer
Designs 2 distinct speaker personas per episode — name, role (host/expert/skeptic), speaking style, vocabulary level, and catchphrases — in a single structured LLM call.
Why single-call persona design? Multi-turn refinement risks personality drift.
A single structured output call produces internally consistent characters with complementary
dynamics (e.g., curious host vs. authoritative expert) that remain coherent across all chapters.
05
Phase 3 — Dialogue Generation 🔄 In Progress
The creative heart of the pipeline. Phase 3 transforms chapter outlines into SSML-annotated dialogue scripts through a 5-agent subgraph.
Agent
Role
Key Behavior
Dialogue Engine
Script writer
Generates raw multi-speaker dialogue per chapter, maintaining character voice consistency
Naturalness Injector
Human-like speech
Injects filler words, pauses, interruptions, reactions, and laughter to simulate authentic conversation
SSML Annotator
TTS preparation
Converts naturalness markers into TTS-compatible SSML with prosody control (rate, pitch, emphasis)
Fact Checker
Verification
Validates factual claims in the script against the Phase 1 source chunks
QA Reviewer
Quality gate
Scores engagement, repetition, clarity, transitions, and energy arc compliance before handoff
06
Phase 4 — Voice Synthesis 🔄 In Progress
Phase 4 converts SSML-annotated scripts into per-speaker audio segments using a configurable TTS router that supports multiple voice providers.
🎙️
TTS Router Agent
Routes each script line to the correct TTS voice based on the speaker-to-voice map established in Phase 2. Supports multiple providers via a provider-agnostic interface.
🔊
Voice Consistency
Each speaker's voice is fixed for the entire episode. Voice identity is established once and never changes — listeners detect inconsistency instantly.
⚡
Parallel Synthesis
Audio segments are synthesised with bounded concurrency and exponential backoff on rate-limit errors, maximising throughput while respecting API quotas.
📦
Chapter Manifest
Builds an ordered clip registry with durations, speaker metadata, and timing directives that Phase 5 uses for stitching and overlap mixing.
07
Phase 5 — Audio Post-Processing 🔄 In Progress
Phase 5 transforms raw utterance WAVs into a broadcast-ready podcast episode using pydub + ffmpeg for all DSP operations.
🔊
Overlap Engine
Blends speech transitions from INTERRUPT/BACKCHANNEL markers. Adds cross-fades between speaker turns. Backchannels mixed at lower volume to simulate natural conversation dynamics.
🎛️
Audio Post-Processor
Professional mastering chain: noise gate, EQ (presence boost, sub-bass cut), 2:1 compression, loudness normalisation to -16 LUFS (Spotify/Apple Podcasts standard).
🎬
Cold Open Generator
Identifies the most compelling 15–30s moment in the episode and extracts the corresponding audio slice as the episode hook that plays before the main content.
📦
Chapter Stitcher
Assembles cold open + all chapters + outro into final MP3 with ID3 metadata — title, artist, artwork, and chapter markers compatible with major podcast apps.
A deliberate architecture decision: all deduplication, embedding, and relevance scoring runs entirely in RAM — no databases, no temp files, no external vector stores.
⚡
Performance
~4 sec execution time. 40% faster than Bedrock vector DB alternative. Zero I/O overhead.
💰
Cost
$0.0003 per podcast (compute only) vs $0.0009 with Bedrock. 3× cheaper at scale.
🔒
Multi-Tenant Safety
Each Lambda invocation runs in an isolated container. No shared state, no cross-tenant data leakage possible by design.
📐
Memory Budget
~210 MB peak: Sentence-Transformer model (~120 MB) + Cross-encoder (~80 MB) + FAISS index + chunks. Well under Lambda's 2 GB limit.
Approach
Execution Time
Lambda Cost / podcast
Complexity
In-Memory (chosen)
~4 sec
$0.0003
Minimal
Temp File Storage
~6 sec
$0.0005
Medium
AWS Bedrock Vector DB
~10 sec
$0.0009
High
Why not a vector database? Each podcast generation is a one-shot, ephemeral operation.
Embeddings are computed, used for deduplication, and discarded. There is no query-time retrieval,
no persistence requirement, and no cross-session reuse. A managed vector store adds latency, cost,
and operational complexity with zero benefit.
09
Tech Stack
Agent Framework
LangGraph
Compiled StateGraph with conditional routing, 5 independent subgraphs
LLM Orchestration
LangChain
LLM chain abstractions; LLMFactory for provider-agnostic swapping
LLM Provider
OpenAI GPT-4
Dialogue generation, query production, character design, content planning
API Framework
FastAPI
Async REST API — job submission, status polling, audio download endpoints
Embeddings
Sentence-Transformers
384-dim embeddings for semantic deduplication and relevance scoring
Vector Index
FAISS (CPU)
In-memory cosine similarity search for chunk deduplication
# Validate your environment
python preflight_check.py
# Start the API server
uvicorn main:app --host 0.0.0.0 --port 8080 --reload
4. Submit a Job via REST API
# Submit a podcast generation job
curl -X POST http://localhost:8080/api/v1/generate \
-H "Content-Type: application/json" \
-d '{
"topic": "The rise of humanoid robots in 2025",
"description": "Focus on Boston Dynamics, Figure, and Tesla Optimus",
"num_speakers": 2
}'# Poll status
curl http://localhost:8080/api/v1/status/<job_id>
# Download completed podcast
curl -O http://localhost:8080/api/v1/download/<job_id>
API Endpoints
Method
Endpoint
Description
GET
/health
Health check with API version
POST
/api/v1/generate
Submit a new podcast generation job
GET
/api/v1/status/{job_id}
Get current job status and progress
GET
/api/v1/download/{job_id}
Download the completed podcast audio
Hear It In Action
Listen to a fully AI-generated podcast episode — from research to final master, produced entirely by this pipeline.