Back / Technical Deep Dive
Technical Documentation

AI Podcast Generator
System Architecture

A fully autonomous, 5-phase LangGraph pipeline that transforms any topic into a production-ready podcast episode — complete with AI-generated dialogue, multi-speaker TTS voice synthesis, conversational overlap mixing, and broadcast-grade audio mastering.

LangGraph FastAPI GPT-4 Gemini 2.5 Pro TTS FAISS Sentence-Transformers pydub + ffmpeg AWS Lambda
Contents
  1. Project Overview
  2. 5-Phase Pipeline Architecture
  3. Phase 1 — Research & Ingestion
  4. Phase 2 — Content Planning
  5. Phase 3 — Dialogue Generation
  6. Phase 4 — Voice Synthesis
  7. Phase 5 — Audio Post-Processing
  8. In-Memory Processing Architecture
  9. Tech Stack
  10. Design Decisions

Project Overview

The AI Podcast Generator is an autonomous system that converts any topic into a fully produced podcast episode through a 5-phase, 30+ node LangGraph pipeline. From web research to final MP3 export, every step is orchestrated by specialized AI agents with no human intervention required.

End-to-end flow

Topic
Research
Plan
Script
Voice
Master
MP3
Zero databases, zero temp files. The entire pipeline runs in-memory within a single AWS Lambda invocation (~210 MB RAM). This architecture decision yields 40% faster execution and $0.0003 per podcast vs $0.0009 with vector DB alternatives.

5-Phase Pipeline Architecture

The system is built as a LangGraph StateGraph with 5 sequential phases, each implemented as a subgraph with its own nodes, conditional routing, retry logic, and quality gates.

INPUT { topic, preferences }
🔍 PHASE 1 — Research & Ingestion ReAct
📋 PHASE 2 — Content Planning Sequential
🎭 PHASE 3 — Dialogue Generation 6 Nodes
🎙️ PHASE 4 — Voice Synthesis Gemini TTS
🎚️ PHASE 5 — Audio Post-Processing DSP
OUTPUT { podcast_episode_final.mp3 }
Full LangGraph Pipeline

Complete LangGraph node topology across all 5 phases

Phase 1 — Research & Ingestion

Phase 1 transforms a raw topic into a ranked corpus of deduplicated, relevance-scored content chunks ready for script generation.

Query Producer Agent

A ReAct-style agent with 4 tools generates 10 diverse search queries using a freshness-aware routing strategy:

ToolPurpose
web_searchGoogle Custom Search API — fetches SERP results
web_fetchScrapes full page content with Trafilatura
get_current_dateProvides current date for temporal query tagging
classify_freshnessRoutes topic through recent vs evergreen path

Deduplication & Relevance Scoring

~100 scraped documents are processed entirely in memory:

# In-memory processing pipeline merged_text (str) → chunks (List[Dict]) # ~200 items, 500-word segments → embeddings (np.ndarray) # Shape: (200, 384) → faiss_index (IndexFlatIP) # Cosine similarity dedup → unique_chunks (List[Dict]) # ~100-150 after dedup → ranked_chunks (List[Dict]) # Top 60 by cross-encoder → state["ranked_chunks"] # Passed to Phase 2

Phase 2 — Content Planning

Phase 2 transforms ranked research chunks into a structured episode plan with chapter outlines and fully realized speaker personas.

Chapter Planner

Creates a multi-chapter episode structure with acts (setup → conflict → resolution), energy arcs, key points per chapter, source chunk assignments, and transition hooks between chapters.

Character Designer Agent

Designs 2–3 speaker personas per episode with detailed attributes:

Why single-call persona design? Multi-turn persona refinement risks personality drift between iterations. A single structured output call produces internally consistent characters with complementary dynamics (e.g., curious host vs. authoritative expert).

Phase 3 — Dialogue Generation

The creative heart of the pipeline. Phase 3 is a 6-node LangGraph subgraph that transforms chapter outlines into SSML-annotated dialogue scripts.

NodeRoleKey Behavior
Dialogue Engine Script writer Beat-by-beat generation (5 beats/chapter) with context continuity
Expert Expander Content enrichment Expands expert utterances with detailed explanations while maintaining conversational flow
Naturalness Injector Human-like speech Injects 7 marker types: [FILLER:*], [PAUSE:*], [EMPHASIS:*], [PACE:*], [LAUGH:*], [INTERRUPT:*], [BACKCHANNEL:*]
Fact Checker Verification Validates claims against source chunks using grounding + semantic similarity
QA Reviewer Quality gate Scores engagement, repetition, clarity, transitions, and energy arc compliance
SSML Annotator TTS preparation Converts naturalness markers into Google Cloud TTS-compatible SSML with prosody control
Beat-by-beat, not monolithic. Each chapter is divided into 5 narrative beats generated sequentially. This prevents context window overflow, maintains energy arc control, and allows the QA reviewer to catch issues at a granular level before they propagate.

Phase 4 — Voice Synthesis

Phase 4 converts SSML-annotated scripts into per-utterance WAV files using Gemini 2.5 Pro TTS as the primary engine, implemented as an 8-step LangGraph subgraph.

1️⃣
Contract Validation
Validates Phase 3 output: SSML structure, speaker metadata, chapter completeness
2️⃣
Voice Assignment
Fixed speaker-to-voice mapping for entire episode with similarity-based fallback
3️⃣
Utterance Normalization
Punctuation-aware splitting for oversized utterances; sub-ID lineage tracking
4️⃣
Gemini Request Routing
Constructs provider-optimized payloads with character/token guards
5️⃣
Parallel Synthesis
Bounded concurrency with exponential backoff on 429/503; QPS-safe execution
6️⃣
Audio Quality Gate
Technical QC on generated clips with selective auto-repair for failed segments
7️⃣
Chapter Manifest
Builds ordered clip registry with durations, speaker metadata, timing directives
8️⃣
Phase 5 Handoff
Packages WAV clips + timing metadata + voice map into Phase 5 input contract
Voice consistency is non-negotiable. Voice identity is fixed per speaker for the entire episode. Any fallback voice must be pre-mapped by similarity profile and logged. Listeners detect voice changes instantly — even subtle ones break immersion.

Phase 5 — Audio Post-Processing

Phase 5 is almost entirely pure audio DSP — no LLMs except one lightweight script-scan call. It transforms raw utterance WAVs into a broadcast-ready podcast episode.

🔊
Audio Overlap Engine
Mixes clips with conversational overlaps using INTERRUPT/BACKCHANNEL/LAUGH timing directives. Adds 50–100ms cross-fades between speaker turns. Backchannels mixed at -8dB.
🎛️
Audio Post-Processor
Professional mastering chain: noise gate (-40dB), EQ (2–5kHz presence boost, sub-100Hz cut), 2:1 compression, loudness normalization to -16 LUFS (Spotify/Apple standard).
🎬
Cold Open Generator
LLM (Claude Haiku) scans the full script to identify the most compelling 15–30s moment, then extracts the corresponding audio slice as the episode hook.
📦
Chapter Stitcher
Assembles cold open + intro music + all chapters + host outro + outro music into final MP3 with ID3 metadata (title, artist, artwork, chapter markers).
# Audio mastering chain (per chapter) noise_gate(threshold=-40dB) → eq(boost="2-5kHz", cut="<100Hz") → compress(ratio=2:1, threshold=-20dB) → normalize(target=-16 LUFS) → room_tone(level=-32dB) → chapter_N_mastered.wav

In-Memory Processing Architecture

A deliberate architecture decision: all deduplication, embedding, and relevance scoring runs entirely in RAM — no databases, no temp files, no external vector stores.

Performance
40% faster than Bedrock vector DB alternative. Sub-second processing latency with zero I/O overhead.
💰
Cost
$0.0003 per podcast (compute only) vs $0.0009 with Bedrock. 3× cheaper at scale.
🔒
Multi-Tenant Safety
Each Lambda invocation is isolated. No shared state, no cross-tenant data leakage possible by design.
📐
Memory Budget
~210 MB peak usage: models (~150 MB) + embeddings (~1.2 MB) + FAISS index (~0.6 MB) + chunks (~50 MB). Well under Lambda's 2 GB limit.
Why not a vector database? Each podcast generation is a one-shot operation — embeddings are computed, used for deduplication, and discarded. There is no query-time retrieval, no persistence requirement, and no cross-session reuse. A vector DB would add latency, cost, and operational complexity with zero benefit.

Tech Stack

Agent Framework
LangGraph
Stateful graph with conditional edges, subgraphs per phase, checkpoint persistence
LLM
OpenAI GPT-4
Dialogue generation, query production, character design, content planning
TTS Engine
Gemini 2.5 Pro
Multi-voice synthesis with SSML prosody, 10-voice bank, parallel execution
Cold Open LLM
Claude Haiku
Lightweight script scan for compelling excerpt selection (single call)
API
FastAPI
Async REST endpoints for job submission and status tracking
Embeddings
Sentence-Transformers
384-dim embeddings for semantic deduplication and relevance scoring
Vector Index
FAISS
In-memory cosine similarity search for chunk deduplication
Audio DSP
pydub + ffmpeg
Overlap mixing, EQ, compression, loudness normalization, MP3 export
Web Scraping
Trafilatura
Clean text extraction from web pages, boilerplate removal
Search
Google Custom Search
SERP results for research phase query execution
Runtime
AWS Lambda
Serverless, 2 GB RAM, 5-min timeout, isolated per invocation
Validation
Pydantic v2
Typed state models, inter-phase contracts, auto-truncation validators

Design Decisions

Eight core principles that shaped the architecture of the podcast generation pipeline.

Principle 01
Phase-as-Subgraph Isolation
Each phase is a self-contained LangGraph subgraph with its own state, retry logic, and quality gates. Phases communicate through strongly-typed contracts — never shared globals.
Principle 02
In-Memory Everything
No databases, no temp files, no external vector stores. All intermediate state lives in Python objects within a single Lambda invocation. Simpler, faster, cheaper.
Principle 03
Beat-by-Beat Generation
Dialogue is generated in 5 narrative beats per chapter, not monolithically. This prevents context overflow, maintains energy arc control, and enables granular QA.
Principle 04
Naturalness Markers as First-Class Data
7 marker types ([FILLER], [PAUSE], [LAUGH], etc.) flow through the entire pipeline — from script to SSML to timing directives to audio overlap. They are data, not decoration.
Principle 05
Voice Consistency > Voice Quality
A speaker's voice is fixed for the entire episode. Fallback voices are pre-mapped by similarity. Listeners detect inconsistency faster than low quality.
Principle 06
DSP over LLMs for Audio
Phase 5 uses zero LLMs for audio processing (except one Haiku call for cold open selection). Mastering, mixing, and normalization are deterministic DSP operations.
Principle 07
Freshness-Aware Research
The query producer classifies topics as "recent" or "evergreen" and routes through different research paths. Recent topics get date-tagged queries and live web data.
Principle 08
Contract-Driven Phase Handoffs
Each phase validates its input contract before execution begins. If Phase 3's output doesn't match Phase 4's expected schema, the pipeline fails fast with a clear error — not silently.

Hear It In Action

Listen to a fully AI-generated podcast episode — from research to final master, produced entirely by this pipeline.