Technical Documentation

AI Podcast Generator
System Architecture

A fully autonomous, 5-phase LangGraph pipeline that transforms any topic into a production-ready podcast episode — complete with AI-generated dialogue, multi-speaker TTS voice synthesis, conversational overlap mixing, and broadcast-grade audio mastering.

LangGraph FastAPI GPT-4 Gemini 2.5 Pro TTS FAISS Sentence-Transformers pydub + ffmpeg AWS Lambda

Contents

Project Overview
5-Phase Pipeline Architecture
Phase 1 — Research & Ingestion
Phase 2 — Content Planning
Phase 3 — Dialogue Generation
Phase 4 — Voice Synthesis
Phase 5 — Audio Post-Processing
In-Memory Processing Architecture
Tech Stack
Design Decisions

Project Overview

The AI Podcast Generator is an autonomous system that converts any topic into a fully produced podcast episode through a 5-phase, 30+ node LangGraph pipeline. From web research to final MP3 export, every step is orchestrated by specialized AI agents with no human intervention required.

End-to-end flow

Topic

→

Research

→

Plan

→

Script

→

Voice

→

Master

→

MP3

Researches the web: generates 10 diverse search queries via a ReAct agent, scrapes ~100 sources, deduplicates with FAISS
Plans the episode: creates chapter outlines, designs 2–3 speaker personas with distinct voices and speaking styles
Writes dialogue: beat-by-beat script generation with naturalness markers, fact-checking, QA review, and SSML annotation
Synthesizes voices: Gemini 2.5 Pro TTS with parallel synthesis, audio quality gates, and auto-repair
Masters audio: conversational overlap mixing, broadcast EQ/compression, loudness normalization to -16 LUFS, cold open generation, and final MP3 assembly with ID3 metadata

Zero databases, zero temp files. The entire pipeline runs in-memory within a single AWS Lambda invocation (~210 MB RAM). This architecture decision yields 40% faster execution and $0.0003 per podcast vs $0.0009 with vector DB alternatives.

5-Phase Pipeline Architecture

The system is built as a LangGraph StateGraph with 5 sequential phases, each implemented as a subgraph with its own nodes, conditional routing, retry logic, and quality gates.

INPUT { topic, preferences }

↓

🔍 PHASE 1 — Research & Ingestion ReAct

↓

📋 PHASE 2 — Content Planning Sequential

↓

🎭 PHASE 3 — Dialogue Generation 6 Nodes

↓

🎙️ PHASE 4 — Voice Synthesis Gemini TTS

↓

🎚️ PHASE 5 — Audio Post-Processing DSP

↓

OUTPUT { podcast_episode_final.mp3 }

Complete LangGraph node topology across all 5 phases

Phase 1 — Research & Ingestion

Phase 1 transforms a raw topic into a ranked corpus of deduplicated, relevance-scored content chunks ready for script generation.

Query Producer Agent

A ReAct-style agent with 4 tools generates 10 diverse search queries using a freshness-aware routing strategy:

Freshness Classifier: determines if the topic is "recent" (needs live data) or "evergreen" (LLM-sufficient)
Recent path: performs seed search → extracts entities/dates → generates date-aware queries (~20–30s)
Evergreen path: direct LLM generation of diverse angle queries (~5–10s)
Date-aware tagging: appends temporal qualifiers to ensure freshness

Tool	Purpose
web_search	Google Custom Search API — fetches SERP results
web_fetch	Scrapes full page content with Trafilatura
get_current_date	Provides current date for temporal query tagging
classify_freshness	Routes topic through recent vs evergreen path

Deduplication & Relevance Scoring

~100 scraped documents are processed entirely in memory:

# In-memory processing pipeline
merged_text (str)
    → chunks (List[Dict])           # ~200 items, 500-word segments
    → embeddings (np.ndarray)       # Shape: (200, 384)
    → faiss_index (IndexFlatIP)     # Cosine similarity dedup
    → unique_chunks (List[Dict])    # ~100-150 after dedup
    → ranked_chunks (List[Dict])    # Top 60 by cross-encoder
    → state["ranked_chunks"]       # Passed to Phase 2

Phase 2 — Content Planning

Phase 2 transforms ranked research chunks into a structured episode plan with chapter outlines and fully realized speaker personas.

Chapter Planner

Creates a multi-chapter episode structure with acts (setup → conflict → resolution), energy arcs, key points per chapter, source chunk assignments, and transition hooks between chapters.

Character Designer Agent

Designs 2–3 speaker personas per episode with detailed attributes:

Persona schema: name, role (host/expert/skeptic), speaking style, vocabulary level, catchphrases
Voice bank: 10 curated Gemini 2.5 Pro TTS voices mapped to persona archetypes
Single LLM call: all personas generated in one structured output call — no iterative refinement needed
Chapter context: personas are designed with awareness of the episode's topic and chapter structure

Why single-call persona design? Multi-turn persona refinement risks personality drift between iterations. A single structured output call produces internally consistent characters with complementary dynamics (e.g., curious host vs. authoritative expert).

Phase 3 — Dialogue Generation

The creative heart of the pipeline. Phase 3 is a 6-node LangGraph subgraph that transforms chapter outlines into SSML-annotated dialogue scripts.

Node	Role	Key Behavior
Dialogue Engine	Script writer	Beat-by-beat generation (5 beats/chapter) with context continuity
Expert Expander	Content enrichment	Expands expert utterances with detailed explanations while maintaining conversational flow
Naturalness Injector	Human-like speech	Injects 7 marker types: `[FILLER:]`, `[PAUSE:]`, `[EMPHASIS:]`, `[PACE:]`, `[LAUGH:]`, `[INTERRUPT:]`, `[BACKCHANNEL:*]`
Fact Checker	Verification	Validates claims against source chunks using grounding + semantic similarity
QA Reviewer	Quality gate	Scores engagement, repetition, clarity, transitions, and energy arc compliance
SSML Annotator	TTS preparation	Converts naturalness markers into Google Cloud TTS-compatible SSML with prosody control

Beat-by-beat, not monolithic. Each chapter is divided into 5 narrative beats generated sequentially. This prevents context window overflow, maintains energy arc control, and allows the QA reviewer to catch issues at a granular level before they propagate.

Phase 4 — Voice Synthesis

Phase 4 converts SSML-annotated scripts into per-utterance WAV files using Gemini 2.5 Pro TTS as the primary engine, implemented as an 8-step LangGraph subgraph.

1️⃣

Contract Validation

Validates Phase 3 output: SSML structure, speaker metadata, chapter completeness

2️⃣

Voice Assignment

Fixed speaker-to-voice mapping for entire episode with similarity-based fallback

3️⃣

Utterance Normalization

Punctuation-aware splitting for oversized utterances; sub-ID lineage tracking

4️⃣

Gemini Request Routing

Constructs provider-optimized payloads with character/token guards

5️⃣Parallel Synthesis
Bounded concurrency with exponential backoff on 429/503; QPS-safe execution

6️⃣

Audio Quality Gate

Technical QC on generated clips with selective auto-repair for failed segments

7️⃣

Chapter Manifest

Builds ordered clip registry with durations, speaker metadata, timing directives

8️⃣

Phase 5 Handoff

Packages WAV clips + timing metadata + voice map into Phase 5 input contract

Voice consistency is non-negotiable. Voice identity is fixed per speaker for the entire episode. Any fallback voice must be pre-mapped by similarity profile and logged. Listeners detect voice changes instantly — even subtle ones break immersion.

Phase 5 — Audio Post-Processing

Phase 5 is almost entirely pure audio DSP — no LLMs except one lightweight script-scan call. It transforms raw utterance WAVs into a broadcast-ready podcast episode.

🔊Audio Overlap Engine
Mixes clips with conversational overlaps using INTERRUPT/BACKCHANNEL/LAUGH timing directives. Adds 50–100ms cross-fades between speaker turns. Backchannels mixed at -8dB.

🎛️

Audio Post-Processor

Professional mastering chain: noise gate (-40dB), EQ (2–5kHz presence boost, sub-100Hz cut), 2:1 compression, loudness normalization to -16 LUFS (Spotify/Apple standard).

🎬

Cold Open Generator

LLM (Claude Haiku) scans the full script to identify the most compelling 15–30s moment, then extracts the corresponding audio slice as the episode hook.

📦

Chapter Stitcher

Assembles cold open + intro music + all chapters + host outro + outro music into final MP3 with ID3 metadata (title, artist, artwork, chapter markers).

# Audio mastering chain (per chapter)
noise_gate(threshold=-40dB)
  → eq(boost="2-5kHz", cut="<100Hz")
  → compress(ratio=2:1, threshold=-20dB)
  → normalize(target=-16 LUFS)
  → room_tone(level=-32dB)
  → chapter_N_mastered.wav

In-Memory Processing Architecture

A deliberate architecture decision: all deduplication, embedding, and relevance scoring runs entirely in RAM — no databases, no temp files, no external vector stores.

⚡

Performance

40% faster than Bedrock vector DB alternative. Sub-second processing latency with zero I/O overhead.

💰

Cost

$0.0003 per podcast (compute only) vs $0.0009 with Bedrock. 3× cheaper at scale.

🔒

Multi-Tenant Safety

Each Lambda invocation is isolated. No shared state, no cross-tenant data leakage possible by design.

📐

Memory Budget

~210 MB peak usage: models (~150 MB) + embeddings (~1.2 MB) + FAISS index (~0.6 MB) + chunks (~50 MB). Well under Lambda's 2 GB limit.

Why not a vector database? Each podcast generation is a one-shot operation — embeddings are computed, used for deduplication, and discarded. There is no query-time retrieval, no persistence requirement, and no cross-session reuse. A vector DB would add latency, cost, and operational complexity with zero benefit.

Tech Stack

Agent Framework

LangGraph

Stateful graph with conditional edges, subgraphs per phase, checkpoint persistence

LLM

OpenAI GPT-4

Dialogue generation, query production, character design, content planning

TTS Engine

Gemini 2.5 Pro

Multi-voice synthesis with SSML prosody, 10-voice bank, parallel execution

Cold Open LLM

Claude Haiku

Lightweight script scan for compelling excerpt selection (single call)

API

FastAPI

Async REST endpoints for job submission and status tracking

Embeddings

Sentence-Transformers

384-dim embeddings for semantic deduplication and relevance scoring

Vector Index

FAISS

In-memory cosine similarity search for chunk deduplication

Audio DSP

pydub + ffmpeg

Overlap mixing, EQ, compression, loudness normalization, MP3 export

Web Scraping

Trafilatura

Clean text extraction from web pages, boilerplate removal

Google Custom Search

SERP results for research phase query execution

Runtime

AWS Lambda

Serverless, 2 GB RAM, 5-min timeout, isolated per invocation

Validation

Pydantic v2

Typed state models, inter-phase contracts, auto-truncation validators

Design Decisions

Eight core principles that shaped the architecture of the podcast generation pipeline.

Principle 01

Phase-as-Subgraph Isolation

Each phase is a self-contained LangGraph subgraph with its own state, retry logic, and quality gates. Phases communicate through strongly-typed contracts — never shared globals.

Principle 02

In-Memory Everything

No databases, no temp files, no external vector stores. All intermediate state lives in Python objects within a single Lambda invocation. Simpler, faster, cheaper.

Principle 03

Beat-by-Beat Generation

Dialogue is generated in 5 narrative beats per chapter, not monolithically. This prevents context overflow, maintains energy arc control, and enables granular QA.

Principle 04

Naturalness Markers as First-Class Data

7 marker types ([FILLER], [PAUSE], [LAUGH], etc.) flow through the entire pipeline — from script to SSML to timing directives to audio overlap. They are data, not decoration.

Principle 05

Voice Consistency > Voice Quality

A speaker's voice is fixed for the entire episode. Fallback voices are pre-mapped by similarity. Listeners detect inconsistency faster than low quality.

Principle 06

DSP over LLMs for Audio

Phase 5 uses zero LLMs for audio processing (except one Haiku call for cold open selection). Mastering, mixing, and normalization are deterministic DSP operations.

Principle 07

Freshness-Aware Research

The query producer classifies topics as "recent" or "evergreen" and routes through different research paths. Recent topics get date-tagged queries and live web data.

Principle 08

Contract-Driven Phase Handoffs

Each phase validates its input contract before execution begins. If Phase 3's output doesn't match Phase 4's expected schema, the pipeline fails fast with a clear error — not silently.

Hear It In Action

Listen to a fully AI-generated podcast episode — from research to final master, produced entirely by this pipeline.

AI Podcast GeneratorSystem Architecture

Project Overview

End-to-end flow

5-Phase Pipeline Architecture

Phase 1 — Research & Ingestion

Query Producer Agent

Deduplication & Relevance Scoring

Phase 2 — Content Planning

Chapter Planner

Character Designer Agent

Phase 3 — Dialogue Generation

Phase 4 — Voice Synthesis

Phase 5 — Audio Post-Processing

In-Memory Processing Architecture

Tech Stack

Design Decisions

Hear It In Action

AI Podcast Generator
System Architecture