Back / Technical Documentation
Technical Documentation

Podcast Creator
System Architecture

An autonomous, multi-speaker podcast creation system powered by LangGraph agent pipelines. Given any topic, the pipeline researches the web, plans chapters, designs characters, generates naturalistic dialogue, synthesises multi-voice audio, and delivers a finished podcast — with no human in the loop.

LangGraph FastAPI GPT-4 FAISS Sentence-Transformers Google Custom Search pydub AWS Lambda
Contents
  1. Project Overview
  2. 5-Phase Pipeline Architecture
  3. Phase 1 — Research & Ingestion
  4. Phase 2 — Content Planning
  5. Phase 3 — Dialogue Generation
  6. Phase 4 — Voice Synthesis
  7. Phase 5 — Audio Post-Processing
  8. In-Memory Processing Architecture
  9. Tech Stack
  10. Getting Started

Project Overview

Podcast Creator is a fully autonomous, agentic pipeline that transforms any text topic into a broadcast-quality, multi-speaker podcast. It is architected as a sequence of five independent LangGraph subgraphs, each responsible for a distinct phase of production — from internet research all the way through to final audio post-processing.

The system was designed with two core engineering goals: composability (each phase is a self-contained, testable unit) and serverless-first performance (all intermediate state is held in memory, eliminating I/O overhead and enabling zero-config horizontal scaling on AWS Lambda).

End-to-end flow

Topic
Research
Plan
Dialogue
TTS
Master
MP3
Pure in-memory processing. The entire pipeline runs in RAM within a single AWS Lambda invocation (~210 MB peak). No databases, no temp files — zero cleanup required. This yields 40% faster execution at $0.0003 per podcast.

5-Phase Pipeline Architecture

The system is built as a LangGraph StateGraph with 5 sequential phases, each implemented as a compiled subgraph with its own state, nodes, and quality gates. Phases communicate through strongly-typed Pydantic contracts.

INPUT { "topic": "AI in healthcare" }
🔍 PHASE 1 — Research & Ingestion ✅ Live
📋 PHASE 2 — Content Planning 🔄 WIP
🎭 PHASE 3 — Dialogue Generation 🔄 WIP
🎙️ PHASE 4 — Voice Synthesis 🔄 WIP
🎚️ PHASE 5 — Audio Post-Processing 🔄 WIP
OUTPUT { podcast.mp3 — 15–30 min, multi-speaker }
PhaseNameStatusKey Agents / NodesOutput
1 Research & Ingestion ✅ Complete QueryProducer, WebScraper, DedupRelevanceScorer Top 60 ranked content chunks
2 Content Planning 🔄 In Progress ChapterPlanner, CharacterDesigner Chapter outlines, speaker personas
3 Dialogue Generation 🔄 In Progress DialogueEngine, NaturalnessInjector, SSMLAnnotator, FactChecker, QAReviewer SSML-annotated multi-speaker scripts
4 Voice Synthesis 🔄 In Progress TTSRouter Per-speaker audio segments
5 Audio Post-Processing 🔄 In Progress OverlapEngine, ChapterStitcher, ColdOpenGenerator, PostProcessor Final podcast WAV/MP3
Modular subgraph design. Each phase follows the same pattern: a TypedDict state, a set of nodes, and a compiled StateGraph. This makes every phase independently importable, testable, and extensible. Adding Phase 6 requires zero changes to existing phases.

Phase 1 — Research & Ingestion ✅ Complete

Phase 1 transforms a raw topic into a ranked corpus of deduplicated, relevance-scored content chunks ready for planning. It consists of three sequential agents.

Query Producer Agent

Classifies the topic's freshness and generates 10 diverse search queries:

# Phase 1 — standalone usage from src.pipeline.phases.phase1_graph import create_phase1_graph graph = create_phase1_graph() state = graph.invoke({ "topic": "CRISPR gene editing breakthroughs 2025", "freshness": "", "queries": [], "search_results": [], "ranked_chunks": [] }) print(f"Ranked chunks returned: {len(state['ranked_chunks'])}")

Web Scraper Agent

Executes Google Custom Search per query and scrapes full text from result URLs. Uses Requests + BeautifulSoup4 for HTML extraction with boilerplate removal.

Deduplication & Relevance Scorer

~100 scraped documents are processed entirely in memory through a 5-step pipeline:

# In-memory processing — no databases, no files merged_text (str) → chunks (List[Dict]) # ~200 items, 500-word segments → embeddings (np.ndarray) # Shape: (200, 384) → faiss_index (IndexFlatIP) # Cosine similarity dedup → unique_chunks (List[Dict]) # ~100-150 after dedup → ranked_chunks (List[Dict]) # Top 60 by cross-encoder → state["ranked_chunks"] # Passed to Phase 2

Phase 2 — Content Planning 🔄 In Progress

Phase 2 transforms ranked research chunks into a structured episode plan with chapter outlines and fully realized speaker personas.

📋
Chapter Planner
Outlines N chapters with titles, talking points, source chunk assignments, and transition hooks. Creates narrative acts (setup → conflict → resolution) with energy arcs.
🎭
Character Designer
Designs 2 distinct speaker personas per episode — name, role (host/expert/skeptic), speaking style, vocabulary level, and catchphrases — in a single structured LLM call.
Why single-call persona design? Multi-turn refinement risks personality drift. A single structured output call produces internally consistent characters with complementary dynamics (e.g., curious host vs. authoritative expert) that remain coherent across all chapters.

Phase 3 — Dialogue Generation 🔄 In Progress

The creative heart of the pipeline. Phase 3 transforms chapter outlines into SSML-annotated dialogue scripts through a 5-agent subgraph.

AgentRoleKey Behavior
Dialogue Engine Script writer Generates raw multi-speaker dialogue per chapter, maintaining character voice consistency
Naturalness Injector Human-like speech Injects filler words, pauses, interruptions, reactions, and laughter to simulate authentic conversation
SSML Annotator TTS preparation Converts naturalness markers into TTS-compatible SSML with prosody control (rate, pitch, emphasis)
Fact Checker Verification Validates factual claims in the script against the Phase 1 source chunks
QA Reviewer Quality gate Scores engagement, repetition, clarity, transitions, and energy arc compliance before handoff

Phase 4 — Voice Synthesis 🔄 In Progress

Phase 4 converts SSML-annotated scripts into per-speaker audio segments using a configurable TTS router that supports multiple voice providers.

🎙️
TTS Router Agent
Routes each script line to the correct TTS voice based on the speaker-to-voice map established in Phase 2. Supports multiple providers via a provider-agnostic interface.
🔊
Voice Consistency
Each speaker's voice is fixed for the entire episode. Voice identity is established once and never changes — listeners detect inconsistency instantly.
Parallel Synthesis
Audio segments are synthesised with bounded concurrency and exponential backoff on rate-limit errors, maximising throughput while respecting API quotas.
📦
Chapter Manifest
Builds an ordered clip registry with durations, speaker metadata, and timing directives that Phase 5 uses for stitching and overlap mixing.

Phase 5 — Audio Post-Processing 🔄 In Progress

Phase 5 transforms raw utterance WAVs into a broadcast-ready podcast episode using pydub + ffmpeg for all DSP operations.

🔊
Overlap Engine
Blends speech transitions from INTERRUPT/BACKCHANNEL markers. Adds cross-fades between speaker turns. Backchannels mixed at lower volume to simulate natural conversation dynamics.
🎛️
Audio Post-Processor
Professional mastering chain: noise gate, EQ (presence boost, sub-bass cut), 2:1 compression, loudness normalisation to -16 LUFS (Spotify/Apple Podcasts standard).
🎬
Cold Open Generator
Identifies the most compelling 15–30s moment in the episode and extracts the corresponding audio slice as the episode hook that plays before the main content.
📦
Chapter Stitcher
Assembles cold open + all chapters + outro into final MP3 with ID3 metadata — title, artist, artwork, and chapter markers compatible with major podcast apps.
# Audio mastering chain (per chapter) noise_gate(threshold=-40dB) → eq(boost="2-5kHz", cut="<100Hz") → compress(ratio=2:1, threshold=-20dB) → normalize(target=-16 LUFS) → chapter_N_mastered.wav

In-Memory Processing Architecture

A deliberate architecture decision: all deduplication, embedding, and relevance scoring runs entirely in RAM — no databases, no temp files, no external vector stores.

Performance
~4 sec execution time. 40% faster than Bedrock vector DB alternative. Zero I/O overhead.
💰
Cost
$0.0003 per podcast (compute only) vs $0.0009 with Bedrock. 3× cheaper at scale.
🔒
Multi-Tenant Safety
Each Lambda invocation runs in an isolated container. No shared state, no cross-tenant data leakage possible by design.
📐
Memory Budget
~210 MB peak: Sentence-Transformer model (~120 MB) + Cross-encoder (~80 MB) + FAISS index + chunks. Well under Lambda's 2 GB limit.
ApproachExecution TimeLambda Cost / podcastComplexity
In-Memory (chosen) ~4 sec $0.0003 Minimal
Temp File Storage ~6 sec $0.0005 Medium
AWS Bedrock Vector DB ~10 sec $0.0009 High
Why not a vector database? Each podcast generation is a one-shot, ephemeral operation. Embeddings are computed, used for deduplication, and discarded. There is no query-time retrieval, no persistence requirement, and no cross-session reuse. A managed vector store adds latency, cost, and operational complexity with zero benefit.

Tech Stack

Agent Framework
LangGraph
Compiled StateGraph with conditional routing, 5 independent subgraphs
LLM Orchestration
LangChain
LLM chain abstractions; LLMFactory for provider-agnostic swapping
LLM Provider
OpenAI GPT-4
Dialogue generation, query production, character design, content planning
API Framework
FastAPI
Async REST API — job submission, status polling, audio download endpoints
Embeddings
Sentence-Transformers
384-dim embeddings for semantic deduplication and relevance scoring
Vector Index
FAISS (CPU)
In-memory cosine similarity search for chunk deduplication
Search
Google Custom Search
SERP results for Phase 1 research. Free tier covers 100 queries/day.
Web Scraping
Requests + BS4 + lxml
HTML extraction and boilerplate removal from search result URLs
Audio DSP
pydub + mutagen
Audio stitching, EQ, compression, normalisation, ID3 metadata
Data Validation
Pydantic v2
Typed state models, inter-phase contracts, settings management
Testing
Pytest
Unit tests per agent + integration tests for full subgraph invocations
Runtime
AWS Lambda
Serverless, 2 GB RAM, 5-min timeout, fully isolated per invocation

Getting Started

Requires Python 3.11+, API keys for OpenAI and Google Custom Search.

1. Install Dependencies

# Clone and install in editable mode git clone https://github.com/<your-username>/podcast-creator.git cd podcast-creator python -m venv venv && source venv/bin/activate pip install -e .

2. Configure Environment Variables

# Copy and populate the env template cp config/.env.example .env
# Required keys in .env OPENAI_API_KEY=sk-proj-... GOOGLE_SEARCH_API_KEY=AIzaSy... GOOGLE_SEARCH_ENGINE_ID=63e2eae... # Optional (for future phases) ANTHROPIC_API_KEY=sk-ant-... GEMINI_API_KEY=AIzaSy... # Server & pipeline config HOST=0.0.0.0 PORT=8080 MIN_PODCAST_DURATION_SEC=900 MAX_PODCAST_DURATION_SEC=1800 NUM_SPEAKERS=2

3. Run Preflight Check & Start Server

# Validate your environment python preflight_check.py # Start the API server uvicorn main:app --host 0.0.0.0 --port 8080 --reload

4. Submit a Job via REST API

# Submit a podcast generation job curl -X POST http://localhost:8080/api/v1/generate \ -H "Content-Type: application/json" \ -d '{ "topic": "The rise of humanoid robots in 2025", "description": "Focus on Boston Dynamics, Figure, and Tesla Optimus", "num_speakers": 2 }' # Poll status curl http://localhost:8080/api/v1/status/<job_id> # Download completed podcast curl -O http://localhost:8080/api/v1/download/<job_id>

API Endpoints

MethodEndpointDescription
GET/healthHealth check with API version
POST/api/v1/generateSubmit a new podcast generation job
GET/api/v1/status/{job_id}Get current job status and progress
GET/api/v1/download/{job_id}Download the completed podcast audio

Hear It In Action

Listen to a fully AI-generated podcast episode — from research to final master, produced entirely by this pipeline.