Technical Documentation

Podcast Creator
System Architecture

An autonomous, multi-speaker podcast creation system powered by LangGraph agent pipelines. Given any topic, the pipeline researches the web, plans chapters, designs characters, generates naturalistic dialogue, synthesises multi-voice audio, and delivers a finished podcast — with no human in the loop.

LangGraph FastAPI GPT-4 FAISS Sentence-Transformers Google Custom Search pydub AWS Lambda

Contents

Project Overview
5-Phase Pipeline Architecture
Phase 1 — Research & Ingestion
Phase 2 — Content Planning
Phase 3 — Dialogue Generation
Phase 4 — Voice Synthesis
Phase 5 — Audio Post-Processing
In-Memory Processing Architecture
Tech Stack
Getting Started

Project Overview

Podcast Creator is a fully autonomous, agentic pipeline that transforms any text topic into a broadcast-quality, multi-speaker podcast. It is architected as a sequence of five independent LangGraph subgraphs, each responsible for a distinct phase of production — from internet research all the way through to final audio post-processing.

The system was designed with two core engineering goals: composability (each phase is a self-contained, testable unit) and serverless-first performance (all intermediate state is held in memory, eliminating I/O overhead and enabling zero-config horizontal scaling on AWS Lambda).

End-to-end flow

Topic

→

Research

→

Plan

→

Dialogue

→

TTS

→

Master

→

MP3

Researches the web: generates 10 diverse search queries, scrapes ~100 sources, deduplicates with FAISS + Sentence-Transformers
Plans the episode: creates chapter outlines with acts and energy arcs, designs 2–3 distinct speaker personas
Writes dialogue: generates naturalistic multi-speaker scripts with fact-checking, QA review, and SSML annotation
Synthesises voices: TTS router maps each speaker line to the correct voice provider and generates audio segments
Masters audio: overlap mixing, EQ, compression, loudness normalisation, and final MP3 assembly

Pure in-memory processing. The entire pipeline runs in RAM within a single AWS Lambda invocation (~210 MB peak). No databases, no temp files — zero cleanup required. This yields 40% faster execution at $0.0003 per podcast.

5-Phase Pipeline Architecture

The system is built as a LangGraph StateGraph with 5 sequential phases, each implemented as a compiled subgraph with its own state, nodes, and quality gates. Phases communicate through strongly-typed Pydantic contracts.

INPUT { "topic": "AI in healthcare" }

↓

🔍 PHASE 1 — Research & Ingestion ✅ Live

↓

📋 PHASE 2 — Content Planning 🔄 WIP

↓

🎭 PHASE 3 — Dialogue Generation 🔄 WIP

↓

🎙️ PHASE 4 — Voice Synthesis 🔄 WIP

↓

🎚️ PHASE 5 — Audio Post-Processing 🔄 WIP

↓

OUTPUT { podcast.mp3 — 15–30 min, multi-speaker }

Phase	Name	Status	Key Agents / Nodes	Output
1	Research & Ingestion	✅ Complete	QueryProducer, WebScraper, DedupRelevanceScorer	Top 60 ranked content chunks
2	Content Planning	🔄 In Progress	ChapterPlanner, CharacterDesigner	Chapter outlines, speaker personas
3	Dialogue Generation	🔄 In Progress	DialogueEngine, NaturalnessInjector, SSMLAnnotator, FactChecker, QAReviewer	SSML-annotated multi-speaker scripts
4	Voice Synthesis	🔄 In Progress	TTSRouter	Per-speaker audio segments
5	Audio Post-Processing	🔄 In Progress	OverlapEngine, ChapterStitcher, ColdOpenGenerator, PostProcessor	Final podcast WAV/MP3

Modular subgraph design. Each phase follows the same pattern: a TypedDict state, a set of nodes, and a compiled StateGraph. This makes every phase independently importable, testable, and extensible. Adding Phase 6 requires zero changes to existing phases.

Phase 1 — Research & Ingestion ✅ Complete

Phase 1 transforms a raw topic into a ranked corpus of deduplicated, relevance-scored content chunks ready for planning. It consists of three sequential agents.

Query Producer Agent

Classifies the topic's freshness and generates 10 diverse search queries:

Freshness classification: determines if the topic is "recent" (needs live web data) or "evergreen" (LLM-sufficient)
Recent path: performs seed search → extracts entities/dates → generates date-aware queries
Evergreen path: direct LLM generation of diverse angle queries
Date-aware tagging: appends temporal qualifiers to ensure content freshness

# Phase 1 — standalone usage
from src.pipeline.phases.phase1_graph import create_phase1_graph

graph = create_phase1_graph()
state = graph.invoke({
    "topic": "CRISPR gene editing breakthroughs 2025",
    "freshness": "",
    "queries": [],
    "search_results": [],
    "ranked_chunks": []
})

print(f"Ranked chunks returned: {len(state['ranked_chunks'])}")

Web Scraper Agent

Executes Google Custom Search per query and scrapes full text from result URLs. Uses Requests + BeautifulSoup4 for HTML extraction with boilerplate removal.

Deduplication & Relevance Scorer

~100 scraped documents are processed entirely in memory through a 5-step pipeline:

# In-memory processing — no databases, no files
merged_text (str)
    → chunks (List[Dict])           # ~200 items, 500-word segments
    → embeddings (np.ndarray)       # Shape: (200, 384)
    → faiss_index (IndexFlatIP)     # Cosine similarity dedup
    → unique_chunks (List[Dict])    # ~100-150 after dedup
    → ranked_chunks (List[Dict])    # Top 60 by cross-encoder
    → state["ranked_chunks"]         # Passed to Phase 2

Phase 2 — Content Planning 🔄 In Progress

Phase 2 transforms ranked research chunks into a structured episode plan with chapter outlines and fully realized speaker personas.

📋

Chapter Planner

Outlines N chapters with titles, talking points, source chunk assignments, and transition hooks. Creates narrative acts (setup → conflict → resolution) with energy arcs.

🎭Character Designer
Designs 2 distinct speaker personas per episode — name, role (host/expert/skeptic), speaking style, vocabulary level, and catchphrases — in a single structured LLM call.

Why single-call persona design? Multi-turn refinement risks personality drift. A single structured output call produces internally consistent characters with complementary dynamics (e.g., curious host vs. authoritative expert) that remain coherent across all chapters.

Phase 3 — Dialogue Generation 🔄 In Progress

The creative heart of the pipeline. Phase 3 transforms chapter outlines into SSML-annotated dialogue scripts through a 5-agent subgraph.

Agent	Role	Key Behavior
Dialogue Engine	Script writer	Generates raw multi-speaker dialogue per chapter, maintaining character voice consistency
Naturalness Injector	Human-like speech	Injects filler words, pauses, interruptions, reactions, and laughter to simulate authentic conversation
SSML Annotator	TTS preparation	Converts naturalness markers into TTS-compatible SSML with prosody control (rate, pitch, emphasis)
Fact Checker	Verification	Validates factual claims in the script against the Phase 1 source chunks
QA Reviewer	Quality gate	Scores engagement, repetition, clarity, transitions, and energy arc compliance before handoff

Phase 4 — Voice Synthesis 🔄 In Progress

Phase 4 converts SSML-annotated scripts into per-speaker audio segments using a configurable TTS router that supports multiple voice providers.

🎙️TTS Router Agent
Routes each script line to the correct TTS voice based on the speaker-to-voice map established in Phase 2. Supports multiple providers via a provider-agnostic interface.

🔊

Voice Consistency

Each speaker's voice is fixed for the entire episode. Voice identity is established once and never changes — listeners detect inconsistency instantly.

⚡

Parallel Synthesis

Audio segments are synthesised with bounded concurrency and exponential backoff on rate-limit errors, maximising throughput while respecting API quotas.

📦

Chapter Manifest

Builds an ordered clip registry with durations, speaker metadata, and timing directives that Phase 5 uses for stitching and overlap mixing.

Phase 5 — Audio Post-Processing 🔄 In Progress

Phase 5 transforms raw utterance WAVs into a broadcast-ready podcast episode using pydub + ffmpeg for all DSP operations.

🔊

Overlap Engine

Blends speech transitions from INTERRUPT/BACKCHANNEL markers. Adds cross-fades between speaker turns. Backchannels mixed at lower volume to simulate natural conversation dynamics.

🎛️Audio Post-Processor
Professional mastering chain: noise gate, EQ (presence boost, sub-bass cut), 2:1 compression, loudness normalisation to -16 LUFS (Spotify/Apple Podcasts standard).

🎬

Cold Open Generator

Identifies the most compelling 15–30s moment in the episode and extracts the corresponding audio slice as the episode hook that plays before the main content.

📦

Chapter Stitcher

Assembles cold open + all chapters + outro into final MP3 with ID3 metadata — title, artist, artwork, and chapter markers compatible with major podcast apps.

# Audio mastering chain (per chapter)
noise_gate(threshold=-40dB)
  → eq(boost="2-5kHz", cut="<100Hz")
  → compress(ratio=2:1, threshold=-20dB)
  → normalize(target=-16 LUFS)
  → chapter_N_mastered.wav

In-Memory Processing Architecture

A deliberate architecture decision: all deduplication, embedding, and relevance scoring runs entirely in RAM — no databases, no temp files, no external vector stores.

⚡

Performance

~4 sec execution time. 40% faster than Bedrock vector DB alternative. Zero I/O overhead.

💰

Cost

$0.0003 per podcast (compute only) vs $0.0009 with Bedrock. 3× cheaper at scale.

🔒

Multi-Tenant Safety

Each Lambda invocation runs in an isolated container. No shared state, no cross-tenant data leakage possible by design.

📐

Memory Budget

~210 MB peak: Sentence-Transformer model (~120 MB) + Cross-encoder (~80 MB) + FAISS index + chunks. Well under Lambda's 2 GB limit.

Approach	Execution Time	Lambda Cost / podcast	Complexity
In-Memory (chosen)	~4 sec	$0.0003	Minimal
Temp File Storage	~6 sec	$0.0005	Medium
AWS Bedrock Vector DB	~10 sec	$0.0009	High

Why not a vector database? Each podcast generation is a one-shot, ephemeral operation. Embeddings are computed, used for deduplication, and discarded. There is no query-time retrieval, no persistence requirement, and no cross-session reuse. A managed vector store adds latency, cost, and operational complexity with zero benefit.

Tech Stack

Agent Framework

LangGraph

Compiled StateGraph with conditional routing, 5 independent subgraphs

LLM Orchestration

LangChain

LLM chain abstractions; LLMFactory for provider-agnostic swapping

LLM Provider

OpenAI GPT-4

Dialogue generation, query production, character design, content planning

API Framework

FastAPI

Async REST API — job submission, status polling, audio download endpoints

Embeddings

Sentence-Transformers

384-dim embeddings for semantic deduplication and relevance scoring

Vector Index

FAISS (CPU)

In-memory cosine similarity search for chunk deduplication

Google Custom Search

SERP results for Phase 1 research. Free tier covers 100 queries/day.

Web Scraping

Requests + BS4 + lxml

HTML extraction and boilerplate removal from search result URLs

Audio DSP

pydub + mutagen

Audio stitching, EQ, compression, normalisation, ID3 metadata

Data Validation

Pydantic v2

Typed state models, inter-phase contracts, settings management

Testing

Pytest

Unit tests per agent + integration tests for full subgraph invocations

Runtime

AWS Lambda

Serverless, 2 GB RAM, 5-min timeout, fully isolated per invocation

Getting Started

Requires Python 3.11+, API keys for OpenAI and Google Custom Search.

1. Install Dependencies

# Clone and install in editable mode
git clone https://github.com/<your-username>/podcast-creator.git
cd podcast-creator
python -m venv venv && source venv/bin/activate
pip install -e .

2. Configure Environment Variables

# Copy and populate the env template
cp config/.env.example .env

# Required keys in .env
OPENAI_API_KEY=sk-proj-...
GOOGLE_SEARCH_API_KEY=AIzaSy...
GOOGLE_SEARCH_ENGINE_ID=63e2eae...

# Optional (for future phases)
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIzaSy...

# Server & pipeline config
HOST=0.0.0.0
PORT=8080
MIN_PODCAST_DURATION_SEC=900
MAX_PODCAST_DURATION_SEC=1800
NUM_SPEAKERS=2

3. Run Preflight Check & Start Server

# Validate your environment
python preflight_check.py

# Start the API server
uvicorn main:app --host 0.0.0.0 --port 8080 --reload

4. Submit a Job via REST API

# Submit a podcast generation job
curl -X POST http://localhost:8080/api/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "topic": "The rise of humanoid robots in 2025",
    "description": "Focus on Boston Dynamics, Figure, and Tesla Optimus",
    "num_speakers": 2
  }'

# Poll status
curl http://localhost:8080/api/v1/status/<job_id>

# Download completed podcast
curl -O http://localhost:8080/api/v1/download/<job_id>

API Endpoints

Method	Endpoint	Description
GET	/health	Health check with API version
POST	/api/v1/generate	Submit a new podcast generation job
GET	/api/v1/status/{job_id}	Get current job status and progress
GET	/api/v1/download/{job_id}	Download the completed podcast audio

Hear It In Action

Listen to a fully AI-generated podcast episode — from research to final master, produced entirely by this pipeline.

Podcast CreatorSystem Architecture

Project Overview

End-to-end flow

5-Phase Pipeline Architecture

Phase 1 — Research & Ingestion ✅ Complete

Query Producer Agent

Web Scraper Agent

Deduplication & Relevance Scorer

Phase 2 — Content Planning 🔄 In Progress

Phase 3 — Dialogue Generation 🔄 In Progress

Phase 4 — Voice Synthesis 🔄 In Progress

Phase 5 — Audio Post-Processing 🔄 In Progress

In-Memory Processing Architecture

Tech Stack

Getting Started

1. Install Dependencies

2. Configure Environment Variables

3. Run Preflight Check & Start Server

4. Submit a Job via REST API

API Endpoints

Hear It In Action

Podcast Creator
System Architecture