Click any component to explore it in the full case study.

5-Stage Video Production Pipeline
Stage 1
Number Normalize
Digits → words for TTS ("60" → "sixty"). Prevents robotic pronunciation in narration.
Stage 2
ElevenLabs TTS
NARRATION → MP3 + per-character timestamps (eleven_multilingual_v2). Word-level timing map generated.
Stage 3
Caption Align
Word-level timing from ElevenLabs → ASR map (spoken words back to written). Semantic color-coding applied.
Stage 4
Frame Compose
Park images + word-pop captions + brand overlays. Per-content-type render profiles (motion, transitions, effects).
Stage 5
FFmpeg Encode
1080×1920 @ 30fps. CRF 18. Min 3500k bitrate. Duration-matched to TTS audio. Thumbnail extracted.
🎬
Quality Gates + Safety Checks
Word count (36–42) enforced pre-render. _must_skip_social() check before any platform API call. PREVIEW_MODE and PIPELINE_DRY_RUN gates for staging. Video quality validated (frame count, audio sync).
Word count gate (36–42)_must_skip_social() checkPREVIEW_MODE / DRY_RUNCaption color validationAudio sync verification0 manual steps
5
AI systems (Claude + ElevenLabs + Pillow + FFmpeg + Blob)
3
Caption colors (semantic)
10+
Render profiles
0
Manual steps
StackPythonElevenLabs eleven_multilingual_v2Pillow (PIL)FFmpegAzure Blob StorageClaude Sonnet (script)Instagram APITikTok APIYouTube API