AI Video Production Team — ElevenLabs · Claude · FFmpeg

The Video Team

Five Systems, One Video — Each Owns Its Step

Scriptwriter

Claude Sonnet 4.5

Writes the REELS block during content generation Pass 3. Produces a 10-field structured script including HOOK, SEGMENT_1, CTA, and the NARRATION field (36-42 words, 3 sentences) that drives the voice track. Operates under strict word-count constraints at temperature=0 — first response is always the final script.

36-42 word NARRATION HOOK 5-8 words temperature=0

Voice Synthesizer

ElevenLabs TTS

Converts the NARRATION text to speech using eleven_multilingual_v2 via convert_with_timestamps. Returns both the MP3 audio and per-character timestamp alignment data — the timestamps are what make word-synchronized captions possible. Numbers are pre-normalized to spoken form before the API call.

eleven_multilingual_v2 per-char timestamps stability=0.45

Frame Compositor

Pillow (PIL) + Effects Pipeline

Assembles every video frame from background park images, the word-pop caption layer, brand v2 overlays (corner lockup + glass-morphism hook strap), and particle/kinetic effects. Each content type has a named render profile that controls which effects fire and at what intensity.

brand v2 overlays word-pop captions per-profile effects

Video Encoder

FFmpeg

Final assembly: frame sequence + narration audio → encoded MP4. Encodes at CRF 18 with a 3500k minimum bitrate floor so quality never degrades on complex motion frames. 30fps at 1080×1920. Background music mixed at a lower level under the narration track.

CRF 18 min 3500k 30fps 1080×1920

Distributor

Azure Blob + Social Publishers

Writes the final MP4 to Azure Blob Storage (reels/preview_v2/), then calls platform publishers in sequence: Instagram Reels (via Meta API), TikTok, YouTube Shorts. Each platform receives a tailored caption from the REELS block (IG_CAPTION, TIKTOK_CAPTION, YT_TITLE + YT_DESCRIPTION + YT_TAGS).

Instagram Reels TikTok YouTube Shorts

Production Pipeline

Script → Voice → Captions → Encode → Publish

Script Contract

Claude Sonnet outputs 10-field REELS block during Pass 3 NARRATION: 3 sentences, 36-42 words total Sentence 1-2: two specific moves with real numbers (ride names + wait times) Sentence 3: hardcoded CTA — "Full breakdown at Park Whisperer, link in bio." Forbidden: "based on the information", "here's what you need to know", generic travel phrasing Weak hook detection: reject if starts with "today", "so", "hey", "here's"

Number Normalization

Before TTS: all bare integers converted to spoken words "60 min" → "sixty minutes" · "25" → "twenty-five" Time expressions: "9:30 AM" → "nine thirty AM" "3/10" → "three out of ten" · "5/5" → "five out of five" After TTS: word-to-digit ASR map for caption color-matching ("twenty" → "20")

ElevenLabs TTS

convert_with_timestamps() → audio + alignment object Model: eleven_multilingual_v2 · Voice: 1SM7GgM6IMuvQlz2BwM3 stability=0.45 · similarity_boost=0.75 · style=0.25 · speaker_boost=true Alignment: characters[], character_start_times_seconds[], character_end_times_seconds[] Character timings reconstructed into word-level tuples: (word, start_s, end_s) Fallback: synthetic word times at audio_duration / word_count if alignment absent

Caption Alignment

retime_caption_words() reconciles TTS text vs. display caption_text TTS text: number words ("sixty") · Caption text: digits ("60") — must match for color coding ASR map applied: spoken number words → digit strings for color-code lookup Per-word tuples drive frame-accurate caption rendering: word appears at start_s, fades at end_s

Frame Composition

Background: park images from Azure Blob (9:16 stock photos, attraction-keyword matched) Motion: scene_motion / beat_zoom / ken_burns / whip — profile-dependent Word-pop captions: each word renders at its timestamp with semantic color coding Brand v2 corner lockup: Park Whisperer logo + pulsing coral live dot (always on in v2) Glass-morphism hook strap: content-type accent + LIVE pill (hook_strap=true per profile) Branded endcard: gradient wordmark + LINK IN BIO CTA, last 4 seconds Optional: particles (rain/snow/sparkles/dust), kinetic_typo POP, flash_cuts, number_counter

FFmpeg Encode

Frame sequence + narration MP3 + background music (lower volume mix) CRF 18 · min bitrate 3500k · 30fps · 1080×1920 MP4 Output: Azure Blob reels/preview_v2/{stem}_v2_{timestamp}.mp4 Thumbnail URL: first frame extracted, stored alongside video blob

Publish

_must_skip_social() gate: PREVIEW_MODE / PIPELINE_DRY_RUN / storage_only content types Instagram Reels: IG_CAPTION from REELS block + video blob URL TikTok: TIKTOK_CAPTION from REELS block YouTube Shorts: YT_TITLE + YT_DESCRIPTION + YT_TAGS from REELS block Social audit blob written: publisher_variant="v2_unified", per-platform dispatch log

Synchronized Captions

Per-Word Color Coding Driven by ElevenLabs Timestamps

Every word in the narration has an exact start and end timestamp from ElevenLabs' character alignment data. The renderer uses those timestamps to show each word the moment it's spoken. Words are color-coded by semantic category so key data pops visually without manual editing.

TRON just hit 45 minutes —
Rise of the Resistance SOLD OUT at 7AM.
Full breakdown at Park Whisperer, link in bio.

Numbers (45, 7)

Ride names (TRON, Rise of the Resistance)

Status / urgency (SOLD OUT, link in bio)

Normal narration text

The NARRATION field Claude generates is constrained to produce captions that land well visually. Word-count limits ensure the video stays under 15 seconds. The 3-sentence structure maps cleanly to natural caption groupings.

Total Length

36-42 words maximum (ElevenLabs: 34-word cap for ≤15s; buffer for natural speech variation). Enforced by _truncate_narration() before TTS call.

Sentence 1

Most urgent park fact with real numbers. Must mention a ride name and a wait time or operational status. "TRON just hit 45 minutes" not "wait times are high today".

Sentence 2

Second specific action or counterpoint. Lightning Lane status, weather impact, or hidden gem walk-on. Must have its own concrete number or ride name.

Sentence 3

Fixed CTA: "Full breakdown at Park Whisperer, link in bio." — never varied, appended after any truncation so it always survives.

Pre-TTS

Numbers → words: "60" → "sixty", "9:30 AM" → "nine thirty AM". Markdown stripped. Em-dashes → spaces. Ellipsis → comma. ElevenLabs never sees digits.

Post-TTS

ASR map re-converts spoken numbers back to digits for caption color-coding: "twenty" → "20", "forty-five" → "45". This round-trip is what makes number coloring work reliably.

Content Type	Motion	Transition	Kinetic Typo	Flash Cuts	Counter	Particles	Glitch
rope_drop_strategy	scene_motion	whip 0.14s	ON	ON	ON	dust 0.45	off
ride_down_alert	beat_zoom 1.6bps	whip 0.12s	ON	ON	ON	dust 0.55	ON
weather_storm_alert	beat_zoom 1.8bps	whip 0.14s	ON	ON	off	rain 0.65	off
morning_briefing	scene_motion	smooth 0.22s	ON	off	ON	sparkles 0.35	off
evening_wrap	scene_motion	wipe 0.20s	ON	off	ON	sparkles 0.35	off
operations_bulletin	scene_motion	smooth 0.22s	ON	off	off	none	off
weather_morning/midday	scene_motion	fade 0.30-0.35s	off	off	ON	rain 0.35-0.40	off
yesterday_recap	scene_motion	fade 0.30s	ON	off	ON	sparkles 0.40	off

Design Decisions

Engineering the AI Video Team

Why does ElevenLabs return timestamps, and how are they used?+

convert_with_timestamps returns a response with an alignment object containing three parallel arrays: characters[], character_start_times_seconds[], and character_end_times_seconds[]. The video publisher reconstructs these into word-level timing tuples (word, start_s, end_s) by grouping non-whitespace characters into words and taking the first character's start time and the last character's end time.

These tuples are passed to the frame renderer. At each video frame (1/30th of a second), the renderer checks the current timestamp and renders words that are active at that frame. Words that have passed fade out; words that haven't started yet are hidden. The result is frame-perfect word-sync with no manual SRT editing.

The fallback path (_synthetic_word_times()) distributes words evenly across audio duration if ElevenLabs doesn't return alignment data. It's less accurate but means captions are never completely absent.

Why do numbers go words → TTS → digits for caption coloring?+

ElevenLabs pronounces digits inconsistently — "45" might be read as "forty-five", "four five", or "forty five" depending on context. By pre-normalizing to words ("forty-five"), the speech is always correct and the timestamp data matches the spoken tokens.

But caption display text should show digits ("45 min") — that's what users read and it's more scannable. The color-coding rules match against digit strings (numbers → purple). So after TTS, an ASR map converts word tokens back to digits: "twenty" → "20", "forty" → "40", compound forms like "forty-five" → "45". The timestamp data from ElevenLabs anchors to the word token "forty-five", and that token is remapped to "45" for the caption layer. Same timing, different visual text. retime_caption_words() reconciles the TTS token list with the display caption list to preserve the correct timestamps across the conversion.

How does image selection know which park/ride to show?+

The narration text is used as an image selection hint. A keyword map (_RIDE_KEYWORD_FALLBACKS) covers ~30 specific attractions: "rise of the resistance" → (hollywood-studios, 28.355, -81.560), "seven dwarfs" → (magic-kingdom, 28.420, -81.583). The publisher scans the narration for these keywords (longest match first, so "rise of the resistance" beats "rise") and queries Azure Blob Storage for classified images with a subject_label matching that ride.

If no ride-specific images are found, it falls back to the park folder (magic-kingdom, epcot, etc.) inferred from park name keywords in the narration. Stock images are stored in 9_16 aspect ratio (1080×1920) organized by park and aspect. The subject classification comes from a separate image_classify pipeline that runs against the photo library.

What prevents a video from publishing during a dry run or staging test?+

_must_skip_social() is checked before any social API call. It returns True (skip social) when any of these conditions are met: PREVIEW_MODE=true (manual preview render), PIPELINE_DRY_RUN=true (function app env var), payload.storage_only=true (per-job flag from video_job_trigger), or content_type is in _STORAGE_ONLY_CONTENT_TYPES (alert types still being iterated visually — ride_down_alert, wait_time_alert, weather_watch, etc.).

Even when social is skipped, the video is still rendered and written to blob storage so it can be reviewed via the portal. The worker writes a status blob (_v2_status_skipped_social.txt or _v2_status_published.txt) that the pipeline monitor reads to display publish status without querying social APIs.

How do render profiles separate visual identity without pipeline branching?+

render_profiles.py is a pure declarative mapping: content_type → profile dict. The dict has boolean flags for each effect (kinetic_typo, hook_strap, flash_cuts, number_counter, glitch) and float/string params for intensities and motion style. The video publisher loads the profile at render start and uses it as a feature flag surface — effects check their flag and skip if False.

Adding a new effect means adding one boolean to the profile schema (defaulting to False in _DEFAULT), implementing the effect as a composable frame-level operation, and opting in per content type by setting the flag to True. No per-pipeline code paths. This is why a new content type (e.g. ll_intelligence_report) can get a full visual identity by adding one dict entry to the profiles file.

AI VideoProduction Team

Five Systems, One Video — Each Owns Its Step

Claude Sonnet 4.5

ElevenLabs TTS

Pillow (PIL) + Effects Pipeline

FFmpeg

Azure Blob + Social Publishers

Script → Voice → Captions → Encode → Publish

Per-Word Color Coding Driven by ElevenLabs Timestamps

Each Pipeline Type Has Its Own Visual Identity

Engineering the AI Video Team

AI Video
Production Team