Production · Azure Container Apps · PostgreSQL · ETL

Park Data Ingest
Real-Time Theme Park ETL

A containerized ETL pipeline running every 10 minutes on Azure Container Apps, consuming the ThemeParks.wiki public API and formatting live attraction status, Lightning Lane availability, sellout events, and park schedules into a PostgreSQL data store that feeds all downstream AI consumers — the ride wait forecasting models, the SRE agent, and the Park Agent chat interface.

10 min
Run interval
286
Park entities
4
PostgreSQL tables
v6
Current version
2–3 min
Typical runtime
35 days
PG backup retention

Three Phases, One Container, Every 10 Minutes

Each run executes three sequential processing phases: entities (metadata upserts), schedules (operating hours + Lightning Lane purchase options), and live data (real-time attraction status with event detection). All output lands in PostgreSQL.

Source
ThemeParks.wiki API /entity/{id}/children — entity roster /entity/{id}/schedule — operating hours /entity/{id}/live — real-time status Walt Disney World entity ID: e957da41-3552-4cf6-b636-5babc5cbc4e5
Phase 1 — Entities
process_entities() Fetch flat entity roster (attractions, restaurants, shows) UPSERT into park_entities_v2 — 286 entities Fields: entity_id, name, entity_type, parent_id, slug, lat/lon, timezone RETIRED_ENTITY_IDS blocklist — suppress writes for closed attractions
Phase 2 — Schedules
process_schedules() Park operating hours per date UPSERT into park_schedules_v2 Lightning Lane purchase options — UPSERT into ll_purchase_options_v2 Fields: entity_id, date, open_time, close_time, is_park_hours
Phase 3 — Live Data
process_live_data() Poll real-time status: wait_time, queue_type, status, is_down UPSERT into park_live_status_logs_v2 — high-volume table Downtime event detection — compare to PREVIOUS_STATES_CACHE Sellout event detection — Lightning Lane FINISHED/SOLD_OUT transitions Release event detection — availability restored after sellout
Event Publishing
publish_sellout_events() — called inline from live data phase Writes selloutEvent (legacy) + selloutEventv2 (enhanced) documents v2 includes: release_duration_minutes, release_duration_display, operating_minutes Duration calculated from last releaseEvent → sellout time, excluding CLOSED periods clip_to_park_hours() strips closed intervals from operating_minutes calculation
Sink
PostgreSQL Flexible Server — az-park-data-v2.postgres.database.azure.com All writes via psycopg2 with ON CONFLICT DO UPDATE upsert semantics Blob Storage — raw API response archival (Azure Storage) 35-day backup retention, geo-redundant

Four Core PostgreSQL Tables

All four tables use ON CONFLICT DO UPDATE upsert semantics — every run is idempotent. The high-volume table is park_live_status_logs_v2, which accumulates a snapshot every 10 minutes for each of the 286 park entities.

park_entities_v2
entity_idUUID PK
nameTEXT
entity_typeTEXT
parent_idUUID
slugTEXT
longitude / latitudeFLOAT
timezoneTEXT
286 rows · authoritative entity roster · name column is downstream source-of-truth
park_live_status_logs_v2
entity_idUUID
status_timestampTIMESTAMPTZ
wait_timeINT (minutes)
queue_typeTEXT
statusTEXT
is_downBOOL
queue_detailsJSONB
~144 writes/run · high-volume · powers ride forecasting models
park_schedules_v2
entity_idUUID
dateDATE
open_timeTIME
close_timeTIME
is_park_hoursBOOL
Used by clip_to_park_hours() for operating_minutes calculations and forecast features
ll_purchase_options_v2
entity_idUUID
dateDATE
availability_typeTEXT
priceNUMERIC
statusTEXT
Lightning Lane Individual and Multi pricing · drives sellout event pricing field

Sellout and release events are written as structured documents in addition to the tabular snapshots. The v2 sellout event enriches the legacy format with duration analytics derived from operating hours.

FieldTypeSourceDescription
doctypeTEXThardcodedselloutEventv2
entity_id / entity_nameUUID / TEXTAPIAttraction identity
availability_typeTEXTAPILIGHTNING_LANE_SINGLE or LIGHTNING_LANE_MULTI
sold_out_atTIMESTAMPTZstate transitionUTC timestamp when status flipped to FINISHED/SOLD_OUT
sold_out_timeTEXTderivedHuman-readable Eastern time: "2:47 PM"
release_duration_minutesFLOATcomputedMinutes from last releaseEvent → this sellout. Null if no prior release found.
release_duration_displayTEXTderivedFormatted duration: "1h 23m" or "47m"
operating_minutesFLOATcomputedDuration minus CLOSED park intervals via clip_to_park_hours()
priceNUMERICll_purchase_options_v2Purchase price at time of sellout
operational_dateTEXTderivedYYYY-MM-DD in Eastern time (rolls at 3 AM)

From Cosmos Round-Trip to Postgres-Native

The data store migration was the most significant architectural decision in this pipeline's history. The original v6 design used a two-job split — ingest to PostgreSQL, then a second job (park-cosmos-publisher) to read from PostgreSQL and re-publish to Cosmos DB. This round-trip was eliminated entirely.

v1 – v5 · 2024
Single monolithic job — direct API to Cosmos
All data written directly to Cosmos park_ops container with a single /doctype partition key. Hot partition problem emerged as waitSnapshot grew to 62.5% of all documents.
v6 Split · Early 2026
Two-job split — ingest → PostgreSQL → Cosmos publish
Fault isolation: Job 1 writes to PostgreSQL, Job 2 reads from PostgreSQL and re-publishes to three Cosmos containers (park_timeseries, park_aggregates, park_config) with hierarchical partition keys. Solved hot partition issue but introduced operational complexity and ~$394/month Cosmos floor cost.
v6 Current · June 2026
Single job — PostgreSQL as the only data store
park-cosmos-publisher decommissioned. Cosmos DB abandoned. All downstream consumers (Azure Functions / park-knowledge-updater, ride forecasting models) read directly from PostgreSQL. Zero Cosmos floor cost. One job, one container, one data store.

Engineering Tradeoffs

Why upsert semantics instead of append-only inserts+

Every write uses ON CONFLICT (entity_id) DO UPDATE or composite key variants. This makes every run idempotent — a job that runs twice at the same timestamp produces identical state rather than duplicate rows. For a 10-minute cron that has no deduplication budget, idempotency is more valuable than the slight write amplification from checking for existing rows first.

The exception is park_live_status_logs_v2, which is designed as an append log — each 10-minute snapshot is a distinct row even for the same entity. The upsert key here is (entity_id, status_timestamp), so re-runs at the same timestamp still converge to one row per entity per timestamp.

PREVIOUS_STATES_CACHE — in-memory deduplication for event detection+

Sellout and downtime event detection compares the current API response against the previous known state for each entity. Rather than querying PostgreSQL on every iteration to find the last known status, the cache is loaded once at the start of each run and updated in-place as the live data phase processes each entity. This reduces per-entity queries from O(n) database calls to one bulk load.

The cache is module-level (PREVIOUS_STATES_CACHE = {}), so it resets on each container invocation. Container App Jobs start fresh for each scheduled execution — there's no warm instance carrying stale cache between runs. This is intentional: it avoids a ghost state problem where a cache entry from a prior run suppresses a real state transition that happens to look like the previous value.

operating_minutes — why raw duration isn't enough+

A Lightning Lane sellout that lasts 90 minutes means something very different if the park was closed for 45 of those minutes. clip_to_park_hours() subtracts CLOSED intervals from the release-to-sellout window to compute the number of minutes the attraction was actually open and accessible during that period. This makes operating_minutes the correct feature for demand analysis — a 45-minute operating window that sold out is more significant than a 90-minute raw window that included park closure.

This field directly feeds the downstream content pipeline's analysis of which attractions have the most aggressive sellout patterns relative to their actual operational window.

The Cosmos-to-PostgreSQL migration — why it took 6 versions+

The original decision to use Cosmos DB was reasonable at the time: the hot partition problem wasn't visible until waitSnapshot accumulated millions of documents. Once it was visible, the planned solution was a hierarchical partition key migration — creating park_ops_v2 with [/doctype, /date] keys and backfilling 5.9M documents. That plan was designed, documented, and ready to execute.

The decision to abandon Cosmos entirely instead came from a simpler observation: every aggregate document in Cosmos was computed from data that already lived in PostgreSQL. The publisher read from PostgreSQL, computed GROUP BY rollups, and wrote them to Cosmos — which was then re-read by downstream consumers. This is a pure round-trip with no benefit. PostgreSQL can compute the same aggregates at query time or via materialized views at a fraction of the cost, with one fewer hop in the data path and no per-RU billing.

The v6 "split architecture" was always meant to be temporary — a transition state between the old monolith and a cleaner design. The split was introduced to provide fault isolation while the Cosmos migration was planned. Once the decision to drop Cosmos was made, the split job was no longer necessary and was eliminated.

RETIRED_ENTITY_IDS — why a blocklist, not a soft-delete+

When an attraction closes permanently (Rock 'n' Roller Coaster, DINOSAUR), the ThemeParks.wiki API may still return the entity in the roster with stale data for some time. Writing zero-wait or null status records for a closed attraction adds noise to the historical dataset that downstream models have to filter. The blocklist prevents writes for specific entity IDs entirely, while preserving all historical rows already in the database.

A soft-delete approach (marking the entity as inactive in park_entities_v2) would still require every downstream consumer to filter on the inactive flag. The blocklist is enforced at the ingest layer, keeping downstream consumers clean without extra filter logic.

Downstream: Ride Forecasting → LightGBM v6 ↗ Park Agent Chat ↗ All Projects ↗