Park Data Ingest — Real-Time ETL Pipeline

ETL Pipeline

Three Phases, One Container, Every 10 Minutes

Each run executes three sequential processing phases: entities (metadata upserts), schedules (operating hours + Lightning Lane purchase options), and live data (real-time attraction status with event detection). All output lands in PostgreSQL.

Source

ThemeParks.wiki API /entity/{id}/children — entity roster /entity/{id}/schedule — operating hours /entity/{id}/live — real-time status Walt Disney World entity ID: e957da41-3552-4cf6-b636-5babc5cbc4e5

Phase 1 — Entities

process_entities() Fetch flat entity roster (attractions, restaurants, shows) UPSERT into park_entities_v2 — 286 entities Fields: entity_id, name, entity_type, parent_id, slug, lat/lon, timezone RETIRED_ENTITY_IDS blocklist — suppress writes for closed attractions

Phase 2 — Schedules

process_schedules() Park operating hours per date UPSERT into park_schedules_v2 Lightning Lane purchase options — UPSERT into ll_purchase_options_v2 Fields: entity_id, date, open_time, close_time, is_park_hours

Phase 3 — Live Data

process_live_data() Poll real-time status: wait_time, queue_type, status, is_down UPSERT into park_live_status_logs_v2 — high-volume table Downtime event detection — compare to PREVIOUS_STATES_CACHE Sellout event detection — Lightning Lane FINISHED/SOLD_OUT transitions Release event detection — availability restored after sellout

Event Publishing

publish_sellout_events() — called inline from live data phase Writes selloutEvent (legacy) + selloutEventv2 (enhanced) documents v2 includes: release_duration_minutes, release_duration_display, operating_minutes Duration calculated from last releaseEvent → sellout time, excluding CLOSED periods clip_to_park_hours() strips closed intervals from operating_minutes calculation

Sink

PostgreSQL Flexible Server — az-park-data-v2.postgres.database.azure.com All writes via psycopg2 with ON CONFLICT DO UPDATE upsert semantics Blob Storage — raw API response archival (Azure Storage) 35-day backup retention, geo-redundant

Data Model

Four Core PostgreSQL Tables

All four tables use ON CONFLICT DO UPDATE upsert semantics — every run is idempotent. The high-volume table is park_live_status_logs_v2, which accumulates a snapshot every 10 minutes for each of the 286 park entities.

park_entities_v2

entity_idUUID PK

nameTEXT

entity_typeTEXT

parent_idUUID

slugTEXT

longitude / latitudeFLOAT

timezoneTEXT

286 rows · authoritative entity roster · name column is downstream source-of-truth

park_live_status_logs_v2

entity_idUUID

status_timestampTIMESTAMPTZ

wait_timeINT (minutes)

queue_typeTEXT

statusTEXT

is_downBOOL

queue_detailsJSONB

~144 writes/run · high-volume · powers ride forecasting models

park_schedules_v2

entity_idUUID

dateDATE

open_timeTIME

close_timeTIME

is_park_hoursBOOL

Used by clip_to_park_hours() for operating_minutes calculations and forecast features

ll_purchase_options_v2

entity_idUUID

dateDATE

availability_typeTEXT

priceNUMERIC

statusTEXT

Lightning Lane Individual and Multi pricing · drives sellout event pricing field

Event Document Schema

Sellout and release events are written as structured documents in addition to the tabular snapshots. The v2 sellout event enriches the legacy format with duration analytics derived from operating hours.

Field	Type	Source	Description
doctype	TEXT	hardcoded	selloutEventv2
entity_id / entity_name	UUID / TEXT	API	Attraction identity
availability_type	TEXT	API	LIGHTNING_LANE_SINGLE or LIGHTNING_LANE_MULTI
sold_out_at	TIMESTAMPTZ	state transition	UTC timestamp when status flipped to FINISHED/SOLD_OUT
sold_out_time	TEXT	derived	Human-readable Eastern time: "2:47 PM"
release_duration_minutes	FLOAT	computed	Minutes from last releaseEvent → this sellout. Null if no prior release found.
release_duration_display	TEXT	derived	Formatted duration: "1h 23m" or "47m"
operating_minutes	FLOAT	computed	Duration minus CLOSED park intervals via `clip_to_park_hours()`
price	NUMERIC	ll_purchase_options_v2	Purchase price at time of sellout
operational_date	TEXT	derived	YYYY-MM-DD in Eastern time (rolls at 3 AM)

Architecture Evolution

From Cosmos Round-Trip to Postgres-Native

The data store migration was the most significant architectural decision in this pipeline's history. The original v6 design used a two-job split — ingest to PostgreSQL, then a second job (park-cosmos-publisher) to read from PostgreSQL and re-publish to Cosmos DB. This round-trip was eliminated entirely.

v1 – v5 · 2024

Single monolithic job — direct API to Cosmos

All data written directly to Cosmos park_ops container with a single /doctype partition key. Hot partition problem emerged as waitSnapshot grew to 62.5% of all documents.

v6 Split · Early 2026

Two-job split — ingest → PostgreSQL → Cosmos publish

Fault isolation: Job 1 writes to PostgreSQL, Job 2 reads from PostgreSQL and re-publishes to three Cosmos containers (park_timeseries, park_aggregates, park_config) with hierarchical partition keys. Solved hot partition issue but introduced operational complexity and ~$394/month Cosmos floor cost.

v6 Current · June 2026

Single job — PostgreSQL as the only data store

park-cosmos-publisher decommissioned. Cosmos DB abandoned. All downstream consumers (Azure Functions / park-knowledge-updater, ride forecasting models) read directly from PostgreSQL. Zero Cosmos floor cost. One job, one container, one data store.

Design Decisions

Engineering Tradeoffs

Why upsert semantics instead of append-only inserts+

Every write uses ON CONFLICT (entity_id) DO UPDATE or composite key variants. This makes every run idempotent — a job that runs twice at the same timestamp produces identical state rather than duplicate rows. For a 10-minute cron that has no deduplication budget, idempotency is more valuable than the slight write amplification from checking for existing rows first.

The exception is park_live_status_logs_v2, which is designed as an append log — each 10-minute snapshot is a distinct row even for the same entity. The upsert key here is (entity_id, status_timestamp), so re-runs at the same timestamp still converge to one row per entity per timestamp.

PREVIOUS_STATES_CACHE — in-memory deduplication for event detection+

Sellout and downtime event detection compares the current API response against the previous known state for each entity. Rather than querying PostgreSQL on every iteration to find the last known status, the cache is loaded once at the start of each run and updated in-place as the live data phase processes each entity. This reduces per-entity queries from O(n) database calls to one bulk load.

The cache is module-level (PREVIOUS_STATES_CACHE = {}), so it resets on each container invocation. Container App Jobs start fresh for each scheduled execution — there's no warm instance carrying stale cache between runs. This is intentional: it avoids a ghost state problem where a cache entry from a prior run suppresses a real state transition that happens to look like the previous value.

operating_minutes — why raw duration isn't enough+

A Lightning Lane sellout that lasts 90 minutes means something very different if the park was closed for 45 of those minutes. clip_to_park_hours() subtracts CLOSED intervals from the release-to-sellout window to compute the number of minutes the attraction was actually open and accessible during that period. This makes operating_minutes the correct feature for demand analysis — a 45-minute operating window that sold out is more significant than a 90-minute raw window that included park closure.

This field directly feeds the downstream content pipeline's analysis of which attractions have the most aggressive sellout patterns relative to their actual operational window.

The Cosmos-to-PostgreSQL migration — why it took 6 versions+

The original decision to use Cosmos DB was reasonable at the time: the hot partition problem wasn't visible until waitSnapshot accumulated millions of documents. Once it was visible, the planned solution was a hierarchical partition key migration — creating park_ops_v2 with [/doctype, /date] keys and backfilling 5.9M documents. That plan was designed, documented, and ready to execute.

The decision to abandon Cosmos entirely instead came from a simpler observation: every aggregate document in Cosmos was computed from data that already lived in PostgreSQL. The publisher read from PostgreSQL, computed GROUP BY rollups, and wrote them to Cosmos — which was then re-read by downstream consumers. This is a pure round-trip with no benefit. PostgreSQL can compute the same aggregates at query time or via materialized views at a fraction of the cost, with one fewer hop in the data path and no per-RU billing.

The v6 "split architecture" was always meant to be temporary — a transition state between the old monolith and a cleaner design. The split was introduced to provide fault isolation while the Cosmos migration was planned. Once the decision to drop Cosmos was made, the split job was no longer necessary and was eliminated.

RETIRED_ENTITY_IDS — why a blocklist, not a soft-delete+

When an attraction closes permanently (Rock 'n' Roller Coaster, DINOSAUR), the ThemeParks.wiki API may still return the entity in the roster with stale data for some time. Writing zero-wait or null status records for a closed attraction adds noise to the historical dataset that downstream models have to filter. The blocklist prevents writes for specific entity IDs entirely, while preserving all historical rows already in the database.

A soft-delete approach (marking the entity as inactive in park_entities_v2) would still require every downstream consumer to filter on the inactive flag. The blocklist is enforced at the ingest layer, keeping downstream consumers clean without extra filter logic.

Park Data IngestReal-Time Theme Park ETL

Three Phases, One Container, Every 10 Minutes

Four Core PostgreSQL Tables

From Cosmos Round-Trip to Postgres-Native

Engineering Tradeoffs

Park Data Ingest
Real-Time Theme Park ETL