Airflow¶

Orchestration layer for ML pipelines. Contains the Dockerfile and DAG definitions. No ML dependencies — pipelines run via BashOperator + uv run.

DAG Flow¶

graph LR
    subgraph "Materialize (schedule=None, manual)"
        S[seed] --> M[materialize]
    end

    subgraph "Training (schedule=None, manual)"
        T[train]
    end

    subgraph "Promotion (schedule=None, manual)"
        P[promote]
    end

    subgraph "Pipeline (schedule=None, orchestrator)"
        TT[trigger_training] --> TP[trigger_promotion]
    end

    P -.->|Redis pub/sub| SRV[Serving reloads model]

DAGs¶

DAG	Schedule	Tasks	Description
`vroom_forecast_materialize`	Manual	seed, materialize	Seed DB + compute features, write Parquet + Redis
`vroom_forecast_training`	Manual	train	Train from offline store, register candidate
`vroom_forecast_promotion`	Manual	promote	Compare candidate vs champion, promote if better, notify via Redis
`vroom_forecast_pipeline`	Manual	trigger_training, trigger_promotion	Orchestrator: chains training → promotion (used by UI "Train" button)

Materialization Pipeline¶

The materialize DAG runs two steps that populate the feature stores:

graph LR
    CSV[vehicles.csv<br/>reservations.csv] -->|seed.py| DB[(SQLite)]
    UI[Serving API] -->|save vehicle| DB
    DB -->|pipeline.py| COMPUTE[Compute Features<br/><i>price_diff<br/>num_reservations</i>]
    COMPUTE --> PQ[Parquet<br/>offline store<br/><i>fleet vehicles only</i>]
    COMPUTE --> RD[(Redis<br/>online store<br/><i>new arrivals only</i>)]

Step 1: Seed (`seed.py`)¶

Loads vehicles.csv and reservations.csv into SQLite. Idempotent — safe to run multiple times. Vehicles are inserted with source='csv'.

cd features && uv run python seed.py --data-dir ../data --db /feast-data/vehicles.db

Step 2: Materialize (`pipeline.py`)¶

Reads all vehicles + reservations from SQLite, computes derived features, and writes to both stores:

Fleet vehicles (observed num_reservations) → Parquet (offline store) — used for training and fleet display
New arrivals only (num_reservations IS NULL) → Redis via store.write_to_online_store() — used for real-time inference

cd features && uv run python pipeline.py \
    --db /feast-data/vehicles.db \
    --feast-repo feature_repo \
    --parquet-path /feast-data/vehicle_features.parquet

Derived features computed¶

Feature	Formula
`price_diff`	`actual_price - recommended_price`
`num_reservations`	`COUNT(reservations)` per vehicle — `NULL` for ui-sourced vehicles, `0` for csv vehicles with no bookings

Running¶

# Via Airflow (recommended — handles seed + materialize in order):
docker compose exec airflow airflow dags trigger vroom_forecast_materialize

# Or manually:
cd features && uv run python seed.py --data-dir ../data --db /feast-data/vehicles.db
cd features && uv run python pipeline.py --db /feast-data/vehicles.db \
    --feast-repo feature_repo --parquet-path /feast-data/vehicle_features.parquet

How it works¶

Airflow doesn't install any ML dependencies. Each task runs:

uv run --project <project> python -m <module> [args]
# or
cd features && uv run python pipeline.py [args]

uv creates an isolated venv inside the container on first run.

Key files¶

Dockerfile — Extends apache/airflow:2.10.5-python3.12, adds uv, copies sub-projects
dags/vroom_forecast_materialize.py — Feature materialization DAG
dags/vroom_forecast_training.py — Training DAG
dags/vroom_forecast_promotion.py — Promotion DAG
dags/vroom_forecast_pipeline.py — Pipeline orchestrator DAG

Credentials¶

The admin user is created explicitly in docker-compose.yml with credentials admin / admin. The password is also written to:

docker compose exec airflow cat /opt/airflow/standalone_admin_password.txt