Skip to content

Airflow

Orchestration layer for ML pipelines. Contains the Dockerfile and DAG definitions. No ML dependencies — pipelines run via BashOperator + uv run.

DAG Flow

graph LR
    subgraph "Materialize (schedule=None, manual)"
        S[seed] --> M[materialize]
    end

    subgraph "Training (schedule=None, manual)"
        T[train]
    end

    subgraph "Promotion (schedule=None, manual)"
        P[promote]
    end

    subgraph "Pipeline (schedule=None, orchestrator)"
        TT[trigger_training] --> TP[trigger_promotion]
    end

    P -.->|Redis pub/sub| SRV[Serving reloads model]

DAGs

DAG Schedule Tasks Description
vroom_forecast_materialize Manual seed, materialize Seed DB + compute features, write Parquet + Redis
vroom_forecast_training Manual train Train from offline store, register candidate
vroom_forecast_promotion Manual promote Compare candidate vs champion, promote if better, notify via Redis
vroom_forecast_pipeline Manual trigger_training, trigger_promotion Orchestrator: chains training → promotion (used by UI "Train" button)

Materialization Pipeline

The materialize DAG runs two steps that populate the feature stores:

graph LR
    CSV[vehicles.csv<br/>reservations.csv] -->|seed.py| DB[(SQLite)]
    UI[Serving API] -->|save vehicle| DB
    DB -->|pipeline.py| COMPUTE[Compute Features<br/><i>price_diff<br/>num_reservations</i>]
    COMPUTE --> PQ[Parquet<br/>offline store<br/><i>fleet vehicles only</i>]
    COMPUTE --> RD[(Redis<br/>online store<br/><i>new arrivals only</i>)]

Step 1: Seed (seed.py)

Loads vehicles.csv and reservations.csv into SQLite. Idempotent — safe to run multiple times. Vehicles are inserted with source='csv'.

cd features && uv run python seed.py --data-dir ../data --db /feast-data/vehicles.db

Step 2: Materialize (pipeline.py)

Reads all vehicles + reservations from SQLite, computes derived features, and writes to both stores:

  1. Fleet vehicles (observed num_reservations) → Parquet (offline store) — used for training and fleet display
  2. New arrivals only (num_reservations IS NULL) → Redis via store.write_to_online_store() — used for real-time inference
cd features && uv run python pipeline.py \
    --db /feast-data/vehicles.db \
    --feast-repo feature_repo \
    --parquet-path /feast-data/vehicle_features.parquet

Derived features computed

Feature Formula
price_diff actual_price - recommended_price
num_reservations COUNT(reservations) per vehicle — NULL for ui-sourced vehicles, 0 for csv vehicles with no bookings

Running

# Via Airflow (recommended — handles seed + materialize in order):
docker compose exec airflow airflow dags trigger vroom_forecast_materialize

# Or manually:
cd features && uv run python seed.py --data-dir ../data --db /feast-data/vehicles.db
cd features && uv run python pipeline.py --db /feast-data/vehicles.db \
    --feast-repo feature_repo --parquet-path /feast-data/vehicle_features.parquet

How it works

Airflow doesn't install any ML dependencies. Each task runs:

uv run --project <project> python -m <module> [args]
# or
cd features && uv run python pipeline.py [args]

uv creates an isolated venv inside the container on first run.

Key files

  • Dockerfile — Extends apache/airflow:2.10.5-python3.12, adds uv, copies sub-projects
  • dags/vroom_forecast_materialize.py — Feature materialization DAG
  • dags/vroom_forecast_training.py — Training DAG
  • dags/vroom_forecast_promotion.py — Promotion DAG
  • dags/vroom_forecast_pipeline.py — Pipeline orchestrator DAG

Credentials

The admin user is created explicitly in docker-compose.yml with credentials admin / admin. The password is also written to:

docker compose exec airflow cat /opt/airflow/standalone_admin_password.txt