Serving API¶

Ray Serve prediction service for vroom-forecast. FastAPI ingress with Ray Serve deployments for model inference, feature computation, and online store lookup.

Why Ray Serve?¶

The serving layer does more than load-a-model-and-predict. It computes features on the fly, looks up pre-materialized features from Redis, reads from the Parquet offline store, runs inference, and listens for model promotion events — all behind a single HTTP API. Ray Serve was chosen because it maps naturally to this multi-concern architecture:

Reason	Detail
Deployment composition	Each concern (feature compute, online lookup, inference) is a separate Ray Serve deployment, independently scalable. Plain FastAPI would put everything in one process.
Hot model reload	Deployments are long-lived actors. `Predictor.reload()` swaps the model in-place without restarting the server or dropping requests — triggered by Redis pub/sub on promotion.
FastAPI integration	`@serve.ingress(app)` gives full FastAPI capabilities (OpenAPI docs, Pydantic validation) while Ray handles replicas and routing. No tradeoff between DX and scalability.
Stack alignment	Ray is listed as high-priority in Turo's tech stack. Using it here demonstrates familiarity with the tool in a realistic serving context.

Tradeoff acknowledged: Ray Serve adds operational complexity (a Ray cluster) and a heavier dependency footprint vs. plain FastAPI + Gunicorn. At this project's scale, plain FastAPI would suffice. The choice is pragmatic — it demonstrates the production pattern while remaining fully functional locally via ray start --head.

Architecture¶

graph TB
    subgraph Ray Serve
        ING[VroomForecastApp<br/>FastAPI ingress]
        FC[FeatureComputer]
        FL[FeatureLookup]
        OFR[OfflineFeatureReader]
        PRED[Predictor]
        MAT[FeatureMaterializer<br/>Ray actor]
    end

    Client -->|HTTP| ING
    ING -->|/predict| FC -->|DataFrame| PRED
    ING -->|/predict/id| FL -->|DataFrame| PRED
    ING -->|/vehicles/features| OFR -->|offline first| ING
    ING -->|/vehicles/features fallback| FL
    ING -->|/vehicles POST| DB[(SQLite)]
    DB -->|Redis pub/sub| MAT -->|write_to_online_store| Redis[(Redis)]
    PQ[Parquet] --> OFR
    Redis --> FL
    MLflow[(MLflow)] -->|load champion| PRED

Running¶

# Local:
uv run --project serving python -m serving

# Docker:
docker compose up ray-serve    # port 8000 + Ray dashboard on 8265

Endpoints¶

Method	Path	Description
GET	`/health`	Liveness check + model info + Feast online status
POST	`/reload`	Hot-reload champion model from MLflow
POST	`/predict`	Single prediction from raw attributes (on-the-fly features)
POST	`/predict/id`	Prediction by vehicle ID (features from online store)
POST	`/predict/batch`	Batch prediction (up to 1000, on-the-fly features)
POST	`/benchmark`	Latency benchmark: on-the-fly features + inference
POST	`/benchmark/id`	Latency benchmark: online store lookup + inference
POST	`/vehicles`	Save a vehicle to SQLite (emits Redis event for materialization)
DELETE	`/vehicles/{id}`	Delete a new arrival vehicle
GET	`/vehicles`	List all vehicles
GET	`/vehicles/features`	Batch: get features for all vehicles (offline + online)
GET	`/vehicles/{id}/features`	Get computed features for one vehicle
GET	`/stores`	Operational info about offline and online stores
GET	`/model`	Champion model metadata (version, metrics, feature importances)
POST	`/materialize`	Trigger the Airflow materialization pipeline
POST	`/train`	Trigger the end-to-end ML pipeline (training + promotion)
GET	`/events`	SSE stream for model promotion events
GET	`/vehicles/events`	SSE stream for vehicle materialization events
GET	`/pipelines/events`	SSE stream for Airflow DAG completion events

Interactive docs at http://localhost:8000/docs.

A Bruno API collection is included at the repo root in bruno/ with pre-filled requests for every endpoint.

Configuration¶

Variable	Default	Description
`SERVING_MLFLOW_URI`	`http://localhost:5001`	MLflow tracking server
`SERVING_MODEL_NAME`	`vroom-forecast`	Registered model name
`SERVING_HOST`	`0.0.0.0`	Bind address
`SERVING_PORT`	`8000`	Bind port
`SERVING_FEAST_REPO`	None	Path to Feast feature repo
`SERVING_REDIS_URL`	None	Redis URL for pub/sub + model reload
`SERVING_DB_PATH`	`/feast-data/vehicles.db`	SQLite path for vehicle persistence
`SERVING_OFFLINE_STORE_PATH`	None	Parquet path for offline feature store
`SERVING_AIRFLOW_URL`	None	Airflow REST API base URL