Serving API¶
Ray Serve prediction service for vroom-forecast. FastAPI ingress with Ray Serve deployments for model inference, feature computation, and online store lookup.
Why Ray Serve?¶
The serving layer does more than load-a-model-and-predict. It computes features on the fly, looks up pre-materialized features from Redis, reads from the Parquet offline store, runs inference, and listens for model promotion events — all behind a single HTTP API. Ray Serve was chosen because it maps naturally to this multi-concern architecture:
| Reason | Detail |
|---|---|
| Deployment composition | Each concern (feature compute, online lookup, inference) is a separate Ray Serve deployment, independently scalable. Plain FastAPI would put everything in one process. |
| Hot model reload | Deployments are long-lived actors. Predictor.reload() swaps the model in-place without restarting the server or dropping requests — triggered by Redis pub/sub on promotion. |
| FastAPI integration | @serve.ingress(app) gives full FastAPI capabilities (OpenAPI docs, Pydantic validation) while Ray handles replicas and routing. No tradeoff between DX and scalability. |
| Stack alignment | Ray is listed as high-priority in Turo's tech stack. Using it here demonstrates familiarity with the tool in a realistic serving context. |
Tradeoff acknowledged: Ray Serve adds operational complexity (a Ray
cluster) and a heavier dependency footprint vs. plain FastAPI + Gunicorn.
At this project's scale, plain FastAPI would suffice. The choice is
pragmatic — it demonstrates the production pattern while remaining fully
functional locally via ray start --head.
Architecture¶
graph TB
subgraph Ray Serve
ING[VroomForecastApp<br/>FastAPI ingress]
FC[FeatureComputer]
FL[FeatureLookup]
OFR[OfflineFeatureReader]
PRED[Predictor]
MAT[FeatureMaterializer<br/>Ray actor]
end
Client -->|HTTP| ING
ING -->|/predict| FC -->|DataFrame| PRED
ING -->|/predict/id| FL -->|DataFrame| PRED
ING -->|/vehicles/features| OFR -->|offline first| ING
ING -->|/vehicles/features fallback| FL
ING -->|/vehicles POST| DB[(SQLite)]
DB -->|Redis pub/sub| MAT -->|write_to_online_store| Redis[(Redis)]
PQ[Parquet] --> OFR
Redis --> FL
MLflow[(MLflow)] -->|load champion| PRED
Running¶
# Local:
uv run --project serving python -m serving
# Docker:
docker compose up ray-serve # port 8000 + Ray dashboard on 8265
Endpoints¶
| Method | Path | Description |
|---|---|---|
| GET | /health |
Liveness check + model info + Feast online status |
| POST | /reload |
Hot-reload champion model from MLflow |
| POST | /predict |
Single prediction from raw attributes (on-the-fly features) |
| POST | /predict/id |
Prediction by vehicle ID (features from online store) |
| POST | /predict/batch |
Batch prediction (up to 1000, on-the-fly features) |
| POST | /benchmark |
Latency benchmark: on-the-fly features + inference |
| POST | /benchmark/id |
Latency benchmark: online store lookup + inference |
| POST | /vehicles |
Save a vehicle to SQLite (emits Redis event for materialization) |
| DELETE | /vehicles/{id} |
Delete a new arrival vehicle |
| GET | /vehicles |
List all vehicles |
| GET | /vehicles/features |
Batch: get features for all vehicles (offline + online) |
| GET | /vehicles/{id}/features |
Get computed features for one vehicle |
| GET | /stores |
Operational info about offline and online stores |
| GET | /model |
Champion model metadata (version, metrics, feature importances) |
| POST | /materialize |
Trigger the Airflow materialization pipeline |
| POST | /train |
Trigger the end-to-end ML pipeline (training + promotion) |
| GET | /events |
SSE stream for model promotion events |
| GET | /vehicles/events |
SSE stream for vehicle materialization events |
| GET | /pipelines/events |
SSE stream for Airflow DAG completion events |
Interactive docs at http://localhost:8000/docs.
A Bruno API collection is included at the repo root in bruno/ with
pre-filled requests for every endpoint.
Configuration¶
| Variable | Default | Description |
|---|---|---|
SERVING_MLFLOW_URI |
http://localhost:5001 |
MLflow tracking server |
SERVING_MODEL_NAME |
vroom-forecast |
Registered model name |
SERVING_HOST |
0.0.0.0 |
Bind address |
SERVING_PORT |
8000 |
Bind port |
SERVING_FEAST_REPO |
None | Path to Feast feature repo |
SERVING_REDIS_URL |
None | Redis URL for pub/sub + model reload |
SERVING_DB_PATH |
/feast-data/vehicles.db |
SQLite path for vehicle persistence |
SERVING_OFFLINE_STORE_PATH |
None | Parquet path for offline feature store |
SERVING_AIRFLOW_URL |
None | Airflow REST API base URL |