Most ML engineers stop at the notebook. They train a model, check the accuracy, and call it done. But shipping a model to production — in a way that's reliable, reproducible, and observable — is a completely different engineering challenge. Here's exactly how I approached it.
Why Most ML Portfolios Miss the Point
The notebook is where the idea lives. Production is where it has to survive. The gap between the two involves: versioning experiments so they can be reproduced, serving predictions reliably at scale, packaging the service so it runs anywhere, orchestrating deployment without manual steps, and knowing when something is going wrong in production. My goal with this project was to build every layer.
Section 1: The Architecture
The full pipeline is five layers, each with a clear responsibility:
Section 2: Data & Training with MLflow
MLflow handles experiment tracking and artifact management. Every training run logs its parameters, metrics, and model artifact. This means any experiment can be reproduced exactly — something that matters enormously when you need to audit or re-deploy a specific model version.
Section 3: Building the FastAPI Inference Service
FastAPI was the right choice here for three reasons: native async support, automatic OpenAPI documentation generation, and Pydantic validation built in. The service exposes three key endpoints: /predict for inference, /health for Kubernetes liveness/readiness probes, and /metrics for Prometheus scraping.
Section 4: Containerising with Docker
The Dockerfile follows production best practices: multi-stage build to keep the image lean, non-root user for security, health check instruction for orchestrator integration, and pinned dependency versions for reproducibility.
Section 5: Kubernetes Deployment
Even on Minikube, using Kubernetes manifests forces you to think in production terms: resource limits, readiness probes, rolling updates, and service exposure. The manifests work on any cluster — cloud or local.
Section 6: Observability with Prometheus & Grafana
Observability isn't optional in production. If you don't know your model's inference latency, error rate, and request volume, you're flying blind. Prometheus scrapes metrics from the /metrics endpoint every 15 seconds. Grafana visualises them. I track p50/p95/p99 inference latency, prediction distribution drift, error rate, and requests per second.
Section 7: CI/CD with GitHub Actions
Every push to main triggers the pipeline: lint → test → build Docker image → push to registry → deploy to Kubernetes. No manual steps. If a test fails, the deployment doesn't happen. This is table stakes for production engineering.
What I Learned
The hardest part wasn't any individual component — it was wiring them together reliably. The most important lesson: observability is a first-class engineering concern, not an afterthought. And the second: reproducibility in experiment tracking saves enormous debugging time when a deployed model starts behaving unexpectedly.