I kept running into the same frustration — every ML tutorial ends at model.fit(). You get a nice accuracy score, maybe a confusion matrix, and that's supposed to be the finish line. But when I tried to actually get a model running somewhere a teammate could hit it with a request, everything fell apart. So I decided to build the whole thing, soup to nuts, and figure out where the real headaches are.
The Gap Nobody Talks About
Training a model is maybe 20% of the work. The rest is all the unglamorous stuff: making sure you can reproduce an experiment six months later, keeping the service up when traffic spikes, knowing when predictions start drifting before someone files a bug report. I wanted to build every one of those layers myself, not just read about them.
How It All Fits Together
I broke the system into five layers. Each one handles a specific job:
Experiment Tracking with MLflow
I went with MLflow because I got tired of losing track of which hyperparameters produced which results. After the third time I couldn't figure out how to recreate a model that worked well two weeks ago, I set up proper tracking. Now every run logs its params, metrics, and the model artifact itself. It sounds basic, but it's saved me more debugging time than any other single decision.
The FastAPI Service
I picked FastAPI over Flask pretty early on. The auto-generated docs alone save a ton of time when someone else needs to figure out your API. Plus Pydantic catches bad input before it ever reaches the model, which means fewer cryptic numpy errors in production. The service has three endpoints: /predict does the actual inference, /health tells Kubernetes the pod is alive, and /metrics feeds Prometheus.
Packaging It Up with Docker
I spent an embarrassing amount of time debugging "works on my machine" issues before I committed to doing Docker properly. Multi-stage builds, non-root user, pinned versions — the usual stuff. But honestly, getting the health check right was the part that tripped me up. The orchestrator needs to know when your container is actually ready, not just running.
Deploying on Kubernetes
Yes, Kubernetes for a personal project is overkill. I know. But the point was to prove that the manifests work in a real orchestration environment. I ran everything on Minikube locally, and the same YAML files would work on any cloud cluster without changes. Resource limits, readiness probes, rolling updates — all of it. It forced me to think about failure modes I would have ignored otherwise.
Monitoring with Prometheus & Grafana
This was the layer I almost skipped, and I'm glad I didn't. The first time I saw my p95 latency spike on the Grafana dashboard during a load test, I caught a memory leak in the inference code that I never would have found through regular testing. Prometheus scrapes the /metrics endpoint every 15 seconds. I track latency percentiles, prediction distribution (for drift), error rates, and throughput.
CI/CD with GitHub Actions
Push to main, and the pipeline takes over: lint, test, build the Docker image, push to registry, deploy to Kubernetes. No SSH-ing into servers, no manual deploys. If something breaks in the test suite, the deployment just doesn't happen. It took a full afternoon to get the workflow right, but now I don't think about it at all.
What Actually Surprised Me
The individual pieces weren't that hard. What caught me off guard was getting them to play nicely together. A misconfigured health check that made Kubernetes restart my pod in a loop. A Prometheus scrape interval that was too aggressive for a cold-start service. The lessons that stuck: treat observability as a first-class concern from day one, and invest in experiment tracking early — it pays off the moment a deployed model starts acting weird.