I kept running into the same frustration — every ML tutorial ends at model.fit(). You get a nice accuracy score, maybe a confusion matrix, and that's supposed to be the finish line. But when I tried to actually get a model running somewhere a teammate could hit it with a request, everything fell apart. So I decided to build the whole thing, soup to nuts, and figure out where the real headaches are.
- ▸ Training is ~20% of the work — reproducibility, serving, and monitoring are the real engineering
- ▸ MLflow for experiment tracking from day one — losing reproducibility is painful to fix retroactively
- ▸ FastAPI over Flask: Pydantic validation + auto-docs saves hours when teammates hit your endpoint
- ▸ Kubernetes Minikube manifests work identically on any cloud cluster — worth the initial YAML overhead
- ▸ Prometheus/Grafana caught a memory leak in inference that no test suite would have found
The Gap Nobody Talks About
Training a model is maybe 20% of the work. The rest is all the unglamorous stuff: making sure you can reproduce an experiment six months later, keeping the service up when traffic spikes, knowing when predictions start drifting before someone files a bug report. I wanted to build every one of those layers myself, not just read about them.
How It All Fits Together
I broke the system into five layers. Each one handles a specific job:
Experiment Tracking with MLflow
I went with MLflow because I got tired of losing track of which hyperparameters produced which results. After the third time I couldn't figure out how to recreate a model that worked well two weeks ago, I set up proper tracking. Now every run logs its params, metrics, and the model artifact itself. It sounds basic, but it's saved me more debugging time than any other single decision.
The FastAPI Service
I picked FastAPI over Flask pretty early on. The auto-generated docs alone save a ton of time when someone else needs to figure out your API. Plus Pydantic catches bad input before it ever reaches the model, which means fewer cryptic numpy errors in production. The service has three endpoints: /predict does the actual inference, /health tells Kubernetes the pod is alive, and /metrics feeds Prometheus.
Packaging It Up with Docker
I spent an embarrassing amount of time debugging "works on my machine" issues before I committed to doing Docker properly. The specific failure that finally convinced me: a numpy version mismatch caused my model's predict_proba to return slightly different float precision on the deployment server, which downstream code wasn't handling. Zero error messages, just wrong predictions. After that I pinned every dependency to the exact version in requirements.txt.
Multi-stage builds made a real difference on image size — the final inference image went from 2.1GB to 680MB by separating the build environment from the runtime. Running as a non-root user adds about thirty seconds to the Dockerfile setup and is worth it. The health check was the part that actually bit me: I had it hitting /predict during startup, which was failing because MLflow model loading takes ~8 seconds. Switched to hitting /health, which returns immediately once FastAPI is up, and Kubernetes stopped restarting my pod.
Deploying on Kubernetes
Yes, Kubernetes for a personal project is overkill. I know. But the point was to prove that the manifests work in a real orchestration environment, not just in a docker run command on my laptop. I ran everything on Minikube locally, and the same YAML files would work on any cloud cluster without changes.
Two things I got wrong initially: resource limits and readiness probes. I didn't set CPU/memory limits on my first deployment, which meant Kubernetes had no idea how to bin-pack the pods and the scheduler made bad decisions. Setting requests and limits made the scheduling predictable. Readiness probes — different from liveness probes — tell Kubernetes when the pod is actually ready to serve traffic, not just alive. Without it, the load balancer was routing requests to pods that were still loading the model into memory, producing 500s. It took a whole evening to figure out that distinction.
Monitoring with Prometheus & Grafana
This was the layer I almost skipped, and I'm glad I didn't. The FastAPI service exposes a /metrics endpoint using the prometheus-fastapi-instrumentator library — it adds automatic request count, latency histograms, and in-progress request tracking with about five lines of code. Prometheus scrapes it every 15 seconds. I added custom metrics on top: prediction class distribution (to catch drift if the model starts returning the same class constantly) and model load time on startup.
During a load test at 50 concurrent requests, the Grafana dashboard showed p95 latency climbing linearly — which should have been flat. Pulled up the pod metrics and saw memory growing with every request. The inference code was caching all input features in a list that never got cleared. That's the kind of bug that exists nowhere in the application logs, doesn't cause errors, just makes things slower and slower until the pod OOM-kills. Prometheus found it in about ten minutes. A test suite would never have caught it.
CI/CD with GitHub Actions
Push to main, and the pipeline takes over: lint with flake8, run pytest, build the Docker image, push to a registry, and apply the Kubernetes manifests. The deployment only happens if every prior step passes. No SSH-ing into servers, no "I'll just push this one thing manually".
The tricky part was injecting secrets — the registry credentials and the MLflow tracking server URL need to live in GitHub Actions secrets, not hardcoded anywhere. I set up the workflow to pull those at runtime and inject them as environment variables into the Docker build. It took a full afternoon to get right, mostly because the GitHub Actions YAML for multi-step Docker builds has some non-obvious ordering constraints. Once it was working I made zero changes to it for the next three months.
What Actually Surprised Me
The individual pieces weren't that hard to get working in isolation. What caught me off guard was getting them to work together correctly under realistic conditions. A misconfigured liveness probe made Kubernetes crash-loop my pod 47 times in twenty minutes because it was timing out during model loading. A Prometheus scrape interval set too aggressively was adding measurable latency to the inference endpoint because my metric collection code wasn't async-safe.
The bigger surprise was how much the project changed how I think about the training side of things. When you have dashboards showing prediction distribution in real time, you start noticing things you'd never catch by looking at offline metrics. The training set had a class imbalance I thought was fine based on the F1 score, but watching the live prediction distribution made it obvious the model was systematically under-predicting one class. That's not a monitoring problem — that's a training problem I would have shipped and never discovered without the observability layer.
If I were starting this over: set up MLflow and the monitoring layer before writing a single line of model training code. Both are the kind of thing that's painful to retrofit and trivial to build from the start.
AI Systems Engineer at HCLTech · M.Tech AI/ML, BITS Pilani
Building agentic AI systems, LLM pipelines, and production ML infrastructure. 3+ years shipping AI at scale.