How I Built an End-to-End MLOps Pipeline: MLflow + FastAPI + Kubernetes

I kept running into the same frustration — every ML tutorial ends at model.fit(). You get a nice accuracy score, maybe a confusion matrix, and that's supposed to be the finish line. But when I tried to actually get a model running somewhere a teammate could hit it with a request, everything fell apart. So I decided to build the whole thing, soup to nuts, and figure out where the real headaches are.

The Gap Nobody Talks About

Training a model is maybe 20% of the work. The rest is all the unglamorous stuff: making sure you can reproduce an experiment six months later, keeping the service up when traffic spikes, knowing when predictions start drifting before someone files a bug report. I wanted to build every one of those layers myself, not just read about them.

How It All Fits Together

I broke the system into five layers. Each one handles a specific job:

Data & Training: Preprocessing → MLflow experiment tracking → artifact storage

Inference: FastAPI service → Pydantic validation → /predict, /health, /metrics

Infrastructure: Docker image → Kubernetes manifests (Minikube for local, cloud-ready)

Observability: Prometheus scrape → Grafana dashboards (latency, error rates, throughput)

CI/CD: GitHub Actions → automated build/test/deploy on every push

Experiment Tracking with MLflow

I went with MLflow because I got tired of losing track of which hyperparameters produced which results. After the third time I couldn't figure out how to recreate a model that worked well two weeks ago, I set up proper tracking. Now every run logs its params, metrics, and the model artifact itself. It sounds basic, but it's saved me more debugging time than any other single decision.

import mlflow

with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
    
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "model")
      

The FastAPI Service

I picked FastAPI over Flask pretty early on. The auto-generated docs alone save a ton of time when someone else needs to figure out your API. Plus Pydantic catches bad input before it ever reaches the model, which means fewer cryptic numpy errors in production. The service has three endpoints: /predict does the actual inference, /health tells Kubernetes the pod is alive, and /metrics feeds Prometheus.

from fastapi import FastAPI
from pydantic import BaseModel
import mlflow.sklearn

app = FastAPI()

class PredictRequest(BaseModel):
    age: int
    cholesterol: float
    blood_pressure: float

@app.post("/predict")
async def predict(request: PredictRequest):
    features = [[request.age, request.cholesterol, request.blood_pressure]]
    prediction = model.predict(features)
    return {"prediction": int(prediction[0]), "confidence": float(model.predict_proba(features).max())}

@app.get("/health")
async def health():
    return {"status": "healthy"}
      

Packaging It Up with Docker

I spent an embarrassing amount of time debugging "works on my machine" issues before I committed to doing Docker properly. Multi-stage builds, non-root user, pinned versions — the usual stuff. But honestly, getting the health check right was the part that tripped me up. The orchestrator needs to know when your container is actually ready, not just running.

Deploying on Kubernetes

Yes, Kubernetes for a personal project is overkill. I know. But the point was to prove that the manifests work in a real orchestration environment. I ran everything on Minikube locally, and the same YAML files would work on any cloud cluster without changes. Resource limits, readiness probes, rolling updates — all of it. It forced me to think about failure modes I would have ignored otherwise.

Monitoring with Prometheus & Grafana

This was the layer I almost skipped, and I'm glad I didn't. The first time I saw my p95 latency spike on the Grafana dashboard during a load test, I caught a memory leak in the inference code that I never would have found through regular testing. Prometheus scrapes the /metrics endpoint every 15 seconds. I track latency percentiles, prediction distribution (for drift), error rates, and throughput.

CI/CD with GitHub Actions

Push to main, and the pipeline takes over: lint, test, build the Docker image, push to registry, deploy to Kubernetes. No SSH-ing into servers, no manual deploys. If something breaks in the test suite, the deployment just doesn't happen. It took a full afternoon to get the workflow right, but now I don't think about it at all.

What Actually Surprised Me

The individual pieces weren't that hard. What caught me off guard was getting them to play nicely together. A misconfigured health check that made Kubernetes restart my pod in a loop. A Prometheus scrape interval that was too aggressive for a cold-start service. The lessons that stuck: treat observability as a first-class concern from day one, and invest in experiment tracking early — it pays off the moment a deployed model starts acting weird.

← Back to Blog View Full Project →