I counted once — there are something like 47 MLOps tools out there, and every single one claims to be the only thing you'll ever need. I spent way too long evaluating options before I realised the best way to pick a stack is to just start building and swap things out when they hurt. So here's what I actually use, after going through that process.
- ▸ Core rule: don't adopt a tool until you're actually feeling the pain it solves
- ▸ MLflow wins over W&B and Neptune for self-hosted projects — free, solid UI, covers tracking + registry
- ▸ Prefect over Airflow for small ML pipelines (5–10 steps) — decorator-based, no infrastructure overhead
- ▸ Start with MLflow + Prefect + FastAPI + Docker + GitHub Actions; add Prometheus/Grafana at deploy time
- ▸ Bring in Kubernetes when you need it — not before
My One Rule: Don't Add Things Until They Hurt
I've watched teams adopt Airflow for a three-step pipeline, or spend two weeks setting up elaborate monitoring that nobody ever checks. Every tool on this list earned its spot by solving a specific problem I was already feeling. If I hadn't felt the pain yet, it didn't make the cut.
The failure mode I see constantly: people building the stack they think they'll need at scale before they have anything running. You end up maintaining infrastructure complexity instead of shipping. My starting point for any new project is: MLflow + FastAPI + Docker + GitHub Actions. Everything else gets added when the absence of it starts costing me time.
Experiment Tracking: MLflow
I tried Weights & Biases first — it has a better UI, better visualisations, and a nicer API. But it's per-seat pricing for anything serious, and for personal and small-team projects that math doesn't work. Neptune has the same problem. MLflow is free, self-hosted, and the UI is good enough. It covers tracking, the model registry, artifact storage, and run comparison. Nothing flashy, but it doesn't get in your way either.
The model registry is the part most people ignore and then regret. The moment you have more than one deployed version of a model — a champion and a challenger, or a rollback candidate — you need something tracking which artifact corresponds to which production deployment. MLflow's registry handles that cleanly.
Where it falls short: LLM work. MLflow doesn't handle prompt versioning or chain tracing well. I've started using LangSmith for anything involving LangChain or LangGraph agents — it's built specifically for that problem and it shows.
Orchestration: Prefect
I used Airflow at Classplus and it's fine for large, genuinely complex DAGs with dozens of tasks and external system dependencies. But for ML pipelines with five to ten Python steps, spinning up an Airflow installation is way more operational burden than the pipeline justifies. You need a metadata database, a scheduler process, workers — and then you spend time debugging Airflow instead of debugging your pipeline.
Prefect is a decorator on your existing Python functions. @flow and @task, and you have scheduling, retries, failure notifications, and a UI that shows run history. The local dev experience is the main win — I can test a flow end to end on my laptop without spinning up any infrastructure. That alone saves hours per week.
Serving: FastAPI
I started with Flask in 2021 and switched to FastAPI in 2023. Flask is fine for quick internal tools, but for any service that someone else is hitting — or that needs to validate incoming data — you end up bolting on marshmallow or cerberus and manually writing error response formatting anyway. FastAPI gives you Pydantic validation, auto-generated OpenAPI docs, proper async support, and clean error handling out of the box. The auto-docs matter more than you'd expect: every time a teammate or interviewer asks "what does your API accept?", I just send them the /docs URL.
Containers: Docker
The numpy version mismatch incident is what converted me. Shipped a model that worked perfectly on my laptop, failed silently on the deployment server — not with an error, just wrong predictions, because float precision differed between numpy 1.23 and 1.24 in one edge case in predict_proba. Zero logs. After that, everything goes in a container, no exceptions, every dependency pinned to the exact version.
Multi-stage builds are worth learning. The pattern: use a full Python image with build tools to install dependencies, then copy only the installed packages and app code into a slim runtime image. Takes a 2.1GB image to 680MB in a typical ML project. Smaller images pull faster, which matters in CI and in cold-start scenarios on Kubernetes.
Orchestration: Kubernetes
The learning curve is steep and I won't pretend otherwise. The YAML is verbose. The mental model for networking takes time to build. The first time you misconfigure a readiness probe and watch your pod restart 40 times in ten minutes, it's genuinely demoralising.
But once you're past that, you get rolling deploys with zero downtime, pod auto-scaling based on CPU or custom metrics, health checks built into the scheduler, and complete infrastructure portability — the same manifests that run on Minikube locally work on EKS or GKE without changes. For anything that needs to stay up and handle traffic spikes, that's worth the upfront investment. For a prototype or a pipeline that runs on a schedule, it probably isn't — Prefect + a single server is simpler.
Monitoring: Prometheus + Grafana
These two have been around for years and they work. The FastAPI service exposes /metrics using prometheus-fastapi-instrumentator — five lines of code and you get request counts, latency histograms, and in-progress request tracking automatically. Prometheus scrapes it every 15 seconds. Grafana reads Prometheus as a data source and you build dashboards from there.
Beyond the automatic metrics, I add two custom ones on every ML service: prediction class distribution and model load time. Class distribution lets you catch silent drift — if the model starts returning one class 95% of the time, you want to know before the downstream system does. Load time tracks how long the model takes to initialise, which creeps up as you add preprocessing steps and catches problems early. The dashboards live in version control as JSON so setting up monitoring on a new project takes about ten minutes.
CI/CD: GitHub Actions
Free, lives next to the code, no infrastructure to maintain. The YAML syntax has some rough edges — the difference between `run:` and `uses:` trips everyone up at first, and secret injection into Docker builds requires a specific pattern that isn't obvious. But the ecosystem of pre-built actions covers almost everything: flake8/black linting, pytest, Docker build and push, kubectl apply. Once the workflow file is working, I rarely touch it again.
My standard ML pipeline workflow has three jobs: test (lint + unit tests), build (Docker build + push to registry), deploy (kubectl apply to the cluster). Deploy only runs if test and build pass, and only on pushes to main. That's it. The simplicity is the point.
What I Deliberately Left Out
DVC for data versioning — I evaluated it and it solves a real problem, but my datasets so far have been small enough to keep in S3 with MLflow artifact tracking. When a dataset is large enough that I need reproducible data versions separately from model versions, I'll add it.
Seldon Core and KServe for model serving — both are powerful, both add significant complexity, and FastAPI with a custom /predict endpoint does 95% of what I need without the Kubernetes operator overhead. If I'm serving dozens of models on shared infrastructure, that calculation changes.
Feast for feature stores — same logic. The moment I have features being computed in real time and reused across multiple models, a feature store makes sense. Until then, it's premature.
What I'm Poking At Next
Ray Serve keeps coming up in conversations about high-throughput inference — specifically for batching requests automatically to improve GPU utilisation. I haven't hit that problem yet but it's clearly the right tool for it. LangSmith is already in regular use for my LLM work. OpenTelemetry is interesting for standardised observability across services — right now my Prometheus setup is slightly different across projects and that inconsistency is annoying.
If I Had to Summarise
Start with MLflow, Prefect, FastAPI, Docker, and GitHub Actions. Add Prometheus and Grafana when you actually deploy something. Bring in Kubernetes when you need rolling deploys or auto-scaling. Add everything else only when the absence of it starts costing you real time. The goal is a stack you can reason about, not one that covers every possible future problem.
AI Systems Engineer at HCLTech · M.Tech AI/ML, BITS Pilani
Building agentic AI systems, LLM pipelines, and production ML infrastructure. 3+ years shipping AI at scale.