12 minute read

Deploy Production-Like AI Inference Infrastructure on Your Laptop – In Eight Minutes

Prefer video? Watch the full CNCF Tampa Bay presentation below, or continue reading for the deep-dive.

95% of AI Pilots Fail to Reach Production

According to a 2025 study from MIT’s NANDA Initiative, ninety-five percent of generative AI pilots fail to reach production and deliver measurable business impact. The researchers interviewed 150 business leaders, surveyed 350 employees, and analyzed 300 public AI deployments to reach that conclusion.

Nineteen out of twenty. These aren’t weekend hobby projects. These are funded initiatives with real budgets, real teams, and real executive sponsorship.

The MIT study found that the problem isn’t the AI models themselves. The models work fine. The problem is what they call the “learning gap”–AI tools that don’t adapt to enterprise workflows, that don’t integrate into existing systems, and that become static science projects rather than evolving production systems.

Infrastructure complexity is a major contributor to this gap. When it takes a week just to stand up the development environment, teams lose momentum before they can even start building. That’s exactly the problem I experienced firsthand, and that’s exactly the problem this post addresses.

My Week-Long “Quick Start”

I’m an experienced Kubernetes engineer. It still took me a full week to get my first successful KServe inference running.

Day one and two: cert-manager race conditions. Day three: Istio version mismatches. Day four: KServe webhook timeouts. Day five: GPU not detected. Day six: model finally loads. Day seven: first successful inference.

Each day brought a new category of problem. Each fix required hours of debugging GitHub issues and Stack Overflow. The quick start guides assume everything works the first time. It never does.

That week of pain is why we built kFabric.

The Deployment Gap: POC vs Production-Like Development

The gap between proof-of-concept and production-like development is significant. A POC requires three things: a Jupyter notebook, a single GPU, and code that works on your local machine. Most data scientists can get a model working in a notebook in hours.

Production-like development requires containerization, GPU scheduling, load balancing, monitoring and alerts, rollback capability, security patterns, and scaling policies. Each of those items represents days or weeks of work if you’re building from scratch. The cumulative complexity is what causes teams to stall. They get the POC working, present it to stakeholders, everyone gets excited, and then the project dies when it comes time to operationalize it.

Why KServe?

The Configuration Burden

Deploying a single model manually on Kubernetes without a platform like KServe requires a Deployment YAML, Service YAML, Ingress YAML, ConfigMaps, Secrets, ServiceAccount, RBAC rules, GPU resource limits, health checks, and Prometheus annotations. Add it up and you’re looking at 500+ lines of YAML spread across a dozen files, all of which need to be correct and consistent with each other.

KServe reduces this to approximately twenty lines of YAML in a single InferenceService manifest:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen-small
  namespace: model-serving
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: hf://Qwen/Qwen2.5-0.5B-Instruct
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
          nvidia.com/gpu: "1"
        limits:
          cpu: "2"
          memory: "4Gi"
          nvidia.com/gpu: "1"

You declare what model you want to run. KServe handles the operational details.

KServe Architecture

The InferenceService is the core abstraction. When you create one, you’re declaring your intent: I want this model to be served with these characteristics. KServe’s controller watches for InferenceService resources and reconciles the cluster state to match your declaration.

The Predictor component handles model loading, inference serving, and health reporting. KServe integrates with Istio for traffic routing–giving you traffic splitting for canary deployments, circuit breaking, and request metrics–and with Prometheus for observability out of the box.

RawDeployment vs Serverless

KServe supports two deployment modes. Serverless mode uses Knative to provide scale-to-zero capability. RawDeployment mode uses standard Kubernetes Deployments–your model runs continuously, always ready without cold start delay.

We chose RawDeployment for kFabric v1 because it’s simpler. Knative adds five or more additional components, each with its own configuration and failure modes. When you’re trying to get started quickly, that complexity is a barrier. Knative support is coming in kFabric v2.

Both modes use the same InferenceService abstraction. When we add Knative support, your existing manifests will continue to work.

When to Use KServe (And When Not To)

I want to be direct about when KServe is the right choice.

Use KServe when:

  • You’re already running on Kubernetes
  • You need to serve multiple model formats on a unified platform
  • You need autoscaling that responds to inference load
  • You want vendor independence
  • Data residency matters and you can’t send data to a cloud provider

Don’t use KServe when:

  • You have a single model and no existing Kubernetes infrastructure
  • Cloud-managed services like SageMaker or Vertex AI meet your compliance requirements
  • Your team lacks Kubernetes skills and doesn’t have time to develop them

If SageMaker works for you, use SageMaker. There’s no prize for doing things the hard way. KServe wins when you need control, but control comes with responsibility.

The Manual Way: What Took Me a Week

The “Quick Start” Reality

Thirteen steps minimum, each with its own failure modes:

  1. Start minikube with GPU flags
  2. Configure NVIDIA container toolkit inside the minikube VM
  3. Install Cert-Manager, wait, verify CRDs
  4. Install Istio base, wait, verify
  5. Install Istiod, wait, verify
  6. Install KServe CRDs, wait, verify
  7. Install KServe controller, wait, verify
  8. Install NVIDIA device plugin, wait
  9. Label GPU nodes
  10. Create namespaces
  11. Deploy model, wait 5+ minutes
  12. Debug why it’s not working
  13. Repeat steps 3-12 several times

Each wait-and-verify step takes two to five minutes. Each failure means starting over because the state is corrupted.

The Commands Wall

This is the happy path–no errors, no retries, just the infrastructure setup:

# Just the infrastructure (not including model deployment)
minikube start --driver=docker --memory=32768 --cpus=8 --gpus=all
minikube ssh "sudo nvidia-ctk runtime configure --runtime=docker"
minikube ssh "sudo systemctl restart docker"
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager --namespace cert-manager \
  --create-namespace --set installCRDs=true
kubectl wait --for=condition=Available deployment/cert-manager -n cert-manager
helm repo add istio https://istio-release.storage.googleapis.com/charts
helm install istio-base istio/base -n istio-system --create-namespace
helm install istiod istio/istiod -n istio-system
kubectl wait --for=condition=Available deployment/istiod -n istio-system
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd
helm install kserve oci://ghcr.io/kserve/charts/kserve \
  --set kserve.controller.deploymentMode=RawDeployment
kubectl apply -f nvidia-device-plugin.yaml
# ... and we haven't even deployed a model yet

In practice, each of these commands can fail. Cert-manager webhook not ready when Istio tries to install. Istio CRDs not registered when KServe tries to start. KServe controller crashlooping because of a version mismatch. I’ve seen all of these failures and more.

Debugging the Hard Way

When something goes wrong, there’s no unified view:

kubectl get pods -A | grep -v Running
kubectl logs -n cert-manager deploy/cert-manager
kubectl logs -n istio-system deploy/istiod
kubectl logs -n kserve deploy/kserve-controller-manager
kubectl get mutatingwebhookconfigurations
kubectl describe node | grep -A5 "Allocatable"
kubectl get events -A --sort-by='.lastTimestamp'

Each namespace has its own logs, its own events, its own problems. No correlation between a failure in one component and its root cause in another. You’re jumping between terminal windows, copying error messages into search engines, hoping someone else posted a solution.

This is why ninety-five percent of AI pilots fail to reach production. Infrastructure complexity consumes time and energy that should be spent on the actual AI application.

The kFabric Way: 8 Minutes, One Command

DevStack for Cloud-Native AI

If you’ve ever worked with OpenStack, you might remember what it was like before DevStack. Setting up OpenStack manually was a multi-day ordeal. Then DevStack collapsed all of that into a single command.

kFabric is DevStack for cloud-native AI. One command to stand up the entire stack: KServe, Istio, cert-manager, NVIDIA device plugin, Prometheus, Grafana, DCGM exporter. All configured correctly, in the right order, with health checks between each step.

$ kfabric cluster start

That’s it. Everything I showed in the manual section–all thirteen steps, all the waiting, all the verification–handled automatically.

The command starts minikube with correct GPU configuration, installs cert-manager and waits for readiness, installs Istio and waits for readiness, installs KServe and waits for readiness, deploys the monitoring stack, labels nodes, and creates namespaces. When it finishes, approximately eight minutes later, you have a fully functional stack ready for model deployment.

Deploy and Query

With the cluster running, deploying a model is one additional command:

# Deploy a model
$ kfabric deploy --models qwen-small --wait
✓ Model qwen-small deployed and ready

# Query it (OpenAI-compatible API)
$ kfabric query --model qwen-small --prompt "What is Kubernetes?"
Kubernetes is an open-source container orchestration platform...

# List available models
$ kfabric list
Available: qwen-small, qwen-medium, tinyllama, smollm2, phi2
Deployed:  qwen-small (Ready)

The API is OpenAI-compatible. If you have existing code that works with GPT-4 or other OpenAI models, you can point it at your local kFabric endpoint with minimal changes.

Three commands total. Start the cluster, deploy a model, query it. That’s the entire workflow.

Pre-Configured Models

kFabric comes with five models tested on consumer GPUs with 6GB VRAM:

Model Parameters VRAM Download Size Use Case
qwen-small 0.5B ~1GB ~1GB Quick testing, low latency
qwen-medium 1.5B ~3GB ~3GB Balanced performance
tinyllama 1.1B ~2.5GB ~2.2GB Lightweight inference
smollm2 1.7B ~3.5GB ~3.4GB Efficient small model
phi2 2.7B ~5.5GB ~5.5GB Higher quality, more resources

Start with qwen-small for quick testing, then move to larger models as needed.

The Key Contrast

Manual kFabric
13+ steps 1 command
500+ lines of YAML 0 lines of YAML
Week of debugging 8 minutes
Race conditions Automatic ordering
No monitoring Full GPU observability
“Works on my machine” Reproducible every time

GPU Monitoring: What Separates a Toy Demo from Real Infrastructure

Why GPU Monitoring Matters

Standard Kubernetes monitoring tools don’t see your GPU at all. You could be at ninety-five percent VRAM utilization, about to trigger an out-of-memory error, and your standard dashboards would show green. You could be thermal throttling and your standard alerts wouldn’t fire.

NVIDIA’s DCGM (Data Center GPU Manager) Exporter bridges this gap. It runs as a DaemonSet on GPU nodes and exposes GPU metrics in Prometheus format. kFabric installs it automatically.

The Monitoring Stack

kFabric deploys three monitoring components that work together:

  • Prometheus: Metrics collection and alerting engine. Scrapes metrics from all components on a regular interval, evaluates alerting rules, sends notifications.
  • Grafana: Visualization layer with pre-built dashboards for GPU and inference metrics.
  • DCGM Exporter: GPU metrics source collecting utilization, memory usage, temperature, and power consumption.

All three are pre-configured to work together. No manual setup required.

GPU Metrics That Matter

Five DCGM metrics matter most for inference workloads:

Metric What It Tells You
DCGM_FI_DEV_GPU_UTIL Percentage of GPU compute being used. Low during inference? Bottleneck is elsewhere.
DCGM_FI_DEV_FB_USED GPU memory allocated. Critical for OOM risk.
DCGM_FI_DEV_FB_FREE Available GPU memory. Headroom for larger batches.
DCGM_FI_DEV_GPU_TEMP Temperature in Celsius. Above 85C, thermal throttling begins.
DCGM_FI_DEV_POWER_USAGE Power consumption. Indicates actual GPU activity vs idle waiting.

Pre-Configured Alerting

kFabric includes Prometheus alerting rules based on real operational experience:

Condition Severity Action
GPU Memory > 90% Warning Scale or optimize
GPU Memory > 95% Critical Immediate attention
GPU Temperature > 80C Warning Check cooling
GPU Temperature > 85C Critical Throttling imminent
Inference Error Rate > 5% Warning Check model health
Inference Latency p99 > 5s Warning Performance issue

These integrate with Slack, PagerDuty, email, or any webhook endpoint. The thresholds are based on operational experience–ninety percent memory is when you should start worrying, eighty-five degrees is when GPUs actually start throttling.

Access the Grafana dashboard with kfabric dashboard–it opens your browser directly, no port-forwarding configuration needed.

kFabric Architecture Under the Hood

kFabric consists of three minikube addons and a CLI:

┌─────────────────────────────────────────────┐
│              User Workstation                │
│   ┌─────────────┐                           │
│   │  kfabric CLI │  Model deploy/query/mgmt │
│   └──────┬──────┘                           │
│          │ kubectl / kubernetes API          │
│          ▼                                   │
├─────────────────────────────────────────────┤
│          Minikube Cluster (Docker Driver)    │
│                                              │
│   kfabric-bootstrap addon                    │
│   ├── Cert-Manager v1.16.1                  │
│   ├── Istio v1.22.0                         │
│   ├── KServe v0.15.0 (RawDeployment)       │
│   └── NVIDIA Device Plugin v0.14.1         │
│                                              │
│   kfabric-model addon                        │
│   └── model-serving namespace               │
│       ├── ConfigMap: 5 pre-configured models│
│       └── InferenceServices: user-deployed  │
│                                              │
│   kfabric-monitoring addon                   │
│   ├── Prometheus v2.51.0                    │
│   ├── Grafana v10.4.0                       │
│   └── DCGM Exporter v4.5.1                 │
└─────────────────────────────────────────────┘

The bootstrap addon uses a Kubernetes Job to orchestrate Helm-based installations in dependency order. Sequential installation adds approximately two minutes compared to parallel, but eliminates entire categories of race condition failures. A cleanup phase before KServe installation ensures idempotent behavior–running the installer multiple times produces the same result.

Production-Like, Not Production

I want to be explicit about this distinction. kFabric is not production infrastructure. It’s production-like infrastructure for development. You use the same tools, patterns, and workflows you’ll use in production–on your laptop.

Path to Production

Development (kFabric) Production (You Add)
Single node minikube Multi-node cluster
Self-signed certs Real TLS certificates
No authentication OAuth2/OIDC
Default network Network policies
Local storage Persistent volumes
Consumer GPU Data center GPUs (A100, H100)

kFabric teaches you the patterns. Production applies them at scale. The InferenceService manifests you create in development work in production. The Grafana dashboards you use in development work in production. The mental model you build in development transfers directly.

Multi-Model Serving

You can run multiple models on shared infrastructure. Each model is deployed as a separate InferenceService. The primary constraint is total GPU memory–on a 6GB consumer GPU, you could run qwen-small and tinyllama together, but phi2 needs almost all available memory.

For enterprise GPUs like the A100, NVIDIA provides Multi-Instance GPU (MIG), which partitions a single physical GPU into multiple isolated instances. Each can run a separate model. Models are isolated at the namespace level, and you route requests by model name using /v1/models/{model-name}.

Requirements

  • Linux with NVIDIA GPU (6GB+ VRAM recommended)
  • Docker
  • minikube
  • NVIDIA drivers installed
  • 8 CPUs, 32GB RAM for comfortable operation

Try It

kFabric is open source. Check out the GitHub repo and let me know what you think. Feedback on what to prioritize next is genuinely useful.

What features would you like to see? Autoscaling with Knative? Multi-GPU support? Model versioning? CI/CD integration? What inference services should be supported beyond vLLM? What platforms should we target beyond local minikube?

Watch on YouTube View kFabric on GitHub

Resources


This post is based on the February 2026 CNCF Tampa Bay Community Group presentation “Fast Track AI Deployment: Master KServe in an Hour!” If you’re dealing with KServe deployment pain, I’d love to hear your experience. Reach out on LinkedIn or drop a comment below.