K8sGPT + kubectl-ai: Let AI Diagnose and Fix Your Kubernetes Cluster Issues (2025 Guide)
The Problem With Kubernetes Troubleshooting Today
You run kubectl get pods and see this:
NAME READY STATUS RESTARTS AGE
payment-api-7d9f8c-xk2p9 0/1 CrashLoopBackOff 8 12m
checkout-svc-6b4d9-mn3q7 0/1 OOMKilled 3 5m
frontend-deploy-5f7b-p9w2l 0/1 ImagePullBackOff 0 2m
Three broken pods. Three completely different root causes. Now begins the ritual:
kubectl describe pod payment-api-7d9f8c-xk2p9
kubectl logs payment-api-7d9f8c-xk2p9 --previous
kubectl get events --sort-by=.metadata.creationTimestamp
# ... scroll through hundreds of lines ...
# ... open browser, search Stack Overflow ...
# ... 45 minutes later, you find the culprit
Sound familiar?
Here's the reality: Kubernetes is powerful, but its error messages are written for machines. As a human engineer, you spend more time translating cluster state than actually fixing it. And when you're on-call at 2 AM, that translation cost is brutal.
AI changes this equation entirely.
Two tools โ K8sGPT and kubectl-ai โ bring LLM intelligence directly into your Kubernetes workflow. They read your cluster state, understand what's wrong, and explain it in plain English. One is your AI diagnostician. The other is your AI co-pilot. Together, they make you dramatically faster at Kubernetes operations.
Let's build both into your workflow today.
What You'll Learn
What K8sGPT and kubectl-ai are (and the difference between them)
Install both tools in under 5 minutes
Real-world diagnosis of
CrashLoopBackOff,OOMKilled,ImagePullBackOff, andPendingpodsRun K8sGPT as a Kubernetes operator for continuous scanning
Use kubectl-ai for natural-language cluster operations
Pro tips, filters, and AI model selection
When to trust AI suggestions โ and when not to
Prerequisites: A running Kubernetes cluster (minikube, kind, EKS, AKS, or GKE), kubectl installed and configured, basic Kubernetes knowledge.
Part 1 โ K8sGPT: Your AI Cluster Diagnostician
What Is K8sGPT?
K8sGPT is an open-source CNCF sandbox project that scans your Kubernetes cluster, identifies issues across all resource types, and uses an LLM backend to explain those issues in plain English โ along with concrete steps to fix them.
Think of it as hiring an experienced SRE to audit your cluster every time you run a scan. Except it's free, takes 3 seconds, and never gets tired.
"K8sGPT is a tool for scanning your Kubernetes clusters, diagnosing and triaging issues in simple English. It has SRE experience codified into its analyzers." โ K8sGPT project docs
What K8sGPT analyzes:
Pods (Pending, CrashLoop, OOMKilled, ImagePullBackOff)
Deployments and ReplicaSets
Services (missing endpoints, wrong selectors)
PersistentVolumeClaims (unbound, wrong storage class)
Ingress (misconfigured backends)
Nodes (not ready, disk pressure, memory pressure)
ConfigMaps and Secrets (missing references)
RBAC (missing permissions)
NetworkPolicies
HorizontalPodAutoscalers
How it works (in 3 steps):
Your Cluster
โ
โผ
K8sGPT CLI
โโโ Built-in SRE Analyzers (20+ types)
โ โ finds structured findings
โ
โโโโบ LLM Backend (OpenAI / Gemini / Ollama / Bedrock)
โ
Plain-English Explanation + Exact Fix Steps
Without --explain: K8sGPT runs its built-in analyzers and surfaces problems. No API key needed.
With --explain: It sends anonymized findings to your chosen LLM and returns AI-enriched root cause analysis.
Install K8sGPT
Linux / WSL:
# Method 1: .deb package (recommended for Ubuntu/Debian)
curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/latest/download/k8sgpt_amd64.deb
sudo dpkg -i k8sgpt_amd64.deb
# Method 2: Direct binary
curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/latest/download/k8sgpt_Linux_x86_64.tar.gz
tar -xzf k8sgpt_Linux_x86_64.tar.gz
sudo mv k8sgpt /usr/local/bin/
macOS:
brew tap k8sgpt-ai/k8sgpt
brew install k8sgpt
Windows (via Chocolatey):
choco install k8sgpt
Verify installation:
k8sgpt version
# k8sgpt: v0.3.xx (linux/amd64), built at ...
Connect K8sGPT to an AI Backend
K8sGPT supports multiple AI providers. Here's how to set up each:
Option 1: OpenAI (most capable)
# Generate API key: https://platform.openai.com/api-keys
k8sgpt auth add --backend openai --model gpt-4o-mini
# Paste your OpenAI API key when prompted
Option 2: Google Gemini (free tier available)
# Get API key: https://aistudio.google.com/apikey
k8sgpt auth add --backend googlevertexai --model gemini-1.5-flash
Option 3: Local model with Ollama (100% free, no API key, runs offline)
# First install Ollama: https://ollama.ai
ollama pull llama3
# Then connect K8sGPT to local Ollama
k8sgpt auth add --backend localai --baseurl http://localhost:11434/v1 --model llama3
Pro tip: For production or sensitive clusters, use Ollama (local model). No data ever leaves your infrastructure. For development clusters where you want the best analysis, use
gpt-4o-miniโ it's cheap and very capable.
Verify your auth setup:
k8sgpt auth list
# Active: openai
Your First Cluster Scan
Basic scan (no AI, just pattern matching)
k8sgpt analyze
Example output:
0/3 found
No issues found
Or if there are problems:
0: Pod default/nginx-7d9f8c-xk2p9
Error: Back-off restarting failed container
1: Service production/payment-svc
Error: Service has no endpoints, expected label selector app=payment
AI-enriched scan (with explanations and fixes)
k8sgpt analyze --explain
Example output with AI enrichment:
100% |โโโโโโโโโโโโโโโโ| (3/3, 8 it/min)
AI Provider: openai
0: Pod default/payment-api-7d9f8c-xk2p9 (Deployment/payment-api)
Error: Back-off 1m20s restarting failed container=api pod=payment-api-7d9f8c-xk2p9_default
AI Analysis:
The "payment-api" container is repeatedly failing and Kubernetes is using
exponential back-off before restarting it (currently waiting 1m20s).
Root cause: The container is crashing at startup. This is most commonly
caused by:
1. A missing environment variable or secret the application depends on
2. A database connection string that's unreachable
3. An application startup error (check logs with --previous flag)
Recommended fix:
kubectl logs payment-api-7d9f8c-xk2p9 -n default --previous
kubectl describe pod payment-api-7d9f8c-xk2p9 -n default | grep -A10 Events
# Check if required secrets exist:
kubectl get secrets -n default
That's the power of K8sGPT. Instead of staring at raw Kubernetes events, you get root cause analysis and exact commands in one shot.
Real-World Diagnosis: 4 Common Kubernetes Errors
Let's walk through how K8sGPT handles the four most common Kubernetes failures.
Scenario 1: CrashLoopBackOff
What you see:
payment-api-7d9f8c 0/1 CrashLoopBackOff 8 12m
K8sGPT output:
k8sgpt analyze --explain --namespace production
Error: Back-off restarting failed container=payment-api
AI Analysis:
CrashLoopBackOff means the container starts, immediately crashes, and
Kubernetes keeps retrying with increasing delays (10s โ 20s โ 40s โ 5min max).
Most likely causes:
1. Application error on startup โ missing config, bad DB URL, code bug
2. Missing ConfigMap or Secret referenced in the pod spec
3. Liveness probe configured too aggressively (killing healthy pods)
Immediate debugging steps:
# Get crash logs from the previous (dead) container
kubectl logs payment-api-7d9f8c -n production --previous
# Check Kubernetes events for the pod
kubectl describe pod payment-api-7d9f8c -n production | tail -30
# Check if referenced secrets/configmaps exist
kubectl get configmap payment-config -n production
kubectl get secret payment-secrets -n production
What makes this powerful: K8sGPT doesn't just tell you the pod is crashing โ it tells you the most common root causes ranked by probability and gives you the exact commands to run next.
Scenario 2: OOMKilled (Out of Memory)
What you see:
checkout-svc-6b4d9 0/1 OOMKilled 3 5m
K8sGPT output:
Error: OOMKilled โ container exceeded memory limit
AI Analysis:
The "checkout-svc" container was killed by the Linux OOM killer because it
exceeded its configured memory limit.
Exit code 137 (128 + signal 9) confirms a SIGKILL from memory pressure.
Diagnostic steps:
# See the current memory limit
kubectl describe pod checkout-svc-6b4d9 -n production | grep -A3 Limits
# Check what the container was using before it died
kubectl top pod checkout-svc-6b4d9 -n production
Recommended fix:
# Increase memory limit in your deployment spec:
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi" # Increase this value
If memory usage keeps growing, the application may have a memory leak.
Consider adding Prometheus memory metrics and alerting on sustained growth.
Scenario 3: ImagePullBackOff
What you see:
frontend-deploy-5f7b 0/1 ImagePullBackOff 0 2m
K8sGPT output:
Error: Back-off pulling image "myrepo.io/frontend:v2.4.1"
AI Analysis:
Kubernetes cannot pull the container image from the registry.
Most common causes:
1. Image tag doesn't exist (typo, wrong version, deleted tag)
2. Private registry requires authentication credentials (imagePullSecret missing)
3. Registry rate limiting (especially DockerHub free tier: 100 pulls/6hr)
4. Registry is temporarily unreachable
Diagnostic commands:
# Check the exact image name and tag
kubectl describe pod frontend-deploy-5f7b | grep Image:
# Check if imagePullSecret is configured
kubectl describe pod frontend-deploy-5f7b | grep -A5 "Image Pull"
# Test image pull manually on a node (if you have access)
docker pull myrepo.io/frontend:v2.4.1
Fix for missing pull credentials:
kubectl create secret docker-registry regcred \
--docker-server=myrepo.io \
--docker-username=<username> \
--docker-password=<password> \
--namespace=production
# Then add to your deployment spec:
spec:
imagePullSecrets:
- name: regcred
Scenario 4: Service with No Endpoints (silently broken)
This one is sneaky โ your pods are Running but the service isn't routing traffic. K8sGPT catches it.
K8sGPT output:
Error: Service production/api-gateway has no endpoints
AI Analysis:
The Service "api-gateway" exists but has no pods matching its selector,
so no traffic can be routed to your application.
This is a label mismatch โ the Service's selector doesn't match
the labels on your pods.
Diagnostic steps:
# Check what the service is selecting
kubectl describe service api-gateway -n production | grep Selector
# Check what labels your pods actually have
kubectl get pods -n production --show-labels
Fix: Make sure your deployment's pod template labels match the service selector.
Service selector: app=api-gateway, version=v2
Pod labels: app=api-gateway, version=v1 โ MISMATCH!
Either update the deployment labels or the service selector to match.
Scan Specific Namespaces and Resource Types
# Scan only a specific namespace
k8sgpt analyze --explain --namespace production
# Scan only Pods and Services (skip other resource types)
k8sgpt analyze --explain --filter Pod,Service
# List all available filters (resource types K8sGPT can analyze)
k8sgpt filters list
# Output as JSON (for CI/CD pipelines or dashboards)
k8sgpt analyze --explain --output json
# Save analysis to a file
k8sgpt analyze --explain --output json > cluster-health-$(date +%Y%m%d).json
Run K8sGPT as a Kubernetes Operator (Continuous Monitoring)
The CLI is great for ad-hoc scans. But for production, you want K8sGPT running continuously โ scanning your cluster every few minutes and storing results as Custom Resources.
Install the K8sGPT operator:
helm repo add k8sgpt-operator https://charts.k8sgpt.ai/
helm repo update
helm install k8sgpt-operator k8sgpt-operator/k8sgpt-operator \
--namespace k8sgpt-operator-system \
--create-namespace
Create your OpenAI secret:
kubectl create secret generic k8sgpt-secret \
--from-literal=openai-api-key=sk-your-key-here \
--namespace k8sgpt-operator-system
Configure K8sGPT:
# k8sgpt-cr.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-prod
namespace: k8sgpt-operator-system
spec:
ai:
enabled: true
model: gpt-4o-mini
backend: openai
secret:
name: k8sgpt-secret
key: openai-api-key
noCache: false
version: v0.3.41
filters:
- Pod
- Service
- Deployment
- PersistentVolumeClaim
- Node
# Scan interval: every 10 minutes
interval: 10m
kubectl apply -f k8sgpt-cr.yaml
View AI-analyzed results:
# List all findings
kubectl get results -n k8sgpt-operator-system
# Get detailed AI analysis of a specific finding
kubectl describe result <result-name> -n k8sgpt-operator-system
Now your cluster is continuously monitored. Every issue gets an AI-generated explanation stored as a Kubernetes Custom Resource โ searchable, auditable, and automatable.
Part 2 โ kubectl-ai: Natural Language for Your Cluster
What Is kubectl-ai?
K8sGPT diagnoses your cluster. kubectl-ai is different โ it's an AI co-pilot built by Google engineers that lets you operate your cluster using natural language instead of complex kubectl commands.
Instead of:
kubectl get pods --all-namespaces --field-selector=status.phase!=Running -o json | jq '.items[] | {name: .metadata.name, namespace: .metadata.namespace, status: .status.phase}'
You just type:
show me all pods that are not running across all namespaces
kubectl-ai figures out the exact commands, runs them against your live cluster, and explains the output in plain English.
Built by Google engineers and open-sourced at
GoogleCloudPlatform/kubectl-ai, this tool is redefining how DevOps teams interact with Kubernetes day-to-day.
Install kubectl-ai
Linux & macOS (one-line install):
curl -sSL https://raw.githubusercontent.com/GoogleCloudPlatform/kubectl-ai/main/install.sh | bash
Via Krew (kubectl plugin manager):
kubectl krew install ai
Verify:
kubectl-ai --version
Connect kubectl-ai to an AI Backend
Using Gemini (recommended โ free API tier available):
export GEMINI_API_KEY=your-api-key-here
kubectl-ai --model gemini-2.0-flash
Using OpenAI:
export OPENAI_API_KEY=sk-your-key-here
kubectl-ai --model gpt-4o-mini
Using Ollama (local, free):
export OPENAI_API_KEY=dummy
kubectl-ai --llm-provider=openai \
--model=llama3 \
--openai-base-url=http://localhost:11434/v1
kubectl-ai in Action: Real Examples
Once you launch kubectl-ai, you enter an interactive session with your cluster:
kubectl-ai --model gemini-2.0-flash
Hey there, what can I help you with today?
Example 1: Check cluster health at a glance
>>> show me the health status of all nodes and flag any issues
Running: kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP
master-01 Ready control 12d v1.29.2 10.0.0.1
worker-01 Ready <none> 12d v1.29.2 10.0.0.2
worker-02 Ready <none> 12d v1.29.2 10.0.0.3
worker-03 Ready <none> 8d v1.29.2 10.0.0.4
All 4 nodes are Ready. worker-03 is newer (8 days old vs 12 days) but
healthy. No scheduling issues detected. Cluster looks clean.
Example 2: Find all broken pods
>>> find all pods that are not running and tell me why they might be failing
Running: kubectl get pods --all-namespaces --field-selector=status.phase!=Running
NAMESPACE NAME STATUS RESTARTS
production payment-api-7d9f8c-xk2p9 CrashLoopBackOff 8
staging test-deploy-6b4d9-mn3q7 OOMKilled 3
I found 2 unhealthy pods:
1. payment-api (production namespace): Crash loop with 8 restarts โ
application is failing at startup. Check logs with:
kubectl logs payment-api-7d9f8c-xk2p9 -n production --previous
2. test-deploy (staging namespace): OOMKilled โ container is exceeding
its memory limit. Current limit may be too low for the workload.
Example 3: Scale a deployment
>>> scale the api-gateway deployment in production to 5 replicas
I'll scale the api-gateway deployment to 5 replicas.
Running: kubectl scale deployment api-gateway -n production --replicas=5
deployment.apps/api-gateway scaled
Done. The api-gateway deployment now has 5 replicas. You can verify with:
kubectl get deployment api-gateway -n production
Example 4: Debug a specific pod
>>> the payment-api pod keeps crashing. what's in the logs?
Running: kubectl logs payment-api-7d9f8c-xk2p9 -n production --previous --tail=50
Error: DATABASE_URL environment variable is not set
panic: runtime error: invalid memory address or nil pointer dereference
The application crashed because the DATABASE_URL environment variable is
missing. The app is trying to connect to a database on startup but can't
find the connection string.
Fix: Check if the ConfigMap or Secret containing DATABASE_URL exists:
kubectl get configmap payment-config -n production
kubectl get secret payment-db-secret -n production
Then verify the pod spec references it correctly under env or envFrom.
Example 5: Check resource usage
>>> which pods are using the most memory in the production namespace?
Running: kubectl top pods -n production --sort-by=memory
NAME CPU(cores) MEMORY(bytes)
payment-api-7d9f8c-xk2p9 450m 892Mi
checkout-svc-6b4d9-mn3q7 120m 678Mi
frontend-deploy-5f7b-p9w2l 45m 234Mi
api-gateway-78b4-k9x2p 89m 189Mi
payment-api is using the most memory at 892Mi. If its limit is 1Gi,
it's at 89% โ close to triggering an OOMKill. Consider increasing
the memory limit or investigating for memory leaks.
Example 6: Complex multi-step operation
>>> rollback the payment-api deployment to the previous version
I'll check the rollout history first, then roll back.
Running: kubectl rollout history deployment/payment-api -n production
REVISION CHANGE-CAUSE
1 Initial deployment
2 Update to v2.4.1 โ added payment gateway integration
3 Update to v2.4.2 โ hotfix for null pointer exception
Running: kubectl rollout undo deployment/payment-api -n production
deployment.apps/payment-api rolled back
The payment-api deployment has been rolled back from v2.4.2 to v2.4.1.
Verify the rollout is healthy:
kubectl rollout status deployment/payment-api -n production
K8sGPT vs kubectl-ai: Which One Should You Use?
| Feature | K8sGPT | kubectl-ai |
|---|---|---|
| Primary purpose | Diagnose cluster issues | Operate cluster with natural language |
| Works without AI key | Yes (pattern-based scan) | No |
| Continuous monitoring | Yes (Kubernetes operator) | No (interactive sessions) |
| Best for | Finding what's broken | Doing things faster |
| Output style | Structured report | Conversational |
| CNCF project | Yes (sandbox) | No (Google open-source) |
| Auto-remediation | Partial (operator mode) | Yes (executes commands) |
| Use in CI/CD | Yes (JSON output) | Not natively |
The answer: use both. They complement each other perfectly.
Use K8sGPT for cluster health scanning, monitoring, and CI/CD gate checks
Use kubectl-ai for daily operations, debugging sessions, and team onboarding
Pro Tips for AI-Assisted Kubernetes Operations
1. Use K8sGPT without AI first
k8sgpt analyze # No API key needed โ pure pattern matching
This alone catches 70% of common issues. Only add --explain when you need deeper analysis.
2. Filter to specific namespaces in production
# Don't scan kube-system โ too much noise
k8sgpt analyze --explain --namespace production --namespace staging
3. Run K8sGPT in your CI/CD pipeline
Add a health check gate to your deployment pipeline:
# .github/workflows/deploy.yml
- name: K8sGPT cluster health check
run: |
k8sgpt analyze --output json --namespace production > health.json
# Fail pipeline if critical issues found
CRITICAL=$(cat health.json | jq '[.[] | select(.severity == "critical")] | length')
if [ "$CRITICAL" -gt "0" ]; then
echo "Critical cluster issues found. Blocking deployment."
cat health.json | jq '.[] | select(.severity == "critical")'
exit 1
fi
4. Use kubectl-ai for team onboarding
Junior engineers who don't yet know kubectl syntax can use kubectl-ai to learn by doing:
>>> how do I check why a pod is not starting?
kubectl-ai will show them the exact commands and explain what each one does โ it's like having a senior engineer pairing with every new team member.
5. Never blindly execute AI suggestions on production
Both tools generate commands. Always review before executing, especially:
Anything with
delete,drain, orcordonChanges to resource limits in production
RBAC modifications
Use --dry-run=client on destructive operations:
kubectl delete pod <name> --dry-run=client
6. Combine K8sGPT with Prometheus for full context
# K8sGPT can integrate with Prometheus for richer analysis
k8sgpt integration activate prometheus
k8sgpt analyze --explain --with-doc
The Complete AI-Assisted K8s Troubleshooting Workflow
Here's how to combine both tools for an efficient incident response workflow:
Incident Alert Fires
โ
โผ
k8sgpt analyze --explain --namespace <affected>
โ
โโโ Issue found? โ Read AI explanation + recommended fix
โ
โโโ Need more context?
โ
โผ
kubectl-ai interactive session
โ
โโโ "show me logs for <pod>"
โโโ "check events in namespace <x>"
โโโ "what resources is <pod> using?"
โโโ "rollback <deployment> to previous version"
โ
โผ
Apply fix + verify with k8sgpt rescan
Summary
Kubernetes troubleshooting used to mean hours of manual investigation. With AI tools, it's a different story:
K8sGPT:
โ Scans your entire cluster in seconds
โ Detects 20+ issue types โ pods, services, PVCs, nodes, RBAC
โ AI-enriched root cause analysis in plain English
โ Runs as a continuous Kubernetes operator
โ Integrates with CI/CD pipelines via JSON output
โ Works with OpenAI, Gemini, or local models (Ollama)
kubectl-ai:
โ Natural language โ kubectl commands
โ Built by Google engineers, backed by Gemini/OpenAI
โ Interactive sessions โ great for debugging and daily ops
โ Explains command output in plain English
โ Dramatically lowers the Kubernetes learning curve
Together, these tools don't replace your Kubernetes skills โ they amplify them. You bring the engineering judgment. The AI handles the translation layer between you and the cluster.
Learn This Hands-On
Want to deploy and use K8sGPT, kubectl-ai, Prometheus, Grafana, and full AIOps pipelines on real AWS EKS clusters with expert guidance?
CloudDevOpsHub Batch 42 โ a 55-day Multi-Cloud + DevOps with AI bootcamp โ covers all of this in depth, with live sessions, hands-on projects, and career support.
๐ Join Batch 42 โ CloudDevOpsHub
SEO Keywords Covered
K8sGPT tutorial, K8sGPT install, kubectl-ai tutorial, AI Kubernetes troubleshooting, Kubernetes AI tools 2025, K8sGPT vs kubectl-ai, CrashLoopBackOff fix AI, OOMKilled Kubernetes, ImagePullBackOff fix, Kubernetes diagnosis AI, CNCF K8sGPT, AI-powered Kubernetes ops, kubectl-ai Google, Kubernetes smart ops, AI SRE Kubernetes, K8sGPT operator, natural language kubectl, Kubernetes cluster scan AI, K8sGPT OpenAI, kubectl-ai Gemini
Written by the CloudDevOpsHub team โ practical DevOps and Cloud AI training for engineers who want to work on real production systems. Follow CloudDevOpsHub on Hashnode for weekly guides on Kubernetes, Multi-Cloud, and AI-powered DevOps.