AI Agents for Kubernetes: Automating Cluster Management
Mei-Lin Zhang
ML researcher focused on autonomous agents and multi-agent systems.
Kubernetes clusters generate an extraordinary volume of operational data — pod events, resource metrics, logs, audit trails, network policies, and cost telemetry. The gap between the data available an...
AI Agents for Kubernetes Management: A Practical Guide for SRE Teams
Kubernetes clusters generate an extraordinary volume of operational data — pod events, resource metrics, logs, audit trails, network policies, and cost telemetry. The gap between the data available and a human operator's ability to act on it in real time is exactly where AI agents are finding traction. Not as replacements for SRE judgment, but as force multipliers that compress the time between "something changed" and "I know what to do about it."
This guide covers the practical reality of using AI agents across the Kubernetes lifecycle: deployment, scaling, troubleshooting, and cost optimization. Every tool mentioned here exists today and has real adoption. Every limitation is real too.
The Current Landscape: What "AI Agent" Actually Means Here
Before diving in, let's be precise about terminology. In the Kubernetes management context, "AI agents" fall into three categories:
| Category | What It Does | Examples |
|---|---|---|
| LLM-powered CLI assistants | Natural language → kubectl/YAML generation | kubectl-ai, k8sgpt, Kubiya |
| ML-driven optimization engines | Continuous tuning of resource requests, scaling parameters | StormForge, Cast AI, Karpenter (with AI layers) |
| Autonomous remediation agents | Detect issues → diagnose → take action (with guardrails) | Robusta, Shoreline.io, PagerDuty AIOps + K8s integrations |
These are fundamentally different tools solving different problems. A YAML generation assistant won't optimize your node costs. A cost optimization engine won't triage a CrashLoopBackOff at 3 AM. SRE teams need to understand where each category fits in their workflow.
Deployment: AI-Assisted Manifest Generation and Validation
The Problem
Writing correct Kubernetes manifests is deceptively complex. A deployment YAML that works in dev can be a security liability in production — missing resource limits, no pod disruption budgets, absent network policies, containers running as root. The combinatorial explosion of best practices across security, reliability, and operability makes "correct by default" manifests genuinely hard.
Tools and Workflows
kubectl-ai (formerly kubectl-ai by Google) lets you describe workloads in natural language and generates manifests:
kubectl ai "Create a deployment for a stateless Go API with 3 replicas,
resource limits, health checks, and a pod disruption budget.
It listens on port 8080 and needs access to a Postgres database via
a secret called db-credentials."
This produces a multi-document YAML that includes the Deployment, Service, PDB, and references to the Secret. It's a reasonable starting point — emphasis on starting point.
What it gets right:
- Correct structure and API versions
- Generally follows resource limit best practices
- Includes liveness/readiness probes with sensible defaults
Where it falls short:
- Security context is inconsistent — sometimes it runs containers as root
- Network policies are rarely generated unless explicitly requested
- Resource requests/limits are generic (not tuned to your actual workload)
- It has no knowledge of your cluster's admission controllers or OPA/Gatekeeper policies
A more reliable workflow for production deployments:
# Step 1: Generate the base manifest
kubectl ai "Create a production-ready deployment for image: myapp:v2.1.0
with 5 replicas, rolling update strategy, resource limits based on
typical Go web services, and anti-affinity rules."
# Step 2: Validate against your policies
kustomize build ./overlays/production | conftest test -p policy/ -
# Step 3: Dry-run against the cluster
kubectl apply --dry-run=server -f generated-manifest.yaml
# Step 4: AI-assisted review
k8sgpt analyze -f generated-manifest.yaml
Kubiya takes a different approach — it's a conversational AI platform that integrates with your existing toolchain (ArgoCD, Helm, Terraform) and can execute multi-step deployment workflows through natural language commands in Slack or Teams. The key differentiator: it respects your existing RBAC and approval gates rather than generating standalone YAML.
@kubiya Deploy service checkout-api v2.3.1 to staging,
run the smoke test suite, and if it passes, create a PR
to promote to production.
Kubiya translates this into a sequence of API calls to your CI/CD pipeline. It's less "generate YAML" and more "orchestrate your existing automation." For teams with mature GitOps workflows, this is significantly more useful than raw manifest generation.
Honest Assessment
AI-generated manifests are useful for bootstrapping and learning. They are not a substitute for understanding the Kubernetes API. The most dangerous pattern I've seen is teams treating generated YAML as production-ready without review. Use AI to accelerate, not to skip the review step.
Scaling: From Reactive to Predictive
The Problem
The Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) are reactive — they respond to metrics after demand changes. For workloads with predictable patterns (daily traffic cycles, batch job schedules), reactive scaling means you're always either over-provisioned or catching up.
Karpenter with Intelligent Provisioning
Karpenter (AWS's open-source node autoscaler, now also available for Azure) isn't purely "AI" in the LLM sense, but it uses sophisticated decision-making logic that goes far beyond the Cluster Autoscaler. It evaluates pending pods, selects optimal node types from the full cloud provider catalog, and can consolidate underutilized nodes.
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
spec:
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["5"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
limits:
cpu: "1000"
memory: 2000Gi
Karpenter's consolidation logic is genuinely intelligent — it will proactively migrate workloads to fewer, right-sized nodes when utilization drops, and it does this with disruption budgets to avoid availability impacts.
StormForge: ML-Driven Resource Optimization
StormForge (formerly StormForge, acquired the rights to the Optimize platform) uses machine learning to analyze historical resource consumption and recommend optimal CPU/memory requests and limits. This is the VPA concept done properly.
The workflow:
# Install the StormForge agent
helm install stormforge oci://registry.stormforge.io/library/stormforge-agent \
--set clientID=<your-id> --set clientSecret=<your-secret>
# Create an optimization experiment
kubectl apply -f - <<EOF
apiVersion: stormforge.io/v1beta1
kind: Recommendation
metadata:
name: checkout-api-optimization
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
EOF
StormForge observes the workload over a configurable window (typically 14 days), then produces recommendations:
Deployment: checkout-api
Namespace: production
Current → Recommended:
CPU request: 500m → 180m
CPU limit: 1000m → 450m
Memory request: 256Mi → 142Mi
Memory limit: 512Mi → 310Mi
Estimated monthly savings: $2,847
Confidence: 94th percentile (P99 consumption observed at 290m CPU)
The critical nuance: these recommendations are based on your actual workload data, not generic benchmarks. The ML model accounts for temporal patterns, burst behavior, and tail latencies.
Cast AI: Full-Stack Cost and Scaling Automation
Cast AI goes further than StormForge by optimizing both the pod layer (resource requests) and the infrastructure layer (node selection, spot instance management, cross-cloud rebalancing). It's closer to a fully autonomous platform.
Key capabilities:
- Real-time pod rightsizing — adjusts VPA recommendations continuously
- Spot instance orchestration — automatically diversifies across instance types and availability zones, with fallback to on-demand when spot capacity is unavailable
- Node pool optimization — selects the cheapest node type that satisfies pod scheduling constraints
- Cluster rebalancing — migrates workloads when cheaper infrastructure becomes available
# Cast AI connects to your cluster and begins analysis
castai-connect --cluster-name production-cluster --api-key <key>
# View optimization recommendations
castai recommendations --cluster production-cluster
# Enable autonomous mode (with guardrails)
castai policy set --cluster production-cluster \
--enable-spot \
--max-spot-percentage 70 \
--min-on-demand-nodes 3 \
--cpu-utilization-threshold 70
The Scaling Agent Workflow for SRE Teams
Here's a practical workflow that combines these tools:
- Deploy Karpenter for node-level autoscaling with intelligent instance selection
- Deploy StormForge or Cast AI for continuous resource rightsizing
- Configure HPA with custom metrics (from Prometheus, Datadog, etc.) for pod-level scaling
- Set up cost anomaly alerts to catch runaway scaling
# HPA with custom Prometheus metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
The behavior field is critical — without it, HPA will aggressively scale down during brief traffic dips, then struggle to scale back up. AI-driven scaling tools can inform these parameters, but the human SRE still needs to set the guardrails.
Troubleshooting: Where AI Agents Provide the Most Immediate Value
This is the area with the most mature tooling and the clearest ROI.
k8sgpt: Cluster Diagnostics with LLM Analysis
k8sgpt scans your cluster for common failure patterns and uses an LLM (OpenAI, Azure OpenAI, local models via Ollama, or others) to explain the root cause in plain language.
# Install
brew install k8sgpt
# Analyze the entire cluster
k8sgpt analyze --explain
# Focus on a specific namespace
k8sgpt analyze -n production --explain
# Filter to specific issue types
k8sgpt analyze --filter=Pod,Service,Deployment --explain
Example output:
AI Analysis:
0: Pod checkout-api-7b8f9d6c4-xk2lp in namespace production:
Error: CrashLoopBackOff - Last exit code 1
Root Cause: The container is failing to start because it cannot connect
to the PostgreSQL database. The error in the logs shows:
"dial tcp 10.0.15.87:5432: connect: connection refused"
The database pod (postgres-0) is in a Pending state because it requires
a PersistentVolumeClaim that is bound to a specific availability zone,
but no nodes in that zone have available capacity.
Recommended Actions:
1. Check node capacity in AZ us-east-1a: kubectl get nodes -l
topology.kubernetes.io/zone=us-east-1a
2. Verify PVC status: kubectl get pvc -n production
3. Consider using a StorageClass with volumeBindingMode: WaitForFirstConsumer
This is genuinely useful. The LLM connects the CrashLoopBackOff to the database pod's Pending state — a causal chain that takes experienced SREs a few minutes to trace but can take junior engineers much longer.
Local model option for sensitive clusters:
# Run with Ollama for air-gapped / security-sensitive environments
ollama pull codellama:13b
k8sgpt analyze --explain --backend ollama --model codellama:13b
The analysis quality drops with smaller models, but for common patterns (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending pods), even a 7B model provides useful explanations.
Robusta: Automated Root Cause Analysis and Remediation
Robusta is the most complete "AI agent" platform for Kubernetes troubleshooting. It combines event enrichment, LLM-powered analysis, and optional automated remediation.
# Install Robusta
helm install robusta robusta/robusta --values values.yaml \
--set clusterName=production \
--set robustaApiKey=<key>
# values.yaml - key configuration
sinksConfig:
- slack_sink:
name: main_slack
slack_channel: "#k8s-alerts"
api_key: ${SLACK_API_KEY}
customPlaybooks:
# Auto-investigate CrashLoopBackOff pods
- triggers:
- on_pod_crash_loop: {}
actions:
- pod_graph_enricher:
resource_type: Memory
display_limits: true
- pod_graph_enricher:
resource_type: CPU
display_limits: true
- logs_enricher:
filter_container: ".*"
warn_on_long_logs: false
- ai_diagnosis: {}
sinks:
- main_slack
When a pod enters CrashLoopBackOff, Robusta automatically:
- Pulls recent memory and CPU graphs
- Collects container logs
- Sends the enriched context to an LLM for analysis
- Posts the full diagnostic report to Slack
The ai_diagnosis action is the key differentiator. It doesn't just show you metrics and logs — it correlates them and provides a probable root cause with recommended actions.
Automated remediation (use with caution):
# Auto-restart pods stuck in CrashLoopBackOff after AI confirms
# it's likely a transient issue
customPlaybooks:
- triggers:
- on_pod_crash_loop:
restart_count: 5
rate_limit: 3600
actions:
- ai_diagnosis: {}
- pod_restart:
# Only restart if AI diagnosis indicates transient issue
filter_regex: "transient|timeout|connection refused"
sinks:
- main_slack
This pattern — AI diagnosis as a gate for automated action — is the responsible way to use AI agents for remediation. The agent doesn't blindly restart; it first diagnoses, and only acts when the diagnosis matches a known-safe pattern.
Shoreline.io: Full Incident Automation
Shoreline.io provides a more comprehensive automation framework with an Op Packs concept — pre-built remediation workflows that can be triggered by alerts or natural language commands.
# Natural language incident response
shoreline> What pods are failing in the production namespace?
shoreline> Show me the logs for checkout-api-pod-xk2lp
shoreline> Scale checkout-api deployment to 10 replicas
shoreline> Drain node ip-10-0-1-42 and cordon it
Shoreline's advantage is its "blast radius" controls — every automated action has configurable scope limits and approval requirements. For SRE teams that want to move toward automated incident response without losing control, this is the right architecture.
Cost Optimization: AI-Driven FinOps
The Problem
Most Kubernetes clusters are over-provisioned by 40-60%. Resource requests are set once during initial deployment and never adjusted. Teams provision for peak load and pay for it 24/7.
Kubecost + AI Analysis
Kubecost provides granular cost allocation per namespace, deployment, service, and even per team. It doesn't use AI internally, but it generates the data that AI tools can analyze.
# Install Kubecost
helm install kubecost cost-analyzer/ \
--repo https://kubecost.github.io/cost-analyzer-helm-chart/ \
--namespace kubecost --create-namespace
# Query cost data via API
curl "http://kubecost.kubecost.svc:9090/model/allocation?\
window=7d&aggregate=namespace&accumulate=true"
Combining Kubecost data with LLM analysis:
import requests
import openai
# Fetch cost data from Kubecost
cost_data = requests.get(
"http://kubecost.kubecost.svc:9090/model/allocation",
params={"window": "7d", "aggregate": "deployment"}
).json()
# Format for LLM analysis
summary = []
for deploy, data in cost_data["data"][0].items():
summary.append(f"- {deploy}: ${data['totalCost']:.2f}/week, "
f"CPU efficiency: {data['cpuEfficiency']:.1%}, "
f"RAM efficiency: {data['ramEfficiency']:.1%}")
prompt = f"""Analyze this Kubernetes cost data and identify optimization
opportunities. Prioritize by potential savings.
{chr(10).join(summary)}
For each recommendation, specify:
1. The deployment name
2. Current vs recommended resource requests
3. Estimated monthly savings
4. Risk level (low/medium/high)"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)
This is a simple pattern, but it works. The LLM can identify patterns across your cost data that are tedious to spot manually — like a deployment with 5% CPU efficiency that's requesting 4 cores, or a namespace that's running 20 replicas of a staging service in production.
The Cost Optimization Agent Loop
Here's a complete workflow that SRE teams can implement:
# CronJob that runs weekly cost analysis
apiVersion: batch/v1
kind: CronJob
metadata:
name: cost-optimization-agent
spec:
schedule: "0 9 * * MON" # Every Monday at 9 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: cost-agent
containers:
- name: agent
image: your-registry/cost-agent:latest
command: ["python", "/app/analyze_and_report.py"]
env:
- name: KUBECOST_HOST
value: "kubecost.kubecost.svc:9090"
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: ai-keys
key: openai-key
- name: SLACK_WEBHOOK
valueFrom:
secretKeyRef:
name: ai-keys
key: slack-webhook
restartPolicy: OnFailure
The agent script:
# analyze_and_report.py
import requests
import json
from datetime import datetime
def fetch_cost_data():
"""Fetch allocation data from Kubecost."""
resp = requests.get(
f"http://{KUBECOST_HOST}/model/allocation",
params={
"window": "7d",
"aggregate": "controller",
"accumulate": "true",
"idle": "false"
}
)
return resp.json()
def generate_rightsizing_recommendations(cost_data):
"""Use StormForge or Cast AI API for ML-based recommendations."""
# StormForge provides an API for fetching current recommendations
resp = requests.get(
"https://api.stormforge.io/v1/recommendations",
headers={"Authorization": f"Bearer {STORMFORGE_TOKEN}"},
params={"cluster": CLUSTER_NAME}
)
return resp.json()
def correlate_and_report(cost_data, rightsizing_recs):
"""Combine cost data with ML recommendations and post to Slack."""
# Build comprehensive report
report = {
"total_weekly_cost": sum(
d["totalCost"] for d in cost_data["data"][0].values()
),
"top_spenders": sorted(
cost_data["data"][0].items(),
key=lambda x: x[1]["totalCost"],
reverse=True
)[:10],
"rightsizing_opportunities": rightsizing_recs,
"generated_at": datetime.utcnow().isoformat()
}
# Post to Slack
requests.post(SLACK_WEBHOOK, json={
"blocks": format_slack_blocks(report)
})
return report
Limitations and Honest Concerns
AI agents for Kubernetes management are useful, but they have real limitations that SRE teams need to understand:
Hallucination in Diagnostic Contexts
LLMs will confidently generate plausible-sounding but incorrect root cause analyses. I've seen k8sgpt blame a CrashLoopBackOff on resource limits when the actual cause was a missing ConfigMap. Always verify AI-generated diagnoses against actual logs and events.
Context Window Limitations
Most LLM-based tools can't analyze your entire cluster state. They work with a subset of data — recent events, specific pod logs, current resource metrics. Complex, multi-component failures that span many services often exceed what these tools can reason about.
Security Implications
Sending cluster data to external LLM APIs (OpenAI, Anthropic) means your pod names, log content, and infrastructure details leave your network. For regulated environments, this is often unacceptable. The local model option (Ollama + k8sgpt) mitigates this but at the cost of analysis quality.
The "Automation Bias" Problem
When an AI agent is right 90% of the time, operators stop questioning it. That remaining 10% includes the hard, novel failures that actually cause outages. The most effective SRE teams I've worked with treat AI agents as hypothesis generators, not decision makers.
Cost of the AI Layer Itself
StormForge, Cast AI, and Robusta all have non-trivial licensing costs. Kubecost's free tier is useful but limited. Run the ROI calculation: if your cluster spend is under $10K/month, the optimization tools may cost more than they save.
Recommended Stack by Team Maturity
| Maturity Level | Recommended Tools | Focus Area |
|---|---|---|
| Early (small team, <50 pods) | k8sgpt + kubectl-ai | Learning, manifest quality, basic diagnostics |
| Growing (dedicated SRE, 50-500 pods) | Robusta + Kubecost + HPA | Automated alert enrichment, cost visibility |
| Scaling (SRE team, 500+ pods) | Cast AI/StormForge + Karpenter + Robusta | Full optimization loop, predictive scaling |
| Enterprise (multi-cluster, regulated) | Shoreline.io + Cast AI + internal LLM | Controlled automation, air-gapped AI, compliance |
The Bottom Line
AI agents for Kubernetes management are not hype — they solve real problems, particularly in troubleshooting speed and cost optimization. But they're tools, not magic. The best outcomes I've seen come from teams that:
- Start with diagnostics (k8sgpt, Robusta) — lowest risk, highest immediate value
- Add cost optimization (Kubecost + ML rightsizing) — measurable ROI
- Implement scaling intelligence (Karpenter + Cast AI) — requires more operational maturity
- Approach autonomous remediation last — with strict guardrails and human approval gates
The SRE role isn't being replaced by AI agents. It's being elevated. The agents handle the pattern-matching and data correlation; the human handles the judgment calls, the novel failures, and the architectural decisions that prevent those failures from recurring.