Kubernetes RCA agent that sandboxes its own AI with OPA Gatekeeper. MCP-ready for any client.
Config is the same across clients — only the file and path differ.
{
"mcpServers": {
"io-github-jdoornink-k8gents": {
"command": "<see-readme>",
"args": []
}
}
}Are you the author?
Add this badge to your README to show your security score and help users find safe servers.
An Autonomous AI-Driven Root Cause Analysis Agent for Kubernetes
No automated test available for this server. Check the GitHub README for setup instructions.
Five weighted categories — click any category to see the underlying evidence.
No known CVEs.
No package registry to scan.
Be the first to review
Have you used this server?
Share your experience — it helps other developers decide.
Sign in to write a review.
Others in cloud / devops
MCP server for using the GitLab API
⚡ A Simple / Speedy / Secure Link Shortener with Analytics, 100% run on Cloudflare.
MCP Server for GCP environment for interacting with various Observability APIs.
All-in-One Sandbox for AI Agents that combines Browser, Shell, File, MCP and VSCode Server in a single Docker container.
MCP Security Weekly
Get CVE alerts and security updates for io.github.JDoornink/k8gents and similar servers.
Start a conversation
Ask a question, share a tip, or report an issue.
Sign in to join the discussion.
A Human-in-the-Loop RCA Agent for Kubernetes — LLM Diagnosis, OPA-Bounded Remediation
K8gentS is a Kubernetes diagnostic agent built around a single premise: an LLM is well-suited to reason about cluster failures, but unfit to act on them unsupervised. It continuously watches the cluster event stream, runs Warning-class events through a Gemini-powered root cause analysis (RCA) pipeline, and surfaces the top hypotheses with resolution steps to Slack. Every remediation is human-in-the-loop (HITL) — a reviewer approves or rejects via interactive buttons before anything mutates. Approved fixes execute inside an ephemeral Job whose blast radius is bounded by an OPA Gatekeeper admission policy at the API server, independent of RBAC. The reasoning is non-deterministic by design; the controls around it are not.
The goal: reduce MTTR on diagnosable failures without handing an LLM a kubectl apply.
Building the Kubernetes side of this is straightforward. The real challenge is making a diagnostic layer trustworthy when the engine behind it is fundamentally non-deterministic.
Traditional observability is built on guarantees. Alerts fire on known thresholds. Dashboards show reproducible numbers. Logs return consistent answers to the same query. SRE success depends on that predictability — it's what makes incident response repeatable and on-call sustainable.
An LLM-driven diagnostic layer breaks that contract. The same pod failure can produce three different plausible explanations across three different runs. Each may be coherent. Each may even be correct under different assumptions. But "plausible" is not the same as "right," and for infrastructure, the gap between them is where outages live.
Some of the specific problems I've been working through while building K8gentS:
1. Confidence scoring with an unbounded output space. The agent returns top-3 root causes with confidence metrics, but confidence in what, exactly? The model is not selecting from a fixed set of known failure modes — it's generating free-form hypotheses. A calibrated confidence score needs a reference distribution, and the distribution here is whatever the model happened to produce this run.
2. When to trust reasoning vs. fall back to deterministic checks. - THE ART Some failures (CrashLoopBackOff, OOMKilled) have well-traveled diagnostic paths and a deterministic check will be right every time. Others benefit from the model's ability to interpolate across signals. Drawing that line — and doing it at runtime — is non-trivial and where the art really lies.
3. Evaluating an agent that's supposed to find failures you didn't anticipate. The standard ML evaluation approach assumes you know what "correct" looks like. For a diagnostic agent, part of the value is catching novel failure modes — by definition, failures you couldn't pre-enumerate. So how do you decide what is wrong and what is right?
4. The "Tool in the Cluster" problem. How do you monitor the monitor? Currently the service is set to run as a service in the cluster, but what if the service itself causes resource exhaustion, or is experiencing failures itself? How can you identify if the service itself is the cause of your issue?
5. Determining the right model for this problem. With so many other types of Machine Learning models out there, is a non-deterministic large language model really the right choice or is another model better suited for infrastructure type problems?
These are the questions I'm actively working on. If you've solved any of them — or have a sharper framing than I've got — I'd like to hear it.
CrashLoopBackOff states, `OOMK