NVIDIA GPU hardware introspection for Kubernetes clusters via MCP
Config is the same across clients — only the file and path differ.
{
"mcpServers": {
"k8s-gpu-mcp": {
"args": [
"-y",
"k8s-gpu-mcp-server@latest"
],
"command": "npx"
}
}
}Are you the author?
Add this badge to your README to show your security score and help users find safe servers.
Just-in-Time SRE Diagnostic Agent for NVIDIA GPU Clusters on Kubernetes
Run this in your terminal to verify the server starts. Then let us know if it worked — your result helps other developers.
npx -y 'k8s-gpu-mcp-server' 2>&1 | head -1 && echo "✓ Server started successfully"
After testing, let us know if it worked:
Five weighted categories — click any category to see the underlying evidence.
No known CVEs.
Checked k8s-gpu-mcp-server against OSV.dev.
Click any tool to inspect its schema.
gpu-health-checkComprehensive GPU health assessment with recommendations
diagnose-xid-errorsAnalyze NVIDIA XID errors with remediation guidance
gpu-triageStandard SRE triage workflow: inventory → health → XID analysis
Be the first to review
Have you used this server?
Share your experience — it helps other developers decide.
Sign in to write a review.
Others in devops / cloud
MCP server for using the GitLab API
MCP Server for GCP environment for interacting with various Observability APIs.
⚡ A Simple / Speedy / Secure Link Shortener with Analytics, 100% run on Cloudflare.
MCP server for Datto SaaS Protection — M365/GWS backups, restores, seats.
MCP Security Weekly
Get CVE alerts and security updates for K8s Gpu Mcp Server and similar servers.
Start a conversation
Ask a question, share a tip, or report an issue.
Sign in to join the discussion.
Just-in-Time SRE Diagnostic Agent for NVIDIA GPU Clusters on Kubernetes
k8s-gpu-mcp-server is an ephemeral diagnostic agent that provides surgical,
real-time NVIDIA GPU hardware introspection for Kubernetes clusters via the
Model Context Protocol (MCP).
Unlike traditional monitoring systems, this agent is designed for AI-assisted troubleshooting by SREs debugging complex hardware failures that standard Kubernetes APIs cannot detect.
Click the button above to install automatically in Cursor.
# Using npx (recommended)
npx k8s-gpu-mcp-server@latest
# Or install globally
npm install -g k8s-gpu-mcp-server
Add to ~/.cursor/mcp.json (Cursor) or VS Code MCP config:
{
"mcpServers": {
"k8s-gpu-mcp": {
"command": "npx",
"args": ["-y", "k8s-gpu-mcp-server@latest"]
}
}
}
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"k8s-gpu-mcp": {
"command": "npx",
"args": ["-y", "k8s-gpu-mcp-server@latest"]
}
}
}
# Clone and build
git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
cd k8s-gpu-mcp-server
make agent
# Test with mock GPUs (no hardware required)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=mock
# Test with real GPU (requires NVIDIA driver)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=real
# Deploy with Helm OCI (recommended)
helm install k8s-gpu-mcp-server \
oci://ghcr.io/arangogutierrez/charts/k8s-gpu-mcp-server \
--namespace gpu-diagnostics --create-namespace
# Or from local chart
helm install k8s-gpu-mcp-server ./deployment/helm/k8s-gpu-mcp-server \
--namespace gpu-diagnostics --create-namespace
# Find agent pod on target node
NODE_NAME=<node-name>
POD=$(kubectl get pods -n gpu-diagnostics \
-l app.kubernetes.io/name=k8s-gpu-mcp-server \
--field-selector spec.nodeNam
... [View full README on GitHub](https://github.com/ArangoGutierrez/k8s-gpu-mcp-server#readme)