kubernetes-debugging

Name: kubernetes-debugging
Author: notque/vexjoy-agent

$npx mdskill add notque/vexjoy-agent/kubernetes-debugging

Diagnoses pod failures and networking issues via structured triage.

Resolves crash loops, OOM kills, pending states, and resource pressure.
Integrates with kubectl, CoreDNS, and Kubernetes event APIs.
Loads specific reference guides based on detected failure signals.
Executes commands and surfaces logs directly within the workflow.

SKILL.md

.github/skills/kubernetes-debuggingView on GitHub ↗

---
name: kubernetes-debugging
description: "Kubernetes debugging for pod failures and networking."
user-invocable: false
context: fork
agent: kubernetes-helm-engineer
routing:
  triggers:
    - "kubernetes debug"
    - "pod failure"
    - "pod crashloop"
    - "kubectl logs"
    - "OOMKilled"
    - "pod pending"
  category: kubernetes
  pairs_with:
    - kubernetes-security
    - service-health-check
---

# Kubernetes Debugging Skill

Systematic diagnosis of pod failures, networking issues, and resource problems using a structured triage flow: describe, logs, events, exec.

## Reference Loading Table

| Signal | Reference | Size |
|--------|-----------|------|
| CrashLoopBackOff, OOMKilled, config error, health check, liveness probe, ImagePullBackOff, image pull, registry auth, Pending, FailedScheduling, node affinity, taint, PVC | `references/crash-diagnosis.md` | ~140 lines |
| service resolution, DNS, nslookup, CoreDNS, port-forward, NetworkPolicy, ingress, egress | `references/network-debugging.md` | ~50 lines |
| CPU throttling, memory limit, OOMKill, ephemeral storage, DiskPressure, debug container, distroless, kubectl reference, rollout, exec | `references/resource-debugging.md` | ~100 lines |

**Load greedily.** If the user's question touches any signal keyword, load the matching reference before responding. Multiple signals matching = load all matching references.

## Instructions

### Triage Flow

Follow this sequence for every pod or workload issue. Do not skip steps -- many failures (scheduling, image pull, volume mount) are only visible in events and describe output, not in logs, so jumping straight to logs misses them.

Always specify `-n <namespace>` explicitly in every command; never rely on the default context namespace, because the wrong namespace silently returns empty or misleading results.

```bash
# 1. Get an overview of the resource state
kubectl get pods -n <namespace> -o wide

# 2. Describe the resource for events, conditions, and status
kubectl describe pod <pod-name> -n <namespace>

# 3. Check current container logs
kubectl logs <pod-name> -n <namespace> -c <container-name>

# 4. Check previous container logs (critical for CrashLoopBackOff)
# Always check --previous before current logs for crashed containers,
# because deleting or restarting the pod destroys these logs permanently.
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

# 5. Check namespace events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# 6. If the container is running, exec in for live inspection
kubectl exec -it <pod-name> -n <namespace> -c <container-name> -- /bin/sh
```

Use read-only commands (describe, logs, get) to gather evidence before proposing any modifications. Never suggest changes based on assumptions -- gather diagnostic output first.

### Diagnosis Routing

Based on triage output, load the appropriate reference and follow its diagnosis flow:

| Symptom | Reference |
|---------|-----------|
| Pod status CrashLoopBackOff, ImagePullBackOff, or Pending | `references/crash-diagnosis.md` |
| Service unreachable, DNS failure, connection refused | `references/network-debugging.md` |
| CPU throttling, OOMKill, disk pressure, need debug container | `references/resource-debugging.md` |

### Error: "no endpoints available for service"
Cause: The Service selector does not match any running pod labels.
Solution: Compare `kubectl get svc <name> -o yaml` selector with `kubectl get pods --show-labels`. Fix the label mismatch.

---

## References

- [kubernetes-security skill](../kubernetes-security/SKILL.md) -- NetworkPolicy patterns and RBAC debugging

More from notque/vexjoy-agent

Skill	Description
adr-consultation	Multi-agent consultation for architecture decisions.
agent-comparison	A/B test agent variants for quality and token cost.
agent-evaluation	Evaluate agents and skills for quality and standards compliance.
architecture-deepening	Proactive architecture improvement: find shallow modules, propose deepening opportunities, design conversation.
auto-dream	Background memory consolidation and learning graduation — overnight knowledge lifecycle.
bluesky-reader	Read public Bluesky feeds via AT Protocol API.
cobalt-core	Cobalt Core infrastructure knowledge: KVM exporters, hypervisor tooling, OpenStack compute.
code-cleanup	Detect stale TODOs, unused imports, and dead code.
code-linting	Run Python (ruff) and JavaScript (Biome) linting.
codebase-analyzer	Statistical rule discovery from Go codebase patterns.