system-tuning
$
npx mdskill add boshu2/agentops/system-tuningRestores system responsiveness by cleaning up processes and agent swarms
- Solves sluggish performance from agent sprawl and stuck processes
- Uses eBPF observability and SRE practices for safe triage
- Prioritizes parent processes and avoids killing protected tasks
- Reports triage results to stdout with warnings on stderr
SKILL.md
.github/skills/system-tuningView on GitHub ↗
---
name: system-tuning
description: Restore system responsiveness via safe, ordered process cleanup and agent-swarm
hygiene.
practices:
- sre
- resilience-patterns
- ebpf-observability
hexagonal_role: supporting
consumes: []
produces: []
context_rel: []
skill_api_version: 1
user-invocable: true
context:
window: fork
intent:
mode: task
sections:
exclude:
- HISTORY
intel_scope: topic
metadata:
tier: execution
dependencies: []
output_contract: 'stdout: triage report; stderr: warnings on protected processes'
---
# System Tuning Skill
> **Quick Ref:** Triage a sluggish dev box without nuking work in flight. Order: diagnose → cleanup zombies and exited sessions → kill stuck children → fix confused parents → renice survivors → verify.
**YOU MUST EXECUTE THIS WORKFLOW. Do not just describe it.**
When the box feels slow — high load, swap thrash, agent sprawl, or compile storms — work the kill hierarchy from the safest move outward and treat parent agents as the real cleanup target, not their children.
## When To Use
| Symptom | Use This Skill |
|---|---|
| Load average drifts above core count and stays there | yes |
| Multiple agents respawn the same `cargo` / `cc1plus` after kills | yes — see [whack-a-mole-anti-pattern.md](references/whack-a-mole-anti-pattern.md) |
| Tmux / zellij session list grew past sanity | yes — see [agent-swarm-cleanup.md](references/agent-swarm-cleanup.md) |
| One process is genuinely the hot path | use `/perf` first; come back if cleanup is still needed |
| Production incident on a server you do not own | escalate; do not run kills blind |
## Method
The skill drives three loops, in order. Each loop has a check, an action set, and a verification.
### 1. Diagnose Without Touching
Read pressure metrics before signalling anything.
```bash
uptime && nproc
cat /proc/pressure/cpu /proc/pressure/io /proc/pressure/memory 2>/dev/null
ps -eo stat | grep -c '^Z' # zombie count
ps aux --sort=-%cpu | head -10
```
Capture this as the baseline. Every kill below should move at least one of these numbers; if nothing moves, you killed the wrong thing.
### 2. Walk the Kill Hierarchy
Work from least-invasive to most-invasive. Full ordering and signal escalation rules live in [references/kill-hierarchy.md](references/kill-hierarchy.md).
```
zombies & exited sessions → reap, no risk
stuck child processes → SIGTERM, wait 3s, escalate to SIGKILL
confused parent agents → kill the parent so it stops respawning children
renice the survivors → give interactive work room without termination
```
Stop at the first loop that restores responsiveness. Do not run later steps "just in case."
### 3. Clean Up The Swarm
If the box hosts multiple coding agents, the second-order problem is duplicated work, not slow code. The patterns to look for are catalogued in [references/agent-swarm-cleanup.md](references/agent-swarm-cleanup.md): competing build target dirs, orphaned MCP children, abandoned tmux/zellij sessions, agents older than the work they were spawned for.
### 4. Verify
After every loop, re-read the same commands from step 1. Write a one-line delta:
```
load: 38.4 -> 12.1 | cpu_pressure_avg10: 61% -> 8% | zombies: 14 -> 0
```
No delta written → the cleanup is not done.
## Quick-Start Checklist
```bash
# Baseline
uptime; cat /proc/pressure/cpu
# Reap free wins
ps -eo stat | grep -c '^Z' # zombies present?
zellij list-sessions 2>&1 | grep -c EXITED # dead sessions present?
zellij delete-all-sessions --yes 2>/dev/null
# Find stuck children (12h+ tests, idle dev servers)
ps -eo pid,etimes,args --sort=-etimes | awk '$2 > 43200' | head
# Find confused parents (16h+ agents)
ps -eo pid,etimes,args | grep -E 'claude|codex' | awk '$2 > 57600'
# Renice live compilation rather than killing it
for pid in $(pgrep -f /bin/cargo) $(pgrep cc1plus); do
renice 19 -p "$pid" 2>/dev/null
ionice -c 3 -p "$pid" 2>/dev/null
done
# Verify
sleep 10 && uptime && cat /proc/pressure/cpu
```
## Protected Processes
Never signal these without explicit operator approval:
```
systemd, sshd, dbus, cron
postgres, mysql, redis, nginx, caddy
docker, containerd, k3d, kubelet
the multiplexer your sessions live inside (tmux server, wezterm-mux-server, zellij)
```
If unsure, leave it running and document the candidate instead. The cost of a wrong kill is much higher than the cost of waiting.
## Output
Run reports go to `.agents/system-tuning/YYYY-MM-DD-triage.md`. Each report includes:
1. Baseline metrics (uptime, pressure, zombies, top CPU)
2. Kill log with reason per signal
3. Renice / ionice changes
4. Post-cleanup metrics
5. Anything escalated for operator decision
## See Also
- [perf](../perf/SKILL.md) — Optimize a single hot path once the system is responsive
- [bug-hunt](../bug-hunt/SKILL.md) — Investigate why a process loops or hangs
- [scope](../scope/SKILL.md) — Lock edit scope before running cleanup in a shared workspace
## Reference Documents
- [references/kill-hierarchy.md](references/kill-hierarchy.md)
- [references/whack-a-mole-anti-pattern.md](references/whack-a-mole-anti-pattern.md)
- [references/agent-swarm-cleanup.md](references/agent-swarm-cleanup.md)
- [references/system-tuning.feature](references/system-tuning.feature) — Executable spec: diagnose-first, ordered safe cleanup hierarchy, protect in-flight processes, verify + report (soc-qk4b)
## Attribution
Methodology pattern-adopted from an external `system-performance-remediation` skill corpus. See `LICENSE.md` in this skill directory for full attribution. No source text reused.
More from boshu2/agentops
- autodevManage bounded autonomous dev loops.
- beadsTrack issues with bd/br, triage with bv, and convert plans to beads.
- bootstrapInitialize AgentOps project files.
- bug-huntInvestigate bugs and root causes.
- codex-teamCoordinate multiple Codex agents.
- compileCompile .agents knowledge wiki.
- complexityFind focused refactor hotspots.
- converterConvert AgentOps skill formats.
- crankExecute epics through waves.
- curateMine transcripts, .agents, bd, and git for skill diffs, bd updates, or