talon
$
npx mdskill add joelhooks/joelclaw/talonSupervises the system-bus worker and monitors Kubernetes infrastructure using a Rust daemon.
- Helps manage and validate system configurations and service monitors for reliable infrastructure operation.
- Integrates with Kubernetes, system-bus, and uses TOML config files for setup and monitoring.
- Decides actions based on config validation, probe cycles, and state machine positions for supervision.
- Presents results through JSON summaries, logs, and command-line outputs for status and diagnostics.
SKILL.md
.github/skills/talonView on GitHub ↗
---
name: talon
description: Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.
---
# Talon — Infrastructure Watchdog Daemon
Compiled Rust binary that supervises the system-bus worker AND monitors the full k8s infrastructure stack. ADR-0159.
## Quick Reference
```bash
talon validate # Parse/validate config + services files, print summary JSON
talon --check # Single probe cycle, print results, exit
talon --status # Current state machine position
talon --dry-run # Print loaded config, exit
talon --worker-only # Supervisor only, no infra probes
talon # Full daemon mode (worker + probes + escalation)
```
## Paths
| What | Where |
|------|-------|
| Binary | `~/.local/bin/talon` |
| Source | `~/Code/joelhooks/joelclaw/infra/talon/src/` |
| Config | `~/.config/talon/config.toml` |
| Service monitors | `~/.joelclaw/talon/services.toml` |
| Default config | `~/Code/joelhooks/joelclaw/infra/talon/config.default.toml` |
| Default services template | `~/Code/joelhooks/joelclaw/infra/talon/services.default.toml` |
| Voice stale cleanup | `~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh` |
| State | `~/.local/state/talon/state.json` |
| Probe results | `~/.local/state/talon/last-probe.json` |
| Log | `~/.local/state/talon/talon.log` (JSON lines, 10MB rotation) |
| Launchd plist | `~/Code/joelhooks/joelclaw/infra/launchd/com.joel.talon.plist` |
| RBAC guard manifest | `~/Code/joelhooks/joelclaw/k8s/apiserver-kubelet-client-rbac.yaml` |
| Worker stdout | `~/.local/log/system-bus-worker.log` |
| Worker stderr | `~/.local/log/system-bus-worker.err` |
| Talon launchd log | `~/.local/log/talon.err` |
## Build
```bash
export PATH="$HOME/.cargo/bin:$PATH"
cd ~/Code/joelhooks/joelclaw/infra/talon
cargo build --release
cp target/release/talon ~/.local/bin/talon
```
## Architecture
```
talon (single binary)
├── Worker Supervisor Thread (only when external launchd supervisor is not loaded)
│ ├── Kill orphan on port 3111
│ ├── Spawn bun (child process)
│ ├── Signal forwarding (SIGTERM → bun)
│ ├── Health poll every 30s
│ ├── PUT sync after healthy startup
│ └── Crash recovery: exponential backoff 1s→30s
│
├── Infrastructure Probe Loop (main thread, 60s)
│ ├── Colima VM alive?
│ ├── Docker socket responding?
│ ├── Talos container running?
│ ├── k8s API reachable?
│ ├── Node Ready + schedulable?
│ ├── Flannel daemonset ready?
│ ├── Redis PONG?
│ ├── Inngest /health 200?
│ ├── Typesense /health ok?
│ └── Worker /api/inngest 200?
│
└── Escalation (on failure)
├── Tier 1a: bridge-heal (force-cycle Colima on localhost↔VM split-brain)
├── Tier 1b: k8s-reboot-heal.sh (300s timeout, RBAC drift guard, VM `br_netfilter` repair, warmup-aware post-Colima invariants including deployment readiness + ImagePullBackOff pod reset, then voice-agent stale cleanup + launchd kickstart via `infra/voice-agent/cleanup-stale.sh`)
├── Tier 2: pi agent (cloud model, 10min cooldown)
├── Tier 3: pi agent (Ollama local, network-down fallback)
└── Tier 4: Telegram + iMessage SOS fan-out (15min critical threshold)
```
## State Machine
```
healthy → degraded (1 critical probe failure)
degraded → failed (3 consecutive failures)
failed → investigating (agent spawned)
investigating → healthy (probes pass again)
investigating → critical (agent failed to fix)
critical → sos (SOS sent via Telegram + iMessage)
any → healthy (all probes pass)
```
## Probes
| Probe | Command | Critical? |
|-------|---------|-----------|
| colima | `colima status` | Yes |
| docker | `docker ps` (Colima socket) | Yes |
| talos_container | `docker inspect joelclaw-controlplane-1` | Yes |
| k8s_api | `kubectl get nodes` | Yes |
| node_ready | kubectl jsonpath for Ready condition | Yes |
| node_schedulable | kubectl jsonpath for spec (taints/cordon) | Yes |
| flannel | `kubectl -n kube-system get daemonset kube-flannel -o jsonpath=...` | No |
| redis | `kubectl exec redis-0 -- redis-cli ping` | Yes |
| kubelet_proxy_rbac | `kubectl auth can-i --as=<apiserver-kubelet-client*> {get,create} nodes --subresource=proxy` | Yes |
| vm:docker | `ssh -F ~/.colima/_lima/colima/ssh.config lima-colima docker ps` | No |
| vm:k8s_api | `ssh ... python socket probe :64784` | No |
| vm:redis | `ssh ... python socket probe :6379` | No |
| vm:inngest | `ssh ... python socket probe :8288` | No |
| vm:typesense | `ssh ... python socket probe :8108` | No |
| inngest | `curl localhost:8288/health` | No |
| typesense | `curl localhost:8108/health` | No |
| worker | `curl localhost:3111/api/inngest` | No |
Critical probes trigger escalation immediately. Non-critical need 3 consecutive failures.
VM probes are witness probes only. They let Talon classify "service alive in VM but dead on localhost" as a Colima bridge split-brain and run bridge-heal instead of full recovery first.
### Dynamic service probes
Add probes in `~/.joelclaw/talon/services.toml` without rebuilding talon:
```toml
[launchd.gateway]
label = "com.joel.gateway"
critical = true
timeout_secs = 5
[http.gateway_slack]
url = "http://127.0.0.1:3018/health/slack"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5
[launchd.voice_agent]
label = "com.joel.voice-agent"
critical = false
timeout_secs = 5
[script.gateway_telegram_409]
command = "test $(tail -20 /tmp/joelclaw/gateway.err 2>/dev/null | grep -c '409: Conflict') -lt 5"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5
[script.colima_orphan_usernet]
command = "test $(pgrep -f 'limactl usernet' | wc -l) -le 2"
critical = true
critical_after_consecutive_failures = 2
timeout_secs = 5
[script.k8s_disk_pressure]
command = "! kubectl get nodes -o jsonpath='{.items[0].spec.taints}' 2>/dev/null | grep -q disk-pressure"
critical = true
critical_after_consecutive_failures = 1
timeout_secs = 10
```
- `launchd.<name>` passes when `launchctl list <label>` reports a non-zero PID
- `http.<name>` passes on HTTP `200`
- `script.<name>` passes on exit code 0, fails on non-zero (runs via `sh -c`)
- `critical = true` escalates when the probe is marked critical (or after debounce if configured)
- `critical_after_consecutive_failures = N` debounces critical alerts for dynamic probes (default `1` = immediate)
- `http.gateway_slack` uses gateway endpoint `GET /health/slack`, fails (503) when Slack channel is not started, and should be debounced (recommended `3` cycles)
- **Do not probe `http://127.0.0.1:8081/` for `voice_agent` by default** — root returns `503` when idle and causes false SOS noise
- Service-heal pre-cleanup for `voice_agent` now clears stale `uv/main.py` listeners on `:8081` before `launchctl kickstart` to avoid bind conflicts after force-cycles
- Talon hot-reloads service probes when `services.toml` mtime changes (no restart required)
- `kill -HUP $(launchctl print gui/$(id -u)/com.joel.talon | awk '/pid =/{print $3; exit}')` forces immediate reload
Recent dynamic probes added for the 2026-03-17 Colima/Restate incident:
- `script.redis_aof_health` — critical after 3 failures; checks `aof_last_bgrewrite_status:ok` to catch Redis AOF rewrite/persistence corruption.
- `script.colima_vm_uptime` — critical after 2 failures; requires VM uptime >120s to catch Colima crash loops after force-cycles.
- `script.restate_worker_ready` — critical after 3 failures; verifies the `restate-worker` pod reports `Ready=true` before workloads are trusted.
- `script.kvm_device_present` — non-critical witness probe; records whether `/dev/kvm` is present inside Colima for nested-virt / Firecracker diagnosis.
### Health endpoint
- `GET http://127.0.0.1:9999/health` returns Talon state JSON
- Gateway heartbeat consumes this as an additional watchdog signal
- Configure via `[health]` in `~/.config/talon/config.toml`
### SOS channel config
- Tier 4 sends to both Telegram and iMessage
- Telegram fields in `[escalation]`:
- `sos_telegram_chat_id`
- `sos_telegram_secret_name` (defaults to `telegram_bot_token`)
- Talon now leases Telegram tokens via `secrets lease <name> --ttl ...` (no `--raw`). If you still see `curl: (3) URL rejected: Malformed input to a URL function`, redeploy the latest Talon binary.
- iMessage recipient remains `sos_recipient`
## Launchd Management
**Talon is active as `com.joel.talon`:**
```bash
launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid =|program =|last exit code ="
```
**Reload binary/config after deploy:**
```bash
launchctl kickstart -k gui/$(id -u)/com.joel.talon
```
**Single owner for worker supervision is mandatory:**
- If `com.joel.system-bus-worker` is loaded, Talon now auto-disables its internal worker supervisor to prevent port-3111 thrash.
- Preferred end-state is Talon-only supervision, but coexistence no longer causes kill/restart loops.
```bash
launchctl list com.joel.system-bus-worker
```
**Legacy services should stay disabled when fully cut over:**
```bash
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.joel.k8s-reboot-heal.plist
```
## Troubleshooting
```bash
# Validate config + service monitor files
talon validate | python3 -m json.tool
# Check what talon sees right now
talon --check | python3 -m json.tool
# Check state machine
talon --status | python3 -m json.tool
# Broken-pipe robustness smoke test (should exit 0)
talon --check | head -n 1 >/dev/null
# Check health endpoint payload
curl -sS http://127.0.0.1:9999/health | python3 -m json.tool
# Check talon's own logs
tail -20 ~/.local/state/talon/talon.log | python3 -m json.tool
# Check launchd
launchctl list | grep talon
tail -50 ~/.local/log/talon.err
# Manual probe test
DOCKER_HOST=unix:///Users/joel/.colima/default/docker.sock docker inspect --format '{{.State.Status}}' joelclaw-controlplane-1
kubectl exec -n joelclaw redis-0 -- redis-cli ping
kubectl auth can-i --as=apiserver-kubelet-client get nodes --subresource=proxy --all-namespaces
kubectl auth can-i --as=apiserver-kubelet-client create nodes --subresource=proxy --all-namespaces
ssh -F ~/.colima/_lima/colima/ssh.config lima-colima 'curl -sS http://127.0.0.1:8288/health'
# Force bridge repair (same behavior Talon uses for split-brain)
colima stop --force && colima start
# Manual voice-agent stale cleanup (same post-gate step k8s-reboot-heal runs)
~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh
```
## Colima Stability Monitoring (2026-03-17)
Talon now monitors failure modes discovered during the Firecracker development incident:
| Probe | What it detects | Critical? |
|-------|----------------|-----------|
| `script:redis_aof_health` | Corrupted Redis AOF from VM crash mid-write | Yes (after 3) |
| `script:colima_vm_uptime` | VM crash-loop (uptime < 120s = just restarted) | Yes (after 2) |
| `script:restate_worker_ready` | Restate worker pod not 1/1 Ready | Yes (after 3) |
| `script:kvm_available` | Whether /dev/kvm exists (nested virt status) | No (informational) |
### Known failure chain: nestedVirtualization → cascade
```
nestedVirtualization ON + heavy Docker build
→ Colima VZ VM crash (silent, no crash report)
→ Docker daemon restart → Talos container killed
→ Redis mid-write → AOF corruption → crash-loop
→ Restate mid-journal → stale invocations → infinite retries
→ Lima socket forwarding broken → docker CLI dead on macOS
```
Talon detects each stage:
1. `colima_vm_uptime` < 120s → VM just crashed
2. `redis` probe fails → Redis down
3. `redis_aof_health` fails → AOF corrupted (needs manual fix)
4. `restate_worker_ready` fails → worker can't start (may be /dev/kvm mount or image pull)
**Talon cannot auto-fix Redis AOF corruption** (requires `redis-check-aof --fix`). It WILL escalate to the pi agent (Tier 2) which should load the `k8s` skill's Redis AOF Recovery procedure.
## Key Design Decisions
- **Zero external deps** — no tokio, no serde, no reqwest. Pure std. Keeps binary at ~444KB.
- **Compiles its own PATH** — immune to launchd environment brittleness (the class of bug that caused the 6-day outage).
- **Worker is a child process** — not a separate launchd service. Signal forwarding prevents orphans.
- **TOML config parsed by hand** — same pattern as worker-supervisor. No dependency just for config.
- **Probes use Colima docker socket** for critical host checks and add VM witness probes over Colima SSH for split-brain detection.
## Related
- ADR-0159: Talon proposal
- ADR-0158: Worker supervisor (superseded by talon)
- `infra/k8s-reboot-heal.sh`: Tier 1 heal script
- `infra/worker-supervisor/`: Original standalone worker supervisor (superseded)
- Ollama + qwen3:8b: Tier 3 local fallback model
More from joelhooks/joelclaw
- add-skillCreate new joelclaw skills with the idiomatic process — repo-canonical, symlinked, git-tracked, slogged. Triggers on 'add a skill', 'create skill', 'new skill', 'canonical skill', 'make a skill for', or any request to formalize a process or domain into a reusable skill.
- adr-skillCreate and maintain Architecture Decision Records (ADRs) optimized for agentic coding workflows. Use when you need to propose, write, update, accept/reject, deprecate, or supersede an ADR; bootstrap an adr folder and index; consult existing ADRs before implementing changes; or enforce ADR conventions. This skill uses Socratic questioning to capture intent before drafting, and validates output against an agent-readiness checklist.
- agent-discovery"Optimize websites, docs, and product surfaces for agent discoverability and operator UX. Use when working on agent SEO/AEO/GEO, crawl policy, markdown or JSON projections, llms.txt, sitemap.md, AGENTS.md guidance, content negotiation, accessibility for browser agents, or any request to make a site easier for pi, OpenCode, Claude Code, ChatGPT, Perplexity, or other agent harnesses to find and use."
- agent-loopStart, monitor, and cancel durable multi-agent coding loops via Inngest. Use when the user wants to run autonomous coding workloads, execute a PRD with multiple stories, kick off an AFK coding session, have agents implement features from a plan, or manage running loops. Triggers on "start a coding loop", "run this PRD", "implement these stories", "go AFK and code this", "check loop status", "cancel the loop", "joelclaw loop", or any request for autonomous multi-story code execution.
- agent-mail>-
- agent-workloads"Compatibility alias for the canonical `workflow-rig` front door. Use when older prompts mention `agent-workloads` or when you need the legacy workload-planning guidance; for new work, load `workflow-rig` first."
- clawmail>-
- cli-design"Design and build agent-first CLIs with HATEOAS JSON responses, context-protecting output, and self-documenting command trees. Use when creating new CLI tools, adding commands to existing CLIs (joelclaw, slog), or reviewing CLI design for agent-friendliness. Triggers on 'build a CLI', 'add a command', 'CLI design', 'agent-friendly output', or any task involving command-line tool creation."
- codex-prompting"Use this skill for any request to trigger, coordinate, or craft prompts for Codex. Use when user says 'send to codex', 'use codex', 'prompt codex', 'ask codex', 'delegate to codex', 'run in codex', or asks for a Codex-first execution handoff."
- content-publish"Publish content to joelclaw.com via the Convex-first pipeline. Covers the full lifecycle: draft → review → publish → revalidate → verify. Handles secret leasing, tag conventions, content types (article, tutorial, note, essay), and verification gates. Use when: 'write article about X', 'publish article <slug>', 'draft a tutorial', 'publish this', 'push to convex', or any content publishing task."