talon

Name: talon
Author: joelhooks/joelclaw
$npx mdskill add joelhooks/joelclaw/talon
Supervises the system-bus worker and monitors Kubernetes infrastructure using a Rust daemon.
Helps manage and validate system configurations and service monitors for reliable infrastructure operation.
Integrates with Kubernetes, system-bus, and uses TOML config files for setup and monitoring.
Decides actions based on config validation, probe cycles, and state machine positions for supervision.
Presents results through JSON summaries, logs, and command-line outputs for status and diagnostics.
SKILL.md
.github/skills/talonView on GitHub ↗
---
name: talon
description: Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.
---

# Talon — Infrastructure Watchdog Daemon

Compiled Rust binary that supervises the system-bus worker AND monitors the full k8s infrastructure stack. ADR-0159.

## Quick Reference

```bash
talon validate         # Parse/validate config + services files, print summary JSON
talon --check          # Single probe cycle, print results, exit
talon --status         # Current state machine position
talon --dry-run        # Print loaded config, exit
talon --worker-only    # Supervisor only, no infra probes
talon                  # Full daemon mode (worker + probes + escalation)
```

## Paths

| What | Where |
|------|-------|
| Binary | `~/.local/bin/talon` |
| Source | `~/Code/joelhooks/joelclaw/infra/talon/src/` |
| Config | `~/.config/talon/config.toml` |
| Service monitors | `~/.joelclaw/talon/services.toml` |
| Default config | `~/Code/joelhooks/joelclaw/infra/talon/config.default.toml` |
| Default services template | `~/Code/joelhooks/joelclaw/infra/talon/services.default.toml` |
| Voice stale cleanup | `~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh` |
| State | `~/.local/state/talon/state.json` |
| Probe results | `~/.local/state/talon/last-probe.json` |
| Log | `~/.local/state/talon/talon.log` (JSON lines, 10MB rotation) |
| Launchd plist | `~/Code/joelhooks/joelclaw/infra/launchd/com.joel.talon.plist` |
| RBAC guard manifest | `~/Code/joelhooks/joelclaw/k8s/apiserver-kubelet-client-rbac.yaml` |
| Worker stdout | `~/.local/log/system-bus-worker.log` |
| Worker stderr | `~/.local/log/system-bus-worker.err` |
| Talon launchd log | `~/.local/log/talon.err` |

## Build

```bash
export PATH="$HOME/.cargo/bin:$PATH"
cd ~/Code/joelhooks/joelclaw/infra/talon
cargo build --release
cp target/release/talon ~/.local/bin/talon
```

## Architecture

```
talon (single binary)
├── Worker Supervisor Thread (only when external launchd supervisor is not loaded)
│   ├── Kill orphan on port 3111
│   ├── Spawn bun (child process)
│   ├── Signal forwarding (SIGTERM → bun)
│   ├── Health poll every 30s
│   ├── PUT sync after healthy startup
│   └── Crash recovery: exponential backoff 1s→30s
│
├── Infrastructure Probe Loop (main thread, 60s)
│   ├── Colima VM alive?
│   ├── Docker socket responding?
│   ├── Talos container running?
│   ├── k8s API reachable?
│   ├── Node Ready + schedulable?
│   ├── Flannel daemonset ready?
│   ├── Redis PONG?
│   ├── Inngest /health 200?
│   ├── Typesense /health ok?
│   └── Worker /api/inngest 200?
│
└── Escalation (on failure)
    ├── Tier 1a: bridge-heal (force-cycle Colima on localhost↔VM split-brain)
    ├── Tier 1b: k8s-reboot-heal.sh (300s timeout, RBAC drift guard, VM `br_netfilter` repair, warmup-aware post-Colima invariants including deployment readiness + ImagePullBackOff pod reset, then voice-agent stale cleanup + launchd kickstart via `infra/voice-agent/cleanup-stale.sh`)
    ├── Tier 2: pi agent (cloud model, 10min cooldown)
    ├── Tier 3: pi agent (Ollama local, network-down fallback)
    └── Tier 4: Telegram + iMessage SOS fan-out (15min critical threshold)
```

## State Machine

```
healthy → degraded (1 critical probe failure)
degraded → failed (3 consecutive failures)
failed → investigating (agent spawned)
investigating → healthy (probes pass again)
investigating → critical (agent failed to fix)
critical → sos (SOS sent via Telegram + iMessage)
any → healthy (all probes pass)
```

## Probes

| Probe | Command | Critical? |
|-------|---------|-----------|
| colima | `colima status` | Yes |
| docker | `docker ps` (Colima socket) | Yes |
| talos_container | `docker inspect joelclaw-controlplane-1` | Yes |
| k8s_api | `kubectl get nodes` | Yes |
| node_ready | kubectl jsonpath for Ready condition | Yes |
| node_schedulable | kubectl jsonpath for spec (taints/cordon) | Yes |
| flannel | `kubectl -n kube-system get daemonset kube-flannel -o jsonpath=...` | No |
| redis | `kubectl exec redis-0 -- redis-cli ping` | Yes |
| kubelet_proxy_rbac | `kubectl auth can-i --as=<apiserver-kubelet-client*> {get,create} nodes --subresource=proxy` | Yes |
| vm:docker | `ssh -F ~/.colima/_lima/colima/ssh.config lima-colima docker ps` | No |
| vm:k8s_api | `ssh ... python socket probe :64784` | No |
| vm:redis | `ssh ... python socket probe :6379` | No |
| vm:inngest | `ssh ... python socket probe :8288` | No |
| vm:typesense | `ssh ... python socket probe :8108` | No |
| inngest | `curl localhost:8288/health` | No |
| typesense | `curl localhost:8108/health` | No |
| worker | `curl localhost:3111/api/inngest` | No |

Critical probes trigger escalation immediately. Non-critical need 3 consecutive failures.

VM probes are witness probes only. They let Talon classify "service alive in VM but dead on localhost" as a Colima bridge split-brain and run bridge-heal instead of full recovery first.

### Dynamic service probes

Add probes in `~/.joelclaw/talon/services.toml` without rebuilding talon:

```toml
[launchd.gateway]
label = "com.joel.gateway"
critical = true
timeout_secs = 5

[http.gateway_slack]
url = "http://127.0.0.1:3018/health/slack"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5

[launchd.voice_agent]
label = "com.joel.voice-agent"
critical = false
timeout_secs = 5

[script.gateway_telegram_409]
command = "test $(tail -20 /tmp/joelclaw/gateway.err 2>/dev/null | grep -c '409: Conflict') -lt 5"
critical = true
critical_after_consecutive_failures = 3
timeout_secs = 5

[script.colima_orphan_usernet]
command = "test $(pgrep -f 'limactl usernet' | wc -l) -le 2"
critical = true
critical_after_consecutive_failures = 2
timeout_secs = 5

[script.k8s_disk_pressure]
command = "! kubectl get nodes -o jsonpath='{.items[0].spec.taints}' 2>/dev/null | grep -q disk-pressure"
critical = true
critical_after_consecutive_failures = 1
timeout_secs = 10
```

- `launchd.<name>` passes when `launchctl list <label>` reports a non-zero PID
- `http.<name>` passes on HTTP `200`
- `script.<name>` passes on exit code 0, fails on non-zero (runs via `sh -c`)
- `critical = true` escalates when the probe is marked critical (or after debounce if configured)
- `critical_after_consecutive_failures = N` debounces critical alerts for dynamic probes (default `1` = immediate)
- `http.gateway_slack` uses gateway endpoint `GET /health/slack`, fails (503) when Slack channel is not started, and should be debounced (recommended `3` cycles)
- **Do not probe `http://127.0.0.1:8081/` for `voice_agent` by default** — root returns `503` when idle and causes false SOS noise
- Service-heal pre-cleanup for `voice_agent` now clears stale `uv/main.py` listeners on `:8081` before `launchctl kickstart` to avoid bind conflicts after force-cycles
- Talon hot-reloads service probes when `services.toml` mtime changes (no restart required)
- `kill -HUP $(launchctl print gui/$(id -u)/com.joel.talon | awk '/pid =/{print $3; exit}')` forces immediate reload

Recent dynamic probes added for the 2026-03-17 Colima/Restate incident:
- `script.redis_aof_health` — critical after 3 failures; checks `aof_last_bgrewrite_status:ok` to catch Redis AOF rewrite/persistence corruption.
- `script.colima_vm_uptime` — critical after 2 failures; requires VM uptime >120s to catch Colima crash loops after force-cycles.
- `script.restate_worker_ready` — critical after 3 failures; verifies the `restate-worker` pod reports `Ready=true` before workloads are trusted.
- `script.kvm_device_present` — non-critical witness probe; records whether `/dev/kvm` is present inside Colima for nested-virt / Firecracker diagnosis.

### Health endpoint

- `GET http://127.0.0.1:9999/health` returns Talon state JSON
- Gateway heartbeat consumes this as an additional watchdog signal
- Configure via `[health]` in `~/.config/talon/config.toml`

### SOS channel config

- Tier 4 sends to both Telegram and iMessage
- Telegram fields in `[escalation]`:
  - `sos_telegram_chat_id`
  - `sos_telegram_secret_name` (defaults to `telegram_bot_token`)
- Talon now leases Telegram tokens via `secrets lease <name> --ttl ...` (no `--raw`). If you still see `curl: (3) URL rejected: Malformed input to a URL function`, redeploy the latest Talon binary.
- iMessage recipient remains `sos_recipient`

## Launchd Management

**Talon is active as `com.joel.talon`:**
```bash
launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid =|program =|last exit code ="
```

**Reload binary/config after deploy:**
```bash
launchctl kickstart -k gui/$(id -u)/com.joel.talon
```

**Single owner for worker supervision is mandatory:**
- If `com.joel.system-bus-worker` is loaded, Talon now auto-disables its internal worker supervisor to prevent port-3111 thrash.
- Preferred end-state is Talon-only supervision, but coexistence no longer causes kill/restart loops.

```bash
launchctl list com.joel.system-bus-worker
```

**Legacy services should stay disabled when fully cut over:**
```bash
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.joel.k8s-reboot-heal.plist
```

## Troubleshooting

```bash
# Validate config + service monitor files
talon validate | python3 -m json.tool

# Check what talon sees right now
talon --check | python3 -m json.tool

# Check state machine
talon --status | python3 -m json.tool

# Broken-pipe robustness smoke test (should exit 0)
talon --check | head -n 1 >/dev/null

# Check health endpoint payload
curl -sS http://127.0.0.1:9999/health | python3 -m json.tool

# Check talon's own logs
tail -20 ~/.local/state/talon/talon.log | python3 -m json.tool

# Check launchd
launchctl list | grep talon
tail -50 ~/.local/log/talon.err

# Manual probe test
DOCKER_HOST=unix:///Users/joel/.colima/default/docker.sock docker inspect --format '{{.State.Status}}' joelclaw-controlplane-1
kubectl exec -n joelclaw redis-0 -- redis-cli ping
kubectl auth can-i --as=apiserver-kubelet-client get nodes --subresource=proxy --all-namespaces
kubectl auth can-i --as=apiserver-kubelet-client create nodes --subresource=proxy --all-namespaces
ssh -F ~/.colima/_lima/colima/ssh.config lima-colima 'curl -sS http://127.0.0.1:8288/health'

# Force bridge repair (same behavior Talon uses for split-brain)
colima stop --force && colima start

# Manual voice-agent stale cleanup (same post-gate step k8s-reboot-heal runs)
~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh
```

## Colima Stability Monitoring (2026-03-17)

Talon now monitors failure modes discovered during the Firecracker development incident:

| Probe | What it detects | Critical? |
|-------|----------------|-----------|
| `script:redis_aof_health` | Corrupted Redis AOF from VM crash mid-write | Yes (after 3) |
| `script:colima_vm_uptime` | VM crash-loop (uptime < 120s = just restarted) | Yes (after 2) |
| `script:restate_worker_ready` | Restate worker pod not 1/1 Ready | Yes (after 3) |
| `script:kvm_available` | Whether /dev/kvm exists (nested virt status) | No (informational) |

### Known failure chain: nestedVirtualization → cascade

```
nestedVirtualization ON + heavy Docker build
  → Colima VZ VM crash (silent, no crash report)
    → Docker daemon restart → Talos container killed
      → Redis mid-write → AOF corruption → crash-loop
      → Restate mid-journal → stale invocations → infinite retries
      → Lima socket forwarding broken → docker CLI dead on macOS
```

Talon detects each stage:
1. `colima_vm_uptime` < 120s → VM just crashed
2. `redis` probe fails → Redis down
3. `redis_aof_health` fails → AOF corrupted (needs manual fix)
4. `restate_worker_ready` fails → worker can't start (may be /dev/kvm mount or image pull)

**Talon cannot auto-fix Redis AOF corruption** (requires `redis-check-aof --fix`). It WILL escalate to the pi agent (Tier 2) which should load the `k8s` skill's Redis AOF Recovery procedure.

## Key Design Decisions

- **Zero external deps** — no tokio, no serde, no reqwest. Pure std. Keeps binary at ~444KB.
- **Compiles its own PATH** — immune to launchd environment brittleness (the class of bug that caused the 6-day outage).
- **Worker is a child process** — not a separate launchd service. Signal forwarding prevents orphans.
- **TOML config parsed by hand** — same pattern as worker-supervisor. No dependency just for config.
- **Probes use Colima docker socket** for critical host checks and add VM witness probes over Colima SSH for split-brain detection.

## Related

- ADR-0159: Talon proposal
- ADR-0158: Worker supervisor (superseded by talon)
- `infra/k8s-reboot-heal.sh`: Tier 1 heal script
- `infra/worker-supervisor/`: Original standalone worker supervisor (superseded)
- Ollama + qwen3:8b: Tier 3 local fallback model