loop-diagnosis

Name: loop-diagnosis
Author: joelhooks/joelclaw

$npx mdskill add joelhooks/joelclaw/loop-diagnosis

Diagnose and fix stalled agent loops using the joelclaw CLI when loops appear stuck or stories aren't progressing.

Helps resolve issues like stalled loops, broken event chains, or stories stuck at pending.
Integrates with the joelclaw CLI and checks Redis, worktree, Inngest runs, and agent processes.
Runs six checks to pattern-match root causes and can auto-fix based on diagnosis results.
Presents results via CLI commands with options for concise output or detailed JSON inspection.

SKILL.md

.github/skills/loop-diagnosisView on GitHub ↗

---
name: loop-diagnosis
description: Diagnose and fix stalled agent loops using the joelclaw CLI. Use when loops appear stuck, stories aren't progressing, or the event chain broke. Triggers on "loop stalled", "why isn't the loop progressing", "diagnose loops", "fix stuck loop", "loop not moving", "what happened to the loop", "stories stuck at pending", or any request to debug loop infrastructure.
---

# Loop Diagnosis

Diagnose and fix stalled agent coding loops. This skill covers the diagnostic CLI, common failure modes, and the observability patterns that prevent silent stalls.

## Quick Commands

```bash
# Diagnose all active loops at once
joelclaw loop diagnose all -c

# Diagnose a specific loop
joelclaw loop diagnose <loop-id> -c

# Diagnose AND auto-fix
joelclaw loop diagnose all -c --fix

# Full JSON output (for detailed inspection)
joelclaw loop diagnose <loop-id>
```

## What Diagnosis Checks

The `diagnose` command runs 6 checks in order:

1. **Redis state** — PRD stories (pass/skip/pending), progress entries, active claims
2. **Worktree** — exists? commits? uncommitted changes? .out files?
3. **Inngest runs** — running/failed agent-loop-* functions, recent plan runs
4. **Agent processes** — any claude/codex processes still alive?
5. **Worker health** — function_count from localhost:3111/api/inngest
6. **Diagnosis** — pattern-matches the above into a root cause

## Failure Modes & Fixes

| Diagnosis | Root Cause | Auto-Fix |
|-----------|-----------|----------|
| `CHAIN_BROKEN` | Judge sent `story.passed` but plan never received it. Event lost in transit. | Re-fires `agent/loop.story.passed` → plan picks next story |
| `ORPHANED_CLAIM` | Story claimed by an event, but agent died and no Inngest run is active. | Clears claim + re-fires plan event |
| `STUCK_RUN` | Inngest run marked RUNNING but agent process is dead. Run won't complete. | Clears claims + re-fires (manual run cancellation may be needed in Inngest dashboard) |
| `WORKER_UNHEALTHY` | Worker registering fewer functions than expected. Missing imports or crash loop. | Restarts `system-bus-worker` deployment in k8s |
| `NO_PRD` | Loop has no PRD in Redis — was nuked or never created. | None — start a new loop |
| `COMPLETE` | All stories passed or skipped. Nothing to do. | None — run `joelclaw loop nuke dead` to clean up |

## When to Use (vs Other Skills)

- **loop-diagnosis** → Loop is stuck/stalled, need to figure out why and fix it
- **loop-nanny** → Loop is running, need to monitor progress and clean up after
- **agent-loop** → Need to START a new loop

## The Event Chain

Understanding the chain helps diagnose WHERE it broke:

```
agent/loop.started
  → plan (picks story, dispatches test-writer)
    → agent/loop.story.dispatched
      → test-writer (writes acceptance tests)
        → agent/loop.tests.written
          → implement (codex/claude writes code)
            → agent/loop.story.implemented
              → review (runs tests, typecheck, claude review)
                → agent/loop.story.reviewed
                  → judge (pass/fail/retry decision)
                    → agent/loop.story.passed  ←── feeds back to plan
                    → agent/loop.story.failed  ←── feeds back to plan
                    → agent/loop.story.retry   ←── feeds back to implement
```

**Most common break point**: `judge → plan`. The `agent/loop.story.passed` event fires but plan never picks it up. This happens when:
- Inngest is restarting during the event
- Worker was restarted between judge and plan
- k8s pod restart dropped the event

## Observability Patterns

### Passive: Failure Events
Every loop function should emit failure events via `onFailure` handlers (being added by harden loop). These fire `agent/loop.function.failed` which gets logged to slog.

### Active: Watchdog (Future)
A periodic Inngest function (`system/loop-watchdog`) that:
1. Scans all loops in Redis with pending stories
2. Checks if any events were emitted in the last 10 minutes
3. If not → auto-runs diagnose + fix
4. Logs to slog + daily log

### Manual: The Diagnostic Session
When an agent needs to debug loops manually, follow this sequence:

```bash
# 1. Quick overview
joelclaw loop diagnose all -c

# 2. If fix needed
joelclaw loop diagnose all -c --fix

# 3. Verify fix worked (wait ~30s for plan to fire)
joelclaw loop status <loop-id> -c

# 4. If still stuck, check worker
curl -s localhost:3111/api/inngest | python3 -c "import json,sys; print(json.load(sys.stdin)['function_count'])"

# 5. Nuclear option: full restart
joelclaw loop restart <loop-id>
```

## Making Loops More Resilient

The root cause of most stalls is **lost events in the judge→plan chain**. Solutions being implemented:

1. **onFailure handlers** — every function gets one, logs failure + emits diagnostic event
2. **Loop watchdog** — periodic check for silent stalls
3. **Debounce on content-sync** — prevents event storms that can crowd out loop events
4. **Singleton on backfill** — prevents resource contention during loops

## Cross-References

- [agent-loop skill](/Users/joel/.pi/agent/skills/agent-loop/SKILL.md) — starting loops
- [loop-nanny skill](/Users/joel/.pi/agent/skills/loop-nanny/SKILL.md) — monitoring + cleanup
- [joelclaw skill](../joelclaw/SKILL.md) — full CLI reference
- [ADR-0028](/Users/joel/Vault/docs/decisions/0028-inngest-reliability-patterns.md) — reliability patterns

More from joelhooks/joelclaw

Skill	Description
add-skill	Create new joelclaw skills with the idiomatic process — repo-canonical, symlinked, git-tracked, slogged. Triggers on 'add a skill', 'create skill', 'new skill', 'canonical skill', 'make a skill for', or any request to formalize a process or domain into a reusable skill.
adr-skill	Create and maintain Architecture Decision Records (ADRs) optimized for agentic coding workflows. Use when you need to propose, write, update, accept/reject, deprecate, or supersede an ADR; bootstrap an adr folder and index; consult existing ADRs before implementing changes; or enforce ADR conventions. This skill uses Socratic questioning to capture intent before drafting, and validates output against an agent-readiness checklist.
agent-discovery	"Optimize websites, docs, and product surfaces for agent discoverability and operator UX. Use when working on agent SEO/AEO/GEO, crawl policy, markdown or JSON projections, llms.txt, sitemap.md, AGENTS.md guidance, content negotiation, accessibility for browser agents, or any request to make a site easier for pi, OpenCode, Claude Code, ChatGPT, Perplexity, or other agent harnesses to find and use."
agent-loop	Start, monitor, and cancel durable multi-agent coding loops via Inngest. Use when the user wants to run autonomous coding workloads, execute a PRD with multiple stories, kick off an AFK coding session, have agents implement features from a plan, or manage running loops. Triggers on "start a coding loop", "run this PRD", "implement these stories", "go AFK and code this", "check loop status", "cancel the loop", "joelclaw loop", or any request for autonomous multi-story code execution.
agent-mail	>-
agent-workloads	"Compatibility alias for the canonical `workflow-rig` front door. Use when older prompts mention `agent-workloads` or when you need the legacy workload-planning guidance; for new work, load `workflow-rig` first."
clawmail	>-
cli-design	"Design and build agent-first CLIs with HATEOAS JSON responses, context-protecting output, and self-documenting command trees. Use when creating new CLI tools, adding commands to existing CLIs (joelclaw, slog), or reviewing CLI design for agent-friendliness. Triggers on 'build a CLI', 'add a command', 'CLI design', 'agent-friendly output', or any task involving command-line tool creation."
codex-prompting	"Use this skill for any request to trigger, coordinate, or craft prompts for Codex. Use when user says 'send to codex', 'use codex', 'prompt codex', 'ask codex', 'delegate to codex', 'run in codex', or asks for a Codex-first execution handoff."
content-publish	"Publish content to joelclaw.com via the Convex-first pipeline. Covers the full lifecycle: draft → review → publish → revalidate → verify. Handles secret leasing, tag conventions, content types (article, tutorial, note, essay), and verification gates. Use when: 'write article about X', 'publish article <slug>', 'draft a tutorial', 'publish this', 'push to convex', or any content publishing task."