system-architecture
$
npx mdskill add joelhooks/joelclaw/system-architectureProvides canonical system topology and wiring maps for architecture reasoning, event tracing, and debugging.
- Helps debug why processes ran or didn't run and trace event flows end-to-end.
- Integrates with Inngest, Kubernetes, gateway services, and local system files.
- Uses direct reads of source code and configuration files as evidence.
- Delivers results as a single source of truth document for routing and debugging.
SKILL.md
.github/skills/system-architectureView on GitHub ↗
---
name: system-architecture
displayName: System Architecture
description: Canonical joelclaw topology and wiring map. Use when reasoning about architecture, tracing event flow, debugging why something ran/didn't run, identifying which worker executes a function, checking what listens on a port, or following an event end-to-end.
version: 0.1.0
author: joel
tags:
- architecture
- topology
- inngest
- gateway
- kubernetes
- observability
---
# System Architecture (Canonical Topology)
This skill is the **single source of truth** for joelclaw system wiring.
Use it for:
- "why did this run / not run"
- "which worker handles this function"
- "what is listening on port X"
- "how does event Y flow"
- full-stack routing/debug across CLI → Inngest → workers → gateway → telemetry
## Ground-Truth Scope + Evidence Snapshot
This document is grounded in direct reads of:
- `apps/docs-api/src/index.ts`
- `packages/restate/Dockerfile`
- `packages/restate/src/index.ts`
- `packages/restate/src/workflows/dag-orchestrator.ts`
- `packages/agent-execution/src/microvm.ts`
- `packages/system-bus/src/serve.ts`
- `packages/system-bus/src/inngest/functions/index.host.ts`
- `packages/system-bus/src/inngest/functions/index.cluster.ts`
- `packages/system-bus/src/inngest/client.ts`
- `infra/worker-supervisor/src/main.rs`
- `~/Library/LaunchAgents/com.joel*.plist`
- `k8s/*` (all files)
- `infra/pds/values.yaml`
- `packages/gateway/src/daemon.ts`
- `packages/gateway/src/channels/*.ts`
- `~/.joelclaw/gateway/AGENTS.md`
- `~/.joelclaw/gateway/.pi/settings.json`
- `~/.local/caddy/Caddyfile`
- `~/.colima/default/colima.yaml` + `colima status --json`
- `packages/cli/src/cli.ts`, `packages/cli/src/config.ts`, `packages/cli/src/inngest.ts`
- `packages/system-bus/src/observability/*` (key files: `emit.ts`, `otel-event.ts`, `store.ts`)
- `packages/telemetry/src/emitter.ts`
- `packages/system-bus/src/lib/langfuse.ts`
- `packages/inference-router/src/tracing.ts`
- ADRs in `~/Vault/docs/decisions/` (required + topology-adjacent)
- last 50 lines of `~/Vault/system/system-log.jsonl`
### Related docs verified
- `docs/architecture.md` — Restate/Firecracker runtime + workload execution flow
- `docs/deploy.md` — Restate worker deploy + auth/identity/PVC procedures
- `docs/cli.md` — workload command tree + runtime bridge
- `docs/observability.md` — **not inspected in this update**
---
## 1) Physical Topology
```text
Mac Mini "Panda" (host macOS)
├─ launchd services (gateway, worker supervisor, caddy, talon, agent-mail, etc.)
├─ Colima VM (driver: VZ, arch: aarch64, runtime: docker, VM IP: 192.168.64.2)
│ └─ Talos node: joelclaw-controlplane-1 (k8s v1.35.0, internal IP 10.5.0.2)
│ ├─ namespace: joelclaw
│ │ ├─ inngest (StatefulSet + NodePort 8288/8289)
│ │ ├─ redis (StatefulSet + NodePort 6379)
│ │ ├─ typesense (StatefulSet + ClusterIP 8108)
│ │ ├─ restate (StatefulSet + NodePort 8080/9070/9071)
│ │ ├─ system-bus-worker (Deployment + ClusterIP 3111)
│ │ ├─ restate-worker (Deployment + ClusterIP 9080; full agent image + Firecracker)
│ │ ├─ dkron (StatefulSet + ClusterIP 8080)
│ │ ├─ docs-api (Deployment + NodePort 3838)
│ │ ├─ livekit-server (Deployment + NodePort 7880/7881)
│ │ ├─ bluesky-pds (Deployment + NodePort 3000)
│ │ └─ minio (StatefulSet + NodePort 30900/30901)
│ └─ namespace: aistor
│ ├─ aistor operator (Deployments: adminjob-operator, object-store-operator)
│ └─ aistor-s3 object store (StatefulSet + NodePort 31000/31001)
├─ Caddy reverse proxy (tailnet HTTPS fan-in)
├─ Gateway daemon (embedded pi session)
├─ Firecracker substrate (requires Colima nestedVirtualization=true for /dev/kvm; OFF by default — unstable under load)
└─ NAS "three-body" (NFS tiers per ADR-0088)
```
### Known runtime endpoints
- Colima VM IP: `192.168.64.2` (`colima status --json`)
- Kubernetes API (local forward): `https://127.0.0.1:64784` (`kubectl cluster-info`)
- Tailnet hostnames seen in config:
- `panda.tail7af24.ts.net` (Caddy routes)
- `pds.panda.tail7af24.ts.net` (PDS values)
### Tailscale mesh state
- `tailscale status --json` failed in this environment: **UNKNOWN — needs manual verification**
---
## 2) Process Inventory (Long-Running)
## Host launchd inventory (snapshot)
> Snapshot source: `launchctl print gui/$(id -u)/<label>` and plist inspection.
| Launchd label | State | PID (snapshot) | Role | Ports / endpoints |
|---|---:|---:|---|---|
| `com.joel.system-bus-worker` | running | 75292 | Host worker supervisor (`worker-supervisor`) | supervises child bun on 3111 |
| `com.joel.restate-worker` | retired / rollback-only | — | Historical host Restate wrapper (`scripts/restate/start.sh`) | superseded by `deployment/restate-worker` on 9080 |
| `com.joel.gateway` | running | 81275 | Gateway daemon (`packages/gateway/src/daemon.ts`) | WS `:3018`, Redis bridge |
| `com.joel.caddy` | running | 9347 | Reverse proxy | 3443, 5443, 6443, 7443, 8290, 8443, 9443 |
| `com.joel.talon` | running | 96359 | Infra watchdog | health `127.0.0.1:9999` |
| `com.joel.agent-secrets` | running | 98048 | Secret lease daemon | no public port |
| `com.joel.imsg-rpc` | running | 61110 | iMessage JSON-RPC socket daemon | Unix socket `/tmp/imsg.sock` |
| `com.joel.typesense-portforward` | running | 32095 | `kubectl port-forward svc/typesense 8108:8108` | local 8108 |
| `com.joel.voice-agent` | running | 71887 | voice agent runtime | local 8081 |
| `com.joel.local-sandbox-janitor` | scheduled | (launchd timer) | ADR-0221 local sandbox janitor (`scripts/local-sandbox-janitor.sh` → `joelclaw workload sandboxes janitor`) | logs in `/tmp/joelclaw/local-sandbox-janitor.{log,err}` |
| `com.joelclaw.agent-mail` | spawn scheduled | (none in launchctl snapshot) | agent-mail MCP HTTP service | observed listener `127.0.0.1:8765` (python process) |
| `com.joel.colima` | not running | — | startup helper for Colima | n/a |
| `com.joel.k8s-reboot-heal` | not running | — | periodic k8s heal script | n/a |
| `com.joel.system-bus-sync` | not running | — | sync guard watcher | n/a |
| `com.joel.gateway-tripwire` | not running | — | gateway tripwire script | n/a |
| `com.joel.content-sync-watcher` | not running | — | fs watch -> content/updated event | n/a |
| `com.joel.vault-log-sync` | not running | — | Vault log sync watcher | n/a |
### Process supervision behavior: `worker-supervisor`
Source: `infra/worker-supervisor/src/main.rs`
- Default config:
- worker dir: `~/Code/joelhooks/joelclaw/packages/system-bus`
- command: `bun run src/serve.ts`
- port: `3111`
- health endpoint: `/api/inngest`
- sync endpoint: `/api/inngest` (PUT)
- health interval: 30s
- restart after 3 consecutive health failures
- restart backoff: 1s → 30s max
- Pre-start kills stale process on port 3111.
- Runs host import preflight before spawn:
- `bun --eval "await import('./src/inngest/functions/index.host.ts');"`
- on failure, skips spawn and retries with exponential backoff
- Loads env from `~/.config/system-bus.env` plus leased secrets.
- Forces `WORKER_ROLE=host` for the supervised host worker.
- Emits OTEL events via CLI on supervisor failures/restarts:
- `worker.supervisor.preflight.failed`
- `worker.supervisor.worker_exit`
- `worker.supervisor.health_check.restart`
### Worker supervision split note
- Talon is running (`com.joel.talon`), but host worker is still launched via `com.joel.system-bus-worker` -> `worker-supervisor`.
- ADR + system-log indicate Talon can defer worker supervision during coexistence.
---
## Kubernetes process inventory
## Node
- `joelclaw-controlplane-1` (Talos v1.12.4, k8s v1.35.0, internal IP `10.5.0.2`)
## Core services
| Service | Workload kind | Service type | Service port(s) | NodePort(s) / exposure | Role |
|---|---|---|---|---|---|
| Inngest | StatefulSet `inngest` | NodePort (`inngest-svc`) | 8288, 8289 | 8288, 8289 | Event API + connect ws |
| Redis | StatefulSet `redis` | NodePort | 6379 | 6379 | Queue/state/pubsub |
| Typesense | StatefulSet `typesense` | ClusterIP | 8108 | host via launchd port-forward 8108 | Search + telemetry store |
| Restate | StatefulSet `restate` | NodePort | 8080, 9070, 9071 | 8080, 9070, 9071 | Durable workflow ingress + admin + metrics |
| system-bus-worker | Deployment | ClusterIP | 3111 | in-cluster only | Cluster-role worker (12 functions) |
| restate-worker | Deployment | ClusterIP | 9080 | in-cluster only | `dagOrchestrator` + `dagWorker` + queue drainer in full agent image |
| docs-api | Deployment | NodePort | 3838 | 3838 | PDF/docs API + agentic search + taxonomy graph |
| dkron | StatefulSet | ClusterIP (`dkron-svc`) + headless peer svc (`dkron-peer`) | 8080, 8946, 6868 | in-cluster only; operator access via short-lived CLI-managed tunnel | Distributed cron scheduler for Restate pipelines |
| livekit-server | Deployment (Helm) | NodePort | 80, 7881 | 7880 (for svc port 80), 7881 | LiveKit signaling + rtc tcp |
| bluesky-pds | Deployment (Helm-managed) | NodePort | 3000 | 3000 | AT Proto PDS |
| minio | StatefulSet | ClusterIP + NodePort | 9000, 9001 | 30900, 30901 | Legacy local S3-compatible runtime |
| aistor-s3-api (`aistor` ns) | NodePort service (operator-managed) | NodePort | 443, 9000 | 31000 (+ dynamic management NodePort) | AIStor S3 API (TLS + management) |
| aistor-s3-console (`aistor` ns) | NodePort service (operator-managed) | NodePort | 9443 | 31001 | AIStor web console |
### Restate / Firecracker runtime note
- `deployment/restate-worker` is the current durable execution worker. The image bundles Bun + Node + `pi` + `codex`, the full repo checkout, and 76 symlinked skills.
- Runtime auth/identity come from `secret/pi-auth` and `configmap/agent-identity`, which recreate `/root/.pi/agent/auth.json` plus the joelclaw identity chain inside the pod.
- Firecracker is enabled in-pod via privileged access to `/dev/kvm` on Colima VZ. The `/dev/kvm` hostPath mount uses type `""` (optional) so the pod starts without it when nestedVirtualization is off.
- Persistent microVM assets live on PVC `firecracker-images`, mounted at `/tmp/firecracker-test` for kernel, rootfs, and snapshot files.
- **Retry caps (2026-03-17)**: dagWorker maxAttempts=5, dagOrchestrator maxAttempts=3. Prevents Restate journal poisoning from infinite retries after code changes or infrastructure failures.
- **Colima stability**: nestedVirtualization is OFF by default (crashes VM under Docker build load). Toggle ON only for Firecracker testing sessions, then toggle OFF. See k8s skill for recovery procedures.
### Control-plane access
- kube API exposed locally at `127.0.0.1:64784` (forwarded)
- additional forwarded control ports observed: `64785`, `9627` (**exact ownership mapping UNKNOWN — needs manual verification**)
---
## 3) Worker Architecture (Role Split + Registration)
Source files:
- `packages/system-bus/src/serve.ts`
- `packages/system-bus/src/inngest/functions/index.host.ts`
- `packages/system-bus/src/inngest/functions/index.cluster.ts`
- `packages/system-bus/src/inngest/client.ts`
## Role model
- `WORKER_ROLE` parsed as `host` (default) or `cluster`.
- Registered function set is role-dependent:
- host uses `hostFunctionDefinitions`
- cluster uses `clusterFunctionDefinitions`
## Ground-truth counts
- Host function set: **101**
- Cluster function set: **12**
- Cluster subset functions:
- `approvalRequest`, `approvalResolve`
- `todoistCommentAdded`, `todoistTaskCompleted`, `todoistTaskCreated`
- `frontMessageReceived`, `frontMessageSent`, `frontAssigneeChanged`
- `todoistMemoryReviewBridge`
- `githubWorkflowRunCompleted`, `githubPackagePublished`
- `webhookSubscriptionDispatchGithubWorkflowRunCompleted`
## App registration isolation
From `inngest/client.ts`:
- app id resolves to:
- `system-bus-host` when role is host
- `system-bus-cluster` when role is cluster
- explicit `INNGEST_APP_ID` overrides role-derived id.
This prevents host and cluster workers from overwriting each other’s function graphs.
## serveHost behavior
From `serve.ts`:
- host role default `serveHost`: `http://host.docker.internal:3111`
- cluster role default `serveHost`: unset (connect-mode default)
- `INNGEST_SERVE_HOST` overrides either role.
Kubernetes cluster worker manifest sets:
- `INNGEST_BASE_URL=http://inngest-svc:8288`
- `INNGEST_SERVE_HOST=http://system-bus-worker:3111`
## Registration mechanics
- Worker exposes `GET|POST|PUT /api/inngest`.
- Worker sends a delayed self-sync `PUT /api/inngest` ~5s after startup.
- `worker-supervisor` also performs startup PUT sync.
## Host is primary today
From index comments + function lists:
- ADR-0089 transition: host remains authoritative for broad function ownership.
- Cluster is intentionally limited to cluster-safe subset (12 functions).
---
## 4) Event Flow (CLI → Inngest → Worker → Completion)
## Canonical flow: `joelclaw send`
1. CLI `joelclaw send <event>` calls `Inngest.send()`.
2. `Inngest.send()` POSTs event JSON to:
- `${INNGEST_URL}/e/${INNGEST_EVENT_KEY}`
- default: `http://localhost:8288/e/<key>`
3. Inngest server persists the event and resolves matching function triggers.
4. Inngest dispatches function steps to the worker app graph that owns that function ID:
- host app (`system-bus-host`) for 101-host set
- cluster app (`system-bus-cluster`) for 12-cluster subset
5. Worker handles callbacks via `/api/inngest` (Hono + `inngest/hono` handler).
6. Each `step.run` result is memoized by Inngest; next step executes when prior completes.
7. Completion/failure is queryable via GraphQL (`/v0/gql`) and CLI commands (`runs`, `run`, `event`, `events`).
## Queue flow: `joelclaw queue emit` → Restate drainer → durable dispatch
1. CLI `joelclaw queue emit <event>` persists a `QueueEventEnvelope` into Redis stream `joelclaw:queue:events` and indexes it in sorted set `joelclaw:queue:priority`.
2. The `restate-worker` k8s deployment (`packages/restate/src/index.ts`) starts a deterministic queue drainer beside the channel callback listener.
3. On startup, the drainer claims pending + never-delivered entries via `@joelclaw/queue#getUnacked()`, reindexes replayable entries, and emits OTEL replay evidence.
4. Each drain tick selects the next priority candidate from the sorted set, resolves its static registry target from `packages/queue/src/registry.ts`, and POSTs a one-node DAG request to Restate `/dagOrchestrator/{workflowId}/run/send`.
5. When backlog remains and a dispatch slot frees, the drainer self-pulses immediately instead of waiting for the next `QUEUE_DRAIN_INTERVAL_MS` heartbeat. That interval is now the idle poll / retry cadence, not a mandatory 2-second tax between successful sends.
6. The current Story-3 bridge re-emits the queue item to its registered Inngest event target inside that one-node DAG request. This is deliberate: the deterministic queue/drainer is proven first; per-family Restate cutovers remain Story 4 work.
7. On accepted Restate dispatch, the drainer acks the queue message; on failure it leaves the message in Redis, applies retry cooldown, and emits `queue.dispatch.failed` OTEL evidence.
8. If backlog remains in Redis but the drainer stops making progress past `QUEUE_DRAIN_STALL_AFTER_MS`, it emits `queue.drainer.stalled` and exits non-zero so k8s restarts `deployment/restate-worker`. That is the self-heal path for a wedged drainer inside an otherwise-running Bun process.
9. Crash recovery comes from the Redis stream + consumer-group replay path, not from vibes: restart the `restate-worker` pod, let `getUnacked()` reclaim the inflight entries, then drain resumes.
## Workload flow: `joelclaw workload run` → Redis → Restate DAG → execution
1. `joelclaw workload plan ... --stages-from <file>` can load an explicit stage DAG, validate unknown deps/self-deps/duplicates/cycles, and preserve per-stage acceptance gates.
2. `joelclaw workload run <plan-artifact>` normalizes the selected stage into the canonical `workload/requested` runtime request.
3. Queue admission writes the request into Redis, where the deterministic drainer forwards it into Restate as a `dagOrchestrator/{workflowId}/run/send` request.
4. `dagOrchestrator` executes dependency waves: ready nodes in parallel, chained nodes only after every `dependsOn` node has terminal output.
5. `dagWorker` executes the node handler:
- `shell` → subprocess work inside the `restate-worker` pod
- `infer` → `pi -p --no-session --no-extensions` inside the pod, using the mounted auth + identity + skill set
- `microvm` → Firecracker boot/restore through `/dev/kvm` with kernel/rootfs/snapshot files on PVC `firecracker-images`
6. Each node emits OTEL (`dag.node.*`), and the workflow emits `dag.workflow.*` so queue → Restate → execution remains observable.
7. Current truthful limit: the microVM runtime boots and restores snapshots in-cluster, but the broader exec-in-VM workspace drive protocol is still incomplete for general coding slices.
## Webhook flow
1. External service posts to `/webhooks/:provider`.
2. Caddy routes `/webhooks/*` on `localhost:8443` to worker `localhost:3111`.
3. `webhookApp` verifies signature, normalizes payload, emits Inngest events (`provider/event`).
4. Inngest executes subscribed functions.
## "Why did this run / not run" trace recipe
1. `joelclaw send <event> -d '<payload>'`
2. `joelclaw events --prefix <event-prefix> --hours 1`
3. `joelclaw event <event-id>` (fan-out to function runs)
4. `joelclaw run <run-id>` (step trace + errors)
5. `joelclaw runs --count 20 --hours 1`
6. `joelclaw otel search "<component/action>" --hours 1`
7. Validate function ownership in `index.host.ts` / `index.cluster.ts`.
---
## 5) Port Map (Canonical)
> Exposure sources: k8s service manifests, Caddyfile, `kubectl get svc`, `lsof` listeners.
| Port | Listener / owner | What it is | Exposure path |
|---:|---|---|---|
| 3111 | host bun worker | host system-bus worker HTTP (`/`, `/api/inngest`, `/webhooks`, `/observability/emit`) | local host; proxied via Caddy 3443 + webhook path via 8443 |
| 8080 | ssh forward (Colima) -> restate | Restate ingress / workflow API | NodePort + host forward |
| 8288 | ssh forward (Colima) -> Inngest svc | Inngest API + dashboard backend | NodePort + host forward; proxied via Caddy 9443 |
| 8289 | ssh forward (Colima) -> Inngest ws | Inngest connect websocket | NodePort + host forward; proxied via Caddy 8290 |
| 6379 | ssh forward (Colima) -> Redis | Redis | NodePort + host forward |
| 8108 | ssh forward / kubectl port-forward | Typesense API | ClusterIP; exposed locally by port-forward |
| 9070 | ssh forward (Colima) -> restate | Restate admin API | NodePort + host forward |
| 9071 | ssh forward (Colima) -> restate | Restate metrics | NodePort + host forward |
| 9080 | k8s `restate-worker` service | Restate worker HTTP (`dagOrchestrator`, `dagWorker`, queue drainer) | ClusterIP only |
| random high local port | transient `kubectl port-forward` (CLI-managed) -> `svc/dkron-svc:8080` | Dkron HTTP API | ClusterIP only; short-lived operator tunnel |
| 3838 | ssh forward (Colima) -> docs-api | docs-api HTTP | NodePort + host forward; proxied via Caddy 5443 |
| 7880 | ssh forward (Colima) -> livekit-server | LiveKit signaling | NodePort 7880; proxied via Caddy 7443 |
| 7881 | ssh forward (Colima) -> livekit-server | LiveKit RTC TCP | NodePort 7881 |
| 3000 | k8s bluesky-pds NodePort | Bluesky PDS HTTP | NodePort 3000 |
| 30900 | k8s minio-nodeport | Legacy MinIO S3 API (HTTP) | NodePort 30900 |
| 30901 | k8s minio-nodeport | Legacy MinIO console (HTTP) | NodePort 30901 |
| 31000 | k8s aistor-s3-api (`aistor` ns) | AIStor S3 API (TLS) | NodePort 31000 |
| 31001 | k8s aistor-s3-console (`aistor` ns) | AIStor console (TLS) | NodePort 31001 |
| 3443 | Caddy | HTTPS reverse proxy to `localhost:3111` | tailnet HTTPS |
| 5443 | Caddy | HTTPS reverse proxy to `localhost:3838` | tailnet HTTPS |
| 7443 | Caddy | HTTPS reverse proxy to `localhost:7880` | tailnet HTTPS |
| 9443 | Caddy | HTTPS reverse proxy to `localhost:8288` | tailnet HTTPS |
| 8290 | Caddy | HTTPS reverse proxy to `localhost:8289` | tailnet HTTPS |
| 8443 | Caddy (HTTP) | webhook/public ingress router | expected Funnel target |
| 6443 | Caddy | reverse proxy to local 6333 (Qdrant) | tailnet HTTPS |
| 3018 | gateway daemon | gateway websocket stream port | local |
| 9999 | talon | Talon health endpoint | local `127.0.0.1` |
| 8765 | agent-mail HTTP service | MCP agent-mail API | local `127.0.0.1` |
| 64784 | ssh forward | Kubernetes API | local kubectl endpoint |
### Notes
- Host NodePort exposure appears through an `ssh` listener process (Colima portForwarder=ssh).
- Exact per-port ssh forward command line is **UNKNOWN — needs manual verification** (process introspection restricted in this environment).
---
## 6) Storage Topology
## Redis
- Runtime: k8s StatefulSet (`redis:7-alpine`, appendonly enabled).
- Primary uses:
- gateway queue/session keys (`joelclaw:events:*`, `joelclaw:notify:*`, `joelclaw:gateway:sessions`)
- webhook subscriptions (`joelclaw:webhook:*`)
- gateway health mute/streak keys (`gateway:health:*`)
## Typesense
From observability code:
- `otel_events` collection (canonical telemetry event store)
- `memory_observations` collection (vector-aware memory index; schema validated at startup)
- docs-api also points at `http://typesense:8108` for docs search/index surfaces.
## Firecracker runtime storage
- PVC: `firecracker-images`
- Mounted in `deployment/restate-worker` at `/tmp/firecracker-test`
- Stores:
- kernel (`vmlinux`)
- rootfs (`agent-rootfs.ext4`)
- snapshots (`snapshots/vm.snap`, `snapshots/vm.mem`)
- Firecracker snapshot restore is currently operator-proven at ~9ms on the Colima VZ nested-virt path.
## Inngest state
- StatefulSet PVC mounted at `/data`
- `INNGEST_SQLITE_DIR=/data`
## docs-api surface
- Deployment: `docs-api` on NodePort `3838`
- Route count: 11 endpoints including `/health`
- Key routes:
- `GET /search` — hybrid chunk search with `concept`, `concepts`, `doc_id`, `expand`, and `assemble`
- `GET /docs/search`
- `GET /docs`
- `GET /docs/:id`
- `GET /docs/:id/toc`
- `GET /docs/:id/chunks`
- `GET /chunks/:id`
- `GET /concepts`
- `GET /concepts/:id`
- `GET /concepts/:id/docs`
- Taxonomy surface: 21-concept SKOS graph (10 parents + 11 sub-concepts) with `broader`, `narrower`, and `related` edges.
## NAS (ADR-0088 + ADR-0187)
Tiering policy:
- Tier 1 local SSD (hot runtime state)
- Tier 2 NAS NVMe (`/Volumes/nas-nvme` ↔ `/volume2/data`)
- Tier 3 NAS HDD (`/Volumes/three-body`)
### Access paths
| From | NVMe tier (1.5TB) | HDD tier (56TB) | Method |
|------|-----------|----------|--------|
| macOS host | `/Volumes/nas-nvme` | `/Volumes/three-body` | NFS mount via LaunchDaemon |
| k8s pods | PVC `nas-nvme` | PVC `nas-hdd` | NFS PV (192.168.1.163) |
| host-worker funcs | `/Volumes/nas-nvme` | `/Volumes/three-body` | Direct path (runs on macOS) |
### k8s ↔ NAS networking
k8s pods reach the NAS via a LAN route through the Colima col0 bridge:
`Talos → Docker NAT → VM col0 → macOS (ip.forwarding=1) → LAN → NAS`
The VZ NAT on eth0 does NOT forward LAN traffic. Route persisted in Colima provision + colima-tunnel script:
`ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0`
**Always use IP 192.168.1.163, never hostname three-body** — DNS doesn't resolve from k8s.
Degradation contract (ADR-0187):
- writes must fallback `local -> remote -> queued`
- queue spool default: `/tmp/joelclaw/nas-queue`
## Vault
- Obsidian vault at `/Users/joel/Vault`
- system log file: `/Users/joel/Vault/system/system-log.jsonl`
---
## 7) Networking Topology
## Caddy reverse proxy routes (from `~/.local/caddy/Caddyfile`)
- `https://panda.tail7af24.ts.net:9443` -> `localhost:8288` (Inngest)
- `https://panda.tail7af24.ts.net:8290` -> `localhost:8289` (Inngest connect)
- `https://panda.tail7af24.ts.net:3443` -> `localhost:3111` (worker)
- `https://panda.tail7af24.ts.net:5443` -> `localhost:3838` (docs-api)
- `https://panda.tail7af24.ts.net:7443` -> `localhost:7880` (LiveKit)
- `https://panda.tail7af24.ts.net:6443` -> `localhost:6333` (Qdrant)
- `http://localhost:8443` path router:
- `/webhooks/*` -> `localhost:3111`
- fallback -> `localhost:8288`
## Tailscale + Funnel
- Config comments and ADR-0051 describe Funnel path `:443 -> localhost:8443`.
- Runtime `tailscale status` unavailable here: **UNKNOWN — needs manual verification**.
## External webhook ingress
Expected path:
1. Internet provider -> Tailscale Funnel :443
2. Funnel -> local `:8443`
3. Caddy path route `/webhooks/*` -> worker `:3111`
4. worker `/webhooks/:provider` verifies + emits Inngest event
---
## 8) CLI Wiring (Command Tree → Endpoint Surface)
Primary command tree root: `packages/cli/src/cli.ts`.
## Endpoint map by command family
| Command family | Primary backend |
|---|---|
| `send` | Inngest Event API `POST /e/<event-key>` |
| `runs`, `run`, `functions`, `event`, `events` | Inngest GraphQL `POST /v0/gql` |
| `status` | Inngest/worker health probes + k8s checks + agent-mail liveness |
| `gateway *` | Redis keys/channels + launchd/system ops |
| `workload *` | workload planner + Redis queue admission + Restate `dagOrchestrator` / `dagWorker` runtime |
| `docs *` | docs-api REST API (`/search`, `/docs/*`, `/chunks/*`, `/concepts*`) |
| `restate cron *` | Dkron REST API via direct `--base-url` or short-lived `kubectl port-forward` to `svc/dkron-svc` |
| `otel *` | Typesense `otel_events` via capability adapter |
| `recall *` | Typesense recall adapter |
| `mail *` | Agent-mail MCP HTTP (`127.0.0.1:8765`) via CLI adapter wrappers |
| `inngest *` | worker launchd + Talon + k8s + Typesense diagnostics |
Config source:
- `~/.config/system-bus.env` (plus env overrides)
- defaults:
- `INNGEST_URL=http://localhost:8288`
- `INNGEST_WORKER_URL=http://localhost:3111`
---
## 9) Observability + Tracing Topology
## OTEL event pipeline
- Worker emits via `emitOtelEvent()` / `emitMeasuredOtelEvent()`.
- Gateway emits via `@joelclaw/telemetry` (`emitGatewayOtel`) to:
- default `OTEL_EMIT_URL=http://localhost:3111/observability/emit`
- Worker endpoint `/observability/emit` validates token (`x-otel-emit-token`) if configured.
- Store path (`storeOtelEvent`):
1. Typesense `otel_events` (primary)
2. optional Convex mirror for high-severity recent window
3. optional Sentry forward for `warn/error/fatal`
## Langfuse integration points
- Gateway boot: `packages/gateway/src/daemon.ts` calls `initTracing({})` from inference-router.
- Inference router traces model-route decisions:
- `packages/inference-router/src/tracing.ts`
- used from `packages/inference-router/src/router.ts`
- System-bus LLM traces:
- `packages/system-bus/src/lib/langfuse.ts` (`traceLlmGeneration`)
- called by `packages/system-bus/src/lib/inference.ts` and `channel-message-classify.ts`
---
## 10) Key ADR Topology Decisions
| ADR | Title | Status | Topology impact |
|---|---|---|---|
| ADR-0048 | Webhook gateway | shipped | `/webhooks/:provider` normalization + signature verification + Inngest emission |
| ADR-0088 | NAS-backed storage tiering | shipped | Defines SSD/NAS NVMe/NAS HDD storage contract |
| ADR-0089 | Single-source worker deployment | shipped | Host/cluster role split + single canonical source |
| ADR-0144 | Gateway hexagonal architecture | shipped | Gateway as composition root; heavy logic in `@joelclaw/*` |
| ADR-0155 | Three-stage story pipeline | shipped | Simplified story function flow through Inngest durable steps |
| ADR-0156 | Graceful worker restart | superseded | Historical restart strategy; superseded by Talon ADR |
| ADR-0159 | Talon watchdog daemon | shipped | Compiled watchdog + infra supervision model |
| ADR-0038 | Embedded pi gateway daemon | shipped | Always-on gateway session architecture |
| ADR-0051 | Tailscale Funnel ingress | shipped | Public webhook ingress via Funnel/Caddy pattern |
| ADR-0148 | k8s resilience policy | accepted | NodePort-first exposure, probe requirements, restart recovery checklist |
| ADR-0158 | worker-supervisor binary | superseded | Legacy supervisor ADR now superseded, but binary remains in active launchd path |
| ADR-0182 | node-0 localhost resilience | shipped | endpoint class fallback (`localhost -> vm -> svc_dns`) |
| ADR-0187 | NAS degradation fallback contract | accepted | mandatory local/remote/queued write fallback |
| ADR-0212 | AIStor as local S3 runtime | accepted | maintained local S3 runtime in `aistor` namespace; legacy MinIO retained for rollback |
---
## 10.1) Sandbox Execution Contract (@joelclaw/agent-execution)
**Package**: `packages/agent-execution/`
**Purpose**: Canonical contract for sandboxed story execution shared between Restate workflows, system-bus Inngest functions, and k8s Job launcher.
### Contract Types
**Request**: `SandboxExecutionRequest`
- `workflowId`, `requestId`, `storyId`: identifiers
- `task`: story prompt/task to execute
- `agent`: `{ name, variant?, model?, program? }`
- `sandbox`: `"workspace-write" | "danger-full-access"`
- `baseSha`: git SHA before execution
- `cwd?`: working directory
- `timeoutSeconds?`: timeout
- `verificationCommands?`: post-execution verification
- `sessionId?`: tracking identifier
**Result**: `SandboxExecutionResult`
- `requestId`: correlation ID
- `state`: `"pending" | "running" | "completed" | "failed" | "cancelled"`
- `startedAt`, `completedAt?`, `durationMs?`: timing
- `artifacts?`: execution artifacts (see below)
- `error?`: error message (failed state)
- `output?`: stdout/stderr output
**Artifacts**: `ExecutionArtifacts`
- `headSha`: git SHA after execution
- `touchedFiles`: list of modified/untracked files from `git status --porcelain`
- `patch?`: git patch content (format-patch or diff)
- `verification?`: `{ commands, success, output }`
- `logs?`: `{ executionLog?, verificationLog? }`
### Repo Materialization (Story 3)
**Function**: `materializeRepo(targetPath, baseSha, options)`
**Behavior**:
- Clone repo if target path doesn't exist (requires `remoteUrl`)
- Fetch + checkout if target path exists
- SHA verification after checkout
- Automatic unshallow if SHA not in shallow clone
- Isolated sandbox-local workspace (host worktree untouched)
**Returns**: `{ path, sha, freshClone, durationMs }`
**Key options**:
- `remoteUrl?`: remote URL for fresh clone
- `branch?`: branch/ref to fetch (default: `"main"`)
- `depth?`: shallow clone depth (default: `1`)
- `includeSubmodules?`: include submodules
- `timeoutSeconds?`: timeout (default: `300`)
### Artifact Export (Story 3)
**Function**: `generatePatchArtifact(options)`
**Behavior**:
- Captures touched-file inventory via `getTouchedFiles()`
- Generates git patch from `baseSha..headSha`:
- Uses `git format-patch` if commits exist in range
- Uses `git diff` if only uncommitted changes
- Optionally includes untracked files as patch content
- Embeds verification summary and log references
- Serializable to JSON via `writeArtifactBundle()`
**Key options**:
- `repoPath`: path to git repo
- `baseSha`: base SHA (start of diff range)
- `headSha?`: head SHA (default: HEAD)
- `includeUntracked?`: include untracked files (default: `true`)
- `verificationCommands?`, `verificationSuccess?`, `verificationOutput?`: verification data
- `executionLogPath?`, `verificationLogPath?`: log references
- `timeoutSeconds?`: timeout (default: `60`)
**Returns**: `ExecutionArtifacts`
### Promotion Boundary (Phase 1)
**Authoritative output is patch bundle + metadata.**
Sandbox runs **do not** merge to main or push to remote. The runtime:
1. Materializes repo at `baseSha` in sandbox-local workspace
2. Executes agent task
3. Runs verification commands
4. Exports patch artifact with touched files and verification results
5. Emits `SandboxExecutionResult` event with `ExecutionArtifacts`
Promotion is a separate operator decision:
- Restate workflow receives `ExecutionArtifacts`
- Operator reviews patch + verification summary
- Operator applies patch to host repo (or discards)
- Operator commits and pushes (if approved)
This keeps sandbox runs isolated and reversible.
### k8s Job Integration
**Job spec generation**: `generateJobSpec(request, options)`
Cold k8s Jobs for isolated story execution:
- Deterministic Job naming keyed by `requestId`
- Runtime image contract: Git, Bun, agent tooling, `/workspace` directory
- Environment-driven config: `WORKFLOW_ID`, `REQUEST_ID`, `STORY_ID`, `TASK_PROMPT_B64`, `BASE_SHA`, etc.
- Resource limits: `500m-2` CPU, `1-4Gi` memory (configurable)
- TTL cleanup: auto-delete after 5 minutes (default)
- Active deadline: 1 hour max runtime (default)
- No automatic retries (`backoffLimit: 0`)
- Security: non-root (UID 1000), no privilege escalation, capabilities dropped
**Runtime contract**:
1. Decode `TASK_PROMPT_B64` from env
2. Call `materializeRepo()` at `BASE_SHA`
3. Execute agent with task
4. Run verification commands (if `VERIFICATION_COMMANDS_B64` set)
5. Call `generatePatchArtifact()` with results
6. Emit `SandboxExecutionResult` event with `ExecutionArtifacts`
7. Exit 0 (success) or non-zero (failure)
**Cancellation**: Delete Job resource (SIGTERM to container)
**Job deletion**: `generateJobDeletion(requestId)` -> `{ name, namespace, propagationPolicy }`
See `k8s/agent-runner.yaml` for full runtime contract specification.
### Topology Impact
- **Story 2**: Added contract types and Job spec generation
- **Story 3**: Added repo materialization and artifact export helpers
- **ADR-0221 phase 1**: added explicit local sandbox isolation primitives — deterministic sandbox identity, deterministic local sandbox paths, per-sandbox env materialization, minimal/full mode vocabulary, and a JSON registry helper for host-worker sandboxes
- **ADR-0221 phase 2**: wired those local helpers into the real host-worker `system/agent-dispatch` local backend so sandbox runs now allocate deterministic paths under `~/.joelclaw/sandboxes/`, materialize `.sandbox.env`, persist registry state, and carry `localSandbox` metadata in inbox snapshots
- **ADR-0221 phase 3/4/5/6**: phase 3 added terminal retention/cleanup policy (`cleanupAfter` + registry metadata), opportunistic pruning of expired local sandboxes on new-run startup, copy-first `.devcontainer` materialization helpers with exclusion rules for env/secret junk, live sandbox env injection so the agent process actually sees the reserved runtime identity, a hash-preserving sandbox identity fix after live dogfood exposed path collisions from long shared requestId prefixes, abbreviated-`baseSha` acceptance during repo materialization, truthful failed inbox snapshots when dispatch crashes before normal terminal writeback, and a repeatable operator probe at `bun scripts/verify-local-sandbox-dispatch.ts`; phase 4 adds `sandboxMode=minimal|full` through the workload front door, requested-cwd mapping inside the cloned checkout, compose-backed full local mode startup, the reality that stale Restate workers can reject `workload/requested` until restarted and reloaded, a recursion guard because sandboxed stage runs were able to call `scripts/verify-workload-full-mode.ts` / `joelclaw workload run` from inside the sandbox and spawn nested canaries instead of terminating honestly, and a guarded workflow-rig proof run (`WR_20260310_013158`) that completes terminally with healthy compose startup plus clean teardown; phase 5 adds the operator-facing CLI surface `joelclaw workload sandboxes list|cleanup|janitor` so retained sandboxes can be inspected and janitored on demand instead of only during startup opportunistic pruning, and the operator surfaces now reconcile registry entries against per-sandbox metadata before reporting or deleting so old partial writeback residue stops lying about terminal state; phase 6 makes janitoring scheduled instead of purely manual via repo-managed launchd service `com.joel.local-sandbox-janitor`, which runs `scripts/local-sandbox-janitor.sh` → `joelclaw workload sandboxes janitor` at load and every 30 minutes
- **Future**: Runtime image build, hot-image CronJob, warm-pool scheduler, Restate integration
**Current state**: the host-worker local sandbox path is now using the local-isolation helpers in production code, the package has a concurrent proof that two local sandboxes keep distinct compose identity plus copied devcontainer state, guarded full-mode workflow-rig dogfood closes terminally, and cleanup now has both on-demand CLI surfaces and scheduled launchd janitoring. Follow-on work is now about deeper runtime ergonomics and debugging any remaining non-terminal stale residues, not missing basic cleanup automation.
---
## 11) Verification Commands (Health + Wiring)
## Core topology
```bash
# Colima + VM IP
colima status --json
# Kubernetes control plane + node
kubectl cluster-info
kubectl get nodes -o wide
# Core workloads
kubectl get pods -n joelclaw -o wide
kubectl get svc -n joelclaw -o wide
```
## Host supervision
```bash
# Worker supervisor launchd state
launchctl print gui/$(id -u)/com.joel.system-bus-worker | rg "state =|pid =|last exit code"
# Gateway / Caddy / Talon
launchctl print gui/$(id -u)/com.joel.gateway | rg "state =|pid ="
launchctl print gui/$(id -u)/com.joel.caddy | rg "state =|pid ="
launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid ="
# Talon health
curl -s http://127.0.0.1:9999/health
```
## Worker role split
```bash
# Parse role counts directly from source lists
python - <<'PY'
import re
from pathlib import Path
for f,name in [('packages/system-bus/src/inngest/functions/index.host.ts','host'),('packages/system-bus/src/inngest/functions/index.cluster.ts','cluster')]:
txt=Path(f).read_text()
body=re.search(rf'export const {name}FunctionDefinitions = \[(.*?)\];', txt, re.S).group(1)
count=sum(1 for line in body.splitlines() if line.strip() and not line.strip().startswith('//'))
print(name, count)
PY
# Inngest app ID derivation logic
rg -n "INNGEST_APP_ID|system-bus-host|system-bus-cluster|WORKER_ROLE" packages/system-bus/src/inngest/client.ts
```
## Event flow trace
```bash
# Send event
joelclaw send <event> -d '<json>'
# Trace event and resulting runs
joelclaw events --prefix <event-prefix> --hours 1 --count 20
joelclaw event <event-id>
joelclaw runs --hours 1 --count 20
joelclaw run <run-id>
# Telemetry correlation
joelclaw otel search "<component_or_action>" --hours 1
```
## Networking
```bash
# Caddy route config
caddy validate --config ~/.local/caddy/Caddyfile
# Listening ports snapshot
/usr/sbin/lsof -iTCP -sTCP:LISTEN -n -P
# Tailscale runtime (if daemon available)
tailscale status --json
```
---
## 12) Known Unknowns (Do Not Guess)
- Tailscale daemon state is not readable in this environment.
- `tailscale status --json` -> failed to connect.
- **UNKNOWN — needs manual verification**
- `docs/architecture.md`, `docs/deploy.md`, `docs/observability.md` are absent in-repo.
- **UNKNOWN — needs manual verification**
- Exact command-line ownership of all Colima ssh forwarding ports (`64784`, `64785`, `9627`, etc.)
- **UNKNOWN — needs manual verification**
- Ingress controller runtime status for `k8s/docs-api-ingress.yaml`
- **UNKNOWN — needs manual verification**
---
## 13) Mandatory Update Policy (Non-Optional)
Update this skill **in the same change** whenever any of these change:
1. Worker runtime wiring
- `serve.ts`, `client.ts`, `index.host.ts`, `index.cluster.ts`
- `WORKER_ROLE`, app IDs, serveHost behavior, registration path
2. Supervision/process topology
- any `~/Library/LaunchAgents/com.joel*.plist`
- `infra/worker-supervisor/*`, Talon behavior, gateway launch script/label
3. Kubernetes topology
- any file under `k8s/`
- Helm values affecting core services (`livekit`, `pds`, etc.)
- Service type/port changes (NodePort/ClusterIP)
4. Networking/ingress
- Caddyfile route/port changes
- Tailscale/Funnel hostnames or ingress path changes
- Colima/VM networking model changes
5. Storage topology
- Redis keyspace contracts for gateway/webhook routing
- Typesense telemetry collection/schema changes
- NAS mount/fallback/queue contract changes
6. Observability/tracing
- OTEL emit endpoint/token behavior
- telemetry storage path changes (Typesense/Convex/Sentry)
- Langfuse integration points
7. CLI control-plane routing
- command families moved to different endpoints/services
8. ADR status changes affecting topology
- especially ADR-0048, 0088, 0089, 0144, 0155, 0156, 0159, 0182, 0187
If any item above changed and this skill was not updated, this skill is stale and non-canonical.
More from joelhooks/joelclaw
- add-skillCreate new joelclaw skills with the idiomatic process — repo-canonical, symlinked, git-tracked, slogged. Triggers on 'add a skill', 'create skill', 'new skill', 'canonical skill', 'make a skill for', or any request to formalize a process or domain into a reusable skill.
- adr-skillCreate and maintain Architecture Decision Records (ADRs) optimized for agentic coding workflows. Use when you need to propose, write, update, accept/reject, deprecate, or supersede an ADR; bootstrap an adr folder and index; consult existing ADRs before implementing changes; or enforce ADR conventions. This skill uses Socratic questioning to capture intent before drafting, and validates output against an agent-readiness checklist.
- agent-discovery"Optimize websites, docs, and product surfaces for agent discoverability and operator UX. Use when working on agent SEO/AEO/GEO, crawl policy, markdown or JSON projections, llms.txt, sitemap.md, AGENTS.md guidance, content negotiation, accessibility for browser agents, or any request to make a site easier for pi, OpenCode, Claude Code, ChatGPT, Perplexity, or other agent harnesses to find and use."
- agent-loopStart, monitor, and cancel durable multi-agent coding loops via Inngest. Use when the user wants to run autonomous coding workloads, execute a PRD with multiple stories, kick off an AFK coding session, have agents implement features from a plan, or manage running loops. Triggers on "start a coding loop", "run this PRD", "implement these stories", "go AFK and code this", "check loop status", "cancel the loop", "joelclaw loop", or any request for autonomous multi-story code execution.
- agent-mail>-
- agent-workloads"Compatibility alias for the canonical `workflow-rig` front door. Use when older prompts mention `agent-workloads` or when you need the legacy workload-planning guidance; for new work, load `workflow-rig` first."
- clawmail>-
- cli-design"Design and build agent-first CLIs with HATEOAS JSON responses, context-protecting output, and self-documenting command trees. Use when creating new CLI tools, adding commands to existing CLIs (joelclaw, slog), or reviewing CLI design for agent-friendliness. Triggers on 'build a CLI', 'add a command', 'CLI design', 'agent-friendly output', or any task involving command-line tool creation."
- codex-prompting"Use this skill for any request to trigger, coordinate, or craft prompts for Codex. Use when user says 'send to codex', 'use codex', 'prompt codex', 'ask codex', 'delegate to codex', 'run in codex', or asks for a Codex-first execution handoff."
- content-publish"Publish content to joelclaw.com via the Convex-first pipeline. Covers the full lifecycle: draft → review → publish → revalidate → verify. Handles secret leasing, tag conventions, content types (article, tutorial, note, essay), and verification gates. Use when: 'write article about X', 'publish article <slug>', 'draft a tutorial', 'publish this', 'push to convex', or any content publishing task."