cron-watchdog-debug
$
npx mdskill add vercel-labs/vercel-openclaw/cron-watchdog-debugDiagnose cron wake failures and watchdog state mismatches.
- Resolve scheduled job wake failures and incorrect watchdog status.
- Integrates with Vercel Cron, OpenClaw, and admin API endpoints.
- Generates proof diagrams and hypothesis tables for root cause.
- Outputs structured artifacts and state maps to the agent.
SKILL.md
.github/skills/cron-watchdog-debugView on GitHub ↗
---
name: cron-watchdog-debug
description: "Cron and watchdog debugging for vercel-openclaw: Vercel Cron auth, persisted OpenClaw jobs, cron wake keys, token refresh, restore oracle, hot spare, and watchdog reports. Use when scheduled OpenClaw jobs fail to wake or run, watchdog status is wrong, cron persistence is suspect, or /api/cron/watchdog behavior changes."
---
# Cron Watchdog Debug
Use this skill for cron wake, watchdog, scheduled OpenClaw job, and restore-oracle incidents.
## Non-Negotiables
Before proposing a fix, produce:
1. Deployment-state proof.
2. Watchdog path diagram with every edge marked `unknown`, `verified-good`, or `verified-bad`.
3. Hypothesis table with fastest falsifier and status.
4. Cron Watchdog Handoff.
Do not treat "cron route ran" as proof that an OpenClaw scheduled job ran. Keep these states separate: Vercel Cron invoked, watchdog authorized, persisted wake due, sandbox woke, AI Gateway token refreshed, OpenClaw cron scheduler loaded jobs, user-visible delivery happened.
## Evidence First
For live deployments, create one artifact root:
```bash
RUN_TS="$(date -u +%Y%m%dT%H%M%SZ)"
ART=".agent-runs/cron-debug/$RUN_TS"
mkdir -p "$ART"/{admin,vercel,sandbox,workflow}
```
Collect before editing code:
- `.vercel/project.json` versus the intended project/team/deployment.
- `git rev-parse HEAD` and `git ls-remote origin main`.
- Live `GET /api/status`.
- Live `GET /api/admin/sandbox-diag`.
- Live `GET /api/admin/logs` filtered for `watchdog.`, `sandbox.`, `gateway.`, `cron`, and `restore`.
- Live `GET /api/admin/launch-verify` if the incident involves launch readiness.
- Live `GET` or `POST /api/cron/watchdog` only with valid cron authorization.
If a release env file is supplied, source it only in the shell running live commands. Never print or save raw secrets.
## Watchdog Path Diagram
Use this shape and mark each edge:
```text
Vercel Cron tick -> /api/cron/watchdog auth -> runSandboxWatchdog
-> deployment contract -> metadata read -> busy/stale/running/stopped branch
-> cron wake key read -> ensureSandboxReady? -> token refresh
-> OpenClaw cron jobs present in sandbox -> scheduled job executes -> delivery visible
```
For a running sandbox due soon, include the parallel branch:
```text
running sandbox -> gateway probe -> cronNextWakeMs <= now + 60s
-> force AI Gateway token refresh -> restore oracle cycle -> no duplicate wake
```
## Runtime Probes
Read admin surfaces first. Then, only if needed, inspect the sandbox with `npx sandbox` or the app admin SSH/exec fallback.
Read-only sandbox checks:
```bash
set -eu
echo "== process =="
ps -eo pid,ppid,comm,args | grep -E "[o]penclaw|[n]ode" || true
echo "== ports =="
(ss -ltnp || netstat -ltnp) 2>/dev/null | grep -E ":(3000|8787)\\b" || true
echo "== cron files =="
ls -la /home/vercel-sandbox/.openclaw/cron 2>/dev/null || true
echo "== cron jobs shape =="
node - <<'NODE'
const fs = require('fs');
for (const p of [
'/home/vercel-sandbox/.openclaw/cron/jobs.json',
'/home/vercel-sandbox/.openclaw/cron/jobs-state.json',
]) {
try {
const j = JSON.parse(fs.readFileSync(p, 'utf8'));
console.log(p, JSON.stringify({
version: j.version ?? null,
jobCount: Array.isArray(j.jobs) ? j.jobs.length : Object.keys(j.jobs || {}).length,
jobIds: Array.isArray(j.jobs) ? j.jobs.map((x) => x.id) : Object.keys(j.jobs || {}),
}));
} catch (err) {
console.log(p, 'unreadable', err.message);
}
}
NODE
```
Do not print job payload text if it may contain user data. Prefer shape, counts, ids, and next-run timestamps.
## Store Evidence
Cron wake depends on store keys written through the lifecycle layer. Do not hardcode Redis prefixes in app code. For debugging, use app/admin surfaces when available; if direct store inspection is unavoidable, record key names and redacted shapes only.
High-signal facts:
- `cronNextWakeMs` exists and is due, future, missing, or malformed.
- `cronJobsJson` structured record exists, has a SHA-256, job count, and source.
- `lastRestoreMetrics.cronRestoreOutcome` is present or missing after wake.
- `watchdog` report contains `cron.wake`, `token.refresh`, `probe`, and `restore.prepare` checks.
## Hypotheses To Compare
Keep all rows visible as they are ruled out:
| Hypothesis | Evidence for | Evidence against | Fastest falsifier | Status |
|---|---|---|---|---|
| Vercel Cron did not invoke the route | | | Vercel logs for `/api/cron/watchdog` | |
| Cron auth rejected the request | | | route response/log status 401 vs report | |
| Wake key was never persisted | | | store/admin evidence for `cronNextWakeMs` | |
| Wake key is future or stale | | | compare `cronNextWakeMs` to current UTC | |
| Watchdog intentionally skipped idle sandbox | | | `cron.wake` check message and metadata status | |
| Sandbox woke but token refresh failed | | | `token.refresh` check and AI Gateway token metadata | |
| Jobs file missing or malformed in sandbox | | | sanitized cron file shape from sandbox | |
| OpenClaw scheduler loaded jobs but delivery failed | | | OpenClaw logs plus channel lastForward/user-visible evidence | |
| Restore oracle or hot spare side effect changed cron state | | | `restore.prepare` check and restore metrics | |
## Safe Fix Boundaries
- `src/app/api/cron/watchdog/route.ts` owns cron-route auth and response shape.
- `src/server/watchdog/run.ts` owns watchdog decision order and report checks.
- `src/shared/watchdog.ts` owns report types.
- `src/server/sandbox/lifecycle.ts` owns cron persistence keys, stop/touch persistence, and wake behavior.
- `src/server/sandbox/cron-persistence.test.ts` and `src/server/watchdog/run.test.ts` own regression coverage.
- `lat.md/sandbox-lifecycle.md` owns the architecture narrative for cron wake.
Do not edit channel routes while debugging cron unless evidence proves the scheduled job reached channel delivery and failed there. Switch to the relevant channel skill for that phase.
## Verification
Pick the smallest verification that covers the changed edge:
```bash
node scripts/verify.mjs --steps=test,typecheck
node --test src/server/watchdog/run.test.ts
node --test src/server/sandbox/cron-persistence.test.ts
lat check
```
For live fixes, include before/after evidence from `/api/cron/watchdog`, `/api/admin/logs`, and sandbox cron file shape.
## Handoff
Use `references/handoff-template.md` for incident reports.
More from vercel-labs/vercel-openclaw
- admin-ui-debugAdmin UI and operator surface debugging for vercel-openclaw: command shell design, admin actions, request core, status panels, launch verification UI, channel readiness UI, and local read-only production-data workflows. Use when the root admin UI, controls, visual state, or operator copy is wrong.
- auth-store-debugAuth and store debugging for vercel-openclaw: admin-secret mode, Sign in with Vercel, session cookies, CSRF, LOCAL_READ_ONLY, Redis vs memory store, keyspace namespacing, and metadata shape migrations. Use when login, route authorization, Redis persistence, or metadata state is suspect.
- channel-debug-coreChannel webhook triage for vercel-openclaw Slack/Telegram/Discord/WhatsApp issues: prove deployment state, collect admin readiness endpoints, build evidence-first handoff before fixes.
- channel-forward-parityWebhook route parity audit for channel delivery changes: ensure terminal paths log, record lastForward, classify failures, and refresh stale sandbox port URLs.
- discord-deliveryDiscord channel specialist workflow: debug interaction webhooks, Ed25519 signatures, deferred replies, workflow forwarding to /discord-webhook, integration reconcile, and token expiry.
- firewall-ai-gateway-debugFirewall and Vercel AI Gateway debugging for vercel-openclaw: network policy allowlists, OIDC token refresh, AI Gateway transform rules, firewall learning/enforcement, and sandbox.update networkPolicy calls. Use when model calls, egress, token refresh, or firewall policy application fails.
- gateway-proxy-debugGateway and proxy debugging for vercel-openclaw: /gateway routing, HTML injection, WebSocket rewrite, gateway-token handoff, waiting page, status heartbeat, sandbox port URL cache, and proxy auth. Use when the OpenClaw UI, WebSockets, gateway proxying, or waiting-page flow breaks.
- lat-md>-
- launch-verify-debugLaunch verification and remote smoke debugging for vercel-openclaw: preflight, queue ping, ensureRunning, chatCompletions, wakeFromSleep, restorePrepared, channelReadiness, NDJSON progress, and vclaw create readiness. Use when launch verification, vclaw create validation, or remote smoke checks fail.
- openclaw-bootstrap-debugOpenClaw bootstrap, bundle, config, and restore-asset debugging for vercel-openclaw: openclaw.bundle sidecars, plugin discovery, channel catalog, restart scripts, config hashes, dynamic resume files, and fast restore. Use when setup, gateway boot, plugin loading, or bundle-sidecar compatibility fails.