o11y-logging

Name: o11y-logging
Author: joelhooks/joelclaw

$npx mdskill add joelhooks/joelclaw/o11y-logging

Enforces canonical observability logging to prevent silent failures when updating functions, APIs, or pipeline steps.

Helps ensure failures are visible by implementing structured telemetry on changes.
Integrates with OTEL, Inngest, gateway channels, and internal system-bus packages.
Triggers on keywords like 'observability', 'otel', or 'silent failure' to enforce rules.
Presents results by verifying event contracts and storage paths for compliance.

SKILL.md

.github/skills/o11y-loggingView on GitHub ↗

---
name: o11y-logging
displayName: O11y Logging
description: "Implement and verify joelclaw observability on every change so failures cannot stay silent. Use when adding/updating Inngest functions, gateway channels, webhook providers, APIs, workers, or any pipeline step. Enforces canonical OTEL contract, storage path, and verification gates. Triggers on: 'o11y', 'observability', 'logging', 'otel', 'instrument this', 'silent failure', 'add telemetry', 'log this function'."
version: 1.1.0
author: Joel Hooks
tags: [joelclaw, observability, logging, o11y, otel, typesense, convex, inngest, gateway]
---

# JoelClaw Observability + Logging

Prevent silent failure by default. Observability is not optional polish: it is part of done.

## Non-Negotiable Rules

1. Use the canonical event contract only.
   - `packages/system-bus/src/observability/otel-event.ts`
   - `packages/system-bus/src/observability/emit.ts`
   - `packages/system-bus/src/observability/store.ts`
2. Worker/Inngest code emits through `emitOtelEvent` or `emitMeasuredOtelEvent`.
3. Gateway code emits through `emitGatewayOtel`.
4. Internal ingestion goes through `POST /observability/emit` (`packages/system-bus/src/serve.ts`), not ad-hoc writes.
5. Never treat `console.log` as primary observability. Keep structured events as source of truth.
6. High-cardinality values go in `metadata`, not in facet fields (`source`, `component`, `level`, `success`).
7. Failures must set `success: false` with a meaningful `error`.
8. For warn/error/fatal, verify Convex mirror behavior (rolling window) in addition to Typesense write.
9. In Inngest durable functions, any "emit once" telemetry must live inside `step.run(...)` to avoid replay duplication after resume.

## Event Conventions

- `source`: subsystem (`worker`, `gateway`, `webhook`, `memory`, `verification`, etc.)
- `component`: stable module/service name (`check-system-health`, `redis-channel`, `observe`)
- `action`: stable dotted action (`system.health.checked`, `events.immediate_telegram`)
- `metadata`: request IDs, deployment IDs, function IDs, session IDs, payload identifiers
- `duration_ms`: include for timed operations

Use event-per-hop (wide event style): one context-rich event for each major boundary/operation, not scattered string logs.

## Implementation Workflow

1. Identify the boundary being changed.
   - Inngest function, gateway channel, webhook route, API route, background job, sync step.
2. Add success and failure envelopes.
   - Start + completion for long tasks, or a single completion event for short tasks.
3. Include operational and business context in `metadata`.
   - Example: function id, event id, provider, queue depth, affected resource id.
4. Keep severity useful.
   - `debug/info` for normal activity, `warn` for degraded but recoverable, `error/fatal` for failures.
5. Run verification gates before finishing.

For full checklists and command recipes, read `references/implementation-checklist.md`.

## Quick Patterns

### Worker / Inngest timed operation

```typescript
import { emitMeasuredOtelEvent } from "../../observability/emit";

await emitMeasuredOtelEvent(
  {
    level: "info",
    source: "worker",
    component: "content-sync",
    action: "content_sync.run",
    metadata: { trigger: event.name },
  },
  async () => {
    await runSync();
  }
);
```

### Gateway emission

```typescript
import { emitGatewayOtel } from "../observability";

await emitGatewayOtel({
  level: "error",
  component: "redis-channel",
  action: "events.immediate_telegram",
  success: false,
  error: "telegram_send_failed",
  metadata: { sessionId, queueDepth },
});
```

## Definition of Done

- Structured OTEL events added for the changed path.
- No direct feature-level writes to Typesense/Convex for observability data.
- Smoke probe passes (`scripts/otel-smoke.sh`).
- `joelclaw otel list` and `joelclaw otel stats` show expected behavior.
- New failure modes are queryable by `source`, `component`, and `action`.

## Inngest Replay + Hang Triage

Use this when step code appears to run but runs remain `RUNNING`/`CANCELLED` with `Finalization` errors.

1. Inspect run trace first.

```bash
joelclaw run <run-id>
```

Look for `errors.Finalization.stack` containing `Unable to reach SDK URL`.

2. Confirm whether this is true network reachability or worker-side blocking.

```bash
joelclaw inngest status
joelclaw logs worker --lines 200
joelclaw logs errors --lines 200
```

3. Check for replay-noise in OTEL.

If an action that should emit once (for example `manifest.archive.prereqs-passed`) appears hundreds of times in one run window, move that emit into its own `step.run`.

```bash
joelclaw otel search "manifest.archive.prereqs-passed" --hours 1
```

4. Treat `Unable to reach SDK URL` as an ambiguous symptom.

It can indicate ingress problems, but in practice it can also happen when a function handler blocks on local IO/dependencies long enough that finalization cannot complete.

## Helper Script

Use `scripts/otel-smoke.sh` for a fast end-to-end probe:

```bash
./skills/o11y-logging/scripts/otel-smoke.sh verification o11y-skill probe.emit
```

## Key Files

- `packages/system-bus/src/observability/otel-event.ts`
- `packages/system-bus/src/observability/emit.ts`
- `packages/system-bus/src/observability/store.ts`
- `packages/system-bus/src/serve.ts`
- `packages/gateway/src/observability.ts`
- `packages/system-bus/src/inngest/functions/check-system-health.ts`
- `packages/cli/src/commands/otel.ts`
- `apps/web/app/api/otel/route.ts`

More from joelhooks/joelclaw

Skill	Description
add-skill	Create new joelclaw skills with the idiomatic process — repo-canonical, symlinked, git-tracked, slogged. Triggers on 'add a skill', 'create skill', 'new skill', 'canonical skill', 'make a skill for', or any request to formalize a process or domain into a reusable skill.
adr-skill	Create and maintain Architecture Decision Records (ADRs) optimized for agentic coding workflows. Use when you need to propose, write, update, accept/reject, deprecate, or supersede an ADR; bootstrap an adr folder and index; consult existing ADRs before implementing changes; or enforce ADR conventions. This skill uses Socratic questioning to capture intent before drafting, and validates output against an agent-readiness checklist.
agent-discovery	"Optimize websites, docs, and product surfaces for agent discoverability and operator UX. Use when working on agent SEO/AEO/GEO, crawl policy, markdown or JSON projections, llms.txt, sitemap.md, AGENTS.md guidance, content negotiation, accessibility for browser agents, or any request to make a site easier for pi, OpenCode, Claude Code, ChatGPT, Perplexity, or other agent harnesses to find and use."
agent-loop	Start, monitor, and cancel durable multi-agent coding loops via Inngest. Use when the user wants to run autonomous coding workloads, execute a PRD with multiple stories, kick off an AFK coding session, have agents implement features from a plan, or manage running loops. Triggers on "start a coding loop", "run this PRD", "implement these stories", "go AFK and code this", "check loop status", "cancel the loop", "joelclaw loop", or any request for autonomous multi-story code execution.
agent-mail	>-
agent-workloads	"Compatibility alias for the canonical `workflow-rig` front door. Use when older prompts mention `agent-workloads` or when you need the legacy workload-planning guidance; for new work, load `workflow-rig` first."
clawmail	>-
cli-design	"Design and build agent-first CLIs with HATEOAS JSON responses, context-protecting output, and self-documenting command trees. Use when creating new CLI tools, adding commands to existing CLIs (joelclaw, slog), or reviewing CLI design for agent-friendliness. Triggers on 'build a CLI', 'add a command', 'CLI design', 'agent-friendly output', or any task involving command-line tool creation."
codex-prompting	"Use this skill for any request to trigger, coordinate, or craft prompts for Codex. Use when user says 'send to codex', 'use codex', 'prompt codex', 'ask codex', 'delegate to codex', 'run in codex', or asks for a Codex-first execution handoff."
content-publish	"Publish content to joelclaw.com via the Convex-first pipeline. Covers the full lifecycle: draft → review → publish → revalidate → verify. Handles secret leasing, tag conventions, content types (article, tutorial, note, essay), and verification gates. Use when: 'write article about X', 'publish article <slug>', 'draft a tutorial', 'publish this', 'push to convex', or any content publishing task."