observability

Name: observability
Author: BuilderIO/agent-native

$npx mdskill add BuilderIO/agent-native/observability

Automatically instrument agents for tracing, evals, and feedback.

Captures run duration, token usage, and tool execution details.
Stores all data internally in SQL with optional external exports.
Configures prompt capture and evaluation sampling via settings.
Displays inline thumbs up/down ratings for user feedback.

SKILL.md

.github/skills/observabilityView on GitHub ↗

---
name: observability
description: >-
  Agent observability, evals, feedback, and experiments. Use when adding
  observability dashboards, configuring trace capture, setting up evals,
  creating A/B experiments, or collecting user feedback on agent responses.
---

# Agent Observability

## Rule

The observability system auto-instruments every agent run with zero configuration. Traces, automated evals, and feedback collection work out of the box. All data lives in the app's own SQL database — no external services required. Templates can optionally export to Langfuse, Datadog, or any OTel-compatible platform.

## Five Pillars

### 1. Traces

Every `runAgentLoop()` call is automatically instrumented via `instrumentAgentLoop()` in `packages/core/src/observability/traces.ts`. It captures:

- **agent_run** span — top-level parent with total duration and cost
- **llm_call** span — model name, token counts (input, output, cache read/write), cost
- **tool_call** spans — one per action invocation, with duration and success/error

Content (prompts, tool args, tool results) is **redacted by default**. Opt in via the `observability-config` settings key:

```ts
await putSetting("observability-config", {
  enabled: true,
  capturePrompts: false,
  captureToolArgs: true,    // capture action input args
  captureToolResults: false,
  evalSampleRate: 0.05,     // 5% of runs get LLM-as-judge eval
});
```

### 2. Feedback

**Explicit** — `ThumbsFeedback` component renders inline thumbs up/down on every agent message in the chat UI. Thumbs down opens a category popover (Inaccurate, Not helpful, Wrong tool, Too slow). Already wired into `AssistantChat.tsx` via `React.lazy`.

**Implicit** — `computeSatisfactionScore(threadId)` computes a Frustration Index (0-100) from conversation signals:
- Rephrasing detection (weight 30): consecutive similar user messages
- Abandonment (weight 20): session ends shortly after agent response
- Sentiment (weight 15): negative language patterns
- Length trend (weight 15): declining message lengths
- Retry patterns (weight 20): "try again", "no that's wrong"

Score interpretation: 0-20 healthy, 20-40 friction, 40-60 dissatisfied, 60+ broken.

Satisfaction scoring fires automatically after each feedback POST with a threadId.

### 3. Evals

Three layers, configured via `evalSampleRate` in the observability config:

**Automated (every run):** Deterministic scorers that run after every traced run:
- `tool_success_rate` — % of tool calls without errors
- `step_efficiency` — 1.0 for no-tool runs; penalizes excessive LLM iterations for tool-using runs
- `latency_score` — normalized against 10s/tool baseline
- `cost_efficiency` — normalized against 50 centicents/tool baseline
- `error_recovery` — 1.0 if the run recovered from tool errors or had none

**LLM-as-judge (sampled):** Runs on `evalSampleRate` fraction of runs. Calls the configured engine with a judge prompt that scores against custom criteria.

**Dataset evaluation:** `runDatasetEval(datasetId)` runs a golden dataset through the agent and scores each case.

Custom criteria use natural language rubrics:
```ts
const criteria: EvalCriteria = {
  name: "helpfulness",
  description: "Was the response helpful and complete?",
  rubric: "0.0 = completely unhelpful, 0.5 = partially helpful, 1.0 = fully resolved the user's need",
};
```

### 4. Experiments

A/B testing with sticky user-level assignment:

```ts
import { createExperiment, startExperiment } from "@agent-native/core/observability";

const exp = await createExperiment({
  name: "sonnet-vs-haiku",
  variants: [
    { id: "control", weight: 50, config: { model: "claude-sonnet-4-6" } },
    { id: "treatment", weight: 50, config: { model: "claude-haiku-4-5-20251001" } },
  ],
  metrics: ["cost", "latency", "satisfaction"],
});
await startExperiment(exp.id);
```

The agent loop reads active experiments via `resolveActiveExperimentConfig()` and applies the variant's `model` override automatically. Assignment uses consistent hashing — same user always gets the same variant.

Compute results with `POST /_agent-native/observability/experiments/:id/results`.

### 5. Dashboard

`ObservabilityDashboard` is a React component with 5 tabs:
- **Overview** — metric cards (runs, cost, latency, tool success, thumbs up rate, eval score)
- **Conversations** — trace list with drill-down to span detail
- **Evals** — eval stats and criteria breakdown bars
- **Experiments** — experiment list with status badges, drill-down to results
- **Feedback** — feedback stream, thumbs ratio, category badges

Add a dashboard route to any template:
```tsx
// app/routes/observability.tsx
import { ObservabilityDashboard } from "@agent-native/core/client";

export default function ObservabilityPage() {
  return (
    <div className="min-h-screen bg-background p-6">
      <ObservabilityDashboard />
    </div>
  );
}
```

## API Endpoints

All auto-mounted at `/_agent-native/observability/*`:

| Method | Path | Purpose |
|--------|------|---------|
| GET | `/` | Overview stats |
| GET | `/traces` | List trace summaries |
| GET | `/traces/:runId` | Trace detail (summary + spans) |
| GET | `/traces/:runId/evals` | Evals for a run |
| POST | `/feedback` | Submit feedback |
| GET | `/feedback` | List feedback entries |
| GET | `/feedback/stats` | Feedback aggregation |
| GET | `/satisfaction` | Satisfaction scores |
| GET | `/evals/stats` | Eval statistics |
| POST | `/experiments` | Create experiment |
| GET | `/experiments` | List experiments |
| GET | `/experiments/:id` | Experiment detail |
| PUT | `/experiments/:id` | Update experiment status |
| POST | `/experiments/:id/results` | Compute experiment results |
| GET | `/experiments/:id/results` | Get experiment results |

All endpoints support `?since=N` (ms timestamp) and `?limit=N` query params.

## SQL Tables

9 tables created automatically via `ensureObservabilityTables()`:
- `agent_trace_spans` — individual trace spans
- `agent_trace_summaries` — aggregated run summaries
- `agent_feedback` — explicit user feedback
- `agent_satisfaction_scores` — computed frustration index
- `agent_evals` — evaluation results
- `agent_eval_datasets` — golden test datasets
- `agent_experiments` — experiment definitions
- `agent_experiment_assignments` — user → variant assignments
- `agent_experiment_results` — computed metric results

All tables are dialect-agnostic (SQLite + Postgres) and strictly additive.

## Key Files

| File | Purpose |
|------|---------|
| `packages/core/src/observability/types.ts` | Shared type definitions |
| `packages/core/src/observability/store.ts` | SQL tables + CRUD |
| `packages/core/src/observability/traces.ts` | Auto-instrumentation |
| `packages/core/src/observability/feedback.ts` | Feedback + Frustration Index |
| `packages/core/src/observability/evals.ts` | Eval engine (3 layers) |
| `packages/core/src/observability/experiments.ts` | A/B testing system |
| `packages/core/src/observability/routes.ts` | HTTP API handlers |
| `packages/core/src/client/observability/ObservabilityDashboard.tsx` | Admin dashboard |
| `packages/core/src/client/observability/ThumbsFeedback.tsx` | Inline feedback buttons |
| `packages/core/src/client/observability/useObservability.ts` | React Query hooks |

## Export to External Platforms

Configure OTLP export in the observability settings:

```ts
await putSetting("observability-config", {
  enabled: true,
  exporters: [
    {
      type: "otlp",
      endpoint: "https://cloud.langfuse.com/api/public/otel",
      headers: { Authorization: "Bearer ..." },
    },
  ],
});
```

The framework emits `gen_ai.*` semantic convention spans compatible with Langfuse, Datadog, Grafana, New Relic, and any OTel-compatible backend.

More from BuilderIO/agent-native

Skill	Description
a2a-protocol	>-
actions	>-
adding-a-feature	>-
adhoc-analysis	>-
agent-engines	>-
ai-video-tools	>-
animation-tracks	Track-based animation system. AnimationTrack, AnimatedProp types, findTrack/trackProgress/getPropValue helpers. Read before editing animations.
apollo	>
authentication	>-
automations	>-