harness-engineering

$npx mdskill add guanyang/antigravity-skills/harness-engineering

Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. The harness is the difference between a helpful agent session and an autonomous loop that can run for days without corrupting its objective.

SKILL.md

.github/skills/harness-engineeringView on GitHub ↗
---
name: harness-engineering
description: This skill should be used when designing autonomous agent harnesses: research loops, evaluation scaffolds, locked and editable surfaces, durable logs, novelty gates, pruning, rollback, PR preparation, and human approval boundaries.
---

# Harness Engineering

Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. The harness is the difference between a helpful agent session and an autonomous loop that can run for days without corrupting its objective.

## When to Activate

Activate this skill when:

- Building autonomous research or experimentation loops
- Designing an agent environment with locked metrics and editable code or content
- Creating PR-producing or background agents
- Evaluating whether an agent can safely run without frequent human prompts
- Adding novelty, ablation, pruning, rollback, or durable logging to an agent workflow
- Preventing agents from gaming benchmarks, weakening rubrics, or losing state across compaction

Do not activate this skill for adjacent work owned by other skills:
- General quality gates, regression suites, or outcome metrics without autonomous control surfaces: `evaluation`.
- Tool schemas, response formats, and recovery errors for harness tools: `tool-design`.
- Project-level task-model fit, pipeline shape, and cost planning: `project-development`.
- Remote sandbox, warm-pool, and hosted session infrastructure: `hosted-agents`.

## Core Concepts

### Harness Boundary

Separate the agent from the environment it operates inside. The agent proposes actions; the harness defines allowed surfaces, feedback, persistence, and promotion rules.

Use four surface classes:

| Surface | Examples | Rule |
| --- | --- | --- |
| Locked | Eval metric, rubric, validation script, merge policy | Agent may read and propose changes, but cannot score itself with modified rules |
| Editable | Skill draft, experiment file, prompt, config under test | Agent may mutate during the loop |
| Append-only | Results log, research thread, rejected ideas | Agent may append, not rewrite |
| Human-controlled | Merge, production deploy, credentials, destructive operations | Requires explicit human approval |

### Tight Feedback Loops

Autonomy works when feedback is fast, unambiguous, and hard to game. Karpathy's `autoresearch` is the minimal pattern: one editable file, one locked evaluation file, fixed wall-clock budget, one scalar metric, git rollback, and a durable results log. The lesson is not that every harness needs one metric; it is that ambiguous feedback creates ambiguous autonomy.

For open-ended research-to-skill work, replace the scalar metric with locked rubrics, deterministic structure checks, source traceability, and human review thresholds.

### Durable State

Long-running agents must externalize state. Store plans, source queues, results, failures, and handoffs in files so future agents can resume without relying on chat history. Prime Intellect's autonomous nanoGPT work showed the value of durable scratchpads and `THREAD.md`-style logs for recovery, monitoring, and audit.

Use append-only logs for:

- What was tried
- What improved or failed
- Why a candidate was kept, discarded, or routed to review
- Which upstream sources were checked
- What the next agent should do

### Search Discipline

Agents tend to exploit the nearest surface, stack complexity, and under-run pruning. Add explicit search rules:

1. Refresh upstream sources on a schedule.
2. Require novelty checks before spending large budgets.
3. Preserve rejected attempts to avoid rediscovery.
4. Run leave-one-out pruning when a stack has multiple additions.
5. Reward simplification when quality is equal.
6. Use separate verification before promotion.

### Mechanism Registry

For research-to-skill systems, track accepted mechanisms separately from prose. A mechanism record should include a stable `mechanism_id`, `owning_skill`, `status`, activation scenario, behavior change, evidence, and failure modes. Novelty gates should compare against this registry before using broader corpus overlap, because keyword overlap catches stale phrasing while mechanism comparison catches real duplication.

### Governance

Autonomous agents may prepare PRs, but governance must be explicit. They can draft changes, run checks, and write PR summaries. They should not merge, deploy, or push without human approval unless the user has explicitly granted that permission for the specific action.

## Detailed Topics

### Autoresearch-Style Loop

Use this pattern when optimizing an artifact against a stable evaluator:

```text
read locked context -> choose hypothesis -> edit allowed surface -> commit/checkpoint
-> run evaluator -> log result -> keep if better -> discard or rollback if worse
-> repeat
```

Required properties:

- The evaluator is outside the editable surface.
- The feedback cadence is fixed enough to compare attempts.
- Failed attempts leave an audit trail.
- Rollback is cheap.
- The agent has a policy for crashes and timeouts.

### Research-To-Skill Loop

Use this pattern when sources become skill changes:

```text
discover -> retrieve -> gate -> score -> extract mechanism
-> map to existing or new skill -> draft proposal -> validate structure
-> prepare PR -> human review
```

The locked evaluator is a combination of source rubrics, skill-change rubrics, structure checks, and reviewer approval. The editable artifact is the proposed skill delta.

### Metric Gaming Resistance

Assume an optimizing agent will learn the harness. Guard against:

- Editing evaluation code or rubrics and then using the new version for self-approval
- Adding verbose content that pleases a judge but harms skill activation
- Citing unretrieved sources
- Optimizing aggregate scores while failing a critical dimension
- Avoiding failed results in the log

Mitigation: lock rubrics per run, report per-dimension scores, require source retrieval evidence, preserve rejected attempts, and route governance changes to human review.

### Monitoring Agents

Use monitoring agents for long runs, but restrict them to read-only reporting unless explicitly tasked otherwise. Monitoring output should report:

- Best current candidate
- Active jobs or drafts
- Last upstream refresh
- Failed or stale loops
- Disagreements between logs and claimed state
- Next action and blocker

## Practical Guidance

### Harness Design Checklist

1. Define the objective in one sentence.
2. Identify locked, editable, append-only, and human-controlled surfaces.
3. Choose the feedback mechanism: scalar metric, rubric, deterministic tests, human review, or combination.
4. Define keep, discard, crash, timeout, and review states.
5. Create a durable thread log before the loop starts.
6. Add source refresh, mechanism-registry novelty, and pruning rules for long-running loops.
7. Define what the agent may do without asking and what requires approval.
8. Validate the harness on one known good and one known bad artifact.

### File Layout

```text
research-run/
  THREAD.md
  sources/
    queue.md
    evaluations/
  proposals/
  logs/
    results.tsv
    rejected.md
  drafts/
```

Use TSV or JSONL for append-only machine-readable logs. Use Markdown for handoffs and reviewer-facing summaries.

## Examples

**Example 1: Locked metric**

An agent optimizes `train.py`, but `prepare.py` owns data loading and evaluation. The agent can edit the model but cannot change the metric. Failed experiments are logged and rolled back.

**Example 2: Locked rubric**

An agent evaluates a new Anthropic or OpenAI engineering post, but the source curation rubric is locked for the run. If the source passes, the agent drafts a skill proposal. It cannot lower the rubric threshold to admit the source.

**Example 3: Auto-PR without auto-merge**

An agent prepares a branch and PR body after passing source, skill, and structure checks. The PR states unresolved risks and waits for human merge approval.

## Guidelines

1. Lock evaluators before starting the loop.
2. Keep editable surfaces narrow enough for reliable diffs.
3. Write durable logs before context compaction can erase state.
4. Report per-dimension scores instead of only aggregate scores.
5. Require source retrieval before citation.
6. Add novelty gates for broad search and pruning gates for complex stacks.
7. Prefer simplification when quality is equal.
8. Separate PR preparation from merge authority.
9. Revalidate harness changes with old and new evaluators.
10. Treat stopped autonomous loops as harness failures, not agent personality quirks.

## Gotchas

1. **Mutable evaluator**: If the agent can edit the metric, it may optimize the benchmark instead of the task. Keep rubrics and eval code locked during the run.
2. **Chat-only memory**: Long runs fail after compaction when plans live only in conversation history. Write thread logs and result files from the start.
3. **No discard record**: Without rejected-attempt logs, agents repeat failed ideas. Preserve failures with enough detail to avoid rediscovery.
4. **Complexity accretion**: Agents stack changes and rarely remove them. Require pruning rounds and reward equal-quality simplification.
5. **Premature novelty claims**: Agents label recombinations as novel. Compare against existing repo skills, source queue, and rejected logs before claiming novelty.
6. **Monitor misreporting**: Monitoring agents can summarize stale or inconsistent state. Require them to cite the files or logs behind claims.
7. **Human approval ambiguity**: "Prepare a PR" is not "merge a PR." Make approval boundaries explicit in the harness.
8. **Volatile source drift**: Fast-moving lab claims age quickly. Put dated evidence in references and schedule revalidation.

## Integration

This skill connects to:

- evaluation - Rubrics and quality gates provide the locked feedback surface
- advanced-evaluation - Pairwise comparison and bias mitigation improve proposal review
- filesystem-context - Durable logs, scratchpads, and thread files preserve state
- multi-agent-patterns - Researcher, verifier, monitor, and writer agents need isolated contexts
- tool-design - Harness tools must expose clear contracts and recovery errors
- project-development - File-based pipelines and task-model fit analysis keep loops simple
- hosted-agents - Background execution needs sandbox, snapshot, and approval boundaries

## References

Internal references:
- `researcher/README.md` - Read when implementing the repo-native research-to-skill operating system
- `researcher/rubrics/harness-change.md` - Read when evaluating changes to an agent harness
- `researcher/runbooks/autonomous-research-loop.md` - Read when running a source-to-skill loop

External resources:
- Karpathy `autoresearch` - Constrained autonomous experiment loop with locked evaluation
- Prime Intellect autonomous nanoGPT speedrun - Durable scratchpads, handoffs, monitoring, and autonomy failure modes
- AlphaEvolve and FunSearch - LLM-generated candidates paired with systematic evaluators
- HELM and LM Evaluation Harness - Transparent, reproducible evaluation infrastructure

---

## Skill Metadata

**Created**: 2026-05-14
**Last Updated**: 2026-05-15
**Author**: Agent Skills for Context Engineering Contributors
**Version**: 1.1.0

More from guanyang/antigravity-skills

SkillDescription
advanced-evaluationThis skill should be used for advanced LLM evaluation: LLM-as-judge systems, direct scoring, pairwise comparison, rubric calibration, evaluator bias mitigation, confidence scoring, and automated quality assessment.
baoyu-compress-imageCompresses images to WebP (default) or PNG with automatic tool selection. Use when user asks to "compress image", "optimize image", "convert to webp", or reduce image file size.
baoyu-danger-gemini-webGenerates images and text via reverse-engineered Gemini Web API. Supports text generation, image generation from prompts, reference images for vision input, and multi-turn conversations. Use when other skills need image generation backend, or when user requests "generate image with Gemini", "Gemini text generation", or needs vision-capable AI generation.
baoyu-danger-x-to-markdownConverts X (Twitter) tweets and articles to markdown with YAML front matter. Uses reverse-engineered API requiring user consent. Use when user mentions "X to markdown", "tweet to markdown", "save tweet", or provides x.com/twitter.com URLs for conversion.
baoyu-diagramCreate professional, dark-themed SVG diagrams of any type — architecture diagrams, flowcharts, sequence diagrams, structural diagrams, mind maps, timelines, illustrative/conceptual diagrams, and more. Use this skill whenever the user asks for any kind of technical or conceptual diagram, visualization of a system, process flow, data flow, component relationship, network topology, decision tree, org chart, state machine, or any visual representation of structure/logic/process. Also trigger when the user says "画个图" "画一个架构图" "diagram" "flowchart" "sequence diagram" "draw me a ..." or uploads content and asks to visualize it. Output is always a standalone .svg file.
baoyu-electron-extractExtracts resources and JavaScript from any installed Electron app (`.asar` bundle), restoring original sources from `.js.map` files when available or formatting minified code with Prettier otherwise. Use when user wants to "extract Electron app", "decompile Electron", "get the source code of <app>", "inspect app.asar", "看 Electron 应用源码", "提取 .asar", or asks how a desktop Electron app is built. Skips `node_modules` and supports both macOS and Windows.
baoyu-image-cardsGenerates infographic image card series with 12 visual styles, 8 layouts, and 3 color palettes. Breaks content into 1-10 cartoon-style image cards optimized for social media engagement. Use when user mentions "小红书图片", "小红书种草", "小绿书", "微信图文", "微信贴图", "image cards", "图片卡片", or wants social media infographic series.
baoyu-imagineAI image generation with OpenAI GPT Image 2, Azure OpenAI, Google, OpenRouter, DashScope, Z.AI GLM-Image, MiniMax, Jimeng, Seedream and Replicate APIs. Supports text-to-image, reference images, aspect ratios, and batch generation from saved prompt files. Sequential by default; use batch parallel generation when the user already has multiple prompts or wants stable multi-image throughput. Use when user asks to generate, create, or draw images.
baoyu-markdown-to-htmlConverts Markdown to styled HTML with WeChat-compatible themes. Supports code highlighting, math, Mermaid (rendered to PNG via headless Chrome), PlantUML, footnotes, alerts, infographics, and optional bottom citations for external links. Use when user asks for "markdown to html", "convert md to html", "md 转 html", "微信外链转底部引用", or needs styled HTML output from markdown.
baoyu-post-to-wechatPosts content to WeChat Official Account (微信公众号) via API or Chrome CDP. Supports article posting (文章) with HTML, markdown, or plain text input, and image-text posting (贴图, formerly 图文) with multiple images. Markdown article workflows default to converting ordinary external links into bottom citations for WeChat-friendly output. Use when user mentions "发布公众号", "post to wechat", "微信公众号", or "贴图/图文/文章".