autolab-managed-experiment

$npx mdskill add huggingface/context-course/autolab-managed-experiment

Execute single Autolab benchmark experiments safely on Hugging Face.

  • Enables safe hypothesis testing against local promoted master.
  • Depends on terminal toolsets and Hugging Face Jobs.
  • Validates preflight checks before launching managed runs.
  • Delivers parsed metrics and local patch submission.
SKILL.md
.github/skills/autolab-managed-experimentView on GitHub ↗
---
name: autolab-managed-experiment
description: "Run one Autolab benchmark experiment safely on Hugging Face Jobs. Use when a planner, reviewer, or experiment worker is preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master."
version: 1.0.0
metadata:
  hermes:
    category: autolab
    requires_toolsets: [terminal]
---

Use this for any single Autolab experiment that should result in exactly one
managed benchmark run.

## Workflow

1. Load the local operator env and refresh from local master:
   - `. ~/.autolab/credentials`
   - `uv run scripts/refresh_master.py --fetch-dag`
2. Edit only `train.py` for the single intended hypothesis.
3. Run preflight before launch:
   - `uv run scripts/hf_job.py preflight`
4. Launch exactly one managed experiment:
   - `uv run scripts/hf_job.py launch --mode experiment`
5. Stream logs and parse the final metric:
   - `uv run scripts/hf_job.py logs <JOB_ID> --follow --output /tmp/autolab-run.log`
   - `uv run scripts/parse_metric.py /tmp/autolab-run.log`
6. Record the run locally:
   - `uv run scripts/submit_patch.py --comment "..."`
7. Promotion is local and only happens if `val_bpb` beats current master.

## Guardrails

- Treat `train_orig.py` as the refreshed local-master base. If preflight reports
  multiple known hypothesis categories, stop and inspect the diff before
  launching.
- Ignore repo git `main` and `origin/main` when judging freshness. In this rig
  repo those refs describe control-plane history, not the benchmark master. The
  comparable base is whatever `refresh_master.py` just wrote into
  `train_orig.py`, `research/live/master.json`, and `research/results.tsv`.
- Never run `uv run scripts/hf_job.py launch --mode prepare` from an
  experiment-scoped worktree. `prepare` is shared bootstrap work, not
  per-experiment work.
- Do not launch a second experiment job for the same experiment unless you have a
  specific reason and intentionally override the duplicate check.
- If the workspace looks stale against the current local master, stop and rewrite
  the experiment rather than rationalizing the mismatch.

## Fast Checks

- `uv run scripts/hf_job.py preflight --json`
  Use this when you need to inspect the diff preview, active conflicts, or
  detected change categories programmatically.
- `uv run scripts/trackio_reporter.py summary --max-jobs 25`
  Use this to confirm the experiment id or hypothesis is not already active.
More from huggingface/context-course