prepare-submission-workflow
$
npx mdskill add UKGovernmentBEIS/inspect_evals/prepare-submission-workflowVerify repo requirements and generate evaluation registration YAML.
- Validates upstream repositories for pyproject.toml and task decorators.
- Fetches repository files via WebFetch to confirm dependency declarations.
- Determines readiness by checking for missing project tables or decorators.
- Outputs a YAML file at register/<eval_name>/eval.yaml for submission.
SKILL.md
.github/skills/prepare-submission-workflowView on GitHub ↗
--- name: prepare-submission-workflow description: Prepare an evaluation for PR submission as an entry to the register. Use when user asks to prepare an eval for submission or finalize a PR. Trigger when the user asks you to run the "Prepare Evaluation For Submission" workflow. --- # Prepare Eval For Submission Since May 2026, new evaluations are submitted as **entries to the register** — the evaluation code lives in your own upstream repository, and you add a pointer to it here. Code is no longer added directly to `src/inspect_evals/`. If the user appears to be submitting evaluation code into the repo, direct them to [`register/README.md`](../../../register/README.md) for the full process. ## Workflow Steps To prepare an evaluation for submission as a pull request: ### 1. Verify upstream repo requirements The upstream repo must: - Have a `pyproject.toml` with a `[project]` table so it can be installed via `uv sync` - Declare `inspect_ai` as a dependency - Define each task with the `@task` decorator from `inspect_ai` Ask the user whether their upstream repo meets these requirements. Offer to check for them — if they provide the GitHub repository URL, fetch the repo's `pyproject.toml` and task files (e.g. via `WebFetch` on the raw GitHub URLs) to verify the requirements are met. If any requirement is not met, tell the user what needs to be fixed upstream before they can register. ### 2. Gather information and create `register/<eval_name>/eval.yaml` Skip this step if `register/<eval_name>/eval.yaml` already exists. Use [`register/example_eval.yaml`](../../../register/example_eval.yaml) as the template — it documents every field. Don't ask the user field-by-field; instead, derive what you can from the upstream repo first, then ask one batched question for what's missing. **Hints on what to derive from the upstream repo (don't ask):** - `source.repository_url` — from step 1. - `source.repository_commit` — fetch the latest commit SHA on the default branch (must be a 40-char SHA, not a tag or branch). - `tasks[].name` and `tasks[].task_path` — locate every `@task`-decorated function in the repo and record the function name and file path. - `title` — from the upstream README heading or `pyproject.toml` `[project].name`. - `description` — draft from the upstream README; keep to one short paragraph since the generated README links back upstream. - `source.maintainers` — defaults to the repo owner; only override if the repo is org-owned and the real maintainers are individuals. - `tags` — propose based on the eval's domain (e.g. `Coding`, `games`, `tools`). The upstream repo name is added automatically, so don't include it. Use [`register/example_eval.yaml`](../../../register/example_eval.yaml) to determine what additional questions are needed. Show the user the drafted YAML for confirmation before writing the file. Do **not** set `id` — it is auto-injected from the directory name. ### 3. Run validation ```bash make check ``` This validates the `eval.yaml` and auto-generates a `README.md` next to it. The README is fully generated from `eval.yaml` — do not edit it by hand. Because the generated page defers to upstream for details, make sure the upstream repo's README covers the dataset, scorer, task parameters, and how the eval was validated. ### 4. Create a changelog fragment ```bash uv run scriv create ``` ### 5. Open a PR Use the [PR template](../../.github/PULL_REQUEST_TEMPLATE.md). The reviewer will ping anyone listed under `source.maintainers` for acknowledgement before merging.
More from UKGovernmentBEIS/inspect_evals
- build-repo-contextCrawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
- check-trajectories-workflowUse Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
- ci-maintenance-workflowCI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
- code-quality-fix-allFix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
- code-quality-review-allReview all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
- create-evalRedirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
- ensure-test-coverageEnsure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
- eval-quality-workflowFix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
- eval-report-workflowCreate an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
- eval-validity-reviewReview a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).