eval-quality-workflow

Name: eval-quality-workflow
Author: UKGovernmentBEIS/inspect_evals

$npx mdskill add UKGovernmentBEIS/inspect_evals/eval-quality-workflow

Refactor or audit single evaluations against checklist standards.

Corrects evaluation code to match EVALUATION_CHECKLIST.md requirements.
Operates within src/inspect_evals/ directory for targeted fixes.
Selects fix or review mode based on user command intent.
Outputs compliance status or refactored evaluation code directly.

SKILL.md

.github/skills/eval-quality-workflowView on GitHub ↗

---
name: eval-quality-workflow
description: Fix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
---

# Evaluation Quality — Fix or Review

This skill covers two closely related workflows for a single evaluation in `src/inspect_evals/`:

- **Fix An Evaluation**: Refactor the evaluation to comply with EVALUATION_CHECKLIST.md
- **Review An Evaluation**: Assess compliance without making changes

## Identifying the Evaluation

If the user has given you a name, that takes priority. If you were just building an evaluation, or the user has uncommitted code for one specific evaluation, you can assume that's the correct one. If you are not confident which evaluation to look at, ask the user.

## Fix An Evaluation

Our standards are in EVALUATION_CHECKLIST.md, with links to BEST_PRACTICES.md and CONTRIBUTING.md. Your job is to refactor the evaluation to meet these standards.

Do not attempt the evaluation report. You should tell the user in SUMMARY.md that producing evaluation reports automatically requires a separate workflow (the `/eval-report-workflow` skill). You should still perform the initial tests to ensure the evaluation runs at all. The model you should test on is the one that the example commands in the evaluation's README uses. For example, `uv run inspect eval inspect_evals/<eval_name> --model openai/gpt-5-nano --limit 2`

Your task is over when you have examined every single checklist item except for the evaluation report and completed each one that you are capable of. Do a final pass over NOTES.md and UNCERTAINTIES.md, then create a SUMMARY.md file and summarise what you have done. Then inform the user that your task is finished.

## Review An Evaluation

Our standards are in EVALUATION_CHECKLIST.md, with links to BEST_PRACTICES.md and CONTRIBUTING.md. Your job is to review the evaluation against the agent-checkable standards without making changes.

1. Set up the working directory:
a. If the user provides specific instructions about any step, assume the user's instructions override these instructions.
b. If there is no evaluation name, ask the user for one.
c. The evaluation name should be the eval folder name plus its version (from the @task function's version argument). For instance, GPQA version 1.1.2 becomes "gpqa_1_1_2". If this exact folder name already exists, add a number to it via "gpqa_1_1_2_analysis2". This name will be referred to as `<eval_name>`.
d. Create a folder called `agent_artefacts/<eval_name>/review` if it isn't present.
e. Whenever you create a .md file as part of this workflow, assume it is made in `agent_artefacts/<eval_name>/review`.
f. Copy EVALUATION_CHECKLIST.md to the folder.
g. Create a NOTES.md file for miscellaneous helpful notes. Err on the side of taking lots of notes. Create an UNCERTAINTIES.md file to note any uncertainties.
2. Read our EVALUATION_CHECKLIST.md and linked documents.
3. For each item in the [agent runnable checks](EVALUATION_CHECKLIST.md#agent-runnable-checks), go over all possibly relevant files in the evaluation. Go over each item, using linked documents for context where necessary, going from top to bottom. It is important to go over every single item in the checklist! Note any issues you find in NOTES.md under the following format:
a. **Standard** Describe the standard that has not been met.
b. **Issue**: Describe the issue.
c. **Location**: If possible, write the file and lines where the issue occurs. If the issue is not localised by line, write only the file. If the issue is not localised by file, write the evaluation name here.
d. **Fix**: What fix you would recommend. Go into as much detail as you feel confident - it's okay if you don't have a strong solution to solve a given problem, but if you do, please mention what your intended solution is.
e. **Comment**: Write a comment that can be used as a Github comment that gives all the relevant information from above. Each comment should begin with (Agent) in it, to make it clear where the comment comes from. We will provide functionality to write these as Github comments in future through the 'gh' CLI, so make your comments compatible with this. The comments should be informative, yet polite - the contributor may not have been the one to ask for this review, and external contributors have varying degrees of experience.
4. Write a summary in SUMMARY.md of the evaluation's overall quality, how many issues you found, and major issues if any. The contributor won't see this unless they're the one who asked for this review, so you can be more blunt here.
5. Tell the user your task is done.

More from UKGovernmentBEIS/inspect_evals