eval-quality-workflow
$
npx mdskill add UKGovernmentBEIS/inspect_evals/eval-quality-workflowRefactor or audit single evaluations against checklist standards.
- Corrects evaluation code to match EVALUATION_CHECKLIST.md requirements.
- Operates within src/inspect_evals/ directory for targeted fixes.
- Selects fix or review mode based on user command intent.
- Outputs compliance status or refactored evaluation code directly.
SKILL.md
.github/skills/eval-quality-workflowView on GitHub ↗
---
name: eval-quality-workflow
description: Fix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
---
# Evaluation Quality — Fix or Review
This skill covers two closely related workflows for a single evaluation in `src/inspect_evals/`:
- **Fix An Evaluation**: Refactor the evaluation to comply with EVALUATION_CHECKLIST.md
- **Review An Evaluation**: Assess compliance without making changes
## Identifying the Evaluation
If the user has given you a name, that takes priority. If you were just building an evaluation, or the user has uncommitted code for one specific evaluation, you can assume that's the correct one. If you are not confident which evaluation to look at, ask the user.
## Fix An Evaluation
Our standards are in EVALUATION_CHECKLIST.md, with links to BEST_PRACTICES.md and CONTRIBUTING.md. Your job is to refactor the evaluation to meet these standards.
1. Set up the working directory:
a. If the user provides specific instructions about any step, assume the user's instructions override these instructions.
b. If there is no evaluation name, ask the user for one.
c. The evaluation name should be the eval folder name plus its version (from the @task function's version argument). For instance, GPQA version 1.1.2 becomes "gpqa_1_1_2". If this exact folder name already exists, add a number to it via "gpqa_1_1_2_analysis2". This name will be referred to as `<eval_name>`.
d. Create a folder called `agent_artefacts/<eval_name>/fix` if it isn't present.
e. Whenever you create a .md file as part of this workflow, assume it is made in `agent_artefacts/<eval_name>/fix`.
f. Copy EVALUATION_CHECKLIST.md to the folder.
g. Create a NOTES.md file for miscellaneous helpful notes. Err on the side of taking lots of notes. Create an UNCERTAINTIES.md file to note any uncertainties.
2. Go over each item in the EVALUATION_CHECKLIST, using the linked documents for context where necessary, going from top to bottom. It is important to go over every single item in the checklist!
a. For each checklist item, assess your confidence that you know what is being asked of you. Select Low, Medium, or High.
b. If you have High confidence, fix the evaluation to pass the checklist if needed, then edit EVALUATION_CHECKLIST.md to place a check next to it. It is acceptable to check off an item without making any changes if it already passes the requirement.
c. If you have Medium confidence, make a note of this in UNCERTAINTIES.md along with any questions you have, then do your best to solve it as per the High confidence workflow.
d. If you have Low confidence, make a note of this in UNCERTAINTIES.md along with any questions you have, then do not attempt to solve it and leave the checkmark blank.
Do not attempt the evaluation report. You should tell the user in SUMMARY.md that producing evaluation reports automatically requires a separate workflow (the `/eval-report-workflow` skill). You should still perform the initial tests to ensure the evaluation runs at all. The model you should test on is the one that the example commands in the evaluation's README uses. For example, `uv run inspect eval inspect_evals/<eval_name> --model openai/gpt-5-nano --limit 2`
Your task is over when you have examined every single checklist item except for the evaluation report and completed each one that you are capable of. Do a final pass over NOTES.md and UNCERTAINTIES.md, then create a SUMMARY.md file and summarise what you have done. Then inform the user that your task is finished.
## Review An Evaluation
Our standards are in EVALUATION_CHECKLIST.md, with links to BEST_PRACTICES.md and CONTRIBUTING.md. Your job is to review the evaluation against the agent-checkable standards without making changes.
1. Set up the working directory:
a. If the user provides specific instructions about any step, assume the user's instructions override these instructions.
b. If there is no evaluation name, ask the user for one.
c. The evaluation name should be the eval folder name plus its version (from the @task function's version argument). For instance, GPQA version 1.1.2 becomes "gpqa_1_1_2". If this exact folder name already exists, add a number to it via "gpqa_1_1_2_analysis2". This name will be referred to as `<eval_name>`.
d. Create a folder called `agent_artefacts/<eval_name>/review` if it isn't present.
e. Whenever you create a .md file as part of this workflow, assume it is made in `agent_artefacts/<eval_name>/review`.
f. Copy EVALUATION_CHECKLIST.md to the folder.
g. Create a NOTES.md file for miscellaneous helpful notes. Err on the side of taking lots of notes. Create an UNCERTAINTIES.md file to note any uncertainties.
2. Read our EVALUATION_CHECKLIST.md and linked documents.
3. For each item in the [agent runnable checks](EVALUATION_CHECKLIST.md#agent-runnable-checks), go over all possibly relevant files in the evaluation. Go over each item, using linked documents for context where necessary, going from top to bottom. It is important to go over every single item in the checklist! Note any issues you find in NOTES.md under the following format:
a. **Standard** Describe the standard that has not been met.
b. **Issue**: Describe the issue.
c. **Location**: If possible, write the file and lines where the issue occurs. If the issue is not localised by line, write only the file. If the issue is not localised by file, write the evaluation name here.
d. **Fix**: What fix you would recommend. Go into as much detail as you feel confident - it's okay if you don't have a strong solution to solve a given problem, but if you do, please mention what your intended solution is.
e. **Comment**: Write a comment that can be used as a Github comment that gives all the relevant information from above. Each comment should begin with (Agent) in it, to make it clear where the comment comes from. We will provide functionality to write these as Github comments in future through the 'gh' CLI, so make your comments compatible with this. The comments should be informative, yet polite - the contributor may not have been the one to ask for this review, and external contributors have varying degrees of experience.
4. Write a summary in SUMMARY.md of the evaluation's overall quality, how many issues you found, and major issues if any. The contributor won't see this unless they're the one who asked for this review, so you can be more blunt here.
5. Tell the user your task is done.
More from UKGovernmentBEIS/inspect_evals
- build-repo-contextCrawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
- check-trajectories-workflowUse Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
- ci-maintenance-workflowCI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
- code-quality-fix-allFix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
- code-quality-review-allReview all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
- create-evalRedirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
- ensure-test-coverageEnsure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
- eval-report-workflowCreate an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
- eval-validity-reviewReview a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).
- generate-asset-actionsGenerate asset-actions.yaml from ASSETS.yaml by classifying assets into priority tiers. Use when the user asks to regenerate, update, or refresh the asset actions.