ci-maintenance-workflow
$
npx mdskill add UKGovernmentBEIS/inspect_evals/ci-maintenance-workflowDiagnose and repair failing tests or review PRs against standards.
- Fixes pytest failures, smoke tests, or adds slow markers to tests.
- Integrates with GitHub Actions and gh CLI for run diagnostics.
- Executes fixes based on provided URLs or agent-checkable standards.
- Delivers repaired code or PR reviews directly to the user.
SKILL.md
.github/skills/ci-maintenance-workflowView on GitHub ↗
---
name: ci-maintenance-workflow
description: CI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
---
# CI Maintenance Workflows
This skill covers four CI/GitHub Actions-related workflows:
- **Fix A Failing Test**: Diagnose and fix a failing test from a CI run
- **Fix A Failing Smoke Test**: Diagnose and fix a failing smoke test (eval that fails with `--limit 0`)
- **Mark Slow Tests**: Add `@pytest.mark.slow` markers to tests exceeding thresholds
- **Review PR**: Automatically check a PR against agent-checkable standards
---
## Write a PR For A Failing Test
This workflow is for fixing failing **pytest tests** (not smoke tests — see "Fix A Failing Smoke Test" below for those).
1. The user should provide you with a link that looks like `https://github.com/ArcadiaImpact/inspect-evals-actions/actions/runs/<number>/job/<number2>`, or a test to fix. If they haven't, please ask them for the URL of the failing Github action, or a test that should be fixed.
2. If a link is provided, you will be unable to view this link directly, but you can use 'gh run view' to see it.
3. Checkout main and do a git pull. Make sure you're starting from a clean slate and not pulling in any changes from whatever you or the user was previously doing.
4. Create a branch `agent/<number>` for your changes, in inspect_evals. Note this is not the same as inspect-evals-actions where the above URL was from.
5. Identify the failing test(s) and any other problems. Verify the test currently fails in this environment by running it with `uv run pytest tests/<eval> --runslow`.
6. Attempt these fixes one by one and verify they work locally, with `uv run pytest tests/<eval> --runslow`.
7. **On every change, assess if a task version bump and changelog entry are required.** Read TASK_VERSIONING.md and PACKAGE_VERSIONING.md, then explicitly decide for this change. Default to bumping unless you can articulate why results remain comparable. Crash-fixes count: turning a sample that previously errored into one that now scores (correct or incorrect) changes which samples contribute to the aggregate, so the fix affects comparability even though no "scoring logic" was rewritten. If you bump, update the eval's README changelog and run `uv run scriv create` for a package-level fragment.
8. Perform linting as seen in .github/workflows/build.yml, and fix any problems that arise.
9. Commit your changes to the branch.
10. Before pushing, go over your changes and make sure all the changes that you have are high quality and that they are relevant to your finished solution. Remove any code that was used only for diagnosing the problem.
11. Push your changes.
12. Use the Github API to create a pull request with the change. Include a link to the failing test, which test failed, and how you fixed it. Include information on what you did to verify the fix worked. **Always include the full checklist from `.github/PULL_REQUEST_TEMPLATE.md` verbatim in the PR body** — the `Check PR template` CI job greps for the literal checklist substrings and fails the PR otherwise. This applies to bot/automated PRs too.
---
## Fix A Failing Smoke Test
This workflow is for fixing evals that fail their **smoke test** (running with `--limit 0`). Smoke test failures are different from regular test failures — they often involve upstream changes to datasets, APIs, or dependencies, and the root cause is not always obvious from the error message alone. Take extra care to understand the problem before proposing a fix.
### Hard rules (read first, no exceptions)
- **Reproduce before fixing.** Run the smoke command locally and try to observe the same failure CI saw before changing any code.
- **Verify after fixing.** Rerun the same smoke command after your change and confirm it now runs cleanly.
- **Do not skip, xfail, or disable the eval as a "fix".** All of the following are forbidden:
- Adding `@pytest.mark.skip`/`skipif`/`xfail`, `pytest.skip(...)`, `pytest.xfail(...)`, `unittest.skip(...)` to the eval's tests
- Removing the eval from the smoke-test target list, the eval registry, `tools/run_evals.py` allowlists, or any CI matrix
- Commenting out the failing solver/scorer/dataset-loading path
- Wrapping the failing call in a bare `try/except` that swallows the error and returns a dummy result
- Short-circuiting the eval to return success without doing the work (e.g. early-return on `--limit 0`)
- Loosening a scorer or assertion to accept whatever the broken code currently produces
- Replacing a real dataset or upstream call with a stub purely to make the smoke pass
If a candidate fix matches any of the above, it is not a fix. Treat the situation as "blocked" instead and use the blocked-PR path described under "Reporting" below.
### Reporting (when you can't reproduce, can't verify, or can't fix)
The smoke-fix workflow always produces a PR — even when you fail at one of the gates. Do not silently abandon the run, do not push a skip-based fallback, and do not just leave a comment somewhere. The PR is the report.
If you hit any of these situations, still raise a **draft** PR on your branch (use `git commit --allow-empty` if there are no code changes to push) with the standard structured body from "PR Description" below, populated as follows:
- **Could not reproduce the CI failure locally:** put `Unable to reproduce locally — <command run>, <local outcome>` in **How I validated**, and put your hypotheses about the env/dependency/dataset discrepancy in **My uncertainty**. Leave **My bugfix** as `None — blocked on reproduction`.
- **Reproduced but could not verify a fix (rerun still fails, or eval can't be run locally):** put `Unable to verify because <reason>` in **How I validated**, and explain in **My uncertainty**.
- **Root cause is genuinely unfixable here (e.g. dataset removed upstream with no replacement):** describe what you tried and why you stopped, in **My bugfix** and **My uncertainty**.
Mark the PR as draft (override the usual non-draft convention for this case) so it's clear a human needs to take over. Keep the `automated-smoke-test-fix` label so the existing triage flow picks it up.
### Investigation
1. The user should provide the name of the failing eval, or a link to the failing smoke test run. If they haven't, ask them.
2. If a link is provided, you will be unable to view this link directly, but you can use 'gh run view' to see it.
3. Checkout main and do a git pull. Make sure you're starting from a clean slate.
4. Create a branch for your changes.
5. **Read the eval source code.** Find and read the source for the failing eval under `src/inspect_evals/`.
6. **Check the eval's commit history.** Run `git log --oneline -20 -- src/inspect_evals/<eval_name>/` to understand recent changes. This can reveal whether the eval was recently modified in ways that are relevant to the failure — for example, a parameter that was recently added or removed.
7. **Reproduce the failure (gate).** Run the eval with:
```bash
uv run python tools/run_evals.py --pred <eval_name> --fail-fast
```
Read the full error output carefully. Note the exact error message and traceback — it will populate the **The error message** field in the PR body.
If the eval runs cleanly locally (i.e. you cannot reproduce the CI failure), do NOT write a speculative fix. Follow the "Reporting" section above to raise a blocked draft PR; a PR that claims to fix a failure you never observed is not acceptable.
8. **Formulate a hypothesis.** Before writing any code, articulate:
- What the error message means
- How it connects to the eval's source code
- What specifically is broken and why
- Whether recent commits (from step 6) are relevant
If the error is ambiguous or could have multiple causes, investigate further before jumping to a fix. Check upstream dependencies, dataset availability, and API changes.
### Fix
1. **Implement the fix.** Make the minimal change needed to resolve the issue. Avoid speculative changes — only fix what is directly connected to the error. Re-read the "Hard rules" above before you start typing — if your candidate fix matches anything in the forbidden list, it is not a fix.
2. **Verify the fix (gate).** Run the eval again with the **same command** you used to reproduce:
```bash
uv run python tools/run_evals.py --pred <eval_name> --fail-fast
```
The eval must complete without errors. The command and outcome go in the **How I validated** field of the PR body. If it still fails, you have not fixed the bug — keep iterating, or follow the "Reporting" section above to raise a blocked draft PR. If you cannot run the eval at all (e.g. disk space, credentials, network), use the blocked-PR path with `Unable to validate because <reason>` in **How I validated** — do not claim verification you did not perform.
3. **On every change, assess if a task version bump and changelog entry are required.** Read TASK_VERSIONING.md and PACKAGE_VERSIONING.md, then explicitly decide for this change. Default to bumping unless you can articulate why results remain comparable. Crash-fixes count: turning a sample that previously errored into one that now scores (correct or incorrect) changes which samples contribute to the aggregate, so the fix affects comparability even though no "scoring logic" was rewritten. If you bump, update the eval's README changelog and run `uv run scriv create` for a package-level fragment.
4. Perform linting with `make check` and fix any problems that arise.
5. Commit your changes to the branch.
6. Before pushing, go over your changes and make sure all the changes are high quality and relevant to your finished solution. Remove any code that was used only for diagnosing the problem. Re-confirm you have not introduced any of the forbidden patterns from the "Hard rules" section above — `git diff main...HEAD | grep -nE "skip|xfail"` and inspect every match (some matches will be unrelated; the goal is to catch any skip/xfail you added).
7. Push your changes.
### PR Description
1. Create a pull request using the Github API. The PR body **must** include the following sections:
```text
**The bug:** <What is broken — one sentence>
**The error message:** <The key error line(s) from the traceback>
**My hypothesis:** <Why this error occurs — connect the error to the code>
**My bugfix:** <What you changed and why it addresses the root cause>
**How I validated:** <Exact command(s) you ran and the outcome, or "Unable to validate because <reason>">
**My uncertainty:** <What you are not sure about, or "None — fix is straightforward and verified">
```
If you were unable to verify the fix, or your confidence is low, say so clearly in the uncertainty section. It is better to raise a PR that says "I think this is the issue but I could not verify" than to ship an unverified fix with false confidence.
---
## Mark Slow Tests
This workflow adds `@pytest.mark.slow(seconds)` markers to tests that exceed the 10-second threshold in CI but are not already marked. This is necessary because our CI pipeline checks for unmarked slow tests and fails the build if any test takes >= 15 seconds without a marker.
### Workflow Steps
1. **Get the GitHub Actions URL**: The user should provide a URL like `https://github.com/ArcadiaImpact/inspect-evals-actions/actions/runs/<run_id>` or `https://github.com/ArcadiaImpact/inspect-evals-actions/actions/runs/<run_id>/job/<job_id>`. If they haven't, please ask them for the URL of the failing Github action.
2. **Extract slow test information**: Use the fetch_slow_tests.py script to get all slow tests and their recommended markers:
```bash
uv run python agent_artefacts/scripts/fetch_slow_tests.py <url_or_run_id> --threshold 10
```
This script:
- Fetches logs from all jobs in the GitHub Actions run
- Parses test timing information from the "Check for unmarked slow tests" step
- Aggregates durations across all Python versions and platforms
- Reports the maximum duration observed for each test
- Provides recommended `@pytest.mark.slow(N)` marker values (ceiling of max duration)
For JSON output (useful for programmatic processing):
```bash
uv run python agent_artefacts/scripts/fetch_slow_tests.py <url_or_run_id> --threshold 10 --json
```
Tests marked `[FAIL]` in the output exceeded 60 seconds and caused the CI to fail. All tests above the threshold (default 10s) should be marked.
3. **Check existing markers**: For each identified test, check if it already has a `@pytest.mark.slow` marker:
```bash
grep -E "def <test_name>|@pytest\.mark\.slow" tests/<path>
```
If the test already has a slow marker with a value >= the observed duration, skip it (per user instruction: "If the number is there but too high, leave it as is").
4. **Add slow markers**: For tests without markers, add `@pytest.mark.slow(N)` where N is the observed maximum duration rounded up to the nearest integer. Place the marker immediately before other decorators like `@pytest.mark.huggingface` or `@pytest.mark.asyncio`.
Example:
```python
# Before
@pytest.mark.huggingface
def test_example():
# After
@pytest.mark.slow(45)
@pytest.mark.huggingface
def test_example():
```
5. **Run linting**: After adding markers, run linting to ensure proper formatting:
```bash
uv run ruff check --fix <modified_files>
uv run ruff format <modified_files>
```
6. **Commit**: Commit the changes with a descriptive message listing all tests that were marked:
```text
Add slow markers to tests exceeding 10s threshold
Tests marked:
- <test_path>::<test_name>: slow(N)
- ...
Co-Authored-By: Claude <assistant_name> <noreply@anthropic.com>
```
### Notes
- The slow marker value should be slightly higher than the observed maximum to account for CI variance. Rounding up to the nearest integer is usually sufficient.
- Parametrized tests (e.g., `test_foo[param1]`) share a single marker on the base test function. You should use the highest value found across all params for the slow marker.
TODO: What if one param is an order of magnitude higher than the rest? Should we recommend splitting it into its own test? pytest.mark.slow contents aren't that crucial, but still.
---
## Review PR According to Agent-Checkable Standards
This workflow is designed to run in CI (GitHub Actions) to automatically review pull requests against the agent-checkable standards in EVALUATION_CHECKLIST.md. It produces a SUMMARY.md file that is posted as a PR comment.
### Important Notes
- This workflow is optimized for automated CI runs, not interactive use.
- The SUMMARY.md file location is `/tmp/SUMMARY.md` (appropriate for GitHub runners).
- Unlike other workflows, this one does NOT create folders in agent_artefacts since it runs in ephemeral CI environments.
### Workflow Steps
1. **Identify if an evaluation was modified**: Look at the changed files in the PR to determine which evaluation(s) are affected. Use `git diff` commands to see what has changed. If an evaluation (found in src/inspect_evals) was modified, complete all steps in the workflow. If no evaluation was modified, skip any steps that say (Evaluation) in them.
2. **Read the repo context**: Read agent_artefacts/repo_context/REPO_CONTEXT.md if present. Use these lessons in your analysis.
3. **(Evaluation) Read the standards**: Read EVALUATION_CHECKLIST.md, focusing on the [Agent Runnable Checks](EVALUATION_CHECKLIST.md#agent-runnable-checks) section. These are the standards you will check against.
4. **(Evaluation) Check each standard**: For each evaluation modified in the PR, go through every item in the Agent Runnable Checks section:
- Read the relevant files in the evaluation
- Compare against the standard
- For any violations found, add an entry to `/tmp/NOTES.md` with:
- **Standard**: Which standard was violated
- **Issue**: Description of the issue
- **Location**: `file_path:line_number` or `file_path` if not line-specific
- **Recommendation**: What should be changed
Only worry about violations that exist within the code that was actually changed. Pre-existing violations that were not touched or worsened by this PR can be safely ignored.
5. **Examine test coverage**: Use the ensure-test-coverage skill and appropriate commands to identify any meaningful gaps in test coverage. You can skip this step if a previous step showed no tests were found, since this issue will already be pointed out.
6. **Apply code quality analysis**: Examine all files changed for quality according to BEST_PRACTICES.md. Perform similar analysis and note-taking as in Step 5. If you notice poor quality in a way that is not mentioned in BEST_PRACTICES.md, add this to your notes, and under **Standard**, write "No explicit standard - agent's opinion". Err on the side of including issues here, unless REPO_CONTEXT.md or another document tells you not to - we will be updating these documents over time, and it is easier for us to notice something too nitpicky than to notice the absence of something important.
7. **Check infrastructure standards**: Also apply the [Infrastructure Changes (Agent)](EVALUATION_CHECKLIST.md#infrastructure-changes-agent) check from EVALUATION_CHECKLIST.md — this applies to all PRs, not just eval submissions. It's a quick check: if high-impact files changed, did the relevant docs follow?
8. **Write the summary**: After checking all standards, create `/tmp/SUMMARY.md` by consolidating the issues from NOTES.md:
```markdown
## Summary
Brief overview of what was reviewed and overall assessment.
## Issues Found
### [Standard Category Name]
**Issue**: Description of the issue
**Location**: `file_path:line_number` or `file_path` if not line-specific
**Recommendation**: What should be changed
(Repeat for each issue from NOTES.md)
## Notes
Any additional observations or suggestions that aren't violations but could improve the code.
```
If no issues are found, write a brief summary confirming the PR passes all agent-checkable standards. Do NOT include a list of passed checks - it is assumed that anything not mentioned is fine.
9. **Important**: Always write to `/tmp/SUMMARY.md` even if there are no issues. The CI workflow depends on this file existing.
End your comment with 'This is an automatic review performed by Claude Code. Any issues raised here should be fixed or justified, but a human review is still required in order for the PR to be merged.'
### What This Workflow Does NOT Check
- Human-required checks (these need manual review)
- Evaluation report quality (requires running the evaluation)
- Whether the evaluation actually works (requires execution)
This workflow only checks code-level standards that can be verified by reading the code.
More from UKGovernmentBEIS/inspect_evals
- build-repo-contextCrawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
- check-trajectories-workflowUse Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
- code-quality-fix-allFix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
- code-quality-review-allReview all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
- create-evalRedirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
- ensure-test-coverageEnsure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
- eval-quality-workflowFix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
- eval-report-workflowCreate an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
- eval-validity-reviewReview a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).
- generate-asset-actionsGenerate asset-actions.yaml from ASSETS.yaml by classifying assets into priority tiers. Use when the user asks to regenerate, update, or refresh the asset actions.