code-quality-review-all
$
npx mdskill add UKGovernmentBEIS/inspect_evals/code-quality-review-allAudit all evaluations against one code quality standard.
- Enforces consistent quality across every evaluation in the repository.
- Depends on CONTRIBUTING.md and BEST_PRACTICES.md for standard definitions.
- Generates topic-specific documentation and structured review results.
- Outputs detailed summaries and markdown reports for quality assurance.
SKILL.md
.github/skills/code-quality-review-allView on GitHub ↗
--- name: code-quality-review-all description: Review all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead). --- # Review All Evaluations Review all evaluations in the repository against a single code quality standard or topic. This workflow is useful for systematic code quality improvements and ensuring consistency across all evaluations. ## Workflow ### Setup If not already provided, ask user for the topic from the CONTRIBUTING.md or BEST_PRACTICES.md that this review should be focused on. If not provided, come up with a short topic identifier in a file-safe format (e.g., `pytest_marks`, `import_patterns`, `test_coverage`). Create or read existing directory structure: - `<repo root>/agent_artefacts/code_quality/<topic_id>/` - Directory for this review topic - `<repo root>/agent_artefacts/code_quality/<topic_id>/README.md` - Documentation for this specific topic - `<repo root>/agent_artefacts/code_quality/<topic_id>/results.json` - Results of the review - `<repo root>/agent_artefacts/code_quality/<topic_id>/SUMMARY.md` - Summary of the review ### README.md Structure The `README.md` file should contain topic-specific information: - **Topic Description**: What this review checks for and why it matters - **Requirements**: Specific requirements from CONTRIBUTING.md or BEST_PRACTICES.md - **Detection Strategy**: How to identify issues (patterns to look for, tools to use) - **Commands**: Specific commands useful for this topic (grep patterns, pytest commands, etc.) - **Good Examples**: Code snippets showing correct implementation - **Bad Examples**: Code snippets showing common mistakes and how to fix them - **Review Date**: When this documentation was created/updated ### results.json Structure The `results.json` file should follow the template in `assets/results-template.json`. It contains one entry per evaluation in `<repo root>/src/inspect_evals/`, with status and issue details. **Important**: The `issue_location` field should use paths relative to the repository root with forward slashes (e.g., `tests/foo/test_foo.py:42` or `src/inspect_evals/foo/bar.py:15`, not `C:\Users\...\test_foo.py:42` or `tests\foo\test_foo.py:42`). ### General Guidelines for All Topics 1. **Systematic Approach**: Review all evaluations in a consistent manner. Use scripts or automated tools where possible to ensure completeness. 2. **Clear Issue Reporting**: Each issue should include: - The specific file and line number where the issue occurs - A clear description of what's wrong - A concrete suggestion for how to fix it - The issue type for categorization 3. **Verification**: After identifying potential issues, verify a sample of them by reading the actual files to ensure accuracy. 4. **Statistics**: Provide summary statistics including: - Total evaluations reviewed - Number passing/failing - Breakdown of issue types - Most affected evaluations 5. **Prioritization**: Identify which issues are most critical or affect the most evaluations to help guide remediation efforts. 6. **Reusability**: Write scripts and documentation that can be rerun easily as the codebase evolves. Include any helper scripts in the topic directory. 7. **False Positives**: Be aware that automated detection may produce false positives. When possible, include logic to reduce these or document known limitations. ### Workflow Steps 1. Create the directory structure: `agent_artefacts/code_quality/<topic_id>/` 2. **Handle existing results.json** (if re-running review): - Read existing results.json - Note issues that have "fix_status" field (were attempted to be fixed) - After scanning current code state, compare with existing results - **Remove entries for issues that no longer exist** (they were successfully fixed) - Keep entries for issues that still exist, preserving "fix_status" if present - Add new entries for newly discovered issues 3. Use autolint to scan all evaluations: `uv run python tools/run_autolint.py --all-evals --check`. Parse its output to identify structural issues across evals. For topic-specific checks beyond autolint's scope, write targeted grep/AST scripts in the topic directory. 4. Organize findings by evaluation name 5. Write topic-specific documentation to `README.md` 6. Write results to `results.json` with relative paths: - Include all currently detected issues - **IMPORTANT**: Preserve "fix_status" field for issues that still exist - Remove issues that are no longer detected in the code 7. Create a `SUMMARY.md` file in the topic directory with: - Overview of findings - Key statistics - Most affected evaluations - Recommendations for remediation - Impact analysis 8. If you created helper scripts, save them in the topic directory for future use 9. Inform the user that the review is complete and where to find the results ## Skill Responsibilities **This skill (code-quality-review-all) owns results.json** and has full control: - Add new issues when detected - Update existing issues if location/description changes - **Remove issues that no longer exist** in the codebase - Preserve "fix_status" field when updating issues - Update evaluation status (pass/fail) based on current findings **The code-quality-fix-all skill** has limited control: - Can ONLY add/update the "fix_status" field on existing issues - Cannot remove entries from results.json - Relies on this skill to verify fixes and remove resolved issues This separation ensures: - Clear ownership of results.json - Fix skill focuses on fixing, not determining what's fixed - Review skill has authoritative view of current code state ## Expected Output After running this workflow, you should have: ```text agent_artefacts/code_quality/<topic_id>/ ├── README.md # Topic-specific documentation ├── results.json # Detailed results for all evaluations ├── SUMMARY.md # Executive summary └── <helper_scripts> # Optional: automated checker scripts ```
More from UKGovernmentBEIS/inspect_evals
- build-repo-contextCrawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
- check-trajectories-workflowUse Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
- ci-maintenance-workflowCI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
- code-quality-fix-allFix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
- create-evalRedirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
- ensure-test-coverageEnsure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
- eval-quality-workflowFix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
- eval-report-workflowCreate an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
- eval-validity-reviewReview a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).
- generate-asset-actionsGenerate asset-actions.yaml from ASSETS.yaml by classifying assets into priority tiers. Use when the user asks to regenerate, update, or refresh the asset actions.