build-repo-context

Name: build-repo-context
Author: UKGovernmentBEIS/inspect_evals

$npx mdskill add UKGovernmentBEIS/inspect_evals/build-repo-context

Extracts institutional knowledge from GitHub history into a shared context document.

Helps agents understand repo conventions and avoid common mistakes.
Integrates with GitHub CLI to fetch PRs, issues, and review comments.
Decides scope by checking existing document headers for date and PR ranges.
Outputs an updated markdown file containing distilled knowledge and patterns.

SKILL.md

.github/skills/build-repo-contextView on GitHub ↗

---
name: build-repo-context
description: Crawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
---

# Build Repo Context

Crawl GitHub history (PRs, issues, review comments) and distill institutional knowledge into `agent_artefacts/repo_context/REPO_CONTEXT.md`. This document helps worker agents understand repo conventions, common mistakes, and known tech debt before making changes.

## Workflow

### 1. Setup

1. Create `agent_artefacts/repo_context/` if it doesn't exist
2. Read existing `agent_artefacts/repo_context/REPO_CONTEXT.md` if present (will be updated, not replaced)

### 2. Identify What's New

Use the header of `REPO_CONTEXT.md` to determine what to process. The header contains the last-updated date and PR range (e.g., `PRs processed: #965-#1050`).

- **First run** (no `REPO_CONTEXT.md`): Fetch the most recent 50 merged PRs + all open issues
- **Incremental runs**: Fetch PRs merged after the highest PR number in the header, and issues updated since the last-updated date

Use the `gh` CLI to list candidates:

```bash
# First run: recent merged PRs
gh pr list --state merged --limit 50 --json number,title,labels,additions,deletions,reviewDecision,mergedAt

# Incremental: PRs merged since last crawl
gh pr list --state merged --search "merged:>YYYY-MM-DD" --limit 50 --json number,title,labels,additions,deletions,reviewDecision,mergedAt

# Open issues
gh issue list --state open --limit 100 --json number,title,labels,createdAt,updatedAt
```

### 3. Triage

Fast pass over PR titles and metadata. **Skip** these categories (they rarely contain design insights):

- Dependency bumps (titles matching `bump`, `update dependencies`, `renovate`, `dependabot`)
- Changelog-only updates (titles matching `changelog`, `scriv`)
- Bot-generated PRs with no review comments
- PRs with fewer than 5 lines changed and no review comments

**Prioritize** PRs that have:

- Review comments (especially multiple rounds — that's where design discussion lives)
- Changes touching shared utilities (`src/inspect_evals/utils/`, `CONTRIBUTING.md`, `BEST_PRACTICES.md`, `AGENTS.md`)

**Cap at 50 PRs per run** to keep execution time reasonable.

### 4. Extract

For each selected PR, fetch:

```bash
# PR body and metadata
gh pr view <N> --json body,title,labels,files,reviewDecision,comments,reviews

# Review comments (inline code review feedback)
gh api repos/{owner}/{repo}/pulls/<N>/comments --paginate

# Issue comments (general discussion)
gh api repos/{owner}/{repo}/issues/<N>/comments --paginate
```

For open issues, fetch body and comments similarly.

**Link traversal**: If a comment references another PR/issue (e.g., "see #123" or "fixed in #456"), continue to crawl recursively up to 3 hops in total. Do not recurse to an existing PR/issue in the chain to prevent loops.

### 5. Distill

This is the core intellectual work. For each PR/issue, extract **actionable insights** in these categories:

- **Design decisions**: What architectural choice was made and why? What alternatives were rejected?
- **Reviewer corrections**: What mistakes did reviewers catch? These reveal common pitfalls.
- **Established conventions**: What patterns were deliberately chosen that future contributors should follow?
- **Tech debt acknowledged**: What shortcuts were taken intentionally? What should NOT be "fixed" without discussion?
- **Common agent mistakes**: If review comments mention agent-generated code issues, capture the pattern.

**Quality requirements for each insight**:

- Must cite source PR/issue number (e.g., "Per PR #973...")
- Must be actionable ("Do X" / "Don't do Y"), not descriptive ("PR #123 added X")
- Must add nuance beyond what CONTRIBUTING.md and BEST_PRACTICES.md already state
- Must be relevant to future contributors, not just historically interesting
- Must be broadly applicable beyond a single issue or evaluation. If the context is excessively narrow, leave it out.
- Must reflect team convention, not a single maintainer's code style or proposal. If in doubt, leave it out.

**Skip**:

- Bot comments (dependabot, renovate, CI status checks)
- Feature announcements without design implications
- Trivial PRs (typo fixes, version bumps) unless they reveal a convention
- Duplicate insights already captured in REPO_CONTEXT.md

### 6. Merge Into REPO_CONTEXT.md

Integrate new insights into the existing document structure. **Do not just append** — place each insight in the appropriate section and deduplicate:

- If a new insight updates or supersedes an existing one, replace it
- If a section is getting too long, distill further (combine related insights)
- Update the header metadata (last updated date, PR watermark)
- Keep total document size between 500-1000 lines (aggressive distillation if over)

**Each insight appears in exactly one section** — do not repeat the same rule across multiple sections with different framing (see step 7).

### 7. Deduplicate & Consolidate

After merging, review the full document for **cross-section duplication**. This is critical — incremental runs naturally introduce duplication because the same convention surfaces in multiple PR reviews (e.g., "use `@pytest.mark.docker`" might appear as a reviewer correction, an established convention, AND a testing recipe).

**Process**:

1. For each insight, search the entire document for overlapping content. Look for insights that cover the same topic even if phrased differently.
2. Keep each insight in **exactly one location** — the most specific section that fits. Prefer this priority:
- "Rules & Conventions" for mandatory practices ("always do X", "never do Y")
- "Testing Recipes" for detailed how-to patterns (mock setup, test structure)
- "Known Tech Debt" for acknowledged issues that should not be fixed without discussion
- "CI/Tooling" for build/CI/tooling specifics
- "Open Issues" for bugs and design direction
3. Remove the duplicate occurrences, keeping the most complete/specific version.
4. Combine related insights that are split across bullets into a single, richer bullet.

**Common duplication patterns to watch for**:

- The same pytest marker rule appearing in both "Rules" and "Testing Recipes"
- Reviewer corrections that duplicate established conventions (merge into the convention)
- Agent mistakes that are just the inverse of an established convention (keep only the convention)
- API usage patterns appearing in both rules and recipes (keep the rule brief, detail in recipes)

## Bounding Rules

| Rule | Limit |
| ---------------------------- | ------------------------------------------- |
| First run scope | Most recent 50 merged PRs + all open issues |
| Incremental run scope | New items since last crawl |
| Max PRs per run | 50 |
| Link traversal depth | 3 hops |
| Target REPO_CONTEXT.md size | 500-1000 lines |
| Max issues per run | 100 |

## Insight Quality Guidelines

These are critical — the value of REPO_CONTEXT.md depends on insight quality:

1. **Every insight must cite its source** PR or issue number. It is acceptable to cite multiple sources for the same insight.
2. **Insights must be actionable**: "Do X" / "Don't do Y", not "PR #123 added X"
3. **Don't duplicate existing docs**: Only add nuance that CONTRIBUTING.md and BEST_PRACTICES.md miss
4. **Skip noise**: Bot comments, feature announcements without design implications, trivial PRs
5. **Focus on**: Reviewer corrections, design trade-offs, rejected alternatives, acknowledged tech debt, common agent mistakes
6. **Be specific**: "Use `hf_dataset()` wrapper instead of raw `load_dataset()` for HuggingFace datasets (PR #842)" is better than "Use the right dataset loading function"
7. **Date-stamp volatile insights**: If an insight might become stale (e.g., "Currently X is broken"), include the date so agents can verify

## Expected Output

After running this workflow:

```text
agent_artefacts/repo_context/
└── REPO_CONTEXT.md # Distilled institutional knowledge (committed)
```

## Verification Checklist

After each run, verify:

1. `REPO_CONTEXT.md` exists and has well-structured content
2. Insights cite source PR/issue numbers
3. Insights are actionable, not merely descriptive
4. **No duplicate insights across sections** — search for key terms (e.g., `sample ID`, `get_model`, `@pytest.mark`) and confirm each appears in exactly one place
5. Document stays under ~1000 lines
6. Header metadata (date, PR range) is updated
7. Incremental runs don't reprocess already-crawled PRs

More from UKGovernmentBEIS/inspect_evals