split-pr
$
npx mdskill add NVIDIA/Megatron-LM/split-prSplit large PRs to minimize CODEOWNERS reviewer groups.
- Reduces review burden by fragmenting monolithic changes.
- Integrates with GitHub CLI and CODEOWNERS file parsing.
- Analyzes file paths to map dependencies on owner groups.
- Outputs optimized PR splits with minimal reviewer overlap.
SKILL.md
.github/skills/split-prView on GitHub ↗
--- name: split-pr description: Split a PR into multiple PRs to reduce the number of required CODEOWNERS reviewer groups. when_to_use: User asks to split a PR, reduce reviewer groups, or break up a large PR; 'too many CODEOWNERS', 'split this PR', 'break up PR', 'reduce reviewers needed'. user_invocable: true argument: "<pr-url-or-number>" --- # Split PR by CODEOWNERS Groups Split a large pull request into multiple smaller PRs, where each PR touches the fewest possible CODEOWNERS reviewer groups. The goal is to reduce review burden: a PR that only touches `megatron/core/` needs only the core reviewers, while a PR that also touches `examples/`, `tools/`, and `megatron/training/` pulls in many additional groups. ## Workflow ### 1. Analyze the PR 1. Fetch the PR details: `gh pr view <number> --repo NVIDIA/Megatron-LM --json title,body,headRefName,author` and `gh pr diff <number> --repo NVIDIA/Megatron-LM --stat`. Also determine the current GitHub user with `gh api user --jq .login`. 2. Parse `.github/CODEOWNERS` to build a mapping from file path patterns to owner groups. 3. For each changed file in the PR, determine which CODEOWNERS groups would be required to review it. 4. Build a summary table grouped by CODEOWNERS group, showing which files pull in which groups. 5. Count the total number of distinct reviewer groups the PR currently requires. ### 2. Propose a split that minimizes reviewer groups per PR The primary optimization goal: **minimize the number of CODEOWNERS reviewer groups required for each resulting PR**. Strategy: 1. Cluster files by their CODEOWNERS groups. Files owned by the same set of groups naturally belong together. 2. Identify the largest cluster — this becomes the first (and usually largest) PR. 3. Remaining files form one or more additional PRs, each ideally requiring only one or two reviewer groups. 4. If a split creates a dependency (e.g., PR B uses symbols renamed in PR A), the dependent PR must be merged after the first. Note this explicitly. 5. Each PR must be independently mergeable to main — no broken imports, no missing symbols. Backward-compatible aliases and re-export stubs in the first PR can make this possible. Present the proposed split as a table: - PR name/description - Files included - CODEOWNERS groups required - Dependencies on other PRs (if any) Wait for user approval before proceeding. ### 3. Execute the split (after user approval) For each new PR: 1. Create a new branch from the appropriate base (`main`, or a dependency PR's branch). 2. Extract the relevant changes: `git diff upstream/main..<source-branch> -- <file paths> | git apply`. 3. Stage, commit with a clear message, and push to the user's fork. 4. Create the PR as a **draft** (per repo contributing guidelines). 5. If the original PR needs to be narrowed in scope, confirm with the user before force-pushing. 6. Report all PR URLs when done. ## Important guidelines - Always create PRs as **drafts** and push to the user's fork, never directly to upstream. - Backward-compatible changes (aliases, re-exports, deprecation shims) should go in the first PR so subsequent PRs can depend on them. - Test files should go with the production code they test, not in a separate PR. - Prefer a single clean commit per split PR over replaying the original commit history. - If a file is hard to categorize (e.g., it touches two groups), ask the user which PR it should go in. - If the current GitHub user is not the author of the original PR, each new PR's description must explicitly credit the original author (e.g., "Original changes by @<author> in #<number>").
More from NVIDIA/Megatron-LM
- build-and-dependencyContainer-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
- cicdCI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
- create-issueInvestigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
- linting-and-formattingLinting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.
- nightly-syncDomain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
- onboard-gb200-1node-testsOnboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.
- respond-to-issueResearch and draft a response to a GitHub issue or question from an external contributor.
- run-on-slurmHow to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.
- testingTest system for Megatron-LM. Covers test layout, recipe YAML structure, adding and running unit and functional tests, golden values, marker filters, and CI parity.