build-and-dependency
$
npx mdskill add NVIDIA/Megatron-LM/build-and-dependencySpin up isolated Megatron-LM environments with pinned dependencies.
- Resolves CUDA, PyTorch, and native extension installation failures.
- Integrates Docker, uv, and CI container infrastructure.
- Executes container launch commands based on dependency state.
- Delivers ready-to-run development shells with locked versions.
SKILL.md
.github/skills/build-and-dependencyView on GitHub ↗
---
name: build-and-dependency
description: Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
when_to_use: Adding, removing, or updating a dependency; editing pyproject.toml or uv.lock; uv.lock merge conflict; setting up a dev environment; pulling or building the CI container; container build errors; uv errors; 'how do I install', 'uv sync fails', 'ModuleNotFoundError'.
---
# Build & Dependency Guide
The core principle: **build and develop inside containers** — the CI container
ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions
(TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.
---
## Why Containers
Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine,
and optional components like ModelOpt and DeepEP. Installing these on a bare host
is fragile and hard to reproduce. The project ships Dockerfiles that pin every
dependency.
**Use the container as your development environment.** This guarantees:
- Identical CUDA / NCCL / cuDNN versions across all developers and CI.
- `uv.lock` resolves the same way locally and in CI.
- GPU-dependent operations (training, testing) work out of the box.
---
## dev vs lts
Two image variants exist, controlled by the `IMAGE_TYPE` build arg and the
`container::lts` PR label:
| Variant | Base image pin | uv group | When used |
|---------|---------------|----------|-----------|
| **`dev`** | `docker/.ngc_version.dev` | `dev` | Default — CI, local development, most PRs |
| **`lts`** | `docker/.ngc_version.lts` | `lts` | Stability testing; excludes ModelOpt and other bleeding-edge extras |
**Use `dev` for everything unless you have a specific reason to test `lts`.**
CI runs `dev` by default; attach `container::lts` to a PR only when verifying
compatibility with the stable stack (e.g. a dependency upgrade that must not
break LTS users). The `@pytest.mark.flaky_in_dev` marker skips tests in the
`dev` environment; `@pytest.mark.flaky` skips them in `lts`.
---
## Step 1 — Acquire an Image
**Option A — NVIDIA-internal: pull a CI-built image**
> ⚠️ Requires access to the internal GitLab instance.
> See @tools/trigger_internal_ci.md for setup (adding the git remote, obtaining a token).
The internal GitLab CI publishes images to its container registry.
Derive the registry host from your configured `gitlab` remote — the same
host you use for `trigger_internal_ci.py`:
```bash
# Derive host from your 'gitlab' remote:
GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')
docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:main
```
**Option B — Build from scratch (works for everyone)**
> ⚠️ `Dockerfile.ci.dev` has two stages: `main` and `jet`. The `jet` stage
> requires an internal build secret and will fail without it. Always pass
> `--target main` to stop at the public stage.
```bash
# dev image (default)
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
--build-arg IMAGE_TYPE=dev \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local .
# lts image
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
--build-arg IMAGE_TYPE=lts \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local-lts .
```
Which image variant is used is controlled by the PR label `container::lts`;
absent that label, `dev` is used.
---
## Step 2 — Launch the Container
**Option A — Local Docker runtime**
```bash
docker run --rm --gpus all \
-v $(pwd):/workspace \
-w /workspace \
megatron-lm:local \
bash -c "<your command>"
```
**Option B — Slurm cluster (for those without a local Docker runtime)**
NVIDIA clusters typically use [Pyxis](https://github.com/NVIDIA/pyxis) +
[enroot](https://github.com/NVIDIA/enroot). Request an interactive session:
```bash
srun \
--nodes=1 --gpus-per-node=8 \
--container-image megatron-lm:local \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash
```
For clusters that require a `.sqsh` archive first:
```bash
enroot import -o megatron-lm.sqsh dockerd://megatron-lm:local
srun \
--nodes=1 --gpus-per-node=8 \
--container-image $(pwd)/megatron-lm.sqsh \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash
```
---
## Dependency Management
Dependencies are declared in `pyproject.toml`. The venv lives at `/opt/venv`
inside the container (already on `PATH`).
> **All `uv` operations must be run inside the container.**
> Never run `uv sync` / `uv pip install` on the host.
### uv Dependency Groups
| Group | Purpose |
|-------|---------|
| `training` | Runtime training extras |
| `dev` | Full dev environment (TransformerEngine, ModelOpt, …) |
| `lts` | LTS-safe subset (no ModelOpt) |
| `test` | pytest, coverage, nemo-run |
| `linting` | ruff, black, isort, pylint |
| `build` | Cython, pybind11, nvidia-mathdx |
Install commands (inside the container):
```bash
# Full dev + test environment
uv sync --locked --group dev --group test
# Linting only
uv sync --locked --only-group linting
# LTS environment
uv sync --locked --group lts --group test
```
Several dependencies are sourced directly from git (TransformerEngine, nemo-run,
FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked `uv.lock` file
pins exact revisions; update it with `uv lock` when changing `pyproject.toml`.
### Adding a New Dependency
Follow this three-step workflow:
1. **Acquire a container image** — see [Step 1](#step-1--acquire-an-image) above.
2. **Launch the container interactively** — see [Step 2](#step-2--launch-the-container) above.
3. **Update the lock file inside the container**, then commit it:
```bash
# Inside the container:
uv add <package> # adds to pyproject.toml and resolves
uv lock # regenerates uv.lock
# Exit the container, then on the host:
git add pyproject.toml uv.lock
git commit -S -s -m "build: add <package> dependency"
```
### Resolving a merge conflict in uv.lock
`uv.lock` is machine-generated; never resolve conflicts manually. Instead:
```bash
git checkout origin/main -- uv.lock # take main's version as the base
# then inside the container:
uv lock # re-resolve on top of your pyproject.toml changes
```
---
## Common Pitfalls
| Problem | Cause | Fix |
|---------|-------|-----|
| `uv sync --locked` fails | Dependency conflict or stale `uv.lock` | Re-run `uv lock` inside the container and commit updated lock |
| `ModuleNotFoundError` after pip install | pip installed outside the uv-managed venv | Use `uv add` and `uv sync`, never bare `pip install` |
| `uv: command not found` inside container | Wrong container image | Use the `megatron-lm` image built from `Dockerfile.ci.dev` |
| `No space left on device` during uv ops | Cache fills container's `/root/.cache/` | Mount a host cache dir via `-v $HOME/.cache/uv:/root/.cache/uv` |
| `docker build` fails with secret-related error | `Dockerfile.ci.dev` has a `jet` stage that requires an internal secret | Add `--target main` to stop before the `jet` stage |
| `access forbidden` when pulling | Registry URL includes an explicit port (e.g. `:5005`) | Use `${GITLAB_HOST}/adlr/...` with no port — the sed extracts the hostname only |
More from NVIDIA/Megatron-LM
- cicdCI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
- create-issueInvestigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
- linting-and-formattingLinting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.
- nightly-syncDomain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
- onboard-gb200-1node-testsOnboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.
- respond-to-issueResearch and draft a response to a GitHub issue or question from an external contributor.
- run-on-slurmHow to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.
- split-prSplit a PR into multiple PRs to reduce the number of required CODEOWNERS reviewer groups.
- testingTest system for Megatron-LM. Covers test layout, recipe YAML structure, adding and running unit and functional tests, golden values, marker filters, and CI parity.