perf-cuda-graphs

Name: perf-cuda-graphs
Author: NVIDIA-NeMo/Megatron-Bridge
$npx mdskill add NVIDIA-NeMo/Megatron-Bridge/perf-cuda-graphs
Capture GPU operations to reduce host-driver overhead.
Eliminates latency from repeated kernel launches during training.
Integrates with Megatron Bridge and Transformer Engine APIs.
Selects graph scope based on workload type and constraints.
Generates optimized replayable sequences for specific modules.
SKILL.md
.github/skills/perf-cuda-graphsView on GitHub ↗
---
name: perf-cuda-graphs
description: Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
when_to_use: Reducing host-driver overhead via CUDA graphs, or tracing a crash or regression to a CUDA graph config change; 'cuda_graph_impl', 'full iteration graph', 'TE scoped graph', 'graphed callables', 'CUDA graph capture'.
---

# CUDA Graphs

Stable docs: @docs/training/cuda-graphs.md
Card: @skills/perf-cuda-graphs/card.yaml

## What It Is

CUDA graphs capture GPU operations once and replay them with minimal
host-driver overhead. Bridge supports two implementations:

| `cuda_graph_impl` | Mechanism | Scope support |
|---|---|---|
| `"local"` | MCore `FullCudaGraphWrapper` wrapping entire fwd+bwd | `full_iteration` |
| `"transformer_engine"` | TE `make_graphed_callables()` per layer | `attn`, `mlp`, `moe`, `moe_router`, `moe_preprocess`, `mamba` |

## Quick Decision

Start with TE-scoped graphs for most training workloads:

- dense models: `attn`, then optionally `mlp`
- dropless MoE: `attn moe_router moe_preprocess`
- VLMs: the same dropless-MoE scope, but only after the real-data path is stable

Use `local` + `full_iteration` only when you specifically want full-iteration
capture and can satisfy the tighter constraints.

For recompute-heavy workloads:

- TE-scoped graphs pair naturally with selective recompute
- full recompute usually pushes you toward `local` full-iteration graphs or away
  from graphs entirely

Related docs:

- @docs/training/cuda-graphs.md
- @docs/training/activation-recomputation.md

## Enablement

### Local full-iteration graph

```python
cfg.model.cuda_graph_impl = "local"
cfg.model.cuda_graph_scope = ["full_iteration"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
cfg.rerun_state_machine.check_for_nan_in_loss = False
cfg.ddp.check_for_nan_in_grad = False
```

### TE scoped graph (dense model)

```python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn"]           # or ["attn", "mlp"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
```

### TE scoped graph (MoE model)

```python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
```

### Performance harness CLI

```bash
python scripts/performance/run_performance_workload.py \
  --cuda_graph_impl transformer_engine \
  --cuda_graph_scope attn moe_router moe_preprocess \
  ...
```

Valid CLI values live in `scripts/performance/argument_parser.py`:
- `VALID_CUDA_GRAPH_IMPLS`: `["none", "local", "transformer_engine"]`
- `VALID_CUDA_GRAPH_SCOPES`: `["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]`

### Required constraints

- `use_te_rng_tracker = True` (enforced in `gpt_provider.py`)
- `full_iteration` scope only with `cuda_graph_impl = "local"`
- `full_iteration` scope requires `check_for_nan_in_loss = False`
- Do not combine `moe` scope and `moe_router` scope
- Tensor shapes must be static (fixed seq_length, fixed micro_batch_size)
- MoE token-dropless routing limits graphable scope to dense modules
- With `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, set
  `NCCL_GRAPH_REGISTER=0` (MCore enforces for local impl on arch < sm_100;
  TE impl asserts unconditionally)
- CPU offloading is incompatible with CUDA graphs
- `moe_preprocess` scope requires `moe_router` scope to also be set

### Practical bring-up order

1. Stabilize the eager run first.
2. Fix sequence length and micro-batch size.
3. Enable the narrowest useful graph scope.
4. Confirm replay is active and memory is still acceptable.
5. Only then widen scope or combine with overlap features.

## Code Anchors

### Bridge config and validation

```1524:1531:src/megatron/bridge/training/config.py
        # CUDA graph scope validation: check_for_nan_in_loss must be disabled with full_iteration graph
        if self.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in self.model.cuda_graph_scope:
            assert not self.rerun_state_machine.check_for_nan_in_loss, (
                "check_for_nan_in_loss must be disabled when using full_iteration CUDA graph. "
                "Set rerun_state_machine.check_for_nan_in_loss=False."
            )
        if self.model.cuda_graph_impl == "none":
            self.model.cuda_graph_scope = []
```

### TE RNG tracker requirement

```213:216:src/megatron/bridge/models/gpt_provider.py
        if self.cuda_graph_impl != "none":
            assert getattr(self, "use_te_rng_tracker", False), (
                "Transformer engine's RNG tracker is required for cudagraphs, it can be "
                "enabled with use_te_rng_tracker=True'."
```

### Graph creation and capture in training loop

```231:255:src/megatron/bridge/training/train.py
    # Capture CUDA Graphs.
    cuda_graph_helper = None
    if model_config.cuda_graph_impl == "transformer_engine":
        cuda_graph_helper = TECudaGraphHelper(...)
    # ...
    if config.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in config.model.cuda_graph_scope:
        forward_backward_func = FullCudaGraphWrapper(
            forward_backward_func, cuda_graph_warmup_steps=config.model.cuda_graph_warmup_steps
        )
```

### TE graph capture after warmup

```338:350:src/megatron/bridge/training/train.py
        # Capture CUDA Graphs after warmup.
        if (
            model_config.cuda_graph_impl == "transformer_engine"
            and cuda_graph_helper is not None
            and not cuda_graph_helper.graphs_created()
            and global_state.train_state.step - start_iteration == model_config.cuda_graph_warmup_steps
        ):
            if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
                disable_forward_pre_hook(model, param_sync=False)
            cuda_graph_helper.create_cudagraphs()
            if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
                enable_forward_pre_hook(model)
                cuda_graph_helper.cuda_graph_set_manual_hooks()
```

### RNG initialization

```199:206:src/megatron/bridge/training/initialize.py
        _set_random_seed(
            rng_config.seed,
            rng_config.data_parallel_random_init,
            rng_config.te_rng_tracker,
            rng_config.inference_rng_tracker,
            use_cudagraphable_rng=(model_config.cuda_graph_impl != "none"),
            pg_collection=pg_collection,
        )
```

### Delayed wgrad + CUDA graph interaction

```522:555:src/megatron/bridge/training/comm_overlap.py
            cuda_graph_scope = getattr(model_cfg, "cuda_graph_scope", []) or []
            # ... scope parsing ...
            if wgrad_in_graph_scope:
                assert is_te_min_version("2.12.0"), ...
                assert model_cfg.gradient_accumulation_fusion, ...
                if attn_scope_enabled:
                    assert not model_cfg.add_bias_linear and not model_cfg.add_qkv_bias, ...
```

### Perf harness override helper

```102:124:scripts/performance/utils/overrides.py
def _set_cuda_graph_overrides(
    recipe, cuda_graph_impl=None, cuda_graph_scope=None
):
    # Sets impl, scope, and auto-enables te_rng_tracker
```

### Graph cleanup

```1414:1441:src/megatron/bridge/training/train.py
def _delete_cuda_graphs(cuda_graph_helper):
    # Deletes FullCudaGraphWrapper and TE graph objects to free NCCL buffers
```

### MCore classes (in 3rdparty/Megatron-LM)

- `CudaGraphManager`: `megatron/core/transformer/cuda_graphs.py`
- `TECudaGraphHelper`: `megatron/core/transformer/cuda_graphs.py`
- `FullCudaGraphWrapper`: `megatron/core/full_cuda_graph.py`
- `CudaGraphScope` enum: `megatron/core/transformer/enums.py`

### Positive recipe anchors

- `scripts/performance/configs/deepseek/deepseek_workload_base_configs.py`
- `scripts/performance/configs/qwen/qwen3_workload_base_configs.py`
- `scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py`

### Tests

| File | Coverage |
|---|---|
| `tests/unit_tests/training/test_config.py` | `full_iteration` NaN-check constraint |
| `tests/unit_tests/training/test_comm_overlap.py` | `delay_wgrad` + CUDA graph interaction |
| `tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py` | TE autocast with CUDA graphs |
| `tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py` | End-to-end local and TE graph smoke tests |
| `tests/unit_tests/recipes/kimi/test_kimi_k2.py` | TE + CUDA graph recipe config |
| `tests/unit_tests/recipes/gpt/test_gpt3_175b.py` | TE + CUDA graph recipe config |
| `tests/unit_tests/recipes/qwen_vl/test_qwen25_vl_recipes.py` | VLM CUDA graph settings |

## Pitfalls

1. **TE RNG tracker is mandatory**: Setting `cuda_graph_impl` without
   `use_te_rng_tracker=True` and `rng.te_rng_tracker=True` will assert
   in the provider.

2. **`full_iteration` requires NaN checks disabled**: The entire fwd+bwd is
   captured, so loss-NaN checking cannot inspect intermediate values.

3. **MoE scope restrictions**: `moe` scope and `moe_router` scope are
   mutually exclusive. Token-dropless MoE can only graph `moe_router` and
   `moe_preprocess`, not the full expert dispatch.

4. **Memory overhead**: CUDA graphs pin all intermediate buffers for the
   graph's lifetime (no memory reuse). TE scoped graphs add a few GB;
   full-iteration graphs can increase peak memory by 1.5–2×. `PP > 1`
   compounds overhead since each stage holds its own graph.

5. **Delayed wgrad interaction**: When `delay_wgrad_compute=True` and
   attention or MoE router is in `cuda_graph_scope`, additional constraints
   apply: TE >= 2.12.0, `gradient_accumulation_fusion=True`, and no
   attention bias.

6. **Variable-length sequences break graphs**: Sequence lengths must be
   constant across steps. Use padded packed sequences if packing is needed.

7. **Graph cleanup is required**: CUDA graph objects hold NCCL buffer
   references. Bridge handles this in `_delete_cuda_graphs()` at the end
   of training, but early exits must call it explicitly.

8. **Older GPU architectures**: On GPUs with compute capability < 10.0
   (pre-Blackwell), set `NCCL_GRAPH_REGISTER=0` when using
   `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. Enforced in MCore
   `CudaGraphManager` (cuda_graphs.py:1428) and `TECudaGraphHelper`
   (cuda_graphs.py:1697). The TE impl asserts unconditionally regardless
   of arch.

9. **CPU offloading incompatible**: CUDA graphs cannot be used with CPU
   offloading. Enforced in MCore `transformer_config.py:1907`.

10. **MoE recompute + moe_router scope**: MoE recompute is not supported
    with `moe_router` CUDA graph scope when using `cuda_graph_impl =
    "transformer_engine"`. Enforced in MCore `transformer_config.py:1977`.

11. **Layer-level recompute requires `full_iteration` scope**: Using
    `recompute_granularity="full"` with `recompute_num_layers` (recompute N
    whole transformer layers) is incompatible with TE-scoped graphs. MCore
    calls this "full" granularity even though you're selecting how many
    layers — the name refers to recomputing the full layer, not full model.
    Any TE-scoped scope (`attn`, `mlp`, `moe_router`, etc.) will assert:
    `AssertionError: full recompute is only supported with full iteration CUDA graph.`
    This commonly hits FP8 configs that default to TE-scoped graphs (e.g.
    `LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1` uses `cuda_graph_impl=
    "transformer_engine"`, `cuda_graph_scope="mlp"`). Fix: use submodule
    recompute (`recompute_granularity="selective"` + `recompute_modules`),
    disable CUDA graphs, or switch to `local` + `full_iteration`. Enforced
    in MCore `transformer_config.py:2001-2005`. See also
    @skills/perf-activation-recompute/SKILL.md.

12. **Benchmark numbers are workload-specific**: graph wins are usually real
    when host overhead is visible, but the exact gain depends on batch shape,
    PP depth, recompute, and whether the eager baseline was already optimized.

## Verification

### Unit tests

```bash
uv run python -m pytest \
  tests/unit_tests/training/test_config.py -k "cuda_graph" \
  tests/unit_tests/training/test_comm_overlap.py -k "cuda_graph" \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cuda_graph" -q
```

### Functional smoke test (requires GPU)

```bash
uv run python -m pytest \
  tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py -q
```

### Success criteria

- Unit tests pass, covering config validation for both `local` and
  `transformer_engine` implementations.
- Functional test completes training steps with both CUDA graph
  implementations.
- No NCCL errors or illegal memory access in logs.