perf-moe-long-context

$npx mdskill add NVIDIA-NeMo/Megatron-Bridge/perf-moe-long-context

Optimize long-context MoE training with proven scaling patterns.

  • Resolves OOM and throughput degradation during extended sequence training.
  • Integrates with Megatron Bridge for CP sizing and selective recompute.
  • Draws from DSV3 and Qwen3 experiments to guide dispatcher choices.
  • Delivers actionable configurations for memory-efficient GPU utilization.

SKILL.md

.github/skills/perf-moe-long-contextView on GitHub ↗
---
name: perf-moe-long-context
description: Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
when_to_use: Training MoE at long sequence lengths, or investigating a commit that caused long-context MoE OOM or degraded throughput; 'long context MoE', '128k tokens', 'CP sizing for long sequences', 'selective recompute long context', 'MoE long-context OOM'.
---

# MoE Long-Context Training

Stable docs: @docs/training/moe-optimization.md
Card: @skills/perf-moe-long-context/card.yaml

## What Changes At Long Context

Once sequence length moves well past the 4K-class regime, attention memory and
activation residency become the dominant constraints. For MoE models, that
usually means you need some combination of:

- context parallelism
- selective recompute
- lower precision
- CPU offload for optimizer state
- a dispatcher and PP layout that do not waste the smaller remaining DP budget

## Rounded Scaling Patterns

### DSV3 on H100

The DSV3 long-context runs show a stable pattern:

- selective recompute works better than full recompute once you move past the
  shortest contexts
- throughput stays in a fairly narrow band from mid-length through very long
  contexts if CP is increased appropriately
- the trade shifts from "memory fit" to "GPU-count feasibility" as CP grows

In other words, long context does not immediately collapse utilization if the
layout is chosen well, but it does consume the DP budget very quickly.

### Qwen3-Next on GB200

Qwen3-Next behaves more like a memory-sensitive medium-scale model:

- 8K and 32K remain practical with moderate CP
- 64K is possible, but the throughput drop is noticeable and memory becomes
  much tighter
- pipeline layout and grouped-GEMM improvements matter almost as much as CP

### Qwen3 235B on GB200

Qwen3 235B shows that long context can still be efficient on NVL72 systems when
TP, CP, and HybridEP are coordinated. The best 128K-class configurations are
not just "fit-only" recipes; they can remain highly efficient if routing,
parallelism, and recompute are balanced.

## CP Sizing Rules Of Thumb

1. **Start from a 4K shard target**: a good first guess is
   `CP ~= seq_len / 4096`, then round to a practical power-of-two layout.

2. **Keep DP alive if possible**: long-context scaling becomes brittle once CP,
   EP, TP, and PP together squeeze DP down to the floor.

3. **Prefer selective recompute**: recompute modules such as `up_proj`, `norm`,
   `moe`, `moe_act`, or `mlp` before reaching for full recompute.

4. **Avoid SDPA-heavy recompute at very long context**: recomputing attention
   internals can add a lot of work for less memory benefit than recomputing
   smaller MoE and MLP-side modules.

5. **Use TP as another lever on NVL72 systems**: GB200 and GB300 runs can
   sometimes trade some CP for TP while still staying efficient.

6. **Assume GBS will need to shrink**: as CP rises and DP falls, you may need
   to reduce global batch size or accept higher GA.

## Representative Config Families

### DSV3 at 128K on H100

```text
TP=1  CP=32  EP=32  PP=8  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
```

### DSV3 at 256K on H100

```text
TP=1  CP=64  EP=32  PP=8  EDP=2  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
```

### Qwen3 235B at 128K on GB200

```text
TP=4  CP=4  EP=32  PP=4  VPP=12
Precision: BF16 or MXFP8
Dispatcher: HybridEP
Recompute: moe_act, norm
CUDA Graph: attn + moe_router + moe_preprocess
```

## Recompute And CUDA Graph Guidance

For long-context MoE training:

- start with selective recompute
- add CUDA graphs only after the shapes and routing path are stable
- keep sequence length and MBS fixed when using CUDA graphs
- if the run depends on highly dynamic batches, prefer eager execution

Useful references:

- @docs/training/activation-recomputation.md
- @skills/perf-cuda-graphs/SKILL.md

## Pitfalls

1. **CP does not replace EP or PP**: it adds another dimension; it does not make
   the others disappear.

2. **A good 4K baseline can still be a bad long-context baseline**: routing mode,
   recompute choice, and offload strategy often need to change.

3. **GPU-count feasibility becomes the real constraint**: very long context can
   look fine in a single recipe, then become impossible once EP and PP are added
   honestly across the full model.

4. **CUDA graphs need static shapes**: variable-length batches and opportunistic
   padding strategies can silently break the path.

5. **Container and kernel support matters more at 128K+**: long-context paths
   tend to rely on newer kernels and bug fixes than short-context bring-up does.

More from NVIDIA-NeMo/Megatron-Bridge

SkillDescription
adding-model-supportGuide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
mlm-bridge-trainingRun Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.
multi-node-slurmConvert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
parity-testingStructured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill.
perf-activation-recomputeValidate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
perf-cpu-offloadingValidate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
perf-cuda-graphsValidate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
perf-expert-parallel-overlapValidate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
perf-hybrid-context-parallelOperational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
perf-megatron-fsdpOperational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.