perf-moe-hardware-configs

Name: perf-moe-hardware-configs
Author: NVIDIA-NeMo/Megatron-Bridge

$npx mdskill add NVIDIA-NeMo/Megatron-Bridge/perf-moe-hardware-configs

Select MoE training configs for hardware and model families.

Provides throughput bands and parallelism patterns for specific platforms.
Depends on hardware topology and model family specifications.
Matches strategies to platform strengths like H100 or GB200.
Delivers concise tuning stacks for immediate implementation.

SKILL.md

.github/skills/perf-moe-hardware-configsView on GitHub ↗

---
name: perf-moe-hardware-configs
description: Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
when_to_use: Hardware-specific MoE playbooks or throughput estimates; 'MoE on H100', 'GB200 config', 'expected throughput', 'MoE hardware playbook', 'parallelism for B200'.
---

# MoE Hardware Configuration Reference

Stable docs: @docs/training/moe-optimization.md
Card: @skills/perf-moe-hardware-configs/card.yaml

## Quick Platform Playbook

| Platform | Typical MoE strategy | What usually matters most |
|---|---|---|
| H100 | DeepEP + stronger PP + moderate TP | communication overlap and PP efficiency |
| B200 | DeepEP + MXFP8 + careful PP layout | container quality and tuned comm settings |
| GB200 | HybridEP + partial CUDA graphs + CPU cleanup | host overhead, topology-aware dispatch, memory headroom |
| GB300 | HybridEP + newer FP8 and kernel stack | same GB200 playbook, usually with a higher ceiling |

## Rounded Performance Bands

These are intentionally rounded so the document stays durable as the tracker
moves. Treat them as planning ranges, not exact promises.

| Workload family | Hardware | Typical band | Representative shape |
|---|---|---|---|
| DSV3, large-scale | H100 | low-to-mid hundreds TFLOPS/GPU, high-teens MFU | TP2, EP64, PP8, DeepEP |
| DSV3, large-scale | B200 | high-hundreds TFLOPS/GPU, mid-teens MFU | TP1, EP32, PP8, DeepEP |
| DSV3, large-scale | GB200 | around 1K TFLOPS/GPU, low-20s MFU | TP1, EP64, PP4, HybridEP |
| DSV3, large-scale | GB300 | above the GB200 band, often mid-20s MFU | TP1, EP64, PP4, HybridEP |
| Qwen3 235B | H100 | low-300s TFLOPS/GPU, around 30% MFU | TP2, EP32, PP8, DeepEP |
| Qwen3 235B | GB200 | high-hundreds TFLOPS/GPU in tuned runs | TP1 or TP2, EP32-64, PP4, HybridEP |
| Qwen3 30B | H100 | low-200s TFLOPS/GPU | TP1, EP8, PP1, DeepEP |
| Qwen3-Next 80B | GB200 | low-300s TFLOPS/GPU in BF16-class runs | TP1, EP32, PP2, HybridEP |

## Representative Config Families

### DSV3 on H100

```text
Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient
```

### DSV3 on B200

```text
Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning
```

### DSV3 on GB200 or GB300

```text
Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes
```

### Qwen3 235B on H100

```text
Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup
```

### Qwen3 235B on GB200

```text
Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom
```

### Qwen3-Next 80B on GB200

```text
Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality
```

## Cross-Cutting Patterns

### PP layout

- `E` = embedding
- `t` = transformer
- `m` = MTP
- `L` = loss
- `|` = stage boundary

The biggest platform difference is usually not just the dispatcher. It is the
combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.

### Recompute strategy

| Memory pressure | Starting point |
|---|---|
| low | none or a very narrow selective set |
| moderate | `moe_act`, `mlp`, `norm`, or similar selective modules |
| high | model-specific up-projection plus selective MoE and MLP modules |
| extreme or long-context | full recompute only if the selective path still does not fit |

### Environment variables

```bash
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0
```

### CPU-side tuning

On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the
needle almost as much as a dispatcher swap. Treat them as first-class tuning
work, not as afterthoughts.

## Pitfalls

1. **Do not cargo-cult a tracker row**: the winning config usually depends on
   routing mode, container, and PP layout as much as on hardware name.

2. **Container quality matters**: large regressions can come from the software
   stack rather than the model recipe.

3. **VPP must be intentional**: a bad VPP split can erase the gain from a better
   dispatcher.

4. **Compare absolute throughput, not only MFU**: MFU can mislead when switching
   between BF16, FP8, and other precision modes.

5. **Force-balance routing is the safer benchmark default**: keep routing mode
   fixed when comparing hardware or dispatcher stacks.

More from NVIDIA-NeMo/Megatron-Bridge

Skill	Description
adding-model-support	Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
mlm-bridge-training	Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.
multi-node-slurm	Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
parity-testing	Structured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill.
perf-activation-recompute	Validate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
perf-cpu-offloading	Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
perf-cuda-graphs	Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
perf-expert-parallel-overlap	Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
perf-hybrid-context-parallel	Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
perf-megatron-fsdp	Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.