perf-moe-comm-overlap

$npx mdskill add NVIDIA-NeMo/Megatron-Bridge/perf-moe-comm-overlap

Optimize MoE throughput by overlapping expert parallel communication.

  • Reduces dispatch and combine latency in large MoE models.
  • Requires Megatron Bridge with expert model parallelism enabled.
  • Activates when token dispatch time is visible in profiles.
  • Outputs configuration flags for comm_overlap settings.

SKILL.md

.github/skills/perf-moe-comm-overlapView on GitHub ↗
---
name: perf-moe-comm-overlap
description: MoE expert-parallel communication overlap in Megatron Bridge. Covers dispatch/combine overlap, flex dispatcher backends, and expert wgrad scheduling.
when_to_use: Tuning MoE communication overlap, or tracing a MoE throughput regression to a comm-overlap config change; 'overlap_moe_expert_parallel_comm', 'MoE dispatch overlap', 'flex dispatcher', 'DeepEP overlap', 'expert wgrad scheduling'.
---

# MoE Communication Overlap

For the higher-level overview, see:

- @docs/training/communication-overlap.md
- @skills/perf-moe-comm-overlap/card.yaml

## Quick Decision

Use MoE communication overlap when:

- `EP > 1`
- token dispatch or combine time is visible in the profile
- the run is already correct and you are now tuning throughput

Avoid turning it on as an early bring-up step. It is easier to validate after
the dispatcher, routing mode, and recompute plan are already stable.

## Enablement

```python
cfg.comm_overlap.overlap_moe_expert_parallel_comm = True

# Optional: delayed wgrad for additional overlap
cfg.comm_overlap.delay_wgrad_compute = True

# IMPORTANT: disable shared expert overlap when using dispatch overlap
cfg.model.moe_shared_expert_overlap = False
```

### Prerequisites

- `expert_model_parallel_size > 1`
- `num_moe_experts > 1`
- `moe_token_dispatcher_type` must be `"alltoall"` or `"flex"`
- Precision: BF16 or FP16
- If PP is used, VPP (`virtual_pipeline_model_parallel_size`) must be set (non-`None`)

### Flex dispatcher activation

Setting `moe_flex_dispatcher_backend` alone does **not** activate flex dispatch.
You must also set `moe_token_dispatcher_type = "flex"`.

## Recompute And CUDA Graph Interaction

- Full recompute is not a good companion for the overlap path.
- `delay_wgrad_compute` adds further constraints if CUDA-graph scopes include
  attention or MoE-router work.
- In practice, selective recompute is the safer pairing when overlap is enabled.

## Code Anchors

- Overlap validation: `src/megatron/bridge/training/comm_overlap.py`
- Flex dispatcher backend: `src/megatron/bridge/training/flex_dispatcher_backend.py`
- Config: `src/megatron/bridge/training/config.py`
- Unit tests: `tests/unit_tests/training/test_comm_overlap.py`
- DeepEP tests: `tests/unit_tests/training/test_deepep.py`

## Pitfalls

1. **Shared expert overlap conflict**: `moe_shared_expert_overlap` and
   `overlap_moe_expert_parallel_comm` can conflict. Disable shared expert
   overlap when using the dispatch overlap path.

2. **PP without VPP**: MoE overlap requires VPP when pipeline parallelism is
   active. Without it, the overlap scheduling cannot interleave correctly.

3. **Flex != backend flag**: `moe_flex_dispatcher_backend="deepep"` alone
   does nothing if `moe_token_dispatcher_type` is still `"alltoall"`.

4. **Conservative recipe defaults**: Most public recipes leave MoE overlap
   disabled. You need to explicitly enable it via overrides.

5. **Performance gains are workload-dependent**: overlap helps most when dispatch
   communication is already a visible slice of step time. It is not guaranteed
   to help every small or lightly loaded EP run.

## Verification

Look for overlap-related log messages during initialization. The comm overlap
validation in `comm_overlap.py` will raise if prerequisites are not met, so a
clean startup confirms the feature is active.

More from NVIDIA-NeMo/Megatron-Bridge

SkillDescription
adding-model-supportGuide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
mlm-bridge-trainingRun Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.
multi-node-slurmConvert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
parity-testingStructured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill.
perf-activation-recomputeValidate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
perf-cpu-offloadingValidate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
perf-cuda-graphsValidate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
perf-expert-parallel-overlapValidate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
perf-hybrid-context-parallelOperational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
perf-megatron-fsdpOperational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.