perf-tp-dp-comm-overlap

$npx mdskill add NVIDIA-NeMo/Megatron-Bridge/perf-tp-dp-comm-overlap

Configure communication overlap for Megatron-Bridge performance.

  • Enables TP, DP, and PP communication overlap in distributed training.
  • Integrates with Megatron-Bridge config knobs and mixed precision settings.
  • Selects overlap strategies based on hardware specs and workload patterns.
  • Provides verification steps and code anchors for implementation.

SKILL.md

.github/skills/perf-tp-dp-comm-overlapView on GitHub ↗
---
name: perf-tp-dp-comm-overlap
description: Operational guide for enabling TP, DP, and PP communication overlap in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
when_to_use: Enabling TP/DP/PP comm overlap, or tracing a throughput regression to a comm overlap config change; 'overlap_param_gather', 'overlap_grad_reduce', 'sequence-parallel overlap', 'TP overlap', 'DP overlap', 'comm overlap'.
---

# TP / DP / PP Communication Overlap Skill

For stable background and recommendation level, see:

- @docs/training/communication-overlap.md

## Enablement

Minimal Bridge override:

```python
from megatron.bridge.training.comm_overlap import CommOverlapConfig

cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2

cfg.comm_overlap = CommOverlapConfig(
    tp_comm_overlap=True,
)

cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True
```

Optional TP preset:

```python
from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048
```

Precision knobs belong to mixed precision:

```python
cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = False
```

## Code Anchors

Bridge overlap gating:

```439:449:src/megatron/bridge/training/comm_overlap.py
if self.user_comm_overlap_cfg.tp_comm_overlap is True:
    if model_cfg.tensor_model_parallel_size < 2:
        ...
    elif not model_cfg.sequence_parallel:
        ...
    elif not HAVE_TE:
        ...
```

PP overlap selection:

```451:458:src/megatron/bridge/training/comm_overlap.py
if model_cfg.pipeline_model_parallel_size > 1:
    if vp_size > 1:
        comm_overlap_cfg.overlap_p2p_comm = True
        comm_overlap_cfg.batch_p2p_comm = False
    else:
        comm_overlap_cfg.overlap_p2p_comm = False
        comm_overlap_cfg.batch_p2p_comm = True
```

DP overlap defaults:

```572:579:src/megatron/bridge/training/comm_overlap.py
if self.data_parallel_size > 1:
    comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
    comm_overlap_cfg.overlap_grad_reduce = True
    comm_overlap_cfg.overlap_param_gather = True
```

Launch-time env tuning:

```570:609:src/megatron/bridge/recipes/run_plugins.py
executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
```

## Pitfalls

1. TP overlap silently disables itself if `sequence_parallel=False` or Transformer Engine is unavailable.
2. PP overlap is not enabled for all PP cases. Bridge only auto-selects `overlap_p2p_comm=True` when `PP > 1` and `VPP > 1`.
3. `bucket_size` is a parameter-count knob, not a byte-size knob.
4. `grad_reduce_in_fp32` and `fp8_param_gather` should be set through mixed precision, not as standalone DDP tuning first.
5. `CUDA_DEVICE_MAX_CONNECTIONS` and LayerNorm SM margin are launch-time plugin settings, not `CommOverlapConfig` fields.

## Verification

Use the checked-in overlap unit coverage first:

```bash
uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q
```

Optional second check if `nemo_run` is available:

```bash
uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q
```

Success criteria:

- first command reports `26 passed`
- second command validates plugin-owned env wiring when not skipped

More from NVIDIA-NeMo/Megatron-Bridge

SkillDescription
adding-model-supportGuide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
mlm-bridge-trainingRun Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.
multi-node-slurmConvert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
parity-testingStructured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill.
perf-activation-recomputeValidate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
perf-cpu-offloadingValidate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
perf-cuda-graphsValidate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
perf-expert-parallel-overlapValidate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
perf-hybrid-context-parallelOperational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
perf-megatron-fsdpOperational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.