perf-hybrid-context-parallel
$
npx mdskill add NVIDIA-NeMo/Megatron-Bridge/perf-hybrid-context-parallelConfigure hierarchical context parallelism for scaling beyond KV heads.
- Enables scaling context parallelism beyond KV heads in Megatron-Bridge.
- Integrates with Megatron-LM config and Transformer Engine 1.12.0.
- Validates constraints like sequence length and product of sizes.
- Provides code anchors and verification steps for safe deployment.
SKILL.md
.github/skills/perf-hybrid-context-parallelView on GitHub ↗
---
name: perf-hybrid-context-parallel
description: Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
when_to_use: Scaling context parallelism beyond KV heads, or investigating a commit that changed CP config and caused OOM or a regression; 'hierarchical_context_parallel_sizes', 'a2a+p2p', 'hybrid context parallel', 'CP beyond KV heads', 'multi-level CP'.
---
# Hybrid / Hierarchical Context Parallel Skill
For what HCP is, when to use it, and the decision tree (a2a+p2p vs pure a2a vs p2p), see:
- @docs/training/hybrid-context-parallel.md
- @skills/perf-hybrid-context-parallel/card.yaml
## Enablement
Minimal Bridge override:
```python
cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False
```
Required constraints:
- `prod(hierarchical_context_parallel_sizes) == context_parallel_size`
- `seq_length % (2 * context_parallel_size) == 0`
- Transformer Engine `>= 1.12.0`
## Code Anchors
Upstream config and validation:
```45:54:3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""
hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify
the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
groups of two levels, so the first value of the list indicates the group size of the a2a
communication type, and the second value indicates the group size of the p2p communication
type.
"""
```
```428:433:3rdparty/Megatron-LM/megatron/training/arguments.py
if args.hierarchical_context_parallel_sizes:
from numpy import prod
assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
assert args.hierarchical_context_parallel_sizes is not None, \
"--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"
```
Bridge MPU path:
```613:648:src/megatron/bridge/training/initialize.py
parallel_state.initialize_model_parallel(
...
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
...
)
...
return ProcessGroupCollection.use_mpu_process_groups()
```
Bridge decentralized-PG path:
```503:524:src/megatron/bridge/training/initialize.py
pg_collection = ProcessGroupCollection(
...
cp=cp_pg,
tp_cp=tp_cp_pg,
hcp=None,
ep=ep_pg,
...
)
```
## Implementation Map
### Config definition
`hierarchical_context_parallel_sizes` is declared in `ModelParallelConfig`:
```
# 3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
hierarchical_context_parallel_sizes: Optional[list[int]] = None
# First value = a2a group size, second value = p2p group size.
# Product must equal context_parallel_size.
```
`cp_comm_type` is declared in `TransformerConfig`:
```
# 3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
cp_comm_type: Optional[Union[str, List[str]]] = None
# Can be per-layer (List[str]) or uniform (str).
# Values: "p2p", "all_gather", "a2a", "a2a+p2p"
```
### Validation (MCore)
`TransformerConfig.__post_init__` enforces that `a2a+p2p` requires HCP sizes and the product matches CP.
### Process group creation
`parallel_state.initialize_model_parallel` creates hierarchical CP sub-groups when HCP sizes are provided via `create_hierarchical_groups`.
### TE integration
`TEDotProductAttention` passes the hierarchical groups to Transformer Engine when `a2a+p2p` is used. Requires **Transformer Engine >= 1.12.0**.
## Pitfalls
1. **Different features**: `a2a+p2p` and upstream `hybrid_context_parallel=True` are different features. The latter is for balancing packed/variable-length workloads.
2. **Bridge HCP is MPU-only today**: If `use_decentralized_pg=True`, Bridge initializes flat CP groups and leaves HCP unset.
3. **No checked-in Bridge recipe** currently exercises HCP directly.
4. **Single-GPU load helpers** clear `hierarchical_context_parallel_sizes`.
5. **Silent broken training**: If you use `a2a+p2p` without setting `hierarchical_context_parallel_sizes`, MCore now asserts. Older versions would silently disable CP communication — each rank attended only to its local chunk, producing artificially high throughput but completely broken gradients.
6. **Product must match**: `prod(hierarchical_context_parallel_sizes)` must exactly equal `context_parallel_size`. A mismatch triggers an assertion.
7. **Verify in logs**: Look for the process group initialization output. You should see `HIERARCHICAL_CONTEXT_PARALLEL_GROUPS` being created. If you only see `CONTEXT_PARALLEL_GROUP`, HCP is not active.
## Verification
No dedicated Bridge end-to-end test exists yet for HCP (see @skills/perf-hybrid-context-parallel/card.yaml
`follow_up_validation`). Use the existing unit tests and log inspection instead.
Run the decentralized-PG unit test to confirm the flat-CP behavior is preserved:
```bash
uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q
```
For a manual smoke check, launch a 4-GPU run with a small recipe and
`cp_comm_type=a2a+p2p` plus `hierarchical_context_parallel_sizes=[2,2]`:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
scripts/training/run_recipe.py \
--recipe llama32_1b_pretrain_config \
model.context_parallel_size=4 \
model.cp_comm_type=a2a+p2p \
"model.hierarchical_context_parallel_sizes=[2,2]" \
train.train_iters=2
```
Success criteria:
- Logs show `HIERARCHICAL_CONTEXT_PARALLEL_GROUPS` being created
- Training completes at least one step without error
- If you only see `CONTEXT_PARALLEL_GROUP`, HCP is not active
More from NVIDIA-NeMo/Megatron-Bridge
- adding-model-supportGuide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
- mlm-bridge-trainingRun Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.
- multi-node-slurmConvert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
- parity-testingStructured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill.
- perf-activation-recomputeValidate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
- perf-cpu-offloadingValidate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
- perf-cuda-graphsValidate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
- perf-expert-parallel-overlapValidate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
- perf-megatron-fsdpOperational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
- perf-memory-tuningTechniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.