perf-cpu-offloading
$
npx mdskill add NVIDIA-NeMo/Megatron-Bridge/perf-cpu-offloadingReduce GPU memory by offloading activations or optimizer states to CPU.
- Prevents out-of-memory crashes in large MoE models with deep inference.
- Integrates with Megatron Bridge and HybridDeviceOptimizer for state management.
- Selects strategy based on model size, parallelism, and memory constraints.
- Returns configuration parameters for activation or optimizer offloading.
SKILL.md
.github/skills/perf-cpu-offloadingView on GitHub ↗
---
name: perf-cpu-offloading
description: Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
when_to_use: Enabling CPU offload to reduce GPU memory, or investigating a commit that changed CPU offloading config and caused OOM or a crash; 'cpu_offloading', 'optimizer_cpu_offload', 'optimizer_offload_fraction', 'HybridDeviceOptimizer', 'move optimizer to CPU'.
---
# CPU Offloading
## References
- Stable docs: @docs/training/cpu-offloading.md
- Structured metadata: @skills/perf-cpu-offloading/card.yaml
## What It Is
Two independent mechanisms to move data from GPU to CPU memory:
| Mechanism | Config namespace | What gets offloaded | PP restriction |
|---|---|---|---|
| Activation offloading | `model.cpu_offloading*` | Activations (and optionally weights) per transformer layer | PP must be 1 |
| Optimizer offloading | `optimizer.optimizer_cpu_offload` | Adam optimizer states (momentum + variance) via `HybridDeviceOptimizer` | None |
## Quick Decision
| Situation | Recommendation |
|---|---|
| Large MoE model (30B+), needs PP > 1 | Optimizer offloading — activation offloading is blocked by PP=1 |
| Small/medium model, PP=1 fits, activation memory dominates | Activation offloading |
| Want tunable memory-speed tradeoff | Optimizer offloading with fractional `optimizer_offload_fraction` |
| Throughput is top priority | Don't enable — offloading always adds overhead |
| CUDA graphs are needed | Only optimizer offloading — activation offloading is incompatible |
| Memory pressure is moderate | Optimizer offload at 25–50% fraction for best efficiency |
## Enablement
### Optimizer CPU offloading (recommended for large models)
```python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True
```
CLI overrides:
```bash
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=True
```
### Activation CPU offloading (small/medium models only)
```python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"
```
## Config Parameter Reference
### Optimizer offloading
| Parameter | Default | Description |
|-----------|---------|-------------|
| `optimizer_cpu_offload` | `False` | Master switch |
| `optimizer_offload_fraction` | `0.0` | Fraction of optimizer states on CPU (0.0–1.0) |
| `overlap_cpu_optimizer_d2h_h2d` | `False` | Overlap GPU↔CPU transfers with compute |
| `use_torch_optimizer_for_cpu_offload` | `False` | Use `torch.optim` instead of fused optimizer for CPU portion |
### Activation offloading
| Parameter | Default | Description |
|-----------|---------|-------------|
| `cpu_offloading` | `False` | Master switch |
| `cpu_offloading_num_layers` | `0` | Number of transformer layers to offload (0 to num_layers-1) |
| `cpu_offloading_activations` | `True` | Offload activations |
| `cpu_offloading_weights` | `False` | Offload weights |
| `cpu_offloading_double_buffering` | `False` | Double-buffer across layers while reloading |
## Compatibility And Constraints
### Activation offloading
- `pipeline_model_parallel_size` must be 1
- `recompute_granularity` must be `None`
- Cannot combine with `fine_grained_activation_offloading`
- Cannot combine with CUDA graphs
- `cpu_offloading_num_layers` must be in `[0, num_layers-1)`
### Optimizer offloading
- Requires `use_distributed_optimizer = True` (default in most recipes)
- No PP, recompute, or CUDA graph restrictions
- `optimizer_offload_fraction` must be in `[0.0, 1.0]`
### Practical: large MoE models
Activation offloading is blocked for Qwen3-30B-A3B and similar large MoE
models. The PP=1 constraint means each GPU holds all 48 layers; model
weights + optimizer states alone (~70 GB) exceed H100 80 GB capacity.
## Minimal Working Config
### Optimizer offload (50%, balanced)
```python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5
```
### Optimizer offload (100% + overlap, max savings)
```python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True
```
### Activation offload (small model, PP=1)
```python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
```
### Weight offload only (small model, PP=1)
```python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
```
### Both activations and weights (small model, PP=1)
```python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
```
Weight offloading and activation offloading share the same constraints (PP=1,
no recompute, no CUDA graphs). Weight offloading has not been tested in
the Qwen3-30B-A3B experiments — the measured results cover optimizer
offloading only.
## Minimal Runnable Command
```bash
uv run python scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
train.train_iters=20 \
train.global_batch_size=8 \
train.micro_batch_size=1
```
## Verification
### Unit tests
```bash
uv run python -m pytest \
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q
```
### Success criteria
- Config validation passes for the selected offloading mode
- Training completes without OOM or NCCL errors
- Loss matches the non-offloaded baseline (max delta < 0.001)
- Memory usage drops proportionally to offload fraction
## Code Anchors
### MCore activation offload constraints
```1296:1310:3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
if self.cpu_offloading and (
self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
):
raise ValueError(...)
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
if self.cpu_offloading and self.recompute_granularity is not None:
raise ValueError(
"CPU offloading does not work when activation recomputation is enabled"
)
```
### MCore CUDA graph incompatibility
```1943:1944:3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
if self.cpu_offloading:
raise ValueError("CUDA graphs not supported with CPU offloading.")
```
### MCore fine-grained offloading mutual exclusion
```1427:1430:3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
if self.fine_grained_activation_offloading:
assert (
not self.cpu_offloading
), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."
```
### MCore HybridDeviceOptimizer instantiation
```480:518:3rdparty/Megatron-LM/megatron/core/optimizer/__init__.py
if config.optimizer_cpu_offload:
# ... setup cpu/gpu optimizer classes ...
optimizer = HybridDeviceOptimizer(
param_groups,
offload_fraction=config.optimizer_offload_fraction,
cpu_optimizer_cls=cpu_optimizer_cls,
gpu_optimizer_cls=gpu_optimizer_cls,
overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
pin_cpu_grads=config.pin_cpu_grads,
pin_cpu_params=config.pin_cpu_params,
)
```
### Bridge CUDA graph guard
```232:234:src/megatron/bridge/models/gpt_full_te_layer_autocast_spec.py
assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"
```
### Bridge activation offloading in PEFT
```621:631:src/megatron/bridge/peft/utils.py
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_in(x)
x = self.activation(x)
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_out(x)
```
### MCore model_parallel_config fields
```3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
cpu_offloading: bool = False
cpu_offloading_num_layers: int = 0
cpu_offloading_activations: bool = True
cpu_offloading_weights: bool = False
cpu_offloading_double_buffering: bool = False
cpu_offloading_retain_pinned_cpu_buffers: bool = False
```
### MCore optimizer offload config
```3rdparty/Megatron-LM/megatron/core/optimizer/optimizer_config.py
optimizer_cpu_offload: bool = False
optimizer_offload_fraction: float = 0.0
use_torch_optimizer_for_cpu_offload: bool = False
overlap_cpu_optimizer_d2h_h2d: bool = False
```
## Failure Diagnosis
| Symptom | Likely Cause | How To Confirm | Fix |
|---|---|---|---|
| `Currently there is no support for Pipeline parallelism with CPU offloading` | Activation offload + PP > 1 | Check `pipeline_model_parallel_size` | Set PP=1 or use optimizer offloading |
| `CPU offloading does not work when activation recomputation is enabled` | Activation offload + recompute | Check `recompute_granularity` | Set `recompute_granularity=null` |
| `fine_grained_activation_offloading cannot be enabled with cpu_offloading` | Both offloading modes enabled | Check both flags | Use one or the other |
| `CUDA graphs not supported with CPU offloading` | CUDA graphs + activation offload | Check `cuda_graph_impl` | Set `cuda_graph_impl="none"` |
| OOM with activation offloading | Model too large for PP=1 | Check allocated memory vs 80 GB | Use optimizer offloading with PP > 1 |
| Extreme slowdown (>4x) | 100% optimizer offload, CPU Adam bottleneck | Compare iter time at different fractions | Reduce fraction or enable `overlap_cpu_optimizer_d2h_h2d` |
| OOM at partial optimizer offload | Insufficient offload for this config | Check memory at different fractions | Increase fraction or add PP |
## Known Limitations
- Activation offloading requires PP=1, making it impractical for large models
(30B+ MoE) that need pipeline parallelism.
- Optimizer offloading throughput penalty scales linearly (~1.9x at 25%,
~4.2x at 100% for Qwen3-30B-A3B).
- D2H/H2D overlap provides only ~7% speedup because CPU Adam compute is
the dominant bottleneck.
- `fine_grained_activation_offloading` is a separate module-level approach
that works with PP > 1 but cannot be combined with layer-level
`cpu_offloading`.
More from NVIDIA-NeMo/Megatron-Bridge
- adding-model-supportGuide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
- mlm-bridge-trainingRun Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.
- multi-node-slurmConvert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
- parity-testingStructured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill.
- perf-activation-recomputeValidate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
- perf-cuda-graphsValidate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
- perf-expert-parallel-overlapValidate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
- perf-hybrid-context-parallelOperational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
- perf-megatron-fsdpOperational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
- perf-memory-tuningTechniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.