ptq
$
npx mdskill add NVIDIA/skills/ptqGenerate quantized checkpoints for pretrained models using ModelOpt.
- Enables post-training quantization for LLMs, MoEs, and VLMs.
- Integrates with HuggingFace and TensorRT-LLM ecosystems.
- Selects execution path based on model support tables.
- Delivers optimized checkpoints ready for deployment.
SKILL.md
.github/skills/ptqView on GitHub ↗
---
name: ptq
description: This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.
---
# ModelOpt Post-Training Quantization
Produce a quantized checkpoint from a pretrained model. **Read `examples/llm_ptq/README.md` first** — it has the support matrix, CLI flags, and accuracy guidance.
## Step 1 — Environment
Read `skills/common/environment-setup.md` and `skills/common/workspace-management.md`. After completing them you should know:
- ModelOpt source is available
- Local or remote (+ cluster config if remote)
- SLURM / Docker+GPU / bare GPU
- Launcher available?
- Which workspace to use
## Step 2 — Is the model supported?
Check the support table in `examples/llm_ptq/README.md` for verified HF models.
- **Listed** → supported, use `hf_ptq.py` (step 4A/4B)
- **Not listed** → read `references/unsupported-models.md` to determine if `hf_ptq.py` can still work or if a custom script is needed (step 4C)
## Step 2.5 — Check for model-specific dependencies
If the model uses `trust_remote_code` (check `config.json` for `auto_map`), inspect its custom Python files for imports not present in the container:
```bash
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
```
**Known dependency patterns:**
| Import found | Packages to install |
| --- | --- |
| `from mamba_ssm` / `from causal_conv1d` | `mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba) |
If extra deps are needed:
- **Launcher (4B)**: set `EXTRA_PIP_DEPS` in the task's `environment` section — `ptq.sh` installs them automatically
- **Manual (4A)**: `unset PIP_CONSTRAINT && pip install <deps>` before running `hf_ptq.py`
## Step 3 — Choose quantization format
**First**, check for a model-specific recipe:
```bash
ls modelopt_recipes/models/ 2>/dev/null
```
If a model-specific recipe exists, use `--recipe <path>` — it may contain tuned settings.
**If no model-specific recipe**, choose a format based on GPU (details in `examples/llm_ptq/README.md`):
- **Blackwell** (B100/B200/GB200): `nvfp4` variants
- **Hopper** (H100/H200) or older: `fp8` or `int4_awq`
Use `--qformat <name>` (e.g., `--qformat nvfp4`). Format definitions: `modelopt/torch/quantization/config.py`. General PTQ recipes in `modelopt_recipes/general/ptq/` correspond to the same formats — `--qformat` is the simpler way to use them.
> NVFP4 can be calibrated on Hopper but requires Blackwell for inference.
## Step 4 — Run PTQ
**Goal: checkpoint on disk** (`.safetensors` + `config.json`).
For **listed models** (4A/4B): run full calibration directly (`--calib_size 512`).
For **unlisted models** (4C): run a smoke test first (`--calib_size 4`), wait for success, then full calibration.
### Which path?
```text
In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
│ Local Docker + GPU? ────────→ LAUNCHER (4B)
│ Remote Docker (no SLURM)? ──→ MANUAL (4A)
│ Bare GPU (local or remote)? → MANUAL (4A)
│
└→ NOT LISTED ──→ UNLISTED MODEL (4C)
```
### 4A — Direct: supported model, manual execution
```bash
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <model> \
--qformat <format> \
--calib_size 512 \
--export_path <output>
```
Run `--help` for all options.
For remote: use `remote_run` from `remote_exec.sh` (see `skills/common/remote-execution.md`).
### 4B — Launcher: supported model on SLURM or local Docker
Write a YAML config using `common/hf_ptq/hf_ptq.sh`. See `references/launcher-guide.md` for the full template.
```bash
cd tools/launcher
# SLURM (remote or local):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
# Local Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes
```
The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).
### 4C — Unlisted model
Follow `references/unsupported-models.md`. It walks through investigating the model, patching ModelOpt if needed, and running `hf_ptq.py`. Run manually (like 4A) for easier monitoring and debugging.
For SLURM, see `skills/common/slurm-setup.md` and `references/slurm-setup-ptq.md`.
### Monitoring
After job submission, register the job and set up monitoring per the **monitor skill**.
## Step 5 — Verify output
```bash
ls -lh <output_path>/
# Expect: config.json, tokenizer files, model-*.safetensors
```
Report the path and size to the user.
### Post-quantization validation
Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns) — this only surfaces later as deployment failures. Read `references/checkpoint-validation.md` for the validation script, expected patterns per recipe, and common pattern gaps.
**Next steps**: If the user wants to deploy or evaluate the quantized checkpoint, use the **deployment** or **evaluation** skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
## Key API Rules
- `mtq.register()` classes **must** define `_setup()` and call it from `__init__`
- Call `mto.enable_huggingface_checkpointing()` **before** quantization
- Wildcard `*gate*` matches too broadly — use `*mlp.gate*` or `*router*`
- VLMs: `hf_ptq.py` auto-extracts the language model via `extract_and_prepare_language_model_from_vl()` — no manual VLM handling needed in most cases
- FP8 checkpoints: prefer `_QuantFP8Linear` (lazy dequant) over `FineGrainedFP8Config(dequantize=True)` which wastes ~2x memory. See `references/unsupported-models.md` for details
- Custom quantizer names must end with `_input_quantizer` or `_weight_quantizer`
## Common Pitfalls
- **Model-specific dependencies**: Models with `trust_remote_code` may import packages not in the container (e.g., `mamba-ssm` for hybrid Mamba models). See Step 2.5. Use `EXTRA_PIP_DEPS` env var with the launcher, or install manually before running `hf_ptq.py`
- **Transformers version**: New models may need a newer version of transformers than what's installed. Check `config.json` for `transformers_version`. In containers, beware of `PIP_CONSTRAINT` blocking upgrades — see `references/slurm-setup-ptq.md` for workarounds
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
- **NFS root_squash + Docker**: See `skills/common/slurm-setup.md` section 5
## References
| Reference | When to read |
| --- | --- |
| `skills/common/environment-setup.md` | Step 1: always |
| `skills/common/workspace-management.md` | Step 1: always |
| `references/launcher-guide.md` | Step 4B only (launcher path) |
| `tools/launcher/CLAUDE.md` | Step 4B only, if you need more launcher detail |
| `references/unsupported-models.md` | Step 4C only (unlisted model) |
| `references/checkpoint-validation.md` | Step 5: validate quantization pattern matches recipe |
| `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote |
| `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) |
| `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
| `examples/llm_ptq/README.md` | Step 3: support matrix, CLI flags, accuracy |
| `modelopt/torch/quantization/config.py` | Step 3: format definitions |
| `modelopt/torch/export/model_utils.py` | Step 4C: TRT-LLM export type mapping |
| `modelopt_recipes/` | Step 3: pre-built recipes |
More from NVIDIA/skills
- accessing-mlflowQuery and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
- ad-add-fusion-transformation>
- ad-conf-check>
- ad-graph-dump>
- ad-model-onboard>
- ad-pipeline-failure-pr>
- add-benchmark>
- aiq-deploy|
- aiq-research|
- byobCreate custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.