kernel-tileir-optimization
$
npx mdskill add NVIDIA/skills/kernel-tileir-optimizationOptimize Triton kernels for NVIDIA Blackwell TileIR backend.
- Enhances performance on sm_100+ GPUs via CGA-level tile representations.
- Integrates with nvtriton and standard Triton import syntax.
- Activates automatically when ENABLE_TILE environment variable is set.
- Delivers autotuned configurations including occupancy and TMA descriptors.
SKILL.md
.github/skills/kernel-tileir-optimizationView on GitHub ↗
---
name: kernel-tileir-optimization
description: >
Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+).
Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers
kernel classification (dot-related, norm-like, elementwise, reduction), type-specific
transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR",
"add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode",
"occupancy tuning". Kernels use standard `import triton`; TileIR activates via
ENABLE_TILE=1 when nvtriton is installed.
compatibility: >
Requires Blackwell GPU (sm_100+) for TileIR execution. Supports development on any GPU.
license: Apache-2.0
metadata:
author: NVIDIA Corporation
---
# Triton TileIR Optimization
Optimize EXISTING Triton kernels for NVIDIA's TileIR backend on Blackwell GPUs.
This skill does NOT write kernels from scratch -- that is the Triton Specialist's job.
## Principles
### TileIR vs PTX Backend
TileIR is NVIDIA's compiler backend for Triton that generates optimized CUDA code
using CGA-level (Cooperative Grid Array) tile representations. Critical differences:
| Parameter | PTX Backend | TileIR Backend |
|-----------|-------------|----------------|
| `num_warps` | Strict directive | **Ignored** (compiler decides) |
| `num_stages` | Strict directive | Cost hint (compiler optimizes) |
| `occupancy` | Not available | **Critical** tuning param (1-32) |
| `num_ctas` | Limited | 2CTA mode for Blackwell |
| Block sizes | Smaller often better | Larger often better |
| TMA | Not available | Required for dot kernels |
**Key implication**: Do not tune `num_warps` for TileIR -- focus on `occupancy` instead.
### Triton Package Landscape
Three packages share `import triton`:
| Package | Source | Use Case |
|---------|--------|----------|
| `pytorch-triton` | PyTorch wheel | `torch.compile`, standard kernels |
| `triton` | OpenAI PyPI | Official Triton from triton-lang.org |
| nvtriton | [Triton-to-tile-IR](https://github.com/triton-lang/Triton-to-tile-IR) | TileIR backend for Blackwell |
Only one triton package should be installed at a time. "Converting to TileIR" means
adding TileIR-specific configs, NOT changing imports. TileIR activates via `ENABLE_TILE=1`.
### When TileIR Applies
TileIR targets Blackwell (sm_100+). Without nvtriton or Blackwell hardware, the
specialist still adds TileIR-optimized configs that standard triton safely ignores,
enabling future deployment.
**Expected speedups** (with nvtriton on Blackwell):
| Kernel Type | Speedup | Key Lever |
|-------------|---------|-----------|
| Dot-Related (GEMM, Attention) | 1.2-2.0x | TMA + 2CTA |
| Norm-Like (LayerNorm, Softmax) | 2.0-5.0x | High occupancy |
| Element-Wise (ReLU, Add, Exp) | 1.5-3.0x | Occupancy + num_stages |
| Reduction (Sum, Mean, Max) | 1.8-4.0x | High occupancy |
## Workflow
Five-phase workflow: compatibility, classify, transform, validate, benchmark.
### Phase 1: Compatibility Test (ENABLE_TILE=0)
Verify the kernel works in PTX mode before applying TileIR optimizations.
```bash
python scripts/tileir_check.py
```
Then use the **kernel-triton-writing** skill's `verify_kernel.py` to verify with `ENABLE_TILE=0`:
```bash
python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
```
### Phase 2: Classify Kernel
Determine kernel type to select the optimization strategy.
```bash
python scripts/classify_kernel.py --file kernel.py
```
Classification decision tree:
```
Contains tl.dot()?
YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
NO --> Has reduction + normalization?
YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
NO --> Point-wise only?
YES --> element-wise: occupancy (1-16) + num_stages (2-4)
NO --> reduction: high occupancy + num_warps
```
### Phase 3: Apply Transformations
Classify and apply optimizations in one step:
```bash
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
```
Output JSON includes `optimized_code` and `changes_applied` fields.
**Type-specific transformations:**
**Dot-related** (highest priority):
1. Convert `tl.load`/`tl.store` to TMA descriptors (MANDATORY). See `references/tma-conversion.md`.
2. Add 2CTA configs (`num_ctas=2`) with SM oversubscription guard in pre-hook.
3. Add occupancy (1, 2, 4) and extended num_stages (4, 6).
4. Use larger block sizes (256x256, 256x128).
**Norm-like** (LayerNorm, Softmax, RMSNorm):
- Add occupancy (2, 4), num_warps (4, 8). No TMA needed.
**Element-wise** (ReLU, GELU, Add, Mul, Exp):
- Add occupancy (1, 2, 4, 16), num_stages (2, 3, 4). Include extreme configs for small inputs.
**Reduction** (Sum, Mean, Max):
- Same strategy as norm-like: high occupancy (2, 4), num_warps (4, 8).
Gate TileIR-specific configs for sm_100+:
```python
import torch
def get_configs_with_gating(pre_hook=None):
configs = get_baseline_configs()
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
configs.extend(get_tileir_specific_configs(pre_hook))
return configs
```
See `references/config-templates.md` for complete config templates per kernel type.
### Phase 4: TileIR Validation (ENABLE_TILE=1)
Use the **kernel-triton-writing** skill's `verify_kernel.py` to verify the optimized kernel with TileIR backend:
```bash
python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
```
Set `ENABLE_TILE=1` before running. Check: numerical correctness, no compilation errors,
TMA/2CTA patterns compile successfully.
### Phase 5: Benchmark
Use `triton.testing.do_bench()` (as documented in the **perf-workload-profiling** skill) to compare PTX (`ENABLE_TILE=0`) vs TileIR (`ENABLE_TILE=1`).
Benchmark across multiple input sizes (128, 1024, 8192) -- performance varies by size.
## Scripts
### tileir_check.py
Check TileIR availability (nvtriton, ENABLE_TILE, Blackwell GPU):
```bash
python scripts/tileir_check.py
```
Returns JSON: `nvtriton_installed`, `tileir_active`, `blackwell_gpu`, `gpu_capability`, `recommendation`.
### classify_kernel.py
Classify kernel type and optionally apply TileIR optimizations:
```bash
# Classify only
python scripts/classify_kernel.py --file kernel.py
# Classify + apply optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
# From inline code
python scripts/classify_kernel.py --code '<kernel_code>'
```
Returns JSON: `classification`, `confidence`, `indicators`, `recommendations`.
With `--apply-optimizations`: adds `optimized_code` and `changes_applied`.
## Error Handling
### Common Pitfalls
**TMA descriptor errors** (dot-related kernels):
- Always pass `pre_hook=tma_set_block_size_hook` to config generation -- without it,
TMA descriptors keep dummy block sizes, causing runtime errors or wrong results.
- For GEMM: pass `b.T.contiguous()` in wrapper and use `tl.dot(a, b.T, accumulator)`
in kernel. Transposition mismatch produces incorrect results silently.
**2CTA oversubscription**:
- Adjust SM count in pre-hook when using `num_ctas=2`:
```python
if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
```
**Config function signatures**:
- ALL config helper functions MUST accept `pre_hook=None`, even if unused.
Without it: `TypeError: get_autotune_configs() takes 0 positional arguments`.
**Hardware gating**:
- Gate TileIR configs with `torch.cuda.get_device_capability()[0] >= 10`.
TMA/2CTA on pre-Blackwell GPUs causes runtime crashes.
**API availability**:
- Use `1.0 / (1.0 + tl.exp(-x))` instead of `tl.sigmoid(x)` -- not available in
all Triton versions including some nvtriton builds.
**Performance tuning**:
- Do not over-tune `num_warps` -- TileIR ignores it. Focus on `occupancy`.
- Use larger block sizes (256x256, 256x128) for TileIR, not PTX-tuned small blocks.
- Benchmark across small/medium/large inputs; one-size configs underperform.
- For exp/log heavy kernels, enable approximate math:
```bash
export TILEIR_ENABLE_APPROX=1
export TILEIR_ENABLE_FTZ=1
```
### When to Abort
Stop and report if:
1. **No triton installed** -- cannot proceed.
2. **Compatibility test fails** -- kernel has syntax/runtime errors before optimization.
3. **TileIR validation fails** -- optimized kernel produces wrong results.
4. **No speedup** -- TileIR version is slower than PTX baseline (with nvtriton).
5. **Not Blackwell GPU** -- still add configs for future deployment, but skip
ENABLE_TILE testing and benchmarking.
### Output Format
After optimization, return:
```
## TileIR Optimization: kernel_name
### Classification
- Kernel type: [dot-related | norm-like | element-wise | reduction]
- Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]
### Compatibility Check (ENABLE_TILE=0)
[PASSED | FAILED] — Max difference: X.Xe-Y
### Transformations Applied
- [List of transformations]
### TileIR Validation (ENABLE_TILE=1)
[PASSED | FAILED] — Max difference: X.Xe-Y
### Benchmark Comparison
| Backend | Time (ms) | Speedup |
|---------|-----------|---------|
| PTX (ENABLE_TILE=0) | X.XXX | 1.0x |
| TileIR (ENABLE_TILE=1) | X.XXX | Y.Yx |
### Output
File: kernel_name_tileir.py
```