kernel-tileir-optimization

Name: kernel-tileir-optimization
Author: NVIDIA/skills

$npx mdskill add NVIDIA/skills/kernel-tileir-optimization

Optimize Triton kernels for NVIDIA Blackwell TileIR backend.

Enhances performance on sm_100+ GPUs via CGA-level tile representations.
Integrates with nvtriton and standard Triton import syntax.
Activates automatically when ENABLE_TILE environment variable is set.
Delivers autotuned configurations including occupancy and TMA descriptors.

SKILL.md

.github/skills/kernel-tileir-optimizationView on GitHub ↗

---
name: kernel-tileir-optimization
description: >
  Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+).
  Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers
  kernel classification (dot-related, norm-like, elementwise, reduction), type-specific
  transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR",
  "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode",
  "occupancy tuning". Kernels use standard `import triton`; TileIR activates via
  ENABLE_TILE=1 when nvtriton is installed.
compatibility: >
  Requires Blackwell GPU (sm_100+) for TileIR execution. Supports development on any GPU.
license: Apache-2.0
metadata:
  author: NVIDIA Corporation
---

# Triton TileIR Optimization

Optimize EXISTING Triton kernels for NVIDIA's TileIR backend on Blackwell GPUs.
This skill does NOT write kernels from scratch -- that is the Triton Specialist's job.

## Principles

### TileIR vs PTX Backend

TileIR is NVIDIA's compiler backend for Triton that generates optimized CUDA code
using CGA-level (Cooperative Grid Array) tile representations. Critical differences:

| Parameter | PTX Backend | TileIR Backend |
|-----------|-------------|----------------|
| `num_warps` | Strict directive | **Ignored** (compiler decides) |
| `num_stages` | Strict directive | Cost hint (compiler optimizes) |
| `occupancy` | Not available | **Critical** tuning param (1-32) |
| `num_ctas` | Limited | 2CTA mode for Blackwell |
| Block sizes | Smaller often better | Larger often better |
| TMA | Not available | Required for dot kernels |

**Key implication**: Do not tune `num_warps` for TileIR -- focus on `occupancy` instead.

### Triton Package Landscape

Three packages share `import triton`:

| Package | Source | Use Case |
|---------|--------|----------|
| `pytorch-triton` | PyTorch wheel | `torch.compile`, standard kernels |
| `triton` | OpenAI PyPI | Official Triton from triton-lang.org |
| nvtriton | [Triton-to-tile-IR](https://github.com/triton-lang/Triton-to-tile-IR) | TileIR backend for Blackwell |

Only one triton package should be installed at a time. "Converting to TileIR" means
adding TileIR-specific configs, NOT changing imports. TileIR activates via `ENABLE_TILE=1`.

### When TileIR Applies

TileIR targets Blackwell (sm_100+). Without nvtriton or Blackwell hardware, the
specialist still adds TileIR-optimized configs that standard triton safely ignores,
enabling future deployment.

**Expected speedups** (with nvtriton on Blackwell):

| Kernel Type | Speedup | Key Lever |
|-------------|---------|-----------|
| Dot-Related (GEMM, Attention) | 1.2-2.0x | TMA + 2CTA |
| Norm-Like (LayerNorm, Softmax) | 2.0-5.0x | High occupancy |
| Element-Wise (ReLU, Add, Exp) | 1.5-3.0x | Occupancy + num_stages |
| Reduction (Sum, Mean, Max) | 1.8-4.0x | High occupancy |

## Workflow

Five-phase workflow: compatibility, classify, transform, validate, benchmark.

### Phase 1: Compatibility Test (ENABLE_TILE=0)

Verify the kernel works in PTX mode before applying TileIR optimizations.

```bash
python scripts/tileir_check.py
```

Then use the **kernel-triton-writing** skill's `verify_kernel.py` to verify with `ENABLE_TILE=0`:

```bash
python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
```

### Phase 2: Classify Kernel

Determine kernel type to select the optimization strategy.

```bash
python scripts/classify_kernel.py --file kernel.py
```

Classification decision tree:

```
Contains tl.dot()?
  YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
  NO  --> Has reduction + normalization?
            YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
            NO  --> Point-wise only?
                      YES --> element-wise: occupancy (1-16) + num_stages (2-4)
                      NO  --> reduction: high occupancy + num_warps
```

### Phase 3: Apply Transformations

Classify and apply optimizations in one step:

```bash
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
```

Output JSON includes `optimized_code` and `changes_applied` fields.

**Type-specific transformations:**

**Dot-related** (highest priority):
1. Convert `tl.load`/`tl.store` to TMA descriptors (MANDATORY). See `references/tma-conversion.md`.
2. Add 2CTA configs (`num_ctas=2`) with SM oversubscription guard in pre-hook.
3. Add occupancy (1, 2, 4) and extended num_stages (4, 6).
4. Use larger block sizes (256x256, 256x128).

**Norm-like** (LayerNorm, Softmax, RMSNorm):
- Add occupancy (2, 4), num_warps (4, 8). No TMA needed.

**Element-wise** (ReLU, GELU, Add, Mul, Exp):
- Add occupancy (1, 2, 4, 16), num_stages (2, 3, 4). Include extreme configs for small inputs.

**Reduction** (Sum, Mean, Max):
- Same strategy as norm-like: high occupancy (2, 4), num_warps (4, 8).

Gate TileIR-specific configs for sm_100+:

```python
import torch

def get_configs_with_gating(pre_hook=None):
    configs = get_baseline_configs()
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
        configs.extend(get_tileir_specific_configs(pre_hook))
    return configs
```

See `references/config-templates.md` for complete config templates per kernel type.

### Phase 4: TileIR Validation (ENABLE_TILE=1)

Use the **kernel-triton-writing** skill's `verify_kernel.py` to verify the optimized kernel with TileIR backend:

```bash
python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
```

Set `ENABLE_TILE=1` before running. Check: numerical correctness, no compilation errors,
TMA/2CTA patterns compile successfully.

### Phase 5: Benchmark

Use `triton.testing.do_bench()` (as documented in the **perf-workload-profiling** skill) to compare PTX (`ENABLE_TILE=0`) vs TileIR (`ENABLE_TILE=1`).

Benchmark across multiple input sizes (128, 1024, 8192) -- performance varies by size.

## Scripts

### tileir_check.py

Check TileIR availability (nvtriton, ENABLE_TILE, Blackwell GPU):

```bash
python scripts/tileir_check.py
```

Returns JSON: `nvtriton_installed`, `tileir_active`, `blackwell_gpu`, `gpu_capability`, `recommendation`.

### classify_kernel.py

Classify kernel type and optionally apply TileIR optimizations:

```bash
# Classify only
python scripts/classify_kernel.py --file kernel.py

# Classify + apply optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations

# From inline code
python scripts/classify_kernel.py --code '<kernel_code>'
```

Returns JSON: `classification`, `confidence`, `indicators`, `recommendations`.
With `--apply-optimizations`: adds `optimized_code` and `changes_applied`.

## Error Handling

### Common Pitfalls

**TMA descriptor errors** (dot-related kernels):
- Always pass `pre_hook=tma_set_block_size_hook` to config generation -- without it,
  TMA descriptors keep dummy block sizes, causing runtime errors or wrong results.
- For GEMM: pass `b.T.contiguous()` in wrapper and use `tl.dot(a, b.T, accumulator)`
  in kernel. Transposition mismatch produces incorrect results silently.

**2CTA oversubscription**:
- Adjust SM count in pre-hook when using `num_ctas=2`:
  ```python
  if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
      nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
  ```

**Config function signatures**:
- ALL config helper functions MUST accept `pre_hook=None`, even if unused.
  Without it: `TypeError: get_autotune_configs() takes 0 positional arguments`.

**Hardware gating**:
- Gate TileIR configs with `torch.cuda.get_device_capability()[0] >= 10`.
  TMA/2CTA on pre-Blackwell GPUs causes runtime crashes.

**API availability**:
- Use `1.0 / (1.0 + tl.exp(-x))` instead of `tl.sigmoid(x)` -- not available in
  all Triton versions including some nvtriton builds.

**Performance tuning**:
- Do not over-tune `num_warps` -- TileIR ignores it. Focus on `occupancy`.
- Use larger block sizes (256x256, 256x128) for TileIR, not PTX-tuned small blocks.
- Benchmark across small/medium/large inputs; one-size configs underperform.
- For exp/log heavy kernels, enable approximate math:
  ```bash
  export TILEIR_ENABLE_APPROX=1
  export TILEIR_ENABLE_FTZ=1
  ```

### When to Abort

Stop and report if:

1. **No triton installed** -- cannot proceed.
2. **Compatibility test fails** -- kernel has syntax/runtime errors before optimization.
3. **TileIR validation fails** -- optimized kernel produces wrong results.
4. **No speedup** -- TileIR version is slower than PTX baseline (with nvtriton).
5. **Not Blackwell GPU** -- still add configs for future deployment, but skip
   ENABLE_TILE testing and benchmarking.

### Output Format

After optimization, return:

```
## TileIR Optimization: kernel_name

### Classification
- Kernel type: [dot-related | norm-like | element-wise | reduction]
- Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]

### Compatibility Check (ENABLE_TILE=0)
[PASSED | FAILED] — Max difference: X.Xe-Y

### Transformations Applied
- [List of transformations]

### TileIR Validation (ENABLE_TILE=1)
[PASSED | FAILED] — Max difference: X.Xe-Y

### Benchmark Comparison
| Backend | Time (ms) | Speedup |
|---------|-----------|---------|
| PTX (ENABLE_TILE=0) | X.XXX | 1.0x |
| TileIR (ENABLE_TILE=1) | X.XXX | Y.Yx |

### Output
File: kernel_name_tileir.py
```

More from NVIDIA/skills

Skill	Description
accessing-mlflow	Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
ad-add-fusion-transformation	>
ad-conf-check	>
ad-graph-dump	>
ad-model-onboard	>
ad-pipeline-failure-pr	>
add-benchmark	>
aiq-deploy	\|
aiq-research	\|
byob	Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.