rocm-kernels

Name: rocm-kernels
Author: huggingface/kernels
$npx mdskill add huggingface/kernels/rocm-kernels
Optimize AMD GPU kernels for diffusers and transformers pipelines.
Accelerate RMSNorm, RoPE 3D, and GEGLU operations on MI355X hardware.
Integrates with Triton, ROCm, diffusers, and transformers libraries.
Selects kernels based on user-specified task type and target architecture.
Delivers ready-to-inject code patterns and autotuned performance benchmarks.
SKILL.md
.github/skills/rocm-kernelsView on GitHub ↗
---
name: rocm-kernels
description: "Provides guidance for writing and benchmarking optimized Triton kernels for AMD GPUs (MI355X, R9700) on ROCm, targeting HuggingFace diffusers (LTX-Video, SD3, FLUX) and transformers. Core kernels: RMSNorm, RoPE 3D, GEGLU, AdaLN. Includes XCD swizzle, autotune, diffusers integration patterns, and LTX-Video pipeline injection."
disable-model-invocation: false
user-invocable: true
allowed-tools: "Read, Grep, Glob, Bash"
argument-hint: "kernel type: rmsnorm, rope, rope-3d, geglu, adaln, gemm, benchmark, diffusers, transformers, ltx-video, huggingface-kernels, get_kernel, autotune, xcd-swizzle"
---

# ROCm Triton Kernels for Diffusers & Transformers

This skill provides patterns and guidance for developing optimized Triton kernels targeting AMD GPUs (MI355X, R9700) on ROCm, for use with HuggingFace **diffusers** (LTX-Video, SD3, FLUX) and **transformers** libraries.

## Quick Start

### Diffusers (LTX-Video)

**Inject optimized kernels into LTX-Video pipeline:**
```python
import os
os.environ['TRITON_HIP_USE_BLOCK_PINGPONG'] = '1'
os.environ['TRITON_HIP_USE_ASYNC_COPY'] = '1'

from diffusers import LTXPipeline
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")  # ROCm uses same API via HIP
inject_optimized_kernels(pipe)  # BEFORE CPU offloading
pipe.enable_model_cpu_offload()
```

**For a minimal integration example (~150 lines):**
```bash
python scripts/ltx_kernel_injection_example.py
```

### Isolated Kernel Micro-benchmarks
```bash
# All 4 kernels: correctness + performance + bandwidth
python scripts/benchmark_kernels.py

# Single kernel
python scripts/benchmark_kernels.py --kernel rmsnorm
python scripts/benchmark_kernels.py --kernel rope
python scripts/benchmark_kernels.py --kernel geglu
python scripts/benchmark_kernels.py --kernel adaln
```

### End-to-End Pipeline Benchmark
```bash
# Compare baseline vs Triton vs torch.compile
python scripts/benchmark_e2e.py --mode all

# Quick test
python scripts/benchmark_e2e.py --mode triton --num-frames 9 --steps 5

# Save results for comparison
python scripts/benchmark_e2e.py --mode all --output-json results.json
```

## Target Model: LTX-Video

### Architecture Overview

| Component | Class | Has Weight | Count | Kernel |
|-----------|-------|------------|-------|--------|
| `transformer_blocks.*.norm1` | RMSNorm | **No** (elementwise_affine=False) | 56 | RMSNorm |
| `transformer_blocks.*.norm2` | RMSNorm | **No** | 56 | RMSNorm |
| `transformer_blocks.*.attn1.norm_q` | torch.nn.RMSNorm | Yes | 28 | RMSNorm |
| `transformer_blocks.*.attn1.norm_k` | torch.nn.RMSNorm | Yes | 28 | RMSNorm |
| `transformer_blocks.*.ff` | FeedForward | - | 28 | **GELU** (not GEGLU!) |
| Rotary position encoding | LTXVideoRotaryPosEmbed | - | 1 | RoPE 3D |

**Total RMSNorm modules: 168** (56 with weights, 112 without)

### Target Kernels

| Kernel | Use Case | Input Layout | Key Challenge |
|--------|----------|-------------|---------------|
| **RMSNorm** | Normalization | `[..., hidden_size]` | Weight may be None; 168 instances |
| **RoPE 3D** | Video position encoding | `[batch, t*h*w, heads, head_dim]` | 3D → temporal + spatial decomposition |
| **GEGLU** | Gated activation (SD3/FLUX) | `[batch, seq, 2*hidden]` → `[batch, seq, hidden]` | Gate/value split |
| **AdaLN** | Conditioned normalization (DiT) | `norm(x) * weight * (1+scale) + shift` | Fused norm + condition |

## Supported Hardware

| GPU | Architecture | Wave Size | LDS/CU | Mem BW | Key Feature | Verified |
|-----|-------------|-----------|--------|--------|-------------|:--------:|
| **MI355X** | CDNA3+ (gfx950) | Wave64 | **160 KB** | 8 TB/s | 32 XCDs, XCD Swizzle for GEMM | Yes |
| **R9700** | RDNA4 (gfx1201) | **Wave32** | 64 KB | ~608 GB/s | 256B cacheline, inference-focused | Yes |

> See [MI355X guide](references/mi355x-optimization-guide.md) | [R9700 guide](references/r9700-optimization-guide.md)

## When This Skill Applies

Use this skill when:
- Writing Triton kernels for **RMSNorm, RoPE, GEGLU, AdaLN** on AMD GPUs
- Integrating custom kernels with **diffusers** pipelines (LTX-Video, SD3, FLUX)
- Benchmarking kernel performance against PyTorch baseline on ROCm
- Optimizing existing kernels for MI355X or R9700 architecture
- Debugging ROCm/HIP-specific kernel issues

## Critical ROCm Constraints

### Things That DON'T Work on AMD

```python
# FORBIDDEN - CUDA only, NOT available on ROCm
tl.libdevice.tanh(x)          # Use manual formula below
tl.libdevice.log1p(x)         # Use: tl.log(1.0 + x)
tl.math.tanh(x)               # Also NOT available on ROCm Triton

# Manual tanh (ONLY reliable method on ROCm):
e2x = tl.exp(2.0 * x)
tanh_x = (e2x - 1.0) / (e2x + 1.0)

# FORBIDDEN - Triton limitations on ROCm
break / continue               # Use: tl.where()
min(a, b) / max(a, b)          # Use: tl.minimum(a, b) / tl.maximum(a, b)
```

### Mandatory Environment Variables

```python
import os
os.environ['TRITON_HIP_USE_BLOCK_PINGPONG'] = '1'
os.environ['TRITON_HIP_USE_ASYNC_COPY'] = '1'
```

## Core Kernel Implementations

### 1. RMSNorm (Core Optimization Target)

Row-wise reduction pattern. **168 instances** in LTX-Video, ~5% of total compute.

**CRITICAL: Do NOT autotune BLOCK_D.** Autotune may pick `BLOCK_D < D`, causing partial row processing and wrong results. Always compute `BLOCK_D = triton.next_power_of_2(D)` in the Python wrapper.

```python
@triton.jit
def rmsnorm_kernel(
    x_ptr, weight_ptr, out_ptr,
    stride_x, D,
    eps: tl.constexpr,
    HAS_WEIGHT: tl.constexpr,
    BLOCK_D: tl.constexpr,
):
    row = tl.program_id(0)
    offs = tl.arange(0, BLOCK_D)
    mask = offs < D
    x = tl.load(x_ptr + row * stride_x + offs, mask=mask, other=0.0).to(tl.float32)

    variance = tl.sum(x * x, axis=0) / D
    rms_inv = tl.rsqrt(variance + eps)

    if HAS_WEIGHT:
        w = tl.load(weight_ptr + offs, mask=mask, other=1.0).to(tl.float32)
        out = x * rms_inv * w
    else:
        out = x * rms_inv

    tl.store(out_ptr + row * stride_x + offs, out.to(x.dtype), mask=mask)


def triton_rmsnorm(x, weight=None, eps=1e-6):
    x_2d = x.contiguous().view(-1, x.shape[-1])
    out = torch.empty_like(x_2d)
    M, D = x_2d.shape
    has_weight = weight is not None
    if not has_weight:
        weight = torch.empty(0, device=x.device)

    BLOCK_D = triton.next_power_of_2(D)
    num_warps = 4 if BLOCK_D <= 1024 else (8 if BLOCK_D <= 4096 else 16)
    rmsnorm_kernel[(M,)](
        x_2d, weight, out, x_2d.stride(0), D, eps, has_weight,
        BLOCK_D=BLOCK_D, num_warps=num_warps, num_stages=2,
    )
    return out.view_as(x)
```

**LTX-Video pitfall: Weight may be None!**
```python
has_weight = hasattr(module, 'weight') and module.weight is not None
```

### 2. RoPE 3D (Video Position Encoding)

Element-wise pattern. LTX-Video splits `head_dim` into temporal + spatial components.

**CRITICAL: cos/sin have shape `[seq_len, head_dim]`.** When grid flattens batch dimension (`batch * seq_len`), use `pid_s % seq_len` to index cos/sin, otherwise batch > 1 causes OOB GPU crash.

```python
@triton.jit
def rope_3d_kernel(
    qk_ptr, cos_ptr, sin_ptr, out_ptr,
    seq_len, num_heads, head_dim,
    stride_s, stride_h, stride_d,
    BLOCK_HD: tl.constexpr,
):
    pid_s = tl.program_id(0)  # batch * seq_len
    pid_h = tl.program_id(1)  # head index
    half_dim = head_dim // 2
    offs = tl.arange(0, BLOCK_HD)
    mask = offs < half_dim

    base = pid_s * stride_s + pid_h * stride_h
    x0 = tl.load(qk_ptr + base + offs, mask=mask, other=0.0).to(tl.float32)
    x1 = tl.load(qk_ptr + base + half_dim + offs, mask=mask, other=0.0).to(tl.float32)

    seq_idx = pid_s % seq_len  # wrap for batch > 1
    cos_val = tl.load(cos_ptr + seq_idx * head_dim + offs, mask=mask, other=1.0).to(tl.float32)
    sin_val = tl.load(sin_ptr + seq_idx * head_dim + offs, mask=mask, other=0.0).to(tl.float32)

    out0 = x0 * cos_val - x1 * sin_val
    out1 = x0 * sin_val + x1 * cos_val

    tl.store(out_ptr + base + offs, out0.to(x0.dtype), mask=mask)
    tl.store(out_ptr + base + half_dim + offs, out1.to(x0.dtype), mask=mask)


def triton_rope_3d(qk, cos, sin):
    qk = qk.contiguous()
    out = torch.empty_like(qk)
    batch, seq_len, num_heads, head_dim = qk.shape
    qk_flat = qk.view(batch * seq_len, num_heads, head_dim)
    out_flat = out.view(batch * seq_len, num_heads, head_dim)
    BLOCK_HD = triton.next_power_of_2(head_dim // 2)
    num_warps = 4 if BLOCK_HD <= 64 else 8
    rope_3d_kernel[(batch * seq_len, num_heads)](
        qk_flat, cos, sin, out_flat,
        seq_len, num_heads, head_dim,
        qk_flat.stride(0), qk_flat.stride(1), qk_flat.stride(2),
        BLOCK_HD=BLOCK_HD, num_warps=num_warps, num_stages=2,
    )
    return out
```

### 3. GEGLU (For SD3/FLUX, NOT LTX-Video)

Element-wise gated activation. Input `[batch, seq, 2*hidden]` → Output `[batch, seq, hidden]`.

**Same BLOCK_SIZE rule: compute dynamically, do NOT autotune.**

```python
@triton.jit
def geglu_kernel(
    input_ptr, output_ptr,
    stride_in, stride_out, hidden_size,
    BLOCK_H: tl.constexpr,
):
    row = tl.program_id(0)
    offs = tl.arange(0, BLOCK_H)
    mask = offs < hidden_size

    gate = tl.load(input_ptr + row * stride_in + offs, mask=mask, other=0.0).to(tl.float32)
    value = tl.load(input_ptr + row * stride_in + hidden_size + offs, mask=mask, other=0.0).to(tl.float32)

    # GELU approx — manual tanh (tl.math.tanh NOT available on ROCm)
    k = 0.7978845608028654  # sqrt(2/pi)
    tanh_arg = k * (gate + 0.044715 * gate * gate * gate)
    e2x = tl.exp(2.0 * tanh_arg)
    tanh_val = (e2x - 1.0) / (e2x + 1.0)
    gate_gelu = 0.5 * gate * (1.0 + tanh_val)
    result = gate_gelu * value

    tl.store(output_ptr + row * stride_out + offs, result.to(gate.dtype), mask=mask)


def triton_geglu(x):
    x = x.contiguous()
    *batch_dims, double_h = x.shape
    hidden_size = double_h // 2
    x_2d = x.view(-1, double_h)
    M = x_2d.shape[0]
    out = torch.empty(M, hidden_size, device=x.device, dtype=x.dtype)
    BLOCK_H = triton.next_power_of_2(hidden_size)
    num_warps = 4 if BLOCK_H <= 1024 else (8 if BLOCK_H <= 4096 else 16)
    geglu_kernel[(M,)](
        x_2d, out, x_2d.stride(0), out.stride(0), hidden_size,
        BLOCK_H=BLOCK_H, num_warps=num_warps, num_stages=2,
    )
    return out.view(*batch_dims, hidden_size)
```

**Warning: LTX-Video uses GELU, NOT GEGLU.** GEGLU is for SD3/FLUX.

### 4. AdaLN (Adaptive Layer Normalization for DiT)

Fused normalization + conditioning: `norm(x) * weight * (1 + scale) + shift`

**Same BLOCK_D rule: compute dynamically.**

```python
@triton.jit
def adaln_kernel(
    x_ptr, weight_ptr, scale_ptr, shift_ptr, out_ptr,
    stride_x, stride_cond, D,
    eps: tl.constexpr,
    BLOCK_D: tl.constexpr,
):
    row = tl.program_id(0)
    offs = tl.arange(0, BLOCK_D)
    mask = offs < D
    x = tl.load(x_ptr + row * stride_x + offs, mask=mask, other=0.0).to(tl.float32)

    variance = tl.sum(x * x, axis=0) / D
    rms_inv = tl.rsqrt(variance + eps)
    x_norm = x * rms_inv

    w = tl.load(weight_ptr + offs, mask=mask, other=1.0).to(tl.float32)
    scale = tl.load(scale_ptr + row * stride_cond + offs, mask=mask, other=0.0).to(tl.float32)
    shift = tl.load(shift_ptr + row * stride_cond + offs, mask=mask, other=0.0).to(tl.float32)

    out = x_norm * w * (1.0 + scale) + shift
    tl.store(out_ptr + row * stride_x + offs, out.to(x.dtype), mask=mask)


def triton_adaln(x, weight, scale, shift, eps=1e-6):
    x_flat = x.contiguous().view(-1, x.shape[-1])
    scale_flat = scale.contiguous().view(-1, x.shape[-1])
    shift_flat = shift.contiguous().view(-1, x.shape[-1])
    out = torch.empty_like(x_flat)
    M, D = x_flat.shape
    BLOCK_D = triton.next_power_of_2(D)
    num_warps = 4 if BLOCK_D <= 1024 else (8 if BLOCK_D <= 4096 else 16)
    adaln_kernel[(M,)](
        x_flat, weight, scale_flat, shift_flat, out,
        x_flat.stride(0), scale_flat.stride(0), D, eps,
        BLOCK_D=BLOCK_D, num_warps=num_warps, num_stages=2,
    )
    return out.view_as(x)
```

## Diffusers Integration

> **See [diffusers-integration.md](references/diffusers-integration.md) for the complete guide.**

### Minimal Integration Pattern

```python
def patch_rmsnorm_modules(model):
    """Patch all RMSNorm modules to use custom Triton kernel."""
    for name, module in model.named_modules():
        if type(module).__name__ == 'RMSNorm':
            eps = getattr(module, 'eps', 1e-6)
            has_weight = hasattr(module, 'weight') and module.weight is not None
            if has_weight:
                def make_forward(mod, epsilon):
                    def forward(x):
                        return triton_rmsnorm(x, mod.weight, eps=epsilon)
                    return forward
                module.forward = make_forward(module, eps)
            else:
                def make_forward(epsilon):
                    def forward(x):
                        w = torch.ones(x.shape[-1], device=x.device, dtype=x.dtype)
                        return triton_rmsnorm(x, w, eps=epsilon)
                    return forward
                module.forward = make_forward(eps)

pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
pipe.to("cuda")
patch_rmsnorm_modules(pipe.transformer)
pipe.enable_model_cpu_offload()
```

### Diffusers Critical Pitfalls

1. **RMSNorm weight may be None** — LTX-Video uses `elementwise_affine=False`
2. **Diffusers RMSNorm != torch.nn.RMSNorm** — Use `type(module).__name__` not `isinstance()`
3. **LTX-Video uses GELU, not GEGLU** — Don't patch GEGLU for LTX-Video
4. **Inject BEFORE CPU offloading** — `inject_kernels()` then `enable_model_cpu_offload()`

## Performance Expectations

### Micro-benchmark Results (MI355X, BF16)

| Kernel | Avg Speedup | Best Config Speedup | Status |
|--------|:-----------:|:-------------------:|:------:|
| **RMSNorm** | **1.71x** | 2.44x ([4×4096×3072]) | PASS |
| **RoPE 3D** | **1.21x** | 1.52x ([2×4096×16×128]) | PASS |
| **GEGLU** | **1.43x** | 2.13x ([4×4096×8192]) | PASS |
| **AdaLN** | **2.22x** | 2.77x ([4×4096×3072]) | PASS |

RMSNorm bandwidth utilization: 3554 GB/s (MI355X theoretical: 8 TB/s, ~44%).

### End-to-End LTX-Video (MI355X, 25 frames, 30 steps)

| Mode | Time (s) | Per Step (s) | Peak Mem (GB) | Speedup |
|------|:--------:|:------------:|:-------------:|:-------:|
| baseline | 1.20 | 0.040 | 18.58 | 1.00x |
| **triton** | **0.98** | **0.033** | **18.58** | **1.22x** |
| torch.compile | 0.78 | 0.026 | 18.58 | 1.54x |

**Key finding**: MI355X Triton E2E speedup (22%) is significantly higher than H100 CUDA reference (6%), because MI355X's default PyTorch RMSNorm path has more room for optimization.

### Micro-benchmark Results (R9700, BF16)

| Kernel | Avg Speedup | Best Config Speedup | Status |
|--------|:-----------:|:-------------------:|:------:|
| **RMSNorm** | **2.90x** | 3.97x ([1×8192×2048]) | PASS |
| **RoPE 3D** | **2.09x** | 2.38x ([1×1024×16×64]) | PASS |
| **GEGLU** | **1.69x** | 1.93x ([2×1024×8192]) | PASS |
| **AdaLN** | **3.00x** | 3.67x ([4×4096×3072]) | PASS |

RMSNorm bandwidth utilization: 483 GB/s (R9700 theoretical: ~608 GB/s, **~79%**).

R9700 speedups are higher than MI355X because PyTorch's default RDNA4 backend is less mature, leaving more room for Triton optimization. The bandwidth utilization (79%) is also significantly better than MI355X (44%).

### End-to-End LTX-Video (R9700, 25 frames, 30 steps)

| Mode | Time (s) | Per Step (s) | Peak Mem (GB) | Speedup |
|------|:--------:|:------------:|:-------------:|:-------:|
| baseline (mean of 3) | 6.91 | 0.231 | 18.58 | 1.00x |
| **triton (mean of 3)** | **6.10** | **0.203** | **18.58** | **1.13x** |
| torch.compile (single run) | 5.05 | 0.168 | 18.58 | 1.37x |

Reviewer-facing benchmark files for this comparison live in `examples/ltx-video-benchmark/`, including:

- Summary table with `gen_time_s`, `time_per_step_s`, `peak_memory_gb`, and `speedup`
- Consolidated JSON results in `examples/ltx-video-benchmark/benchmark_results.json`
- OpenCode run outputs in `examples/ltx-video-benchmark/trace/opencode_live/results.json`
- OpenCode parsed trace in `examples/ltx-video-benchmark/trace/opencode_live/opencode_trace_result.json`

### R9700 Additional Validation

| Test | Result |
|------|--------|
| Transformers injection (TinyLlama 1.1B) | PASS — 45 RMSNorm patched, 99.9 tokens/s |
| HuggingFace Kernels Hub integration | PASS — Hub kernel loads and runs on ROCm |
| Local Triton vs Hub kernel (small shape) | Local **5.92x** vs Hub 1.27x (lower launch overhead) |
| Local Triton vs Hub kernel (large shape) | Local 3.59x vs Hub 3.57x (comparable) |
| num_warps sweep (2/4/8/16/32) | Default heuristic (4/8/16) is near-optimal; nw=32 always worst |
| rocprof kernel fusion analysis | Triton fuses 4 PyTorch kernels (pow+mean+rsqrt+mul) into 1 |

### CUDA Reference (H100, for comparison)

| Shape | Custom (ms) | PyTorch (ms) | Speedup |
|:---|:---:|:---:|:---:|
| [1×1024×2048] | 0.019 | 0.065 | **3.37x** |
| [2×4096×3072] | 0.087 | 0.208 | **2.41x** |

H100 E2E: ~6% (RMSNorm is ~5% of total compute).

### Optimization Targets

| Kernel | MI355X | R9700 | Target | Priority |
|--------|:------:|:-----:|:------:|:--------:|
| RMSNorm | 1.71x | 2.90x | >3x (R9700) | P0 — MI355X bandwidth util (44%→60%+) |
| AdaLN | 2.22x | 3.00x | >3.5x (R9700) | P1 — already strong on both |
| GEGLU | 1.43x | 1.69x | >2x | P1 — tanh overhead |
| RoPE 3D | 1.21x | 2.09x | >2.5x (R9700) | P2 — small head_dim launch overhead |

## Common Issues on ROCm

| Issue | Symptom | Fix |
|-------|---------|-----|
| **Autotune BLOCK_D** | Wrong results (max_abs 4-8+) | **Never autotune BLOCK_D.** Use `triton.next_power_of_2(D)` |
| **RoPE batch OOB** | GPU crash (`Memory access fault`) | Use `pid_s % seq_len` for cos/sin indexing |
| `tl.libdevice` | Not found on AMD | Use manual math formulas |
| `tl.tanh` / `tl.math.tanh` | Not on ROCm | Manual: `e2x=exp(2x); (e2x-1)/(e2x+1)` |
| Python min/max | Runtime error | `tl.minimum()`/`tl.maximum()` |
| LDS overflow | HIP OOM | Reduce num_stages to 2 |
| Weight is None | AttributeError | Check `elementwise_affine` |
| isinstance() miss | RMSNorm not patched | Use `type(module).__name__` |

> See [troubleshooting.md](references/troubleshooting.md) for all common issues.

## Performance Profiling

```bash
rocprof --stats python your_kernel.py
rocprofv3 -i metrics.txt python your_kernel.py
rocm-bandwidth-test
rocminfo | grep -E "Name|Compute Unit|Wavefront"
```

## See Also

### Benchmark & Test Scripts
- [benchmark_kernels.py](scripts/benchmark_kernels.py) - Micro-benchmark all 4 kernels (correctness + perf + bandwidth)
- [benchmark_e2e.py](scripts/benchmark_e2e.py) - End-to-end LTX-Video pipeline benchmark (baseline vs Triton vs compile)
- [sweep_num_warps.py](scripts/sweep_num_warps.py) - num_warps sweep for R9700 Wave32 optimization
- [ltx_kernel_injection_example.py](scripts/ltx_kernel_injection_example.py) - Minimal diffusers injection example
- [transformers_injection_example.py](scripts/transformers_injection_example.py) - Minimal transformers injection example
- [huggingface_kernels_example.py](scripts/huggingface_kernels_example.py) - HuggingFace Kernels Hub integration example

### Integration Guides
- [diffusers-integration.md](references/diffusers-integration.md) - LTX-Video pipeline integration
- [transformers-integration.md](references/transformers-integration.md) - LLaMA/Mistral/Qwen integration
- [huggingface-kernels-integration.md](references/huggingface-kernels-integration.md) - HuggingFace Kernels Hub (`get_kernel`)
- [kernel-templates.md](references/kernel-templates.md) - Complete Triton kernel templates (incl. GEMM with XCD Swizzle)

### GPU Optimization Guides
- [mi355x-optimization-guide.md](references/mi355x-optimization-guide.md) - MI355X (gfx950) deep dive
- [r9700-optimization-guide.md](references/r9700-optimization-guide.md) - R9700 (RDNA4) deep dive

### Reference
- [troubleshooting.md](references/troubleshooting.md) - Common issues and solutions
- [kernelbench-classification.md](references/kernelbench-classification.md) - KernelBench operator taxonomy
- [skill-evaluation-methodology.md](references/skill-evaluation-methodology.md) - How to evaluate and improve skills
- [kernel-agent-knowledge-base.md](references/kernel-agent-knowledge-base.md) - Knowledge from kernel-agent project

### External Resources
- [Triton Documentation](https://triton-lang.org/)
- [ROCm Documentation](https://rocm.docs.amd.com/)
- [HuggingFace Kernels Hub](https://huggingface.co/kernels-community)
- [LTX-Video on HuggingFace](https://huggingface.co/Lightricks/LTX-Video)
- [HuggingFace Diffusers](https://huggingface.co/docs/diffusers/en/index)