xpu-kernels
$
npx mdskill add huggingface/kernels/xpu-kernelsOptimize Triton kernels for Intel GPUs using Xe-Forge trial loops.
- Generates high-performance kernels for Battlemage and Arc Pro B50 hardware.
- Integrates Xe-Forge framework with Bash execution and kernel analysis tools.
- Executes automated trial loops to benchmark and validate kernel variants.
- Delivers optimized code through iterative analysis and finalization steps.
SKILL.md
.github/skills/xpu-kernelsView on GitHub ↗
---
name: xpu-kernels
description: "Provides guidance for writing, optimizing, and benchmarking Triton kernels for Intel XPU GPUs (Battlemage/Arc Pro B50) using the Xe-Forge optimization framework. Includes an LLM-driven trial-loop workflow (analyze, validate, benchmark, profile, finalize), XPU-specific patterns (tensor descriptors, GRF mode, tile swizzling), KernelBench fused kernels, and Flash Attention."
disable-model-invocation: false
user-invocable: true
allowed-tools: "Read, Grep, Glob, Bash"
argument-hint: "kernel type: gemm, reduction, flash-attention, optimize, benchmark, tensor-descriptors, xe-forge"
---
# XPU Triton Kernels for Intel GPUs
This skill provides patterns and guidance for developing optimized Triton kernels targeting Intel XPU GPUs (Battlemage/Arc Pro B50). It integrates the [Xe-Forge](https://github.com/IntelLabs/Xe-Forge) optimization framework — an LLM-driven loop that transforms PyTorch code into fast Triton kernels.
## Quick Start
### Optimize a Kernel (Xe-Forge Workflow)
The full optimization workflow analyzes a PyTorch baseline, generates Triton kernel variants in a branching trial tree, benchmarks each on XPU hardware, and finalizes the best result.
```bash
# 1. Analyze the baseline
python scripts/analyze_kernel.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py
# 2. Initialize trial tracking
python scripts/trial_manager.py init 70_Gemm_Sigmoid test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py
# 3. Validate a generated kernel (no GPU needed)
python scripts/validate_triton.py my_kernel.py
# 4. Benchmark correctness + performance
python scripts/benchmark.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py my_kernel.py
# 5. Profile with VTune (optional)
python scripts/xpu_profiler.py my_kernel.py
# 6. Finalize best trial
python scripts/trial_manager.py finalize 70_Gemm_Sigmoid optimized_triton.py
```
## Supported Hardware
| GPU | Architecture | XVEs | Mem BW | Key Feature | Verified |
|-----|-------------|------|--------|-------------|:--------:|
| **Battlemage G21 / Arc Pro B50** | Xe2 | 128 | ~500 GB/s | Tensor descriptors, GRF 256 | Yes |
> See the [Intel XPU Backend for Triton](https://github.com/intel/intel-xpu-backend-for-triton) for supported hardware.
## When This Skill Applies
Use this skill when:
- Optimizing PyTorch operations into Triton kernels for **Intel XPU**
- Writing GEMM, fused kernels, reductions, or Flash Attention for Intel GPUs
- Running the **Xe-Forge optimization loop** (analyze → validate → benchmark → profile → finalize)
- Benchmarking kernel performance against PyTorch baseline on XPU
## Xe-Forge Optimization Workflow
Transform PyTorch code into optimized Triton kernels for Intel XPU. Kernels must be numerically equivalent and faster than baseline.
### Configuration — Read `config.yaml` first
At the start of every session, read `scripts/config.yaml`. It controls:
- **`max_trials`** — hard cap on optimization trials; always run all of them (use this instead of hardcoded "10")
- **`vtune_enabled`** — if `false`, skip ALL VTune profiling steps (Step 3.6 and profiler-related decisions)
- **`vtune_bin`** — path to the VTune binary (also settable via `VTUNE_BIN` env var)
### Rules — Never Violate
1. **ONLY create** Triton kernel files (`test_kernels/*_triton.py` or trial files `t<trial_id>.py`).
2. **NEVER create** benchmark scripts, test scripts, helper utilities, or any other Python files.
3. **NEVER write custom scripts** to measure performance or test correctness — ONLY use `scripts/benchmark.py`.
4. If a tool fails, **STOP and report the error**. Do NOT work around it with custom scripts.
5. Generated kernels must be **self-contained** — all helper functions inline.
6. You **MUST run all `max_trials` trials** from `config.yaml`. Do NOT stop early due to plateau — LLM sampling can discover new ideas at any point. The only valid early stop is speedup > 5x.
### Mandatory Tools
**CRITICAL — Single-XPU serialization**: There is only ONE XPU on this machine. You MUST NOT run multiple GPU workloads in parallel. `benchmark.py` and `xpu_profiler.py` must execute strictly one at a time — concurrent GPU jobs produce wrong results. CPU-only tools (`analyze_kernel.py`, `validate_triton.py`, `trial_manager.py`) are safe to parallelize with each other and with anything else.
| Tool | Command | Purpose |
|------|---------|---------|
| **Analyze** | `python scripts/analyze_kernel.py <file>` | Static analysis: operations, shapes, fusion opportunities |
| **Validate** | `python scripts/validate_triton.py <file>` | Syntax + constraint checks before GPU time |
| **Benchmark** | `python scripts/benchmark.py <baseline> <triton> [--triton-baseline] [--baseline-us <cached>]` | Correctness + performance via ai-bench |
| **Profile** | `python scripts/xpu_profiler.py <file>` | VTune GPU hardware counters + recommendations |
| **Init trials** | `python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]` | Initialize trial tracking |
| **Save trial** | `python scripts/trial_manager.py save <kernel_name> <file> [--parent <parent_id>] [--strategy "..."]` | Save trial to tree |
| **Record result** | `python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass\|fail> --speedup <float> --baseline_us <float> --triton_us <float>` | Record benchmark result |
| **Check status** | `python scripts/trial_manager.py status <kernel_name>` | View trial tree |
| **Best trial** | `python scripts/trial_manager.py best <kernel_name>` | Get best trial |
| **Baseline time** | `python scripts/trial_manager.py baseline-us <kernel_name>` | Cached baseline time for `--baseline-us` |
| **Finalize** | `python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py` | Copy best trial to output |
### Workflow Steps
#### Step 1: Analyze
- Read the baseline source file. Identify shapes, dtypes, operations, fusion opportunities.
- If baseline is PyTorch: run `python scripts/analyze_kernel.py <pytorch_file>`.
- If baseline is Triton (`--triton-baseline`): skip `analyze_kernel.py` (it only supports PyTorch). Read the Triton file directly.
- Read relevant knowledge base files: start with `references/correctness.yaml` and `references/xpu_optimizations.yaml`.
- Read `references/implementation_reference.md` for templates and the Model class pattern.
#### Step 2: Initialize
```bash
python scripts/trial_manager.py init <kernel_name> <baseline_file> [--triton-baseline]
```
#### Step 3: Trial Loop (always run all `max_trials` from config.yaml)
For each trial:
1. **Write kernel** — start from templates or modify previous trial. See `references/implementation_reference.md`.
2. **Validate** — `python scripts/validate_triton.py <triton_file>` (fix until passing; doesn't count as a trial).
3. **Save** — `python scripts/trial_manager.py save <kernel_name> <triton_file> --parent <parent_id> --strategy "description"`. Omit `--parent` for the first trial (t0).
4. **Benchmark** (MANDATORY every trial):
- **Trial t0:** `python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline]` (measures both baseline and triton).
- **Trials t1+:** Get cached baseline via `python scripts/trial_manager.py baseline-us <kernel_name>`, then run `python scripts/benchmark.py <baseline_file> <triton_file> [--triton-baseline] --baseline-us <cached_value>` (skips baseline perf, saves time).
- **After `finalize`:** Re-run `benchmark.py` without `--baseline-us` for final accurate comparison.
5. **Record** — `python scripts/trial_manager.py result <kernel_name> <trial_id> --validation pass --correctness <pass|fail> --speedup <float> --baseline_us <float> --triton_us <float>` (runtimes from benchmark output).
6. **Profile (MANDATORY after t1, if `vtune_enabled` is true in config.yaml)** — Run `python scripts/xpu_profiler.py <triton_file>` after your first benchmarked trial. Use its output to guide subsequent trial strategies. Run again if speedup plateaus after 2+ additional trials. **Skip this step entirely if `vtune_enabled` is false.**
7. **Decide next action** (use profiler output from step 6 to inform decisions):
- Speedup > 5x → stop (excellent), finalize
- Speedup improved → continue on this branch, try next optimization level
- Speedup regressed → branch back to best trial, try different strategy
- Correctness failed → fix on same branch
- Profiler says low occupancy (if vtune_enabled) → increase tile sizes, check `references/xpu_optimizations.yaml`
- Profiler says overhead kernels dominate (if vtune_enabled) → pre-pack to bf16, see `references/optimization_levels.yaml`
- Plateau → do NOT stop. Try a fundamentally different approach (different algorithm, tiling, fusion strategy). LLM sampling can discover new ideas.
- See `references/optimization_strategies.md` for the full "try harder" decision tree
#### Step 4: Finalize
```bash
python scripts/trial_manager.py finalize <kernel_name> <name>_triton.py
```
### Reference Docs — Read During Step 1
| Doc | Contents |
|-----|----------|
| `references/implementation_reference.md` | Code templates, Model class pattern, GEMM example |
| `references/optimization_strategies.md` | Strategy reference, optimization levels, checklist |
| `references/workflow_details.md` | Detailed workflow, decision tree, benchmarking/validation details |
| `references/correctness.yaml` | Critical constraints to avoid bugs |
| `references/xpu_optimizations.yaml` | XPU-specific patterns (tensor descriptors, GRF, swizzling) |
| `references/fusion_patterns.yaml` | When to fuse vs split operations |
| `references/optimization_levels.yaml` | Progressive optimization with "try harder" decision tree |
### Existing Baselines Are Naive
The `test_kernels/*.py` Triton files (non-pytorch) are **unoptimized baselines**. They use manual pointer arithmetic, lack autotune, and miss XPU optimizations. Do NOT copy their patterns. Use `references/implementation_reference.md` instead.
## Core XPU Kernel Patterns
### Tensor Descriptors (Preferred on XPU)
Tensor descriptors produce better address generation and memory access codegen than block pointers on Intel XPU.
```python
desc = tl.make_tensor_descriptor(
base=ptr, shape=[M, N],
strides=[stride_m, stride_n],
block_shape=[BLOCK_M, BLOCK_N],
)
block = tl.load(desc, [pid_m, pid_n], boundary_check=(0, 1))
```
### GRF Mode '256'
Use the large register file for compute-heavy kernels:
```python
@triton.autotune(
configs=[triton.Config({'BLOCK_M': 256, 'BLOCK_N': 256}, num_warps=32)],
key=['M', 'N', 'K'],
)
@triton.jit(launch_metadata=lambda *args, **kwargs: {'grf_mode': '256'})
def kernel(...):
...
```
### Tile Swizzling
Use 1D grid with GROUP_SIZE_M for L2 locality:
```python
grid = lambda META: (triton.cdiv(M, META['BLOCK_M']) * triton.cdiv(N, META['BLOCK_N']),)
# Inside kernel:
pid = tl.program_id(0)
num_pid_n = tl.cdiv(N, BLOCK_N)
group_id = pid // (GROUP_SIZE_M * num_pid_n)
```
### bf16 Inputs with fp32 Accumulation
```python
a = tl.load(a_desc, [pid_m, k], boundary_check=(0, 1))
b = tl.load(b_desc, [k, pid_n], boundary_check=(0, 1))
acc += tl.dot(a.to(tl.bfloat16), b.to(tl.bfloat16), acc=acc) # fp32 accumulator
```
## Critical XPU Constraints
- **NO default values** for `@triton.autotune` meta-parameters in kernel signature
- **1D grid** when using tile swizzling (GROUP_SIZE_M)
- **`boundary_check`** uses dimension indices `(0, 1)`, not booleans
- **Cast batch indices** to `int64` before stride multiplication
- **Prefer tensor descriptors** over block pointers for all new XPU kernels
- **Do NOT mix** block pointer and tensor descriptor APIs on same operation
- **Pre-zero output buffers** when using atomic accumulation
- Model class must be compatible with ai-bench (`nn.Module` with `nn.Linear`)
- Match `get_inputs()`, `get_init_inputs()`, and module-level constants from `*_pytorch.py`
> Full constraint list: [correctness.yaml](references/correctness.yaml)
## Performance Results
Measured on Intel Battlemage G21 / Arc Pro B50 (128 XVEs). All runtimes are median of benchmark trials.
### KernelBench Level 2 — Fused Kernels (bf16)
Speedup is vs. PyTorch eager baseline. Includes GEMM+Sigmoid+Scaling, GEMM+GELU+Softmax, Conv+BatchNorm+ReLU, and other fused patterns.
### Flash Attention Forward (fp16)
Baseline is the flash attention kernel from the Intel XPU Triton backend; speedup is vs. that kernel across multiple sequence lengths.
> Full results: see the [Xe-Forge repository](https://github.com/IntelLabs/Xe-Forge).
## Common Issues
| Issue | Symptom | Fix |
|-------|---------|-----|
| **Autotune BLOCK_D** | Wrong results (max_abs 4-8+) | **Never autotune BLOCK_D.** Use `triton.next_power_of_2(D)` |
| Python min/max | Runtime error | `tl.minimum()`/`tl.maximum()` |
## Project Structure
```
xpu-kernels/
├── SKILL.md # This file (skill definition + workflow)
├── manifest.txt # Files included in this skill
│
├── scripts/ # Standalone CLI tools
│ ├── analyze_kernel.py # PyTorch → operations, shapes, fusion opportunities
│ ├── validate_triton.py # Syntax + constraint checks
│ ├── benchmark.py # Correctness + performance via ai-bench
│ ├── trial_manager.py # Tree-structured trial management
│ ├── xpu_profiler.py # VTune GPU hardware counters
│ ├── config.py # Shared configuration loader
│ ├── config.yaml # Session config (max_trials, vtune)
│ └── requirements.txt # Python dependencies
│
└── references/ # Knowledge base + integration guides
├── correctness.yaml # Hard constraints for XPU Triton
├── xpu_optimizations.yaml # Tensor descriptors, GRF, swizzling
├── implementation_reference.md # Code templates, Model class pattern
├── implementation_reference.md # Code templates, Model class pattern
├── optimization_strategies.md # Strategy reference + "try harder" tree
├── optimization_levels.yaml # Progressive L1-L5 optimization levels
├── workflow_details.md # Detailed workflow and decision tree
├── fusion_patterns.yaml # When to fuse vs split
├── memory_patterns.yaml # Access patterns and coalescing
├── dtype_optimizations.yaml # Mixed precision choices
├── persistent_kernel_patterns.yaml # Stream K and persistent kernels
├── kernel-templates.md # Triton kernel templates for XPU
└── kernelbench-classification.md # KernelBench operator taxonomy
```
## See Also
### Xe-Forge Tools
- [analyze_kernel.py](scripts/analyze_kernel.py) — Static analysis of PyTorch reference
- [validate_triton.py](scripts/validate_triton.py) — Pre-benchmark constraint checks
- [benchmark.py](scripts/benchmark.py) — Correctness + performance measurement
- [xpu_profiler.py](scripts/xpu_profiler.py) — VTune GPU hardware counters
- [trial_manager.py](scripts/trial_manager.py) — Branching trial tree management
### XPU Optimization References
- [correctness.yaml](references/correctness.yaml) — Critical constraints
- [xpu_optimizations.yaml](references/xpu_optimizations.yaml) — Tensor descriptors, GRF, swizzling
- [optimization_strategies.md](references/optimization_strategies.md) — Strategy reference
- [optimization_levels.yaml](references/optimization_levels.yaml) — Progressive L1-L5 levels
- [implementation_reference.md](references/implementation_reference.md) — Code templates
### Other References
- [kernelbench-classification.md](references/kernelbench-classification.md) — KernelBench operator taxonomy
### External Resources
- [Xe-Forge Repository](https://github.com/IntelLabs/Xe-Forge)
- [AI-Bench](https://github.com/libxsmm/AI-bench) — Benchmark harness for correctness + performance
- [Intel XPU Backend for Triton](https://github.com/intel/intel-xpu-backend-for-triton)
- [Triton Language Guide](https://triton-lang.org/)
More from huggingface/kernels
- cpu-kernelsProvides guidance for writing, optimizing, and benchmarking C++ CPU kernels with SIMD intrinsics (AVX2/AVX512) for the Hugging Face kernels ecosystem. Includes a two-phase workflow: Phase 1 correctness (generic → AVX2) and Phase 2 performance exploration (AVX512 with branching trial loop), runtime CPU dispatch, OpenMP threading, and brgemm integration for GEMM-heavy kernels.
- cuda-kernelsProvides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.
- rocm-kernelsProvides guidance for writing and benchmarking optimized Triton kernels for AMD GPUs (MI355X, R9700) on ROCm, targeting HuggingFace diffusers (LTX-Video, SD3, FLUX) and transformers. Core kernels: RMSNorm, RoPE 3D, GEGLU, AdaLN. Includes XCD swizzle, autotune, diffusers integration patterns, and LTX-Video pipeline injection.