kernel-cute-writing

Name: kernel-cute-writing
Author: NVIDIA/skills
$npx mdskill add NVIDIA/skills/kernel-cute-writing
Generate optimized GPU kernels via CuTe DSL Python API.
Enables element-wise, GEMM, and tensor core kernel implementation.
Depends on cute.jit, cutlass.cute, and CUTLASS 4.x.
Executes only when users request kernel writing or implementation.
Delivers JIT-compiled CUDA kernels through MLIR and ptxas.
SKILL.md

.github/skills/kernel-cute-writingView on GitHub ↗
---
name: kernel-cute-writing
description: >
  Write and implement GPU kernels using NVIDIA CuTe DSL (CUTLASS 4.x Python
  API) — NOT for Triton, CUDA C++, or conceptual explanations.
  Trigger only when the user wants to write or implement a kernel, not when
  asking questions about CuTe DSL concepts or layouts.
  CuTe DSL uses cute.jit/cute.kernel decorators and cutlass.cute imports.
  Covers element-wise kernels, GEMM patterns, reductions, memory hierarchy
  (global/shared/register/TMA), MMA tensor core operations, software
  pipelining, and framework integration.
license: Apache-2.0
metadata:
  author: NVIDIA Corporation
---

# CuTe DSL

CuTe DSL is a Python-based domain-specific language for GPU kernel development,
part of CUTLASS 4.x. It provides Python abstractions over CUTLASS C++ templates
with JIT compilation to optimized CUDA kernels via MLIR and ptxas.

## When to Use

**Triggers:**
- Writing CUDA kernels in Python (element-wise, GEMM, custom ops)
- Optimizing GPU memory access patterns (vectorized loads, TMA, shared memory)
- Building tensor core (MMA) kernels for Ampere/Hopper/Blackwell
- Integrating custom GPU kernels with PyTorch or JAX
- Prototyping high-performance kernels without C++ metaprogramming

**Symptoms (wrong tool otherwise):**
- Need shared memory coordination or tensor core MMA → use CuTe DSL (not Triton for complex patterns)
- Need simple element-wise ops with no shared memory → CuTe DSL or Triton both work
- Need to call existing CUTLASS C++ kernels → use CUTLASS C++ APIs instead
- Need reductions, scans, or non-GEMM collective ops → consider CUB/Thrust

**Keywords:** cute, cutlass, cute.jit, cute.kernel, from_dlpack, zipped_divide,
TiledMMA, TiledCopy, TMA, WGMMA, tcgen05, pipeline, mbarrier

## Requirements

| Requirement | Detail |
|-------------|--------|
| Platform | Linux x86_64 only |
| Python | 3.10–3.13 |
| GPU | NVIDIA Ampere+ (SM80, SM90, SM100) |
| CUDA Driver | ≥ 575.51.03 (Toolkit 12.9 compat) |
| Install | `pip install nvidia-cutlass-dsl` |
| Optional | `apache-tvm-ffi`, `torch-c-dlpack-ext` |

## Workflows

### Workflow 0: Starting from Examples (Recommended)

For any non-trivial kernel (GEMM, attention, pipelined, fused ops), start by
finding the most similar existing example to use as a **starting point** — study
its structure, then rework it for your use case. Do not copy examples verbatim;
they target specific dtypes, architectures, and problem shapes that likely differ.

1. **Pick the closest example** from the index below.
   **Prefer examples matching the target GPU architecture** (check with
   `torch.cuda.get_device_capability()`) when the operation is similar.

   Fetch via `web_fetch` with base URL
   `https://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSL`

   | Operation | Arch | Example path (append to base URL) |
   |-----------|------|-----------------------------------|
   | Element-wise add | SM80 | `ampere/elementwise_add.py` |
   | Element-wise + autotune | SM80 | `ampere/elementwise_add_autotune.py` |
   | Element-wise apply | SM80 | `ampere/elementwise_apply.py` |
   | SGEMM (scalar) | SM80 | `ampere/sgemm.py` |
   | Tensor-core GEMM | SM80 | `ampere/tensorop_gemm.py` |
   | Flash Attention v2 | SM80 | `ampere/flash_attention_v2.py` |
   | HSTU Attention | SM80 | `ampere/hstu_attention.py` |
   | Shared memory allocator | SM80 | `ampere/smem_allocator.py` |
   | CTA norm (LayerNorm) | SM90 | `hopper/cta_norm.py` |
   | Dense GEMM | SM90 | `hopper/dense_gemm.py` |
   | Dense GEMM persistent | SM90 | `hopper/dense_gemm_persistent.py` |
   | Flash MHA | SM90 | `hopper/fmha.py` |
   | Dense GEMM | SM100 | `blackwell/dense_gemm.py` |
   | Dense GEMM persistent | SM100 | `blackwell/dense_gemm_persistent.py` |
   | Dense GEMM + alpha/beta | SM100 | `blackwell/dense_gemm_alpha_beta_persistent.py` |
   | RMSNorm | SM100 | `blackwell/rmsnorm.py` |
   | Reduce | SM100 | `blackwell/reduce.py` |
   | Flash MHA | SM100 | `blackwell/fmha.py` |
   | Grouped GEMM | SM100 | `blackwell/grouped_gemm.py` |
   | Mamba2 SSD | SM100 | `blackwell/mamba2_ssd/` |
   | GEMM tutorial (notebook) | SM100 | `notebooks/tour_to_sol_gemm.ipynb` |

   **Example:** To fetch the Hopper dense GEMM:
   ```bash
   web_fetch https://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSL/hopper/dense_gemm.py
   ```

2. **Read reference materials first** — before diving into example code, read
   the relevant `references/` docs to understand the patterns and APIs:
   - For GEMM: `references/patterns-gemm.md` (3-level tiling, epilogue fusion,
     `cute.compile` with `mark_layout_dynamic`, shared memory layouts)
   - For reductions: `references/patterns-reduction.md` (warp reductions,
     `cute.compile` cache pattern)
   - For element-wise: `references/patterns-elementwise.md` (variations A–E)
   - Always: `references/api-arch.md` (available APIs, arch-specific caveats)

   This gives you the conceptual foundation so you can rework the example
   intelligently rather than trying to copy-paste complex pipelines.

3. **Fetch and study the example source** — read for structure, not to copy:
   - Identify: decorators, tiling strategy, shared memory usage, mainloop flow
   - Note which dtype/arch it targets (many examples are fp16/bf16-specific)
   - Check if it uses APIs tied to a specific arch (TMA → SM90+, tcgen05 → SM100)

4. **Rework for the user's workload** (do not copy-paste):
   - Change shapes, data types, tile sizes to match requirements
   - Replace compute logic (epilogue, activation fusion) as needed
   - If dtype differs (e.g., example is fp16, need fp32), expect vectorization
     and layout changes — the scalar-loop patterns in `references/` may be a
     better starting point than adapting a vectorized example
   - **Runtime wrapper must be lightweight**: `kernel_fn()` should only call
     `from_dlpack()` + the compiled kernel. Never allocate intermediate tensors,
     copy data, or re-compile per call — these belong in one-time setup
   - Apply optimizations from this skill's reference docs

   **⛔ Blackwell/Hopper GEMM + extra tensors — STOP:**
   If the target GPU is SM90+ (Hopper/Blackwell) **and** the GEMM requires
   extra tensors beyond A, B, C in the epilogue (e.g., bias vector, activation
   inputs), **do not attempt it**. These examples use TMA descriptors for all
   data movement — adding tensors requires modifying TMA descriptor setup,
   which is prohibitively complex. Instead, tell the user this limitation and
   suggest a **two-kernel approach**: run the GEMM kernel as-is, then apply
   bias + activation in a separate element-wise kernel (Workflow 1).
   Plain GEMM (just A×B→C with scalar alpha/beta) on Hopper/Blackwell is fine.

5. **Validate and benchmark** using companion scripts:
   ```bash
   python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3
   python scripts/benchmark_kernel.py kernel.py
   ```
   The kernel file must export `kernel_fn`, `reference_fn`, and `get_inputs()`.

**When to skip examples:** Pure element-wise operations (Workflow 1) have
complete patterns in `references/patterns-elementwise.md` — no need to fetch
external examples.

**Reduction kernels** (softmax, layernorm, RMSNorm): Use
`references/patterns-reduction.md` which provides complete, proven patterns
for float32 reductions using scalar loops + butterfly shuffle + shared memory.

### Workflow 1: Element-wise Kernel

For unary/binary/in-place operations that map inputs to outputs 1:1.

1. **Determine kernel structure**: inputs/outputs count, tensor rank, target arch
2. **Select pattern** from `references/patterns-elementwise.md` (Variations A–E)
3. **Write kernel** applying all four invariant principles:
   - P1: `from_dlpack(tensor, assumed_align=16)` for vector loads
   - P2: Derive `vec_size` from `element_type.width`
   - P3: `cute.zipped_divide(mA, tiler)` for coalesced access
   - P4: `cutlass.dynamic_expr(thread_idx < total)` for bounds
4. **Critical rules**: No early return, no `a * 2` (use `a + a`), no `cute.math.sigmoid`
5. **Pre-compile with `cute.compile()`**: Always pre-compile the kernel once
   using `cute.compile()` so that `kernel_fn` calls the compiled object, not
   `@cute.jit` directly.  Without pre-compilation, every call recompiles
   (~20-50ms overhead).  Use `.mark_layout_dynamic()` so a single compiled
   kernel handles arbitrary input shapes without recompilation:
   ```python
   # Compile once with dynamic layouts — works for any shape
   fake_x = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"),
                         assumed_align=16).mark_layout_dynamic()
   fake_out = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"),
                           assumed_align=16).mark_layout_dynamic()
   compiled_kernel = cute.compile(host_fn, fake_x, fake_out)

   def kernel_fn(x):
       out = torch.empty_like(x)
       compiled_kernel(from_dlpack(x, assumed_align=16).mark_layout_dynamic(),
                       from_dlpack(out, assumed_align=16).mark_layout_dynamic())
       return out
   ```
6. **Verify correctness** using companion script:
   ```bash
   python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3
   ```
   The kernel file must export `kernel_fn`, `reference_fn`, and `get_inputs()`.
7. **Benchmark** using companion script:
   ```bash
   python scripts/benchmark_kernel.py kernel.py
   ```

### Workflow 2: GEMM Kernel

For matrix multiplication with tiling, shared memory, and tensor cores.

1. **Define problem**: shapes (M, N, K), data types, target architecture
2. **Choose tiling**: CTA tile (bM, bN, bK), pipeline stages, cluster shape
3. **Three-level partitioning** (see `references/patterns-gemm.md`):
   - Level 1: CTA tiling with `local_tile()`
   - Level 2: Copy partitioning (global → shared) with `TiledCopy`
   - Level 3: Compute partitioning (shared → register) with `TiledMMA`
4. **Shared memory**: Use swizzled layouts (`make_smem_layout_atom`) to avoid bank conflicts
5. **Mainloop**: K-tile loop with copy → sync → MMA → sync
6. **Pipeline**: Use `PipelineTmaAsync` (Hopper) or `PipelineTmaUmma` (Blackwell).
   ⚠️ TMA-based pipelines manage data movement via TMA descriptors — adding
   extra tensors (bias, activation inputs) to the epilogue requires modifying
   descriptor setup, which is prohibitively complex. See the stop condition in
   Workflow 0 step 4.
7. **Epilogue**: Predicated store with alpha/beta scaling
8. **Pre-compile with `cute.compile()`**: Always pre-compile the GEMM kernel
   so `kernel_fn` calls the compiled object, not `@cute.jit` directly.
   Without pre-compilation, every call recompiles (~20-50ms overhead).
9. **Autotune**: Search over tile sizes, cluster shapes, pipeline depths

### Workflow 3: Framework Integration

For wrapping CuTe DSL kernels as PyTorch/JAX custom operators.

1. **Write kernel** using Workflow 1 or 2
2. **Create wrapper**: Accept `torch.Tensor`, convert via `from_dlpack`, call host fn
3. **For production**: Compile with TVM FFI for zero-overhead tensor passing:
   ```python
   compiled = cute.compile(host_fn, *fake_tensors, options="--enable-tvm-ffi")
   compiled(torch_a, torch_b)  # Direct torch.Tensor, no from_dlpack
   ```
4. **For deployment**: Use AOT compilation → export to `.o` → load at runtime

### Workflow 4: Debugging & Profiling

1. **Set environment**: `CUTE_DSL_PRINT_IR=1`, `CUTE_DSL_KEEP_PTX=1`
2. **Use `cute.printf()`** for runtime values (not Python `print`)
3. **Inspect generated code**: `compiled.__ptx__`, `compiled.__mlir__`
4. **Profile**: Enable `CUTE_DSL_LINEINFO=1`, use Nsight Compute/Systems
5. **Debug memory**: Run with `compute-sanitizer python script.py`

## Output Formats

A typical CuTe DSL kernel project:

```
kernel_dir/
  kernel.py          # @cute.kernel + @cute.jit functions
  test_kernel.py     # Correctness test vs PyTorch reference
  bench_kernel.py    # Benchmark with cute.compile() setup
```

**Success indicators:**
- Correctness test passes (`torch.testing.assert_close`)
- Nsight shows vector loads (LDG.128/LDG.256), not scalar loads
- For GEMM: tensor core utilization > 80% in Nsight Compute

## Companion Script Contract

Kernel files used with `scripts/verify_kernel.py` and `scripts/benchmark_kernel.py`
must export three names:

- `kernel_fn(*inputs)` — the CuTe DSL kernel wrapper (calls `cute.compile` + runs kernel)
- `reference_fn(*inputs)` — PyTorch reference implementation (same signature)
- `get_inputs()` — returns a list of CUDA tensors for testing

```python
# Example kernel.py contract
import torch
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack

def kernel_fn(x):
    out = torch.empty_like(x)
    # ... call compiled cute kernel ...
    return out

def reference_fn(x):
    return torch.nn.functional.gelu(x)

def get_inputs():
    return [torch.randn(1024, 512, dtype=torch.float16, device="cuda")]
```

## Examples

### Example: 2D Unary Element-wise (ReLU)

```python
import torch, cutlass, cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack

@cute.kernel
def relu_kernel(gA: cute.Tensor, gC: cute.Tensor):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    bdim, _, _ = cute.arch.block_dim()
    idx = bidx * bdim + tidx
    m, n = gA.shape[1]
    total = m * n
    if cutlass.dynamic_expr(idx < total):
        a = gA[(None, (idx // n, idx % n))].load()
        gC[(None, (idx // n, idx % n))] = cute.where(a > 0, a, 0)

@cute.jit
def relu_host(mA: cute.Tensor, mC: cute.Tensor):
    vec = 16 // (mA.element_type.width // 8)
    gA = cute.zipped_divide(mA, (1, vec))
    gC = cute.zipped_divide(mC, (1, vec))
    T = 256
    N = cute.size(gA.shape[1])
    relu_kernel(gA, gC).launch(grid=((N+T-1)//T,1,1), block=(T,1,1))

x = torch.randn(1024, 512, dtype=torch.float16, device="cuda")
out = torch.empty_like(x)
relu_host(from_dlpack(x, assumed_align=16), from_dlpack(out, assumed_align=16))
```

## Error Handling

| Error | Cause | Fix |
|-------|-------|-----|
| `MLIR function requires a Context` | Called @kernel from Python | Launch via @cute.jit host function |
| `DSLAstPreprocessorError` on return | Early return in @kernel | Use `if cutlass.dynamic_expr(cond):` |
| Type mismatch on store | `a * 2` promotes FP16→FP32 | Use `a + a` or `.to(cutlass.Float16)` |
| `could not get source code` | Kernel in `exec()` context | Write to file and import |
| Scalar loads in Nsight | Missing alignment hint | Add `assumed_align=16` to `from_dlpack` |
| `Missing required argument` | Not all @jit params passed | Pass ALL declared parameters |
| `AttributeError: sigmoid` | No `cute.math.sigmoid` | Use `1.0/(1.0+cute.math.exp(-x))` |

See `references/troubleshooting.md` for the full error table and limitations.

**Debugging rule:** Never delete kernel.py during debugging. Use `backup_file`
to save a checkpoint, then `edit_file` to iterate. If stuck, `revert_file` to
restore the backup. A partially-working kernel is always better than no kernel.

## Finding More Information

### Tier 1: This File (SKILL.md)

Workflows above cover element-wise kernels, GEMM, framework integration, and
debugging. Search this file first for procedural questions.

### Tier 2: references/ Directory

Grep for keywords across `references/`. Headers are grep-friendly.

| File | Content |
|------|---------|
| `concepts-architecture.md` | Core abstractions, terminology, compilation pipeline |
| `concepts-layouts.md` | Layout algebra: composition, complement, divide, swizzle |
| `concepts-tensors.md` | Tensor types, partitioning, tiling, predication |
| `concepts-mma.md` | MMA atoms, TiledMMA, per-architecture tensor core ops |
| `patterns-getting-started.md` | Installation, decorators, first kernel walkthrough |
| `patterns-elementwise.md` | Invariant principles, pattern variations, reference impl |
| `patterns-gemm.md` | 3-level tiling, shared memory, pipelining, autotuning |
| `patterns-memory.md` | from_dlpack, TMA, cp.async, TMEM, copy atoms |
| `patterns-compilation.md` | Control flow, JIT caching, TVM FFI, AOT compilation |
| `patterns-pipeline.md` | Producer-consumer, pipeline classes, barriers, warp specialization |
| `api-core.md` | cute module: layouts, tensors, math, copy, gemm, printing |
| `api-arch.md` | cute.arch: thread indexing, sync, atomics, memory ops |
| `api-nvgpu.md` | cute.nvgpu: warp/warpgroup/cpasync/tcgen05 MMA and copy |
| `api-runtime-utils.md` | Runtime: from_dlpack, fake tensors, utils, schedulers |
| `troubleshooting.md` | Debugging, env vars, common errors, limitations, FAQ |

**How to search:** Grep for your keyword across `references/`. Read only the
file and section that Grep points to.

### Tier 3: Original Documentation

If Tiers 1–2 don't answer, consult the source:
- **Web**: https://docs.nvidia.com/cutlass/latest/
- **GitHub**: https://github.com/NVIDIA/cutlass
- Fetch specific doc pages or search for "CUTLASS CuTe DSL <topic>"
- Consider distilling the answer back into references/
More from NVIDIA/skills

Skill	Description
accessing-mlflow	Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
ad-add-fusion-transformation	>
ad-conf-check	>
ad-graph-dump	>
ad-model-onboard	>
ad-pipeline-failure-pr	>
add-benchmark	>
aiq-deploy	\|
aiq-research	\|
byob	Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.