perf-nsight-compute-analysis

Name: perf-nsight-compute-analysis
Author: NVIDIA/skills

$npx mdskill add NVIDIA/skills/perf-nsight-compute-analysis

Diagnose CUDA kernel bottlenecks using NVIDIA Nsight Compute data.

Identifies compute-bound, memory-bound, or latency-bound kernel issues.
Requires NVIDIA GPU with ncu installed for profiling access.
Classifies performance limits via SOL% and roofline analysis.
Delivers structured reports on occupancy, stalls, and cache metrics.

SKILL.md

.github/skills/perf-nsight-compute-analysisView on GitHub ↗

---
name: perf-nsight-compute-analysis
tags: [profiling]
description: >
  Analyze ncu (NVIDIA Nsight Compute) profiling output: SOL% bottleneck
  classification, roofline analysis, occupancy diagnosis, memory hierarchy
  analysis, warp stall analysis, metric interpretation, and programmatic
  .ncu-rep report analysis. NOT for kernel writing or code generation,
  Nsight Systems (nsys), host-side profiling, or system-level profiling.
compatibility: Requires NVIDIA GPU with ncu (Nsight Compute) installed
license: Apache-2.0
metadata:
  author: NVIDIA Corporation
---

# Nsight Compute Analysis

NVIDIA Nsight Compute (`ncu`) profiles individual CUDA kernels to determine
why they are slow and what to optimize. It measures GPU throughput as a
percentage of theoretical peak (Speed of Light / SOL%), enabling systematic
bottleneck classification and targeted optimization.

## When to Use

Reach for this skill when you encounter:

- **Triggers**: User wants to profile a CUDA kernel, analyze `ncu` output,
  interpret `.ncu-rep` reports, or optimize GPU kernel performance
- **Symptoms**: Kernel running slower than expected, low GPU utilization,
  need to classify compute-bound vs memory-bound, occupancy issues
- **Keywords**: "ncu", "nsight compute", "SOL%", "speed of light", "kernel
  profiling", "compute-bound", "memory-bound", "latency-bound", "occupancy",
  "roofline", "warp stalls", "cache hit rate", "ncu-rep"

Do NOT use this skill for:
- System-level profiling (use Nsight Systems / `nsys` instead)
- CUDA API tracing or CPU-GPU timeline analysis (use `nsys`)
- GPU monitoring without profiling (use `nvidia-smi`)

## Requirements

| Dependency | Version | Notes |
|------------|---------|-------|
| CUDA Toolkit | >=11.0 | Includes `ncu` |
| `ncu` binary | Match CUDA version | Or set `$NCU` env var |
| NVIDIA GPU | Kepler+ | Volta+ recommended |

Permissions: `ncu` may require `sudo`, `CAP_SYS_ADMIN`, or `--privileged`
in containers. Check with `ncu -v` first.

## Principles

### Data Integrity

This is a data-driven analysis system. **Every number you present must have
an authoritative source.** Follow these rules without exception:

1. **Quote before you interpret.** When presenting metrics from ncu output,
   always show the actual ncu command you ran AND the relevant raw output
   (CSV lines, metric values) before stating any numeric conclusion.
2. **Never fabricate metrics.** If ncu fails, returns unexpected output, or
   you cannot run it, say so explicitly. Do not invent plausible-looking
   numbers. An honest "profiling failed" is better than fabricated data.
3. **Attribute every value.** For each metric you cite (SOL%, duration,
   occupancy, throughput), the reader must be able to trace it back to a
   specific line in the raw ncu output you showed.

### SOL% Mental Model

Speed of Light (SOL%) measures how close a kernel runs to the GPU's theoretical peak:
- **Compute SOL%** = actual compute throughput / peak compute throughput
- **Memory SOL%** = actual memory throughput / peak memory throughput

A kernel cannot saturate both simultaneously. The higher metric reveals the bottleneck type. Use this as the primary classification signal.

### Classification Thresholds

| Compute % | Memory % | Bottleneck | Next Step |
|-----------|----------|------------|-----------|
| >60 | <40 | **Compute-bound** | ComputeWorkloadAnalysis section |
| <40 | >60 | **Memory-bound** | MemoryWorkloadAnalysis section |
| <40 | <40 | **Latency-bound** | LaunchStats + Occupancy sections |
| 40-60 | 40-60 | **Balanced** | Profile deeper with detailed sections |

Additional signals:
- Duration <10us with many launches -> **Launch-overhead bound** (use nsys first)
- Both <40% but occupancy >50% -> **Instruction-bound** (check InstructionStats)

### SOL% Performance Levels

| SOL% | Level | Action |
|------|-------|--------|
| >80% | Excellent | Minor tuning only |
| 60-80% | Good | Targeted optimization |
| 40-60% | Fair | Significant optimization needed |
| <40% | Poor | Major rework needed |

### Section-First Profiling

Always use targeted `--section` flags instead of bulk `--set` collection. Individual sections are faster and more surgical. Only escalate to `--set basic` or `--set detailed` when broad exploration is needed.

### ncu vs nsys

| Tool | Scope | Overhead | Purpose |
|------|-------|----------|---------|
| **nsys** | System-level | 5-10% | Find which kernels to optimize |
| **ncu** | Kernel-level | 10-100x slower | Understand why a kernel is slow |

Use nsys first to identify top kernels by GPU time, then ncu for deep analysis of those specific kernels.

## Workflow

**Choose your path based on the request:**
- **Knowledge query** (what metrics to use, --section vs --set, how to filter kernels):
  Answer directly from Principles, Command Reference, and References below. Do NOT run ncu.
- **Quick diagnosis** (classify bottleneck, check SOL%): Step 1 only. Escalate if user wants more.
- **Specific diagnosis** (bank conflicts, register pressure, occupancy): Quick SOL% check (Step 1),
  then go directly to the relevant section in Step 2.
- **Deep analysis** (detailed report, optimization recommendations): Full Steps 1-5.
  Present the complete structured report with all key metrics (SOL%, duration,
  occupancy) in your final response — do not split the report across messages
  or replace it with a brief summary.

### Step 0: Verify ncu

```bash
ncu -v
# Or: $NCU -v
```

If not found, ensure CUDA toolkit is installed or set `NCU` env var to the binary path.

### Step 1: SOL% Diagnosis

Always start with SpeedOfLight to classify the bottleneck:

```bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"KERNEL" \
    --launch-skip 5 --launch-count 3 \
    -- COMMAND
```

Read `Compute (SM) Throughput` and `Memory Throughput` from the output. Classify using the thresholds above.

### Step 2: Escalate with Targeted Sections

Based on Step 1 classification, add sections:

| Classification | Sections to Add |
|----------------|-----------------|
| Compute-bound | `ComputeWorkloadAnalysis` |
| Memory-bound | `MemoryWorkloadAnalysis` |
| Latency-bound | `LaunchStats`, `Occupancy` |
| Warp stalls | `WarpStateStats`, `SchedulerStats` |
| Need instruction breakdown | `InstructionStats` |

Always include `LaunchStats` and `Occupancy` when diagnosing latency-bound kernels. These reveal register pressure, shared memory limits, and block size issues.

Example -- memory-bound deep dive:
```bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding_lookup" \
    --launch-count 3 \
    -- python script.py
```

Example -- compute-bound deep dive:
```bash
ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
    --kernel-name regex:"gemm" \
    --launch-count 3 -- python script.py
```

Example -- occupancy investigation:
```bash
ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
    --kernel-name regex:"small_kernel" \
    -- python script.py
```

### Step 3: Roofline Analysis (Optional)

For visual understanding of compute vs memory balance:

```bash
ncu --section SpeedOfLight_RooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND
```

For precision-specific hierarchical roofline:

```bash
# FP16 kernels
ncu --section SpeedOfLight_HierarchicalHalfRooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND

# Tensor core kernels
ncu --section SpeedOfLight_HierarchicalTensorRooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND
```

Interpretation: kernel left of ridge point = memory-bound; right = compute-bound;
far below both roofs = latency/occupancy issue. See `references/roofline-analysis.md`.

### Step 4: Interpret and Optimize

1. Identify the dominant bottleneck from SOL% classification
2. Look up detailed analysis and optimization strategies in `references/bottleneck-guide.md`
3. Apply highest-impact optimization first
4. Re-profile to validate improvement and detect bottleneck shifts

### Step 5: Validate

Re-profile the same kernel after optimization:

```bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"optimized_kernel" \
    --launch-count 3 \
    -- python optimized_script.py
```

Compare: Did throughput % increase? Did duration decrease? Did the bottleneck type shift?

## Profiling JIT-Compiled Kernels (Triton/cuTile/CuTeDSL)

JIT-compiled kernels trigger autotuning on first invocation. Isolate the actual execution:

1. **Warmup first**: Run the kernel 3-5 times to complete JIT compilation and autotuning, then `torch.cuda.synchronize()`.
2. **Use profiler markers**: Bracket the measured region with `cudaProfilerStart()`/`cudaProfilerStop()`.
3. **Use `--profile-from-start off`** so ncu only captures the marked region:

```python
# Warmup (JIT + autotuning)
for _ in range(5):
    result = kernel(inputs)
torch.cuda.synchronize()

# Profile only steady-state
torch.cuda.cudart().cudaProfilerStart()
for _ in range(3):
    result = kernel(inputs)
    torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStop()
```

```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
    --kernel-name regex:"target_kernel" \
    --launch-count 3 -- python script.py
```

Alternative: use `--launch-skip N` to skip autotuning launches. See
`references/advanced-profiling.md` for NVTX range and replay mode alternatives.

## Programmatic Report Analysis

Extract metrics from `.ncu-rep` files using the `ncu_report` Python module
(in `extras/python/` of the Nsight Compute installation):

```python
import ncu_report

ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
    for action in rng:
        name = action.name()
        compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        duration = action["gpu__time_duration.sum"].as_uint64()

        if compute > 60:
            classification = "compute-bound"
        elif memory > 60:
            classification = "memory-bound"
        else:
            classification = "latency-bound"

        print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")
```

See `references/python-report-api.md` for the full API (IContext, IRange, IAction, IMetric classes).

## Output Formats

**CSV output** (for scripting and automated analysis):
```bash
ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND   # All metrics flat
```

**Report files** (for later analysis):
```bash
ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw            # Export to CSV
```

**Key CSV columns:**

| Column | Meaning |
|--------|---------|
| `Kernel Name` | CUDA kernel function name |
| `Duration` | Execution time (nanoseconds) |
| `Compute (SM) Throughput` | % of peak compute |
| `Memory Throughput` | % of peak memory bandwidth |
| `Achieved Occupancy` | Active warps / max warps (%) |

**Success indicators:**
- SOL% values present in output -> profiling succeeded
- Duration values reasonable (not 0 or extremely large)
- Multiple launches captured when `--launch-count > 1`

## Examples

### Example: Classify a GEMM Kernel

```bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"gemm" \
    --launch-skip 5 --launch-count 3 \
    -- python train.py
```

Output:
```
"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2
```

Interpretation: compute-bound (78.5% compute, 35.2% memory). Next step:
check tensor core usage with `--section ComputeWorkloadAnalysis`.

### Example: Diagnose a Memory-Bound Embedding Kernel

```bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding" \
    --launch-count 3 -- python train.py
```

Check L1/L2 cache hit rates and coalescing efficiency in output. Low hit rates
suggest poor data locality; low coalescing efficiency suggests scattered access.

## Error Handling

| Error | Cause | Fix |
|-------|-------|-----|
| `ncu: command not found` | Not in PATH | `export PATH=$PATH:/usr/local/cuda/bin` or set `$NCU` |
| `Permission denied` | Needs elevated privileges | `sudo ncu ...` or `--cap-add=SYS_ADMIN` in containers |
| No kernels captured | Name regex doesn't match | Run without `--kernel-name` first to see actual names |
| Profiling extremely slow | Using `--set full` or many sections | Use `--section SpeedOfLight` only; reduce `--launch-count` |
| Autotuning pollutes results | JIT kernel warmup captured | Use `--profile-from-start off` with profiler markers |
| Metrics show 0% tensor cores | Kernel doesn't use tensor cores | Check with `--section InstructionStats`; verify dimensions align to 8/16 |
| Report file too large | `--set full` with many kernels | Use targeted sections; limit with `--kernel-name` and `--launch-count` |
| Out-of-range metric values | Async GPU activity or short kernels | Profile on isolated GPU; increase workload size |
| `ncu` hangs on MPI app | Dependent kernels across ranks | Use `--communicator=tcp --lockstep-kernel-launch` |

## Finding More Information

### Tier 1: This File (SKILL.md)

You are reading it now. The section-first workflow and error table above cover
the most common profiling tasks. Search this file first.

### Tier 2: references/ Directory

Grep for keywords across `references/` -- headers are grep-friendly:

- `references/cli-reference.md` -- Complete CLI options, filtering, output formats
- `references/metrics-guide.md` -- Hardware model, metric naming, key metrics
- `references/sections-guide.md` -- All `--section` names, when to use each
- `references/bottleneck-guide.md` -- Per-bottleneck root causes and optimization
- `references/memory-analysis.md` -- Memory hierarchy, cache analysis, coalescing
- `references/roofline-analysis.md` -- Roofline charts and interpretation
- `references/advanced-profiling.md` -- Replay modes, MPI, CUDA graphs, PM sampling, customization
- `references/python-report-api.md` -- `ncu_report` Python module API

**How to search:**
1. `Grep` for your keyword across `references/`
2. `Read` only the file that Grep points to

### Tier 3: Official Documentation

If Tiers 1-2 don't answer:
- [Profiling Guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html) -- Metrics, hardware model, analysis concepts
- [CLI Reference](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html) -- Full CLI options
- [Python Report Interface](https://docs.nvidia.com/nsight-compute/PythonReportInterface/index.html) -- `ncu_report` API
- [Customization Guide](https://docs.nvidia.com/nsight-compute/CustomizationGuide/index.html) -- Section files, rules

WebFetch or WebSearch these URLs for the latest content. Consider distilling
new findings back into `references/`.

More from NVIDIA/skills

Skill	Description
accessing-mlflow	Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
ad-add-fusion-transformation	>
ad-conf-check	>
ad-graph-dump	>
ad-model-onboard	>
ad-pipeline-failure-pr	>
add-benchmark	>
aiq-deploy	\|
aiq-research	\|
byob	Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.