exec-local-compile

Name: exec-local-compile
Author: NVIDIA/skills

$npx mdskill add NVIDIA/skills/exec-local-compile

Compile TensorRT-LLM locally on compute nodes with GPUs.

Enables building TensorRT-LLM from source on compute nodes.
Depends on Docker containers and visible NVIDIA GPUs.
Executes build scripts using TensorRT installation paths.
Outputs compiled binaries directly to the container filesystem.

SKILL.md

.github/skills/exec-local-compileView on GitHub ↗

---
name: exec-local-compile
description: Compile TensorRT-LLM on a compute node inside a Docker container. Use this when already on a compute node with GPUs visible.
license: Apache-2.0
metadata:
  author: NVIDIA Corporation
---

# Compile TensorRT-LLM (Local / Compute Node)

Compile TensorRT-LLM from source on a compute node inside a Docker container.

## When to Use

| Scenario | Use This Skill? |
|----------|----------------|
| On a compute node with GPUs visible (`nvidia-smi` works) | Yes |
| On a SLURM login node (no GPUs) | No — use `exec-slurm-compile` instead |

## Prerequisites

- You are inside a Docker/enroot container on a compute node
- `nvidia-smi` succeeds (GPUs visible)
- `/usr/local/tensorrt` exists (TensorRT installation in the container)

## Instructions

### Step 1: Verify Environment

Run `nvidia-smi` to confirm you are on a compute node with GPU access.

### Step 2: Locate the Codebase

`cd` to the TensorRT-LLM repository. If the path is not provided by the user, ask for it.

### Step 3: (Optional) Checkout Branch

If the user specifies a branch (e.g., "compile ToT"), checkout and pull:
```bash
git checkout main && git pull
```

### Step 4: Build

Run the build command (**incremental by default** — omit `-c`/`--clean` unless explicitly requested or the incremental build fails):

```bash
./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks -ccache -a "<arch>" -f --nvtx
```

Replace `<arch>` with the target GPU architecture (see Architecture Reference below). If not specified by the user, auto-detect from `nvidia-smi`.

### Step 5: Install

```bash
pip install -e .[devel]
```

### Step 6: Verify

```bash
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
```

## Build Flags

| Flag | Description |
|------|-------------|
| `--trt_root /usr/local/tensorrt` | TensorRT installation path (standard in NVIDIA containers) |
| `--benchmarks` | Build the C++ benchmarks |
| `-a "<arch>"` | Target GPU architecture(s) |
| `--nvtx` | Enable NVTX markers for profiling |
| `-ccache` | Use ccache for faster recompilation |
| `-f` / `--fast_build` | Skip some kernels for faster dev compilation. **Always use for dev builds.** |
| `-c` / `--clean` | Clean build directory before building. Only when needed (see below). |
| `--skip_building_wheel` | Build in-place without creating a wheel file |
| `--no-venv` | Skip virtual environment creation |

## Architecture Reference

| Value | GPU Family |
|-------|-----------|
| `"100-real"` | Blackwell (B200, GB200) |
| `"90-real"` | Hopper (H100, H200) |
| `"89-real"` | Ada Lovelace (L40S) |
| `"80-real"` | Ampere (A100) |
| `"90;100-real"` | Multiple architectures |

## Incremental vs. Clean Builds

**Default to incremental builds** — CMake only recompiles changed files, saving significant time.

Use a **clean build** (`-c`) only when:
- The user explicitly requests a clean/fresh build
- An incremental build fails with linker errors, stale object files, or CMake cache issues
- Major branch changes (e.g., rebasing across many commits) that may invalidate the build cache
- Build system files changed (`CMakeLists.txt`, `*.cmake`)

More from NVIDIA/skills

Skill	Description
accessing-mlflow	Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
ad-add-fusion-transformation	>
ad-conf-check	>
ad-graph-dump	>
ad-model-onboard	>
ad-pipeline-failure-pr	>
add-benchmark	>
aiq-deploy	\|
aiq-research	\|
byob	Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.