perf-host-optimization

Name: perf-host-optimization
Author: NVIDIA/skills

$npx mdskill add NVIDIA/skills/perf-host-optimization

Optimize TensorRT-LLM CPU latency with iterative profiling.

Reduces host overhead when GPU utilization drops below sixty percent.
Integrates line_profiler and planned nsys system-level tracing.
Executes profile-analyze-optimize-validate cycles to validate improvements.
Delivers actionable recommendations for PyExecutor throughput gains.

SKILL.md

.github/skills/perf-host-optimizationView on GitHub ↗

---
name: perf-host-optimization
description: Profiles and optimizes TensorRT-LLM host/CPU overhead using line_profiler (with nsys support planned). Runs iterative profile-analyze-optimize-validate rounds. Use when GPU utilization is low or optimizing PyExecutor throughput.
license: Apache-2.0
tags:
  - optimization
  - profiling
  - host-overhead
  - line-profiler
  - inference
metadata:
  author: NVIDIA Corporation
---

# Host Performance Optimization Skill

Automates detection and optimization of host-side (CPU) overhead in TensorRT-LLM's PyTorch backend.

## When to Use

- GPU utilization is low during inference (CPU bottleneck suspected)
- User asks to reduce host overhead or CPU latency
- Optimizing PyExecutor throughput (requests/sec)
- Need line-by-line profiling of specific Python functions

### Detecting a CPU Bottleneck

line_profiler measures *where* CPU time is spent but not *whether* CPU is the bottleneck.
When user asks for confirmation, or there is no clear ending conditions for optimizations, use **nsys** (system-level trace) to confirm CPU is the limiting factor:

| Indicator (from nsys) | Threshold | Meaning |
|----------------------|-----------|---------|
| GPU idle gaps between kernels | >30% of step time | CPU can't feed GPU fast enough |
| `cudaLaunchKernel` API time | >10% of total time | Kernel launch overhead |
| NVTX `_prepare_tp_inputs` range | >50% of step time | Input preparation dominates |
| GPU SM utilization | <60% with no comm/sync bottleneck | CPU-bound inference |

If nsys report is not available, use a rough heuristic: if doubling the batch size does not
proportionally increase GPU utilization or throughput, CPU overhead is likely the bottleneck.

### Using Analysis Skill Results

If the `perf-host-analysis` skill has already been run, use its output to skip the confirmation step and prioritize targets:

1. **Detection verdict**: If YES with host_prep_confirmed, start with `_prepare_tp_inputs`.
2. **NVTX triage** (from Root Cause): The `top_regressing_ops` in the handoff data block maps NVTX range names to source functions. Profile the function with the largest absolute delta first.
3. **Cross-function triage**: When the top NVTX regression is NOT in `_prepare_tp_inputs` (e.g., `_fetch_new_requests`, `broadcast_requests`, `_update_requests`), target that function's source file directly instead of defaulting to `_prepare_tp_inputs`. See [references/trtllm-nvtx-ranges.md](../perf-host-analysis/references/trtllm-nvtx-ranges.md) for the NVTX-to-source mapping.

---

## Profiling Setup

### line_profiler (Primary Method)

**Environment Variables:**
- `TLLM_LINE_PROFILER_ENABLED=True` — Enable the profiler
- `TLLM_LINE_PROFILER_PATH` — Output file path
- `TLLM_LINE_PROFILER_FUNCTIONS` — Additional functions to profile (comma-separated)

**Function specification format:**
```bash
# Class methods: module.path.ClassName.method_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs"

# Standalone functions: module.path::function_name
TLLM_LINE_PROFILER_FUNCTIONS="tensorrt_llm._torch.pyexecutor.sampler::_group_requests_by_strategy_key"

# Multiple functions (comma-separated)
TLLM_LINE_PROFILER_FUNCTIONS="module.Class.method1,module.Class.method2"
```

### CPU Affinity (Environment Factor)

CPU core affinity can significantly affect host overhead measurements,
especially on multi-socket systems (e.g., B300). Pinning processes to cores
near the GPU's NUMA node reduces cross-socket memory access latency.

- Check current affinity: `taskset -p <pid>` or `numactl --show`
- Pin to local NUMA node: `numactl --cpunodebind=<node> --membind=<node>`
- **Impact**: Up to 2x difference in host overhead on B300 systems

When comparing profiling results across runs, ensure CPU affinity is consistent.
Do not externally modify the affinity, unless user requires to do this to examine the affects of this part.
Document the affinity setting in each round's report if it varies.

### Workspace & Suffix Management

Each profiling run should have a unique suffix to track progress across rounds:
```bash
EXTRA_SUFFIX=round0_baseline bash profile.sh
EXTRA_SUFFIX=round1_eliminate_redundant_iter bash profile.sh
```

---

## Autonomous Optimization Loop

Run N rounds (default 3) of the following cycle:

```
FOR round = 1 to MAX_ROUNDS:

  1. PROFILE (with Drill-Down)
  2. ANALYZE (Multi-Option)
  3. OPTIMIZE (Apply Change)
  4. TEST (Unit Test Validation)
  5. VALIDATE (Re-Profile)
  6. REPORT

END FOR → FINAL SUMMARY
```

### Phase 1: PROFILE (with Drill-Down)

- Run workload with profiler enabled
- Parse output: identify functions with highest Total time and lines with highest % Time
- **CRITICAL: Drill down into sub-functions that are not yet profiled** (see below)

#### Drill-Down Profiling

The default profiler covers top-level executor functions but **not all sub-functions**. When a profiled function shows most time in a single sub-call, you must drill down.

**When:** A single line consumes >80% of a function's time calling an unprofiled sub-function:
```
Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
  2848      4100  59200000000.0  14439024.4   98.7      output = self.model_engine.forward(...)
```

**How:**
1. Identify the sub-function's full qualified path (e.g., `tensorrt_llm._torch.pyexecutor.model_engine.PyTorchModelEngine._prepare_tp_inputs`)
2. Add it to `TLLM_LINE_PROFILER_FUNCTIONS`
3. Re-profile to get line-level data inside it
4. Now analyze the **inner** hotspots

For common drill-down targets, see [references/hot-path-files.md](references/hot-path-files.md).

### Phase 2: ANALYZE (Multi-Option)

For the chosen hotspot:

1. **Identify** the top hotspots by **absolute time** (not just %) within the target function
2. **Classify** each hotspot by type. Summary table:

| Type | Indicators | Severity |
|------|------------|----------|
| **SYNC** | `.item()`, `.cpu()`, `synchronize()` | Critical |
| **ALLOC** | `torch.zeros/empty/tensor()` in loops, `.clone()` | High |
| **PYLOOP** | `for x in collection:` with many iterations | High |
| **REDUNDANT_ITER** | Multiple passes over the same collection | High |
| **DEAD_WORK** | Object construction whose results are always discarded | High |
| **CONTAINER** | Dict/set lookups in hot loops | Medium |
| **FUNCALL** | Repeated method/property calls | Medium |
| **GIL** | Lock/queue contention | Medium |
| **GC** | Periodic latency spikes, non-deterministic pauses | Low |
| **COMPUTE** | Actual computation (may not be optimizable) | Low |

For detailed classification with code examples, see [references/hotspot-classification.md](references/hotspot-classification.md).

3. **Propose 2-4 optimization options** in a table:

| Option | Description | Estimated Savings | Risk | Complexity |
|--------|-------------|-------------------|------|------------|
| A | ... | ... | Low/Med/High | ... |
| B | ... | ... | ... | ... |

4. **Select the best option** and explain reasoning (prefer high-savings + low-risk)

For optimization patterns by type, see [references/optimization-patterns.md](references/optimization-patterns.md).

### Phase 3: OPTIMIZE (Apply Change)

- Apply the selected code change with Edit tool
- **One optimization per round** — keep changes minimal and targeted
- Record the exact change (file, line range, before/after) for potential rollback

### Phase 4: TEST (Unit Test Validation)

**Mandatory** after each optimization. Find and run related UTs to verify correctness.

**Finding related tests:**
```bash
# Search by modified file name
grep -rl "model_engine\|PyTorchModelEngine" tests/unittest/_torch/executor/

# Search by modified function name
grep -rl "_prepare_tp_inputs\|prepare_inputs" tests/
```

**Running tests:**
```bash
# Run specific test file with stop-on-first-failure
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py -v -x --timeout=120

# Run specific test method
pytest tests/unittest/_torch/executor/test_pytorch_model_engine.py::PyTorchModelEngineTestCase::test_position_id_preparation -v -x
```

For the full UT-to-file mapping, see [references/hot-path-files.md](references/hot-path-files.md).

**If tests fail:**
1. Read the failure message
2. Rollback immediately (`git checkout -- <file>`)
3. Analyze why the optimization broke correctness
4. Try the next-best option from Phase 2

### Phase 5: VALIDATE (Re-Profile)

- Re-run profiler with identical workload, using suffix `round<N>_<description>`
- Compare three things:
  1. Did the **target hotspot** time decrease?
  2. Did the **overall function** Total time decrease?
  3. Did **benchmark metrics** (TPOT, throughput) improve?

**If regression detected** (function time increased or metrics worsened):
- The "optimization" may have triggered a CPython pitfall — see [references/optimization-patterns.md](references/optimization-patterns.md) (CPython Pitfalls section)
- Rollback and try the next-best option from Phase 2

### Phase 6: REPORT

Log for this round:
- Round number
- Hotspot location (file:line) and classification
- Optimization applied (with before/after code summary)
- Time delta: function Total time before → after
- Benchmark delta: TPOT, throughput before → after

---

## Reading Profile Output

```
Timer unit: 1e-06 s
Total time: 1.234 s
File: /path/to/file.py
Function: my_function at line 100

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   100                                           def my_function(self):
   101       500    890000.0   1780.0     72.1       result = tensor.item()
   102       500    234567.0    469.1     19.0       return result
```

**How to read effectively:**
1. Start with **Total time** for each function — this is the overall budget
2. Sort lines mentally by **absolute Time**, not just % Time (3% of a 60s function = 1.8s)
3. Check **Hits count** to understand iteration patterns:
   - Hits = 2 × expected count → `for x in range(1):` loop overhead (2 hits = enter + exit check)
   - Hits ≫ expected → the line is inside a nested loop
4. Look for **repeated patterns**: if 10 lines each take 3% in a loop body, the loop itself costs 30%

---

## Stopping Criteria

Stop the optimization loop when:
1. **Iteration limit reached**: Completed N rounds (default 3)
2. **No actionable hotspots**: Top hotspots are pure GPU compute (COMPUTE type)
3. **Diminishing returns**: < 5% improvement in last 2 rounds
4. **Risk threshold**: Further optimizations require architectural changes (e.g., Cython, struct-of-arrays)
5. **Test failures**: Cannot find an optimization that passes UTs

**Primary success metric**: Benchmark throughput (requests/sec or tokens/sec) as measured by the profiling script. line_profiler time reductions are leading indicators, but throughput is the ground truth — a function-level speedup that doesn't improve throughput is not a real win.

---

## Final Summary Output

The final report should include:
- **Rounds executed**: Number of profile-optimize cycles completed
- **Cumulative improvement**: Total host time reduction (percentage and absolute)
- **Benchmark metrics**: Before/after comparison table (TPOT, throughput, ITL, E2EL)
- **Optimizations applied**: List of changes with file:line locations and classification
- **Failed attempts**: Any optimizations tried and reverted (with why)
- **Remaining hotspots**: Top bottlenecks that couldn't be optimized (with classification)
- **Recommendations**: Suggested follow-up for architectural changes if needed

For a concrete multi-round example, see [references/examples.md](references/examples.md).

---

## Reference Files

| File | Contents |
|------|----------|
| [references/optimization-patterns.md](references/optimization-patterns.md) | Pattern catalog by hotspot type + CPython pitfalls |
| [references/hotspot-classification.md](references/hotspot-classification.md) | Extended per-type indicators and code examples |
| [references/hot-path-files.md](references/hot-path-files.md) | Key file tables, drill-down targets, UT mapping |
| [references/examples.md](references/examples.md) | Usage examples and multi-round walkthrough |
| [trtllm-nvtx-ranges.md](../perf-host-analysis/references/trtllm-nvtx-ranges.md) | TRT-LLM NVTX range reference (from analysis skill) — maps range names to source functions |

More from NVIDIA/skills

Skill	Description
accessing-mlflow	Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
ad-add-fusion-transformation	>
ad-conf-check	>
ad-graph-dump	>
ad-model-onboard	>
ad-pipeline-failure-pr	>
add-benchmark	>
aiq-deploy	\|
aiq-research	\|
byob	Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.