tensorrt

Name: tensorrt
Author: mkurman/zorai

$npx mdskill add mkurman/zorai/tensorrt

Optimize deep learning inference for maximum GPU throughput.

Accelerates production models with FP16 and INT8 quantization.
Integrates with ONNX, PyTorch, TensorFlow, and Triton.
Automates kernel tuning and layer fusion for peak performance.
Delivers benchmarked engine files ready for deployment.

SKILL.md

.github/skills/tensorrtView on GitHub ↗

---
name: tensorrt
description: "NVIDIA TensorRT — deep learning inference optimizer. FP16/INT8/INT4 quantization, kernel auto-tuning, layer fusion, and dynamic shapes. Max throughput on NVIDIA GPUs for production inference."
tags: [inference-optimization, quantized-inference, nvidia-deployment, engine-building, tensorrt]
---
## Overview

TensorRT is NVIDIA's high-performance inference optimizer and runtime for deploying deep learning models on NVIDIA GPUs. Use it when you need lower latency, higher throughput, FP16/INT8 optimization, or production GPU serving from ONNX or TensorFlow/PyTorch exports.

## When to Use

Use this skill when:
- a model already works in PyTorch/TensorFlow but inference is too slow,
- you need FP16 or INT8 deployment on NVIDIA GPUs,
- you are deploying vision, NLP, or embedding models in production,
- you want to serve optimized engines via Triton Inference Server,
- or you need to benchmark GPU inference carefully instead of guessing.

## Install / Environment

TensorRT is usually installed via NVIDIA packages, Docker images, or NGC containers rather than plain pip.

Typical paths:

```bash
# Inside NVIDIA container ecosystems
# Use an NGC PyTorch or TensorRT container

# ONNX graph simplification often helps before conversion
uv pip install onnx onnxruntime onnxsim polygraphy
```

## Fastest Common Workflow: ONNX -> TensorRT Engine

1. Export model to ONNX.
2. Validate ONNX with ONNX Runtime.
3. Build TensorRT engine with `trtexec`.
4. Benchmark latency/throughput.
5. Integrate engine into app or Triton.

## Export from PyTorch to ONNX

```python
import torch

dummy = torch.randn(1, 3, 224, 224, device='cuda')
model.eval()

torch.onnx.export(
    model,
    dummy,
    'model.onnx',
    input_names=['input'],
    output_names=['logits'],
    dynamic_axes={'input': {0: 'batch'}, 'logits': {0: 'batch'}},
    opset_version=17,
)
```

## Validate the ONNX model first

```bash
python - <<'PY'
import onnx
m = onnx.load('model.onnx')
onnx.checker.check_model(m)
print('ONNX OK')
PY
```

## Build an FP16 engine

```bash
trtexec   --onnx=model.onnx   --saveEngine=model_fp16.plan   --fp16   --workspace=4096   --minShapes=input:1x3x224x224   --optShapes=input:8x3x224x224   --maxShapes=input:32x3x224x224
```

## INT8 optimization

Use INT8 only when you have either:
- good calibration data, or
- quantization-aware-prepared/exported graph.

```bash
trtexec   --onnx=model.onnx   --saveEngine=model_int8.plan   --int8   --fp16
```

## Benchmarking

```bash
trtexec --loadEngine=model_fp16.plan --shapes=input:8x3x224x224
```

Check:
- mean latency
- throughput
- GPU memory
- whether kernels are actually using Tensor Cores

## Triton deployment

Recommended production layout:

```text
model_repository/
  my_model/
    1/
      model.plan
    config.pbtxt
```

Minimal `config.pbtxt`:

```text
name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 32
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
```

## Common failure modes

- ONNX exports unsupported ops -> simplify graph or change export path
- dynamic shapes missing -> engine only works for one batch/shape
- INT8 accuracy collapse -> calibrate properly or stay on FP16
- preprocessing mismatch -> model seems broken but input normalization is wrong
- engine built on one GPU architecture and reused on incompatible target

## Verification checklist

- compare TensorRT output vs PyTorch on the same test batch
- measure top-1/top-k or task metric after conversion
- benchmark multiple batch sizes, not just batch=1
- test warm and cold runs separately
- save build commands alongside the engine artifact