tensorrt
$
npx mdskill add mkurman/zorai/tensorrtOptimize deep learning inference for maximum GPU throughput.
- Accelerates production models with FP16 and INT8 quantization.
- Integrates with ONNX, PyTorch, TensorFlow, and Triton.
- Automates kernel tuning and layer fusion for peak performance.
- Delivers benchmarked engine files ready for deployment.
SKILL.md
.github/skills/tensorrtView on GitHub ↗
---
name: tensorrt
description: "NVIDIA TensorRT — deep learning inference optimizer. FP16/INT8/INT4 quantization, kernel auto-tuning, layer fusion, and dynamic shapes. Max throughput on NVIDIA GPUs for production inference."
tags: [inference-optimization, quantized-inference, nvidia-deployment, engine-building, tensorrt]
---
## Overview
TensorRT is NVIDIA's high-performance inference optimizer and runtime for deploying deep learning models on NVIDIA GPUs. Use it when you need lower latency, higher throughput, FP16/INT8 optimization, or production GPU serving from ONNX or TensorFlow/PyTorch exports.
## When to Use
Use this skill when:
- a model already works in PyTorch/TensorFlow but inference is too slow,
- you need FP16 or INT8 deployment on NVIDIA GPUs,
- you are deploying vision, NLP, or embedding models in production,
- you want to serve optimized engines via Triton Inference Server,
- or you need to benchmark GPU inference carefully instead of guessing.
## Install / Environment
TensorRT is usually installed via NVIDIA packages, Docker images, or NGC containers rather than plain pip.
Typical paths:
```bash
# Inside NVIDIA container ecosystems
# Use an NGC PyTorch or TensorRT container
# ONNX graph simplification often helps before conversion
uv pip install onnx onnxruntime onnxsim polygraphy
```
## Fastest Common Workflow: ONNX -> TensorRT Engine
1. Export model to ONNX.
2. Validate ONNX with ONNX Runtime.
3. Build TensorRT engine with `trtexec`.
4. Benchmark latency/throughput.
5. Integrate engine into app or Triton.
## Export from PyTorch to ONNX
```python
import torch
dummy = torch.randn(1, 3, 224, 224, device='cuda')
model.eval()
torch.onnx.export(
model,
dummy,
'model.onnx',
input_names=['input'],
output_names=['logits'],
dynamic_axes={'input': {0: 'batch'}, 'logits': {0: 'batch'}},
opset_version=17,
)
```
## Validate the ONNX model first
```bash
python - <<'PY'
import onnx
m = onnx.load('model.onnx')
onnx.checker.check_model(m)
print('ONNX OK')
PY
```
## Build an FP16 engine
```bash
trtexec --onnx=model.onnx --saveEngine=model_fp16.plan --fp16 --workspace=4096 --minShapes=input:1x3x224x224 --optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224
```
## INT8 optimization
Use INT8 only when you have either:
- good calibration data, or
- quantization-aware-prepared/exported graph.
```bash
trtexec --onnx=model.onnx --saveEngine=model_int8.plan --int8 --fp16
```
## Benchmarking
```bash
trtexec --loadEngine=model_fp16.plan --shapes=input:8x3x224x224
```
Check:
- mean latency
- throughput
- GPU memory
- whether kernels are actually using Tensor Cores
## Triton deployment
Recommended production layout:
```text
model_repository/
my_model/
1/
model.plan
config.pbtxt
```
Minimal `config.pbtxt`:
```text
name: "my_model"
platform: "tensorrt_plan"
max_batch_size: 32
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
```
## Common failure modes
- ONNX exports unsupported ops -> simplify graph or change export path
- dynamic shapes missing -> engine only works for one batch/shape
- INT8 accuracy collapse -> calibrate properly or stay on FP16
- preprocessing mismatch -> model seems broken but input normalization is wrong
- engine built on one GPU architecture and reused on incompatible target
## Verification checklist
- compare TensorRT output vs PyTorch on the same test batch
- measure top-1/top-k or task metric after conversion
- benchmark multiple batch sizes, not just batch=1
- test warm and cold runs separately
- save build commands alongside the engine artifact
More from mkurman/zorai
- account-management>
- agile-scrum>
- albumentationsFast image augmentation library (Albumentations). 70+ transforms for classification, segmentation, object detection, keypoints, and pose estimation. Optimized OpenCV-based pipeline with unified API across all CV tasks. Supports images, masks, bounding boxes, and keypoints simultaneously. Note: classic Albumentations (MIT) is no longer maintained; successor AlbumentationsX uses AGPL-3.0. For torchvision-native augmentations, use torchvision.transforms.v2.
- aml-complianceAnti-Money Laundering (AML) and Know Your Customer (KYC) compliance workflow. Sanctions screening, PEP detection, transaction monitoring, suspicious activity reporting (SAR), and OFAC compliance.
- anki-connectThis skill is for interacting with Anki through AnkiConnect, and should be used whenever a user asks to interact with Anki, including to read or modify decks, notes, cards, models, media, or sync operations.
- approval-checkpoint-long-taskCanonical long-task pack for daemon-managed work with deliberate approval checkpoints, status summaries, rollback notes, and mobile-safe governance-aware updates.
- auditing-goal-artifactsUse when reviewing recent zorai goal run outputs, closure markers, ledgers, or evidence bundles to judge whether completion is credible or to identify remaining uncertainty.
- autogenAutoGen (Microsoft) — multi-agent conversation framework. Agent-to-agent chat, code generation & execution, tool use, group chat, and human-in-the-loop. Build collaborative AI systems with specialized agents.
- backtraderPython backtesting framework for trading strategies. Data feeds, brokers, analyzers, and live trading support. Strategy development with commission models, slippage, and signal-based execution.
- beautiful-mermaidRender Mermaid diagrams as SVG and PNG using the Beautiful Mermaid library. Use when the user asks to render a Mermaid diagram.