huggingface-tgi

$npx mdskill add mkurman/zorai/huggingface-tgi

Serve LLMs instantly with Hugging Face TGI.

  • Enables continuous batching and tensor parallelism for fast inference.
  • Depends on Hugging Face model hub and OpenAI-compatible API.
  • Executes requests via optimized quantization and flash attention.
  • Delivers streaming text responses through OpenAI client integration.

SKILL.md

.github/skills/huggingface-tgiView on GitHub ↗
---
name: huggingface-tgi
description: "HuggingFace Text Generation Inference (TGI). High-performance LLM serving with continuous batching, tensor parallelism, watermarking, and OpenAI-compatible API. Native HF model hub integration."
tags: [tgi, llm-inference, huggingface, serving, text-generation, api, zorai]
---
## Overview

Text Generation Inference (TGI) is a production-ready LLM serving solution from Hugging Face. It provides optimized inference with continuous batching, quantization (GPTQ, AWQ), tensor parallelism, flash attention, and an OpenAI-compatible API.

## Installation

```bash
# Docker deployment (recommended)
docker run --gpus all -p 8080:80   -v $HOME/models:/data   ghcr.io/huggingface/text-generation-inference:latest   --model-id Qwen/Qwen2.5-1.5B-Instruct
```

## Client

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
```

## Streaming

```python
stream = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")
```

## References
- [TGI docs](https://huggingface.co/docs/text-generation-inference)
- [TGI GitHub](https://github.com/huggingface/text-generation-inference)

More from mkurman/zorai

SkillDescription
account-management>
agile-scrum>
albumentationsFast image augmentation library (Albumentations). 70+ transforms for classification, segmentation, object detection, keypoints, and pose estimation. Optimized OpenCV-based pipeline with unified API across all CV tasks. Supports images, masks, bounding boxes, and keypoints simultaneously. Note: classic Albumentations (MIT) is no longer maintained; successor AlbumentationsX uses AGPL-3.0. For torchvision-native augmentations, use torchvision.transforms.v2.
aml-complianceAnti-Money Laundering (AML) and Know Your Customer (KYC) compliance workflow. Sanctions screening, PEP detection, transaction monitoring, suspicious activity reporting (SAR), and OFAC compliance.
anki-connectThis skill is for interacting with Anki through AnkiConnect, and should be used whenever a user asks to interact with Anki, including to read or modify decks, notes, cards, models, media, or sync operations.
approval-checkpoint-long-taskCanonical long-task pack for daemon-managed work with deliberate approval checkpoints, status summaries, rollback notes, and mobile-safe governance-aware updates.
auditing-goal-artifactsUse when reviewing recent zorai goal run outputs, closure markers, ledgers, or evidence bundles to judge whether completion is credible or to identify remaining uncertainty.
autogenAutoGen (Microsoft) — multi-agent conversation framework. Agent-to-agent chat, code generation & execution, tool use, group chat, and human-in-the-loop. Build collaborative AI systems with specialized agents.
backtraderPython backtesting framework for trading strategies. Data feeds, brokers, analyzers, and live trading support. Strategy development with commission models, slippage, and signal-based execution.
beautiful-mermaidRender Mermaid diagrams as SVG and PNG using the Beautiful Mermaid library. Use when the user asks to render a Mermaid diagram.