vllm
$
npx mdskill add mkurman/zorai/vllmServe LLMs at high speed with vLLM's optimized inference engine.
- Delivers rapid responses through continuous batching and PagedAttention.
- Integrates with OpenAI-compatible clients and supports major model families.
- Executes speculative decoding to reduce latency for complex requests.
- Outputs text directly via API or Python SDK for immediate use.
SKILL.md
.github/skills/vllmView on GitHub ↗
---
name: vllm
description: "Fast LLM inference engine. PagedAttention, continuous batching, tensor parallelism, speculative decoding, and prefix caching. OpenAI-compatible API server. Supports Llama, Mistral, Qwen, DeepSeek, and hundreds of models."
tags: [llm-serving, paged-attention, openai-compatible-server, high-throughput-inference, vllm]
---
## Overview
vLLM is a high-throughput, memory-efficient LLM inference engine featuring PagedAttention (near-zero memory waste), continuous batching, tensor parallelism, speculative decoding, prefix caching, and an OpenAI-compatible API.
## Installation
```bash
uv pip install vllm
```
## Offline Inference
```python
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(["What is the capital of France?"], params)
for o in outputs:
print(o.outputs[0].text)
```
## API Server
```bash
vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 8000
# OpenAI client:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
```
## Multi-GPU
```python
llm = LLM(model="meta-llama/Llama-3.1-8B", tensor_parallel_size=2)
```
## References
- [vLLM docs](https://docs.vllm.ai/)
- [vLLM GitHub](https://github.com/vllm-project/vllm)More from mkurman/zorai
- account-management>
- agile-scrum>
- albumentationsFast image augmentation library (Albumentations). 70+ transforms for classification, segmentation, object detection, keypoints, and pose estimation. Optimized OpenCV-based pipeline with unified API across all CV tasks. Supports images, masks, bounding boxes, and keypoints simultaneously. Note: classic Albumentations (MIT) is no longer maintained; successor AlbumentationsX uses AGPL-3.0. For torchvision-native augmentations, use torchvision.transforms.v2.
- aml-complianceAnti-Money Laundering (AML) and Know Your Customer (KYC) compliance workflow. Sanctions screening, PEP detection, transaction monitoring, suspicious activity reporting (SAR), and OFAC compliance.
- anki-connectThis skill is for interacting with Anki through AnkiConnect, and should be used whenever a user asks to interact with Anki, including to read or modify decks, notes, cards, models, media, or sync operations.
- approval-checkpoint-long-taskCanonical long-task pack for daemon-managed work with deliberate approval checkpoints, status summaries, rollback notes, and mobile-safe governance-aware updates.
- auditing-goal-artifactsUse when reviewing recent zorai goal run outputs, closure markers, ledgers, or evidence bundles to judge whether completion is credible or to identify remaining uncertainty.
- autogenAutoGen (Microsoft) — multi-agent conversation framework. Agent-to-agent chat, code generation & execution, tool use, group chat, and human-in-the-loop. Build collaborative AI systems with specialized agents.
- backtraderPython backtesting framework for trading strategies. Data feeds, brokers, analyzers, and live trading support. Strategy development with commission models, slippage, and signal-based execution.
- beautiful-mermaidRender Mermaid diagrams as SVG and PNG using the Beautiful Mermaid library. Use when the user asks to render a Mermaid diagram.