rt-vlm
$
npx mdskill add NVIDIA-AI-Blueprints/video-search-and-summarization/rt-vlmGenerate dense captions and alerts for video files or live streams.
- Processes stored video files and RTSP streams into actionable insights.
- Integrates with RTVI VLM microservice, OpenAI-compatible models, and Kafka.
- Executes caption generation, file uploads, and live stream management.
- Delivers results via SSE streams and publishes alerts to Kafka topics.
SKILL.md
.github/skills/rt-vlmView on GitHub ↗
---
name: rt-vlm
description: >
Use this skill when working with the RTVI VLM or RT-VLM microservice API on VSS 3.1.
Generate dense captions and alerts for stored video files and live RTSP streams via
`/v1/generate_captions_alerts`; upload media via `/v1/files`; add and remove live
streams with `/v1/streams/add` and `/v1/streams/delete/{stream_id}`; call
OpenAI-compatible `/v1/chat/completions`; consume Kafka caption, incident, and error
topics; or debug rtvi-vlm responses. For deployment, read
`references/deploy-rt-vlm-service.md` first.
license: Apache-2.0
metadata:
version: "3.1.0"
github-url: "https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization"
tags: "nvidia blueprint operational deployment"
---
# RTVI VLM Usage API (VSS 3.1)
RTVI VLM is NVIDIA's real-time vision-language microservice: decode video (file or
RTSP) → segment into chunks → run a VLM (`cosmos-reason1`, `cosmos-reason2`, or any
OpenAI-compatible model) → stream dense captions back over SSE/HTTP and publish
captions + incident alerts + errors to Kafka. Use this skill whenever you need to hit
any `/v1/...` endpoint on the VSS 3.1 rtvi-vlm microservice: caption generation, file
upload, live-stream management, health checks, NIM-compatible chat completions,
Prometheus metrics. API reference: <https://docs.nvidia.com/vss/latest/real-time-vlm-api.html>.
## Setup
```bash
export BASE_URL="http://localhost:8000" # RTVI VLM host:port — matches $RTVI_VLM_PORT in compose
export API_KEY="$NGC_API_KEY" # Bearer token (NGC key works if the service was deployed with NGC auth)
```
Every request below uses `Authorization: Bearer $API_KEY`. Health endpoints
(`/v1/health/*`, `/v1/ready`, `/v1/live`, `/v1/startup`) typically work without auth.
**Smoke test before use:**
```bash
curl -fsS "$BASE_URL/v1/health/ready" && curl -fsS "$BASE_URL/v1/models" | jq
```
## Quick Start — dense captions from a local video
```bash
# 1. Upload the video, capture its file id
FILE_ID=$(curl -fsS -X POST "$BASE_URL/v1/files" \
-H "Authorization: Bearer $API_KEY" \
-F "file=@/path/to/warehouse.mp4" \
-F "purpose=vision" \
-F "media_type=video" | jq -r '.id')
# 2. Generate captions + alerts (SSE stream of chunked responses)
curl -N -X POST "$BASE_URL/v1/generate_captions_alerts" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"id\": \"$FILE_ID\",
\"prompt\": \"Write a concise dense caption for each 10-second segment of this warehouse video.\",
\"model\": \"cosmos-reason1\",
\"chunk_duration\": 10,
\"stream\": true
}"
```
## Endpoints
### Captions
> Generate VLM captions and alerts for videos and live streams.
#### `POST /v1/generate_captions_alerts` — Generate VLM captions (and alerts) for video/stream
**Required:**
| Field | Type | Description |
|-------|------|-------------|
| `id` | string \| array | UUID of a previously-uploaded file, or id of an active live stream. Accepts a list of ids for batch |
| `prompt` | string | User prompt to the VLM (e.g. dense-caption instruction) |
| `model` | string | Model name — see `GET /v1/models` |
**Key optional fields:**
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `system_prompt` | string | — | System prompt; use `<think></think><answer></answer>` tags to enable reasoning on Cosmos Reason |
| `enable_reasoning` | boolean | false | Turn on reasoning for Cosmos Reason models |
| `enable_audio` | boolean | false | Transcribe audio (via Riva) and fold into captions |
| `chunk_duration` | integer | — | Segment video into N-second chunks (`0` = no chunking) |
| `chunk_overlap_duration` | integer | 0 | Overlap between consecutive chunks |
| `num_frames_per_second_or_fixed_frames_chunk` | number | — | FPS (if `use_fps_for_chunking=true`) or fixed frames per chunk |
| `use_fps_for_chunking` | boolean | false | Interpret above as FPS vs. fixed-frame count |
| `vlm_input_width` / `vlm_input_height` | int | — | Resize frames before inference (0 = native) |
| `media_info` | object | — | `{"start_offset_ms": ..., "end_offset_ms": ...}` to process a slice of a file (not live streams) |
| `stream` | boolean | false | SSE: emit per-chunk caption deltas as `data:` events (recommended for long videos) |
| `max_tokens` / `temperature` / `top_p` / `top_k` / `seed` / `ignore_eos` | | | Standard sampling controls |
| `response_format` | object | — | Query response format object |
| `mm_processor_kwargs` | object | — | Extra kwargs for the multimodal processor (e.g. size, shortest/longest edge) |
```bash
curl -N -X POST "$BASE_URL/v1/generate_captions_alerts" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"id": "123e4567-e89b-12d3-a456-426614174000",
"prompt": "Dense-caption this warehouse video, one sentence per 10s chunk.",
"model": "cosmos-reason1",
"chunk_duration": 10,
"stream": true
}'
```
**Response (200, SSE when `stream=true`):** each event payload has `start_ts`, `end_ts`,
`content`, and a terminal `{"status": "completed"}` event.
**Response (200, non-stream):** `{ "id", "object": "caption", "choices": [{...}], "usage": {...} }`.
#### `DELETE /v1/generate_captions_alerts/{stream_id}` — Stop caption generation for a live stream
Stops inference while leaving the stream registered. Pair with
`DELETE /v1/streams/delete/{stream_id}` to also un-register the RTSP source.
```bash
curl -X DELETE "$BASE_URL/v1/generate_captions_alerts/$STREAM_ID" -H "Authorization: Bearer $API_KEY"
```
### Files
> Upload and manage media files consumed by `/v1/generate_captions_alerts`.
#### `POST /v1/files` — Upload a media file (multipart)
```bash
curl -X POST "$BASE_URL/v1/files" -H "Authorization: Bearer $API_KEY" \
-F "file=@./video.mp4" -F "purpose=vision" -F "media_type=video"
```
**Response:** `{ "id", "object": "file", "bytes", "created_at", "filename", "purpose" }`.
#### `GET /v1/files?purpose=vision` — List uploaded files
#### `GET /v1/files/{file_id}` — File metadata
#### `GET /v1/files/{file_id}/content` — Download original file content
#### `DELETE /v1/files/{file_id}` — Delete file (releases asset storage)
### Live Stream
> RTSP stream lifecycle.
#### `POST /v1/streams/add` — Register one or more RTSP streams
**Required per stream:** `liveStreamUrl` (must start with `rtsp://`), `description`.
Optional: `username`, `password`, `sensor_name`, and placement metadata
(`place_name`, `place_type`, `place_lat`, `place_lon`, `place_alt`,
`place_coordinate_x`, `place_coordinate_y`).
```bash
STREAM_ID=$(curl -fsS -X POST "$BASE_URL/v1/streams/add" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"streams":[{"liveStreamUrl":"rtsp://cam:8554/live","description":"warehouse cam 1"}]}' \
| jq -r '.results[0].id')
```
#### `GET /v1/streams/get-stream-info` — List active streams
#### `DELETE /v1/streams/delete/{stream_id}` — Remove a single stream
#### `DELETE /v1/streams/delete-batch` — Remove many (`{"stream_ids":[...]}`)
### NIM Compatible
> OpenAI-compatible endpoints for interop with OpenAI/NVIDIA-API clients.
#### `POST /v1/chat/completions` — OpenAI-compatible chat (text + multimodal)
**Required:** `messages`, `model`. Text-only requests omit `id` / `video_url` / `image_url`.
```bash
curl -X POST "$BASE_URL/v1/chat/completions" -H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"cosmos-reason1","messages":[{"role":"user","content":"Summarize this scene."}]}'
```
#### `POST /v1/completions` — OpenAI-compatible legacy completions
#### `GET /v1/version` — `{ "version": "3.1.0-..." }`
#### `GET /v1/license` — license text
#### `GET /v1/manifest` — NIM manifest
#### `GET /v1/health/live` · `GET /v1/health/ready` — NIM-style probes
### Models · Metadata · Metrics · Health Check
#### `GET /v1/models` — List loaded VLMs: `{ "data": [{ "id", "object": "model", "owned_by" }] }`
#### `GET /v1/metadata` — Service metadata (build, release, image tag)
#### `GET /v1/metrics` — Prometheus metrics (plain text)
#### `GET /v1/ready` · `GET /v1/live` · `GET /v1/startup` — Kubernetes-style probes
---
## Common Workflows
The four scenarios from the VSS 3.1 RT-VLM Usage Skill requirements.
### 1. Dense captions from a stored video file
```bash
# Upload → capture file id → generate captions (SSE stream)
FILE_ID=$(curl -fsS -X POST "$BASE_URL/v1/files" \
-H "Authorization: Bearer $API_KEY" \
-F "file=@warehouse.mp4" -F "purpose=vision" -F "media_type=video" | jq -r '.id')
curl -N -X POST "$BASE_URL/v1/generate_captions_alerts" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d "{
\"id\": \"$FILE_ID\",
\"prompt\": \"Describe warehouse events in 1 sentence per 10s chunk.\",
\"model\": \"cosmos-reason1\",
\"chunk_duration\": 10,
\"stream\": true
}"
# When done, free storage:
curl -X DELETE "$BASE_URL/v1/files/$FILE_ID" -H "Authorization: Bearer $API_KEY"
```
### 2. Dense captions from an RTSP live stream
```bash
# Register the stream
STREAM_ID=$(curl -fsS -X POST "$BASE_URL/v1/streams/add" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"streams":[{"liveStreamUrl":"rtsp://10.0.0.5:8554/warehouse","description":"warehouse cam"}]}' \
| jq -r '.results[0].id')
# Start continuous caption generation (runs until stream stops or DELETE)
curl -N -X POST "$BASE_URL/v1/generate_captions_alerts" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d "{
\"id\": \"$STREAM_ID\",
\"prompt\": \"Describe each event; start each sentence with a timestamp.\",
\"model\": \"cosmos-reason1\",
\"chunk_duration\": 10,
\"num_frames_per_second_or_fixed_frames_chunk\": 2,
\"use_fps_for_chunking\": true,
\"stream\": true
}" &
# Tear down when finished:
curl -X DELETE "$BASE_URL/v1/generate_captions_alerts/$STREAM_ID" -H "Authorization: Bearer $API_KEY"
curl -X DELETE "$BASE_URL/v1/streams/delete/$STREAM_ID" -H "Authorization: Bearer $API_KEY"
```
# Pre-req: the container was started with:
# RTVI_VLM_KAFKA_ENABLED=true
# RTVI_VLM_KAFKA_TOPIC=vision-llm-messages
# RTVI_VLM_KAFKA_INCIDENT_TOPIC=vision-llm-events-incidents
# RTVI_VLM_ERROR_MESSAGE_TOPIC=vision-llm-errors
# HOST_IP=<kafka-host>
STREAM_ID=$(curl -fsS -X POST "$BASE_URL/v1/streams/add" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d '{"streams":[{"liveStreamUrl":"rtsp://10.0.0.5:8554/warehouse","description":"warehouse cam"}]}' \
| jq -r '.results[0].id')
curl -N -X POST "$BASE_URL/v1/generate_captions_alerts" \
-H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
-d "{
\"id\": \"$STREAM_ID\",
\"prompt\": \"You are a warehouse monitoring system. Describe the scene in one sentence, then on a new line output exactly:\\nAnomaly Detected: Yes/No\\nReason: <one sentence>\\nFlag an anomaly if any worker is missing a hard hat or high-vis vest.\",
\"system_prompt\": \"Answer the user's question correctly in yes or no.\",
\"model\": \"cosmos-reason2\",
\"chunk_duration\": 60,
\"chunk_overlap_duration\": 10,
\"stream\": true
}"
```
**Consume alerts from Kafka** (when using the VSS foundational Kafka container).
Kafka values are NvSchema protobuf payloads, so use `print.value=false` for a
clean validation pass that shows timestamp, key, and headers without dumping
binary payload bytes:
```bash
docker exec mdx-kafka kafka-console-consumer \
--bootstrap-server 127.0.0.1:9092 \
--topic vision-llm-events-incidents \
--from-beginning \
--timeout-ms 5000 \
--max-messages 10 \
--property print.timestamp=true \
--property print.key=true \
--property print.headers=true \
--property print.value=false
```
If Kafka is not running in the VSS `mdx-kafka` container, use the Kafka CLI from
the host running the broker:
```bash
kafka-console-consumer \
--bootstrap-server "$HOST_IP:9092" \
--topic vision-llm-events-incidents \
--from-beginning \
--timeout-ms 5000 \
--max-messages 10 \
--property print.timestamp=true \
--property print.key=true \
--property print.headers=true \
--property print.value=false
```
Incident protobuf (`ext.proto :: Incident`) key fields: `sensorId`, `timestamp`, `end`,
`objectIds`, `frameIds`, `place`, `analyticsModule`, `category`, `isAnomaly` (`true` for
alerts), `llm` (nested VisionLLM), `info` map including `triggerPhrase`, `verdict`,
`requestId`, `chunkIdx`, `streamId`, `alertCategory` (if the deployment supports the
`alert_category` query field — post-3.1).
### 3. Kafka workflows (alerts + message bus)
Dense captioning with alerts on an RTSP stream and the HTTP-vs-Kafka response model are documented in [`references/kafka-workflows.md`](references/kafka-workflows.md).
## Error Reference
| Code | Meaning | Common Cause |
|------|---------|--------------|
| 400 | Bad Request | Missing required field (`id`, `prompt`, `model`); unsupported `media_type`; unknown `model` name |
| 401 | Unauthorized | Missing/invalid `Authorization: Bearer $API_KEY` — or wrong key format (expect `nvapi-...`) |
| 404 | Not Found | `file_id` deleted / stream_id not registered / wrong endpoint path (note: `{stream_id}` is required on `DELETE /v1/streams/delete/{stream_id}`) |
| 413 | Payload Too Large | Uploaded file exceeds server `MAX_FILE_SIZE`; increase or pre-chunk the video |
| 422 | Unprocessable Entity | Pydantic schema violation — e.g. `use_fps_for_chunking=true` without `num_frames_per_second_or_fixed_frames_chunk`; stream ids supplied to a file-only field like `media_info` |
| 429 | Rate Limited | Too many concurrent streams — raise `VLM_BATCH_SIZE` or spread across instances |
| 500 | Server Error | VLM inference exception (OOM, model unavailable) — check `docker logs rtvi-vlm-*` |
| 503 | Service Busy | Startup not complete (model still downloading) or upstream NIM dependency unhealthy |
---
## Gotchas
- **3.1 GA endpoint is `/v1/generate_captions_alerts`, not `/v1/generate_captions`.** The rename lands in a post-3.1 build. For VSS 3.1 releases (`rtvi_vlm/26.01.x`–`26.02.3`), always use the `_alerts` suffix. `https://docs.nvidia.com/vss/latest/real-time-vlm-api.html` is the canonical reference.
- **No URL-based input in 3.1 GA** — the `url`/`media_type`/`creation_time` fields were added post-3.1. You **must** upload via `POST /v1/files` first and then pass the returned `id`.
- **Alert trigger = the tokens `"yes"` or `"true"` in the VLM response (case-insensitive)**. There is no per-request alert flag. Design prompts with an explicit `Anomaly Detected: Yes/No` line and set `system_prompt` to constrain the model to Yes/No answers (per the VSS docs). Every chunk is published to `KAFKA_TOPIC`; matched chunks additionally go to `KAFKA_INCIDENT_TOPIC` with `isAnomaly=true`, `info["triggerPhrase"]` set to the matched tokens, and `info["verdict"]="confirmed"`.
- **No `alert_category` query field in the 3.1 OpenAPI spec.** The Kafka incident topic defaults `incident.category = "vlm-alert"` on 3.1. Post-3.1 builds expose an optional `alert_category` request field to override `incident.category`.
- **Kafka topics are server-side config, not per-request.** The `KAFKA_*` env vars (via compose `RTVI_VLM_KAFKA_*` rewrites) are fixed at container start — clients can't override topics on a per-request basis. Kafka publish is *additive* to the HTTP response, never a replacement.
- **`stream=true` returns Server-Sent Events, not chunked JSON.** Use `curl -N` (no buffering). Each event is `data: {"content": "...", "start_ts": ..., "end_ts": ...}\n\n`, terminated by `data: {"status":"completed"}\n\n`. Without `stream=true` the server buffers until the full video is processed — fine for short clips (<1 min), avoid for live streams.
- **`chunk_duration=0` disables chunking** — the entire video is sent to the VLM as one shot. Only meaningful for short clips; long videos will OOM or exceed `max_model_len`.
- **Default frame budget caps at `VLLM_MM_PROCESSOR_VIDEO_NUM_FRAMES` (256).** Requesting FPS that implies >256 frames per chunk is silently capped; drop FPS or shorten `chunk_duration` to stay within budget.
- **`enable_reasoning` requires a Cosmos Reason model.** Passing it with Qwen3-VL or other non-reasoning models is a no-op.
- **`/v1/metrics` requires auth**, unlike `/v1/health/*`. Prometheus scrapers need the Bearer token.
- **File upload is multipart, not JSON.** Use `-F file=@path -F purpose=vision -F media_type=video`; a `-d` body returns 422.
- **Live-stream lifecycle requires two deletes to fully tear down:** `DELETE /v1/generate_captions_alerts/{stream_id}` stops inference; `DELETE /v1/streams/delete/{stream_id}` un-registers the stream. Skipping the second leaks RTSP connection resources.
More from NVIDIA-AI-Blueprints/video-search-and-summarization
- alertsManage and monitor VSS alerts after the alerts profile is deployed. The deployment's mode (CV vs VLM real-time) is fixed at deploy time and determines the workflow — start/stop real-time alerts via the VSS Agent on a VLM deployment, onboard CV alerts by adding RTSP streams to VIOS on a CV deployment, query incidents, customize verifier prompts. Use when asked to start/stop a real-time alert, check or list alerts, add a camera, use a sample video for alerts, customize alert prompts, or view verdicts.
- deployDeploy, debug, or tear down any VSS profile using a compose-centric workflow — config (dry-run) with env overrides, review resolved compose, then compose up. Use this skill when the user says "deploy vss", "deploy `profile`", "debug deploy", "verify deployment", or "why is my vss deploy broken".
- video-analyticsQuery video analytics data and metrics from Elastic search via the VA-MCP server (port 9901). This includes incidents, alerts, sensor data, and metrics. Use for any question about violations, alerts, incidents, object counts, speeds, occupancy, or anything that requires looking up recorded events. This is the primary way to answer a question that requires incidents, alerts and other metrics such as people counts and violations.
- video-searchSearch video archives using natural language — find events, objects, actions, and people across recorded video using fusion search (Cosmos Embed1 semantic search + CV attribute search). Use when asked to search for something in video, find actions and events, locate objects and people, or query video archives. For these types of questions, default to this top-level fusion search unless user specifies otherwise. Requires the search profile to be deployed.
- video-summarizationSummarize a video by calling the VLM NIM or the Long Video Summarization (LVS) microservice directly. For short videos (under 60s) call the VLM's OpenAI-compatible chat completions endpoint; for long videos (60s or longer) call the LVS microservice. Use when asked to summarize a video, describe what happens in a video, analyze a recording, call or debug LVS summarize/model/health/recommended-config/metrics endpoints, or configure and troubleshoot the LVS service that backs long-video summarization.
- video-understandingCall the vss agent to run video understanding on video to answer a text question. Use when the user asks about video content, or about visual details that cannot be answered from conversation history, search hits, or metadata alone.
- viosQuery VIOS REST APIs: sensor list, recording timelines, video clip extraction, snapshot capture, add/delete sensors and streams
- vss-fragGenerate video summary reports using the VSS video_search_frag extension with Long Video Summarization (LVS), Enterprise RAG knowledge retrieval, and human-in-the-loop parameter collection. Use when: user wants to generate a video summary, report, or analysis using the frag pipeline.