rag-blueprint
$
npx mdskill add NVIDIA/skills/rag-blueprintAutomate full RAG lifecycle management and troubleshooting.
- Executes deployment, configuration, and shutdown for all RAG features.
- Integrates with Bash, Docker, Kubernetes, Helm, and NVIDIA tools.
- Auto-detects system state before executing any required action.
- Delivers direct command execution without user prompts when possible.
SKILL.md
.github/skills/rag-blueprintView on GitHub ↗
---
name: rag-blueprint
description: "NVIDIA RAG Blueprint — deploy, configure, troubleshoot, and manage. Handles any RAG action: deploy, install, start, enable, disable, toggle, change, configure, troubleshoot, debug, fix, shutdown, stop, or tear down any RAG feature or service (VLM, guardrails, query rewriting, models, search, ingestion, observability, summarization, and more)."
argument-hint: deploy RAG | enable feature | disable feature | configure | troubleshoot | shutdown
allowed-tools: Bash(echo *), Bash(nvidia-smi *), Bash(curl *), Bash(docker ps *), Bash(docker exec *), Bash(docker info *), Bash(docker --version *), Bash(docker compose version *), Bash(docker logs *), Bash(docker system *), Bash(kubectl get *), Bash(kubectl describe *), Bash(kubectl version *), Bash(kubectl logs *), Bash(helm version *), Bash(helm list *), Bash(git rev-parse *), Bash(git describe *), Bash(git status *), Bash(python3 --version *), Bash(pip3 show *), Bash(df *), Bash(du *), Bash(cat /proc/*), Bash(cat /etc/os-release *), Bash(ss *), Bash(netstat *), Bash(ls *), Bash(grep *), Bash(lsof *), Bash(ps aux *), Read, Grep, Glob
license: Apache-2.0
metadata:
author: nvidia-rag-team
version: "1.0"
---
# NVIDIA RAG Blueprint
## Autonomy Principles
- Auto-detect everything: GPU, VRAM, drivers, Docker, CUDA, disk, OS, ports, existing services, NGC key, repo state.
- If it can be checked with a command, check it — don't ask the user.
- Ask only when user action is required: providing an API key, confirming data deletion, or choosing between equally valid options.
- Once analysis is done, route to the correct workflow and execute.
## Intent Detection
Determine what the user wants and route immediately:
| User Intent | Action |
|-------------|--------|
| Deploy, install, set up, start RAG | Read and follow `references/deploy.md` |
| Configure, enable, change, toggle a feature | Use the **Configure** section below |
| Troubleshoot, debug, fix, error, unhealthy | Read and follow `references/troubleshoot.md` |
| Stop, shutdown, tear down, clean up | Read and follow `references/shutdown.md` |
If the intent is ambiguous, infer from context (e.g., "RAG isn't working" → troubleshoot; "get RAG running" → deploy). Only ask if genuinely unclear.
---
## Configure
Requires a running RAG deployment. If services are not running, deploy first via `references/deploy.md`.
Match the user's request to a reference file, then read and follow it:
| Feature Keywords | Reference |
|-----------------|-----------|
| VLM, VLM embeddings, image captioning | `references/configure/vlm.md` |
| NeMo Guardrails | `references/configure/guardrails.md` |
| Query rewriting, decomposition, multi-turn | `references/configure/query-and-conversation.md` |
| Ingestion (text-only, audio, Nemotron Parse, OCR, batch CLI, NV-Ingest, volume mount, performance) | `references/configure/ingestion.md` |
| Search, retrieval, hybrid search, multi-collection, metadata, filters, reranker, topK, accuracy/performance | `references/configure/search-and-retrieval.md` |
| LLM/embedding/ranking model changes, vector DB, Milvus/Elasticsearch auth, service keys, model profiles, ports/GPU | `references/configure/models-and-infrastructure.md` |
| Reasoning, self-reflection, prompts, generation params (tokens, temperature, citations), per-request LLM params | `references/configure/reasoning-and-generation.md` |
| Summarization | `references/configure/summarization.md` |
| Observability (tracing, Zipkin, Grafana, Prometheus) | `references/configure/observability.md` |
| Multimodal query (image + text) | `references/configure/multimodal-query.md` |
| Data catalog (collection/document metadata) | `references/configure/data-catalog.md` |
| User interface (UI settings) | `references/configure/user-interface.md` |
| API reference (endpoints, schemas) | `references/configure/api-reference.md` |
| Evaluation (RAGAS metrics) | `references/configure/evaluation.md` |
| MCP server & client, agent toolkit | `references/configure/mcp.md` |
| Migration (version upgrades) | `references/configure/migration.md` |
| Notebooks (setup and catalog) | `references/configure/notebooks.md` |
### Configure Flow
1. Match the user's request to a reference file from the table above.
2. Detect what's running:
```bash
echo "=== NIM ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(nim-llm|nemoretriever-embedding|nemoretriever-ranking|nemo-vlm|nemotron-vlm)' || echo "NO_LOCAL_NIMS"; echo "=== RAG ===" && docker ps --format '{{.Names}}' 2>/dev/null | grep -iE '(rag-server|ingestor-server|milvus)' || echo "NO_DOCKER_RAG"; echo "=== K8S ===" && kubectl get pods -n rag 2>/dev/null | head -5 || echo "NO_K8S"; echo "=== LIBRARY ===" && ps aux 2>/dev/null | grep -E '(nvidia_rag|uvicorn.*rag)' | grep -v grep || echo "NO_LIBRARY"
```
3. Use this table to determine platform, deployment type, and where config lives:
| Local NIMs running? | RAG services running? | Deployment Type | Config Location |
|---------------------|-----------------------|-----------------|-----------------|
| Yes (Docker) | Any | Self-hosted | `deploy/compose/.env` |
| No | Yes (Docker) | NVIDIA-hosted | `deploy/compose/nvdev.env` |
| Yes (K8s pods) | Any | Self-hosted | `values.yaml` (NIM sections) |
| No | Yes (K8s pods) | NVIDIA-hosted | `values.yaml` (envVars) |
| — | Library processes | Library mode | `notebooks/config.yaml` |
| No | No | Not running | Deploy first via `references/deploy.md` |
Tell the user what you detected and ask to confirm. Example: "I see local NIM containers running (nim-llm-ms, nemoretriever-embedding-ms) — this is a self-hosted deployment. Config file is `deploy/compose/.env`. Correct?"
4. Check current feature state before changing anything — read the config location from step 3, then cross-check the live service:
- Docker: `docker exec rag-server env 2>/dev/null | grep -E "<VAR_NAME>"`
- Helm: `kubectl get pod -n rag -l app=rag-server -o jsonpath='{.items[0].spec.containers[0].env}' 2>/dev/null`
If the config file and live service disagree, tell the user the service has stale config and will need a restart.
5. If the feature needs extra GPUs, check availability against hardware restrictions (see below):
```bash
nvidia-smi --query-gpu=index,name,memory.total,memory.used --format=csv,noheader 2>/dev/null || echo "NO_GPU"
```
6. Read the reference file and apply changes:
- **Docker**: edit the env file (uncomment to enable, re-comment to disable — the env file is the source of truth). Then restart the affected service:
```
source <env-file> && docker compose -f deploy/compose/<compose-file> up -d
```
| Service | Compose File |
|---------|-------------|
| rag-server | `docker-compose-rag-server.yaml` |
| ingestor-server | `docker-compose-ingestor-server.yaml` |
| milvus, etcd, minio | `vectordb.yaml` |
| NIM containers (LLM, embedding, ranking, VLM, OCR) | `nims.yaml` |
| guardrails | `docker-compose-nemo-guardrails.yaml` |
| observability (Grafana, Prometheus, Zipkin) | `observability.yaml` |
- **Helm**: edit `values.yaml`, then upgrade: `helm upgrade rag <chart> -n rag -f values.yaml`
- **Library**: edit `notebooks/config.yaml`, then restart the Python process
7. Verify:
- Docker: `docker ps --format "table {{.Names}}\t{{.Status}}" | head -20; curl -s http://localhost:8081/v1/health?check_dependencies=true 2>/dev/null | head -1`
- Helm: `kubectl get pods -n rag; kubectl rollout status deployment/rag-server -n rag --timeout=120s`
- Library: `curl -s http://localhost:8081/v1/health 2>/dev/null | head -1`
8. If restart fails, read `references/troubleshoot.md`. If multiple features requested, repeat from step 1 for each.
### When User Says "Configure" Without Specifics
Run steps 2–3 above, then read the identified config file to list what's currently enabled:
```bash
grep -E "^(export )?(ENABLE_|APP_)" <config-file> 2>/dev/null | sort
```
Summarize what's running and enabled, then ask which feature to change.
---
## Hardware Restrictions
Read `docs/support-matrix.md` for current GPU requirements per deployment mode.
Read `docs/service-port-gpu-reference.md` for port mappings and GPU assignments.
| GPU | Feature Restrictions |
|-----|---------------------|
| B200 | No VLM, No Guardrails, No Nemotron Parse. May need multi-GPU LLM (`LLM_MS_GPU_ID`). |
| RTX PRO 6000 | No Nemotron Parse. No Audio on Helm. |
More from NVIDIA/skills
- accessing-mlflowQuery and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
- ad-add-fusion-transformation>
- ad-conf-check>
- ad-graph-dump>
- ad-model-onboard>
- ad-pipeline-failure-pr>
- add-benchmark>
- aiq-deploy|
- aiq-research|
- byobCreate custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.