langsmith-observability

Name: langsmith-observability
Author: Orchestra-Research/AI-Research-SKILLs

$npx mdskill add Orchestra-Research/AI-Research-SKILLs/langsmith-observability

Development platform for debugging, evaluating, and monitoring language models and AI applications.

SKILL.md

.github/skills/langsmith-observabilityView on GitHub ↗

---
name: langsmith-observability
description: LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Observability, LangSmith, Tracing, Evaluation, Monitoring, Debugging, Testing, LLM Ops, Production]
dependencies: [langsmith>=0.2.0]
---

# LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

## When to use LangSmith

**Use LangSmith when:**
- Debugging LLM application issues (prompts, chains, agents)
- Evaluating model outputs systematically against datasets
- Monitoring production LLM systems
- Building regression testing for AI features
- Analyzing latency, token usage, and costs
- Collaborating on prompt engineering

**Key features:**
- **Tracing**: Capture inputs, outputs, latency for all LLM calls
- **Evaluation**: Systematic testing with built-in and custom evaluators
- **Datasets**: Create test sets from production traces or manually
- **Monitoring**: Track metrics, errors, and costs in production
- **Integrations**: Works with OpenAI, Anthropic, LangChain, LlamaIndex

**Use alternatives instead:**
- **Weights & Biases**: Deep learning experiment tracking, model training
- **MLflow**: General ML lifecycle, model registry focus
- **Arize/WhyLabs**: ML monitoring, data drift detection

## Quick start

### Installation

```bash
pip install langsmith

# Set environment variables
export LANGSMITH_API_KEY="your-api-key"
export LANGSMITH_TRACING=true
```

### Basic tracing with @traceable

```python
from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Automatically traced to LangSmith
result = generate_response("What is machine learning?")
```

### OpenAI wrapper (automatic tracing)

```python
from langsmith.wrappers import wrap_openai
from openai import OpenAI

# Wrap client for automatic tracing
client = wrap_openai(OpenAI())

# All calls automatically traced
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)
```

## Core concepts

### Runs and traces

A **run** is a single execution unit (LLM call, chain, tool). Runs form hierarchical **traces** showing the full execution flow.

```python
from langsmith import traceable

@traceable(run_type="chain")
def process_query(query: str) -> str:
    # Parent run
    context = retrieve_context(query)  # Child run
    response = generate_answer(query, context)  # Child run
    return response

@traceable(run_type="retriever")
def retrieve_context(query: str) -> list:
    return vector_store.search(query)

@traceable(run_type="llm")
def generate_answer(query: str, context: list) -> str:
    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")
```

### Projects

Projects organize related runs. Set via environment or code:

```python
import os
os.environ["LANGSMITH_PROJECT"] = "my-project"

# Or per-function
@traceable(project_name="my-project")
def my_function():
    pass
```

## Client API

```python
from langsmith import Client

client = Client()

# List runs
runs = list(client.list_runs(
    project_name="my-project",
    filter='eq(status, "success")',
    limit=100
))

# Get run details
run = client.read_run(run_id="...")

# Create feedback
client.create_feedback(
    run_id="...",
    key="correctness",
    score=0.9,
    comment="Good answer"
)
```

## Datasets and evaluation

### Create dataset

```python
from langsmith import Client

client = Client()

# Create dataset
dataset = client.create_dataset("qa-test-set", description="QA evaluation")

# Add examples
client.create_examples(
    inputs=[
        {"question": "What is Python?"},
        {"question": "What is ML?"}
    ],
    outputs=[
        {"answer": "A programming language"},
        {"answer": "Machine learning"}
    ],
    dataset_id=dataset.id
)
```

### Run evaluation

```python
from langsmith import evaluate

def my_model(inputs: dict) -> dict:
    # Your model logic
    return {"answer": generate_answer(inputs["question"])}

def correctness_evaluator(run, example):
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    score = 1.0 if reference.lower() in prediction.lower() else 0.0
    return {"key": "correctness", "score": score}

results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[correctness_evaluator],
    experiment_prefix="v1"
)

print(f"Average score: {results.aggregate_metrics['correctness']}")
```

### Built-in evaluators

```python
from langsmith.evaluation import LangChainStringEvaluator

# Use LangChain evaluators
results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[
        LangChainStringEvaluator("qa"),
        LangChainStringEvaluator("cot_qa")
    ]
)
```

## Advanced tracing

### Tracing context

```python
from langsmith import tracing_context

with tracing_context(
    project_name="experiment-1",
    tags=["production", "v2"],
    metadata={"version": "2.0"}
):
    # All traceable calls inherit context
    result = my_function()
```

### Manual runs

```python
from langsmith import trace

with trace(
    name="custom_operation",
    run_type="tool",
    inputs={"query": "test"}
) as run:
    result = do_something()
    run.end(outputs={"result": result})
```

### Process inputs/outputs

```python
def sanitize_inputs(inputs: dict) -> dict:
    if "password" in inputs:
        inputs["password"] = "***"
    return inputs

@traceable(process_inputs=sanitize_inputs)
def login(username: str, password: str):
    return authenticate(username, password)
```

### Sampling

```python
import os
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling
```

## LangChain integration

```python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Tracing enabled automatically with LANGSMITH_TRACING=true
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])

chain = prompt | llm

# All chain runs traced automatically
response = chain.invoke({"input": "Hello!"})
```

## Production monitoring

### Hub prompts

```python
from langsmith import Client

client = Client()

# Pull prompt from hub
prompt = client.pull_prompt("my-org/qa-prompt")

# Use in application
result = prompt.invoke({"question": "What is AI?"})
```

### Async client

```python
from langsmith import AsyncClient

async def main():
    client = AsyncClient()

    runs = []
    async for run in client.list_runs(project_name="my-project"):
        runs.append(run)

    return runs
```

### Feedback collection

```python
from langsmith import Client

client = Client()

# Collect user feedback
def record_feedback(run_id: str, user_rating: int, comment: str = None):
    client.create_feedback(
        run_id=run_id,
        key="user_rating",
        score=user_rating / 5.0,  # Normalize to 0-1
        comment=comment
    )

# In your application
record_feedback(run_id="...", user_rating=4, comment="Helpful response")
```

## Testing integration

### Pytest integration

```python
from langsmith import test

@test
def test_qa_accuracy():
    result = my_qa_function("What is Python?")
    assert "programming" in result.lower()
```

### Evaluation in CI/CD

```python
from langsmith import evaluate

def run_evaluation():
    results = evaluate(
        my_model,
        data="regression-test-set",
        evaluators=[accuracy_evaluator]
    )

    # Fail CI if accuracy drops
    assert results.aggregate_metrics["accuracy"] >= 0.9, \
        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"
```

## Best practices

1. **Structured naming** - Use consistent project/run naming conventions
2. **Add metadata** - Include version, environment, user info
3. **Sample in production** - Use sampling rate to control volume
4. **Create datasets** - Build test sets from interesting production cases
5. **Automate evaluation** - Run evaluations in CI/CD pipelines
6. **Monitor costs** - Track token usage and latency trends

## Common issues

**Traces not appearing:**
```python
import os
# Ensure tracing is enabled
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your-key"

# Verify connection
from langsmith import Client
client = Client()
print(client.list_projects())  # Should work
```

**High latency from tracing:**
```python
# Enable background batching (default)
from langsmith import Client
client = Client(auto_batch_tracing=True)

# Or use sampling
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"
```

**Large payloads:**
```python
# Hide sensitive/large fields
@traceable(
    process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}
)
def my_function(data):
    pass
```

## References

- **[Advanced Usage](references/advanced-usage.md)** - Custom evaluators, distributed tracing, hub prompts
- **[Troubleshooting](references/troubleshooting.md)** - Common issues, debugging, performance

## Resources

- **Documentation**: https://docs.smith.langchain.com
- **Python SDK**: https://github.com/langchain-ai/langsmith-sdk
- **Web App**: https://smith.langchain.com
- **Version**: 0.2.0+
- **License**: MIT

More from Orchestra-Research/AI-Research-SKILLs

Skill	Description
academic-plotting	Generates publication-quality figures for ML papers from research context. Given a paper section or description, extracts system components and relationships to generate architecture diagrams via Gemini. Given experiment results or data, auto-selects chart type and generates data-driven figures via matplotlib/seaborn. Use when creating any figure for a conference paper.
ara-compiler	Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.
ara-research-manager	Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.
ara-rigor-reviewer	Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.
autogpt-agents	Autonomous AI agent platform for building and deploying continuous agents. Use when creating visual workflow agents, deploying persistent autonomous agents, or building complex multi-step AI automation systems.
autoresearch	Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.
awq-quantization	Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
blip-2-vision-language	Vision-language pre-training framework bridging frozen image encoders and LLMs. Use when you need image captioning, visual question answering, image-text retrieval, or multimodal chat with state-of-the-art zero-shot performance.
brainstorming-research-ideas	Guides researchers through structured ideation frameworks to discover high-impact research directions. Use when exploring new problem spaces, pivoting between projects, or seeking novel angles on existing work.
constitutional-ai	Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.