read-eval-logs

$npx mdskill add UKGovernmentBEIS/inspect_evals/read-eval-logs

Inspect log files using Python API and CLI commands.

  • Analyze evaluation logs without pre-written scripts.
  • Integrates with Inspect evaluation logs and Python API.
  • Executes commands to list, dump, convert, and view logs.
  • Delivers results through JSON output and interactive viewers.
SKILL.md
.github/skills/read-eval-logsView on GitHub ↗
---
name: read-eval-logs
description: View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.
---

# Analysing Eval Log Files

This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.

## Quick Reference

### CLI Commands

```bash
# List all logs in the default log directory (./logs or INSPECT_LOG_DIR)
uv run inspect log list --json

# List logs with specific status
uv run inspect log list --json --status success
uv run inspect log list --json --status error

# List retryable logs (error/cancelled without subsequent success)
uv run inspect log list --json --retryable

# Dump a log file as JSON (works with any format: .eval or .json)
uv run inspect log dump <log_file_path>

# Convert between log formats
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval

# Get JSON schema for log files
uv run inspect log schema
```

### Interactive Log Viewer

```bash
# Start the interactive log viewer (updates automatically)
uv run inspect view
```

## Python API

### Key Imports

```python
from inspect_ai.log import (
    # Listing and reading
    list_eval_logs,
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    
    # Writing
    write_eval_log,
    
    # Utilities
    retryable_eval_logs,
    recompute_metrics,
    resolve_sample_attachments,
    
    # Types
    EvalLog,
    EvalLogInfo,
    EvalSample,
    EvalSampleSummary,
)
```

### Listing Logs

```python
# List all logs in default directory
logs = list_eval_logs()

# List logs in a specific directory
logs = list_eval_logs(log_dir="./experiment-logs")

# Filter by format
logs = list_eval_logs(formats=["eval"])  # Only .eval files

# Filter with a custom function (receives header-only EvalLog)
logs = list_eval_logs(filter=lambda log: log.status == "success")

# Non-recursive listing
logs = list_eval_logs(recursive=False)
```

### Reading Logs

```python
# Read full log
log = read_eval_log("path/to/logfile.eval")

# Read header only (fast, excludes samples)
log = read_eval_log("path/to/logfile.eval", header_only=True)

# Read with attachments resolved
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)
```

### Reading Samples

```python
# Read a single sample by ID and epoch
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)

# Read a single sample by UUID
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")

# Stream all samples (memory efficient - one at a time)
for sample in read_eval_log_samples("path/to/logfile.eval"):
    process(sample)

# Stream samples from incomplete logs
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
    process(sample)

# Read sample summaries (fast, includes scoring info)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")
```

### Filtering Samples

```python
# Read only samples with errors using summaries for filtering
errors = []
for summary in read_eval_log_sample_summaries(log_file):
    if summary.error is not None:
        errors.append(
            read_eval_log_sample(log_file, summary.id, summary.epoch)
        )
```

## EvalLog Structure

The `EvalLog` object contains:

| Field | Type | Description |
| ----- | ---- | ----------- |
| `version` | `int` | File format version (currently 2) |
| `status` | `str` | `"started"`, `"success"`, `"cancelled"`, or `"error"` |
| `eval` | `EvalSpec` | Task, model, creation time, config |
| `plan` | `EvalPlan` | Solvers and generation config |
| `results` | `EvalResults` | Aggregate scores and metrics |
| `stats` | `EvalStats` | Runtime, model usage statistics |
| `error` | `EvalError` | Error info if `status == "error"` |
| `samples` | `list[EvalSample]` | Individual samples (if not header_only) |
| `reductions` | `list[EvalSampleReductions]` | Multi-epoch reductions |
| `location` | `str` | URI where log was read from |

### Always Check Status

```python
log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
    # Safe to analyse results
    for score in log.results.scores:
        print(f"{score.name}: {score.metrics}")
```

## EvalSample Structure

Each sample contains:

| Field | Type | Description |
| ----- | ---- | ----------- |
| `id` | `int \| str` | Unique sample ID |
| `epoch` | `int` | Epoch number |
| `input` | `str \| list[ChatMessage]` | Sample input |
| `target` | `str \| list[str]` | Expected target(s) |
| `messages` | `list[ChatMessage]` | Full conversation history |
| `output` | `ModelOutput` | Model's output |
| `scores` | `dict[str, Score]` | Scores from scorers |
| `metadata` | `dict[str, Any]` | Sample metadata |
| `store` | `dict[str, Any]` | State at end of execution |
| `events` | `list[Event]` | Transcript events |
| `error` | `EvalError` | Error if sample failed |
| `total_time` | `float` | Total sample runtime |
| `model_usage` | `dict[str, ModelUsage]` | Token usage |

## Common Analysis Patterns

### Get Aggregate Metrics

```python
log = read_eval_log(log_file, header_only=True)
if log.results:
    for score in log.results.scores:
        print(f"Scorer: {score.name}")
        for metric_name, metric in score.metrics.items():
            print(f"  {metric_name}: {metric.value}")
```

### Analyse Failed Samples

```python
log = read_eval_log(log_file)
if log.samples:
    failed = [s for s in log.samples if s.error is not None]
    for sample in failed:
        print(f"Sample {sample.id}: {sample.error.message}")
```

### Extract Model Usage

```python
log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
    print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")
```

### Compare Multiple Runs

```python
logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
    log = read_eval_log(log_info, header_only=True)
    if log.results and log.results.scores:
        accuracy = log.results.scores[0].metrics.get("accuracy")
        print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")
```

### Find Retryable Logs

```python
all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
    print(f"Can retry: {log_info.name}")
```

## Log File Formats

| Type | Description |
| ---- | ----------- |
| `.eval` | Binary format, ~1/8 size of JSON, fast incremental access |
| `.json` | Text format, human-readable, slower for large files |

Both formats are fully supported by the API and can be intermixed.

## Working with Large Logs

For logs too large to fit in memory:

1. **Use `.eval` format** - supports compression and incremental access
2. **Read header only** - `read_eval_log(log_file, header_only=True)`
3. **Stream samples** - `read_eval_log_samples()` yields one at a time
4. **Use summaries** - `read_eval_log_sample_summaries()` for quick overview

## Modifying Logs

```python
# Read, modify, and write back
log = read_eval_log(log_file)
# ... modify log ...
write_eval_log(log)  # Uses log.location automatically

# Write to a new location
write_eval_log(log, location="new_path.eval")

# Recompute metrics after score edits
recompute_metrics(log)
write_eval_log(log)
```

## Environment Variables

| Variable | Description |
| -------- | ----------- |
| `INSPECT_LOG_DIR` | Default log directory (default: `./logs`) |
| `INSPECT_EVAL_LOG_FILE_PATTERN` | Log filename pattern (e.g., `{task}_{model}_{id}`) |

## Related Tools

- **[Inspect Scout](https://meridianlabs-ai.github.io/inspect_scout/)** - Transcript analysis
- **[Inspect Viz](https://meridianlabs-ai.github.io/inspect_viz/)** - Data visualization
- **Log Dataframes** - Extract pandas DataFrames from logs (see `inspect_ai.analysis`)
More from UKGovernmentBEIS/inspect_evals