investigate-dataset
$
npx mdskill add UKGovernmentBEIS/inspect_evals/investigate-datasetExplore HuggingFace, CSV, and JSON datasets for structure and quality.
- Analyzes raw data files to reveal schema, fields, and integrity issues.
- Integrates with HuggingFace Hub, pandas, and inspect_ai libraries.
- Selects analysis methods based on detected file format and content patterns.
- Outputs structured reports detailing discovered columns and data anomalies.
SKILL.md
.github/skills/investigate-datasetView on GitHub ↗
---
name: investigate-dataset
description: Investigate datasets from HuggingFace, CSV, or JSON files to understand their structure, fields, and data quality. Trigger whenever you need to explore or inspect a dataset yourself without using pre-written scripts.
---
# Investigate Dataset
This workflow helps you explore and understand datasets used in evaluations. It covers HuggingFace datasets, CSV files, and JSON/JSONL files.
## Key Concepts
For detailed information on Inspect's dataset types (`datasets.Dataset` vs `inspect_ai.dataset.Dataset`), the `hf_dataset()` pipeline, caching behaviour, and test utilities, see `references/inspect-dataset-patterns.md`.
### Common Patterns in Evals
Evals typically define:
- `DATASET_PATH`: HuggingFace repo path (e.g., `"qiaojin/PubMedQA"`)
- `DATASET_REVISION`: Optional git revision/tag for reproducibility
- `record_to_sample()`: Function converting raw records to `Sample` objects
## Prerequisites
- Access to the evaluation code to find dataset configuration
- Python environment with `datasets`, `pandas`, and `inspect_ai` installed
## Steps
### 1. Identify the Dataset Source
Look for these patterns in the evaluation code:
```python
# HuggingFace dataset
DATASET_PATH = "org/dataset-name"
DATASET_REVISION = "v1.0" # optional
hf_dataset(path=DATASET_PATH, name="subset", split="train", ...)
# CSV dataset
csv_dataset("path/to/file.csv", ...)
load_csv_dataset("https://example.com/file.csv", eval_name="myeval", ...)
# JSON/JSONL dataset
json_dataset("path/to/file.json", ...)
load_json_dataset("https://example.com/file.jsonl", eval_name="myeval", ...)
```
### 2. Load the Raw Dataset
For investigation, load the raw data directly (not through Inspect's `sample_fields` transformation). Use standard `datasets.load_dataset()` for HuggingFace, `pd.read_csv()` for CSV, or `pd.read_json()` for JSON/JSONL. For gated datasets, ensure `HF_TOKEN` is set or run `huggingface-cli login`.
### 3. Explore Structure and Quality
Use standard pandas/datasets methods to explore:
- **Schema**: `ds.features` (HF) or `df.dtypes` (pandas)
- **Shape**: `len(ds)`, `ds.column_names` (HF) or `df.info()`, `df.columns` (pandas)
- **Sample data**: `ds[:3]` (HF) or `df.head()` (pandas)
- **Missing values**: Check for `None`, empty strings, empty lists
- **Duplicates**: Check ID uniqueness if an ID field exists
- **Value distributions**: `value_counts()` for categorical columns, length stats for text fields
For converting an Inspect `Dataset` (which has no `.to_pandas()`) to a DataFrame, see `references/inspect-dataset-patterns.md`.
### 4. Understand the Sample Conversion
Look at the `record_to_sample` function to understand how raw data maps to Inspect samples. Key questions:
- Which fields become `input`? Are they combined/formatted?
- What is the `target` format? (letter, text, JSON, etc.)
- Are there `choices` for multiple choice?
- What goes into `metadata`?
- Are any records filtered out?
### 5. Test the Inspect Loading Pipeline
See `references/inspect-dataset-patterns.md` for the pattern to load through Inspect's `hf_dataset()` and verify sample conversion works correctly.
## Quick Reference Commands
```bash
# View HF dataset info without downloading
uv run python -c "from datasets import load_dataset_builder; b = load_dataset_builder('org/name'); print(b.info)"
# List available configs/subsets
uv run python -c "from datasets import get_dataset_config_names; print(get_dataset_config_names('org/name'))"
# List available splits
uv run python -c "from datasets import load_dataset; print(load_dataset('org/name', split=None).keys())"
```
## Caching and Troubleshooting
For cache locations (HuggingFace native, Inspect AI, Inspect Evals), force re-download commands, and test utilities, see `references/inspect-dataset-patterns.md`.
- **Gated dataset**: Run `huggingface-cli login` or set `HF_TOKEN`
- **Rate limited**: The `hf_dataset` wrapper in `inspect_evals.utils.huggingface` has built-in retry with backoff
- **Large dataset**: Use `streaming=True` or `split="train[:1000]"` for sampling
- **Missing revision**: Check the dataset's "Files and versions" tab on HuggingFace