exploring-codebases

$npx mdskill add oaustegard/claude-skills/exploring-codebases

Orchestrates structural and semantic analysis to orient agents on unfamiliar codebases.

  • Helps agents understand repository scope, language distribution, and feature intent.
  • Integrates tree-sitting for structural inventory and featuring for semantic documentation.
  • Decides analysis depth by distinguishing between broad exploration and targeted queries.
  • Delivers progressive disclosure through directory overviews and feature synthesis reports.

SKILL.md

.github/skills/exploring-codebasesView on GitHub ↗
---
name: exploring-codebases
description: >-
  First-encounter codebase orientation. Chains tree-sitting (structural
  inventory) and featuring (feature synthesis) into an EDA workflow for
  unfamiliar repositories. Use when someone says "explore this repo",
  "what does this do", "I just cloned this", "help me understand this
  codebase", or when starting work on an unfamiliar repository. This is
  the divergent "what's here?" skill — for targeted "where is X?" queries,
  use searching-codebases instead.
metadata:
  version: 1.0.0
---

# Exploring Codebases

Exploratory code analysis for unfamiliar repositories. This skill is a
**workflow**, not a tool — it orchestrates tree-sitting (structural) and
featuring (semantic) into a progressive disclosure sequence.

## Dependencies

- **tree-sitting** — AST-powered code navigation (structural inventory)
- **featuring** — Feature documentation generator (what/why layer)

```bash
uv venv /home/claude/.venv 2>/dev/null
uv pip install tree-sitter-language-pack fastmcp --python /home/claude/.venv/bin/python
```

## Workflow

### Phase 1: Structural Inventory (tree-sitting)

Get oriented — what's here, how big, what languages?

```bash
cd /mnt/skills/user/tree-sitting/scripts
/home/claude/.venv/bin/python -c "
import sys; sys.path.insert(0, '.')
from engine import cache

stats = cache.scan('/path/to/repo')
print(cache.tree_overview())
"
```

This gives you the directory tree with file counts, symbol counts, and
languages per directory. Takes ~700ms for a 250-file repo, then all
subsequent queries are sub-millisecond.

### Phase 2: Drill Into Structure

Follow what looks interesting. Use tree-sitting queries to build understanding:

```bash
/home/claude/.venv/bin/python -c "
import sys; sys.path.insert(0, '/mnt/skills/user/tree-sitting/scripts')
from engine import cache

# Already scanned — these are instant
print(cache.dir_overview('src/core'))       # Files + top symbols in a directory
print(cache.find_symbol('*Handler*'))       # Glob search across codebase
print(cache.file_symbols('src/api/routes.py'))  # Full API of a single file
print(cache.get_source('handle_request'))   # Read a specific implementation
"
```

**Heuristics for what to drill into first:**
- Directories with high symbol counts relative to file counts (dense logic)
- Entry point patterns: `main`, `cli`, `app`, `server`, `routes`, `handler`
- Files with many imports (integration points)
- The root directory's top-level files (often config + entry points)

### Phase 3: Feature Synthesis (featuring)

Once you understand the structure, generate the "what does it DO?" layer:

```bash
/home/claude/.venv/bin/python /mnt/skills/user/featuring/scripts/gather.py /path/to/repo \
  --skip tests,.github,node_modules --source-budget 8000
```

Read the gather output, then synthesize `_FEATURES.md` following the featuring
skill's format. This is the LLM step — identify capabilities, group symbols
into features, write user-facing descriptions.

### Phase 4: Targeted Deep Dives

With structural inventory + feature map in hand, use tree-sitting's
`get_source()` to read specific implementations where the feature
narrative needs verification or where behavior isn't clear from signatures.

```bash
/home/claude/.venv/bin/python -c "
import sys; sys.path.insert(0, '/mnt/skills/user/tree-sitting/scripts')
from engine import cache

# Read implementations that matter
print(cache.get_source('authenticate'))
print(cache.references('AuthToken'))
"
```

## When to Use This vs Other Skills

| Situation | Use |
|-----------|-----|
| "I just cloned this, what is it?" | **exploring-codebases** (this skill) |
| "Where is the retry logic?" | searching-codebases |
| "Find all files matching `class.*Error`" | searching-codebases |
| "Show me the symbols in auth.py" | tree-sitting directly |
| "Document what this codebase does" | featuring directly |

Exploring is the **divergent** skill — you don't know what you're looking
for yet. Searching is the **convergent** skill — you know what you want,
you need to find it.

## Output

The exploration produces understanding, not necessarily files. But the
concrete artifacts, when warranted, are:

- `_FEATURES.md` — top-down feature documentation (via featuring)
- Mental model of codebase structure, entry points, and architecture

## Scaling

For large repos (>100 files), use `--skip` aggressively in Phase 1 to
exclude tests, vendored code, generated files, and docs. Focus the initial
scan on `src/` or the primary source directory. Expand scope as needed.

For monorepos, treat each package/service as a separate exploration.
Generate per-subsystem `_FEATURES.md` files linked from a root index.

More from oaustegard/claude-skills

SkillDescription
accessing-github-reposGitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.
api-credentialsSecurely manages API credentials for multiple providers (Anthropic Claude, Google Gemini, GitHub). Use when skills need to access stored API keys for external service invocations.
asking-questionsGuidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.
browsing-blueskyBrowse Bluesky content via API and firehose - search posts, fetch user activity, sample trending topics, read feeds and lists, analyze and categorize accounts. Supports authenticated access for personalized feeds. Use for Bluesky research, user monitoring, trend analysis, feed reading, firehose sampling, account categorization.
building-github-indexGenerate progressive disclosure indexes for GitHub repositories to use as Claude project knowledge. Use when setting up projects referencing external documentation, creating searchable indexes of technical blogs or knowledge bases, combining multiple repos into one index, or when user mentions "index", "github repo", "project knowledge", or "documentation reference".
categorizing-bsky-accountsAnalyze and categorize Bluesky accounts by topic using keyword extraction. Use when users mention Bluesky account analysis, following/follower lists, topic discovery, account curation, or network analysis.
chartingSelect the right Python charting library (seaborn, matplotlib, graphviz) and produce publication-quality static visualizations. Use when creating charts, plots, graphs, diagrams, heatmaps, visualizations from data, or when choosing between matplotlib/seaborn/graphviz. Also triggers for network diagrams, flowcharts, dependency trees, state machines, and entity-relationship diagrams. For interactive browser-rendered charts or uploaded data exploration, defer to charting-vega-lite instead.
charting-vega-liteCreate interactive data visualizations using Vega-Lite declarative JSON grammar. Supports 20+ chart types (bar, line, scatter, histogram, boxplot, grouped/stacked variations, etc.) via templates and programmatic builders. Use when users upload data for charting, request specific chart types, or mention visualizations. Produces portable JSON specs with inline data islands that work in Claude artifacts and can be adapted for production.
check-toolsValidates development tool installations across Python, Node.js, Java, Go, Rust, C/C++, Git, and system utilities. Use when verifying environments or troubleshooting dependencies.
cloning-projectExports project instructions and knowledge files from the current Claude project. Use when users want to clone, copy, backup, or export a project's configuration and files.