searching-codebases

$npx mdskill add oaustegard/claude-skills/searching-codebases

Search codebases instantly using regex or natural language.

  • Locate implementations and patterns across local or remote repositories.
  • Requires ripgrep and optionally tree-sitting for structural context.
  • Routes queries automatically between regex and semantic search modes.
  • Returns expanded function definitions and direct code locations.
SKILL.md
.github/skills/searching-codebasesView on GitHub ↗
---
name: searching-codebases
description: >-
  Find code by regex pattern or natural language concept in any codebase.
  Auto-routes between n-gram indexed regex search (2-20x faster than ripgrep)
  and TF-IDF semantic search. Expands results to full functions via tree-sitting
  AST data. Accepts GitHub URLs, local directories, uploaded files/archives, or
  project knowledge. Use when asked to find implementations, search for patterns,
  or answer "where is X" / "how does Y work" about code. Triggers on "search
  this repo", "find where X is", "grep for", "what handles Y", regex patterns,
  or natural-language questions about code. This is the convergent "find X" skill
  — for first-encounter orientation, use exploring-codebases instead.
metadata:
  version: 2.0.0
---

# Searching Codebases

Find code in any codebase by pattern or concept. One entry point, two
search strategies, automatic routing.

## Prerequisites

```bash
uv tool install ripgrep
```

tree-sitting (for structural context expansion) installs automatically when
the `--expand` flag is used.

## Primary Command

```bash
SKILL_DIR=/mnt/skills/user/searching-codebases

python3 $SKILL_DIR/scripts/search.py SOURCE "query1" ["query2" ...] [OPTIONS]
```

SOURCE is any of:
- Local directory path
- GitHub URL (downloads tarball automatically)
- `uploads` (uses `/mnt/user-data/uploads/`)
- `project` (uses `/mnt/project/`)
- Path to a `.zip` or `.tar.gz` archive

## Search Modes

**Regex mode** (patterns, identifiers, literal text):
```bash
python3 $SKILL_DIR/scripts/search.py ./repo "def handle_error"
python3 $SKILL_DIR/scripts/search.py ./repo "class.*Exception" --regex
python3 $SKILL_DIR/scripts/search.py ./repo "TODO|FIXME|HACK"
```

**Semantic mode** (concepts, natural language):
```bash
python3 $SKILL_DIR/scripts/search.py ./repo "retry logic with backoff" --semantic
python3 $SKILL_DIR/scripts/search.py ./repo "authentication flow"
python3 $SKILL_DIR/scripts/search.py ./repo "error handling strategy"
```

Auto-detection: short queries and code-like tokens → regex. Multi-word
natural language → semantic. Override with `--regex` or `--semantic`.

## Options

- `--regex` / `--semantic`: Force search mode
- `--expand`: Return full function bodies via tree-sitting AST context
- `--benchmark`: Compare indexed regex vs brute-force ripgrep
- `--branch NAME`: Git branch for GitHub URLs (default: main)
- `--skip DIRS`: Comma-separated directories to skip
- `--json`: Machine-readable output
- `-v`: Show index stats and query routing decisions

## How It Works

**Regex search** builds a sparse n-gram inverted index over all files.
Queries are decomposed into literal fragments, looked up in the index
to identify candidate files (typically 90-99% reduction), then verified
with ripgrep. Frequency-weighted n-grams make rare character sequences
more selective.

**Semantic search** builds a TF-IDF index over code chunks (functions,
classes, structural entries). Queries are ranked by cosine similarity.

**Context expansion** (`--expand`) uses tree-sitting's AST cache to
identify function/class boundaries, returning complete structural units
rather than line fragments. On first use, tree-sitting scans the repo
(~700ms for 250 files); subsequent expansions are sub-millisecond.

**Small codebases** (< 20 files) skip indexing entirely — direct ripgrep is
faster when there's nothing to narrow.

## Mixed Queries

Multiple queries can use different modes in a single invocation. Each query
is auto-routed independently, and indexes are built once per mode:

```bash
python3 $SKILL_DIR/scripts/search.py ./repo \
  "class.*Error" \
  "error recovery strategy" \
  "def retry"
```

## Dependencies

- **tree-sitting**: Provides AST-based context expansion for `--expand`.
  Not required — search works without it, just with less structural context
  in results.
- **ripgrep**: Required for regex verification. Install via `uv tool install ripgrep`.
- **scikit-learn**: Required for semantic mode. Installs automatically.

## When to Use

- **Known target**: "where is the retry logic?", "find all error handlers"
- **Pattern matching**: regex across large codebases with indexed speedup
- **Concept search**: "authentication flow", "database connection pooling"
- **Cross-reference**: find all callers/users of a specific function

## When NOT to Use

- **First encounter**: "what does this repo do?" → use exploring-codebases
- **Repos under ~10 files**: just read them directly
- **Exact symbol lookup**: `find_symbol('ClassName')` via tree-sitting is simpler
- **Structural overview**: use tree-sitting's `tree_overview()` / `dir_overview()`

## Files

- `scripts/search.py` — Entry point, query routing, output formatting
- `scripts/resolve.py` — Input source resolution (GitHub, uploads, archives)
- `scripts/context.py` — tree-sitting-based AST context expansion
- `scripts/ngram_index.py` — Sparse n-gram inverted index, regex decomposition
- `scripts/sparse_ngrams.py` — Core n-gram algorithms, frequency weights
- `scripts/code_rag.py` — TF-IDF semantic search over code chunks
More from oaustegard/claude-skills