searching-codebases
$
npx mdskill add oaustegard/claude-skills/searching-codebasesSearch codebases instantly using regex or natural language.
- Locate implementations and patterns across local or remote repositories.
- Requires ripgrep and optionally tree-sitting for structural context.
- Routes queries automatically between regex and semantic search modes.
- Returns expanded function definitions and direct code locations.
SKILL.md
.github/skills/searching-codebasesView on GitHub ↗
---
name: searching-codebases
description: >-
Find code by regex pattern or natural language concept in any codebase.
Auto-routes between n-gram indexed regex search (2-20x faster than ripgrep)
and TF-IDF semantic search. Expands results to full functions via tree-sitting
AST data. Accepts GitHub URLs, local directories, uploaded files/archives, or
project knowledge. Use when asked to find implementations, search for patterns,
or answer "where is X" / "how does Y work" about code. Triggers on "search
this repo", "find where X is", "grep for", "what handles Y", regex patterns,
or natural-language questions about code. This is the convergent "find X" skill
— for first-encounter orientation, use exploring-codebases instead.
metadata:
version: 2.0.0
---
# Searching Codebases
Find code in any codebase by pattern or concept. One entry point, two
search strategies, automatic routing.
## Prerequisites
```bash
uv tool install ripgrep
```
tree-sitting (for structural context expansion) installs automatically when
the `--expand` flag is used.
## Primary Command
```bash
SKILL_DIR=/mnt/skills/user/searching-codebases
python3 $SKILL_DIR/scripts/search.py SOURCE "query1" ["query2" ...] [OPTIONS]
```
SOURCE is any of:
- Local directory path
- GitHub URL (downloads tarball automatically)
- `uploads` (uses `/mnt/user-data/uploads/`)
- `project` (uses `/mnt/project/`)
- Path to a `.zip` or `.tar.gz` archive
## Search Modes
**Regex mode** (patterns, identifiers, literal text):
```bash
python3 $SKILL_DIR/scripts/search.py ./repo "def handle_error"
python3 $SKILL_DIR/scripts/search.py ./repo "class.*Exception" --regex
python3 $SKILL_DIR/scripts/search.py ./repo "TODO|FIXME|HACK"
```
**Semantic mode** (concepts, natural language):
```bash
python3 $SKILL_DIR/scripts/search.py ./repo "retry logic with backoff" --semantic
python3 $SKILL_DIR/scripts/search.py ./repo "authentication flow"
python3 $SKILL_DIR/scripts/search.py ./repo "error handling strategy"
```
Auto-detection: short queries and code-like tokens → regex. Multi-word
natural language → semantic. Override with `--regex` or `--semantic`.
## Options
- `--regex` / `--semantic`: Force search mode
- `--expand`: Return full function bodies via tree-sitting AST context
- `--benchmark`: Compare indexed regex vs brute-force ripgrep
- `--branch NAME`: Git branch for GitHub URLs (default: main)
- `--skip DIRS`: Comma-separated directories to skip
- `--json`: Machine-readable output
- `-v`: Show index stats and query routing decisions
## How It Works
**Regex search** builds a sparse n-gram inverted index over all files.
Queries are decomposed into literal fragments, looked up in the index
to identify candidate files (typically 90-99% reduction), then verified
with ripgrep. Frequency-weighted n-grams make rare character sequences
more selective.
**Semantic search** builds a TF-IDF index over code chunks (functions,
classes, structural entries). Queries are ranked by cosine similarity.
**Context expansion** (`--expand`) uses tree-sitting's AST cache to
identify function/class boundaries, returning complete structural units
rather than line fragments. On first use, tree-sitting scans the repo
(~700ms for 250 files); subsequent expansions are sub-millisecond.
**Small codebases** (< 20 files) skip indexing entirely — direct ripgrep is
faster when there's nothing to narrow.
## Mixed Queries
Multiple queries can use different modes in a single invocation. Each query
is auto-routed independently, and indexes are built once per mode:
```bash
python3 $SKILL_DIR/scripts/search.py ./repo \
"class.*Error" \
"error recovery strategy" \
"def retry"
```
## Dependencies
- **tree-sitting**: Provides AST-based context expansion for `--expand`.
Not required — search works without it, just with less structural context
in results.
- **ripgrep**: Required for regex verification. Install via `uv tool install ripgrep`.
- **scikit-learn**: Required for semantic mode. Installs automatically.
## When to Use
- **Known target**: "where is the retry logic?", "find all error handlers"
- **Pattern matching**: regex across large codebases with indexed speedup
- **Concept search**: "authentication flow", "database connection pooling"
- **Cross-reference**: find all callers/users of a specific function
## When NOT to Use
- **First encounter**: "what does this repo do?" → use exploring-codebases
- **Repos under ~10 files**: just read them directly
- **Exact symbol lookup**: `find_symbol('ClassName')` via tree-sitting is simpler
- **Structural overview**: use tree-sitting's `tree_overview()` / `dir_overview()`
## Files
- `scripts/search.py` — Entry point, query routing, output formatting
- `scripts/resolve.py` — Input source resolution (GitHub, uploads, archives)
- `scripts/context.py` — tree-sitting-based AST context expansion
- `scripts/ngram_index.py` — Sparse n-gram inverted index, regex decomposition
- `scripts/sparse_ngrams.py` — Core n-gram algorithms, frequency weights
- `scripts/code_rag.py` — TF-IDF semantic search over code chunks
More from oaustegard/claude-skills
- accessing-github-reposGitHub repository access in containerized environments using REST API and credential detection. Use when git clone fails, or when accessing private repos/writing files via API.
- api-credentialsSecurely manages API credentials for multiple providers (Anthropic Claude, Google Gemini, GitHub). Use when skills need to access stored API keys for external service invocations.
- asking-questionsGuidance for asking clarifying questions when user requests are ambiguous, have multiple valid approaches, or require critical decisions. Use when implementation choices exist that could significantly affect outcomes.
- browsing-blueskyBrowse Bluesky content via API and firehose - search posts, fetch user activity, sample trending topics, read feeds and lists, analyze and categorize accounts. Supports authenticated access for personalized feeds. Use for Bluesky research, user monitoring, trend analysis, feed reading, firehose sampling, account categorization.
- building-github-indexGenerate progressive disclosure indexes for GitHub repositories to use as Claude project knowledge. Use when setting up projects referencing external documentation, creating searchable indexes of technical blogs or knowledge bases, combining multiple repos into one index, or when user mentions "index", "github repo", "project knowledge", or "documentation reference".
- categorizing-bsky-accountsAnalyze and categorize Bluesky accounts by topic using keyword extraction. Use when users mention Bluesky account analysis, following/follower lists, topic discovery, account curation, or network analysis.
- chartingSelect the right Python charting library (seaborn, matplotlib, graphviz) and produce publication-quality static visualizations. Use when creating charts, plots, graphs, diagrams, heatmaps, visualizations from data, or when choosing between matplotlib/seaborn/graphviz. Also triggers for network diagrams, flowcharts, dependency trees, state machines, and entity-relationship diagrams. For interactive browser-rendered charts or uploaded data exploration, defer to charting-vega-lite instead.
- charting-vega-liteCreate interactive data visualizations using Vega-Lite declarative JSON grammar. Supports 20+ chart types (bar, line, scatter, histogram, boxplot, grouped/stacked variations, etc.) via templates and programmatic builders. Use when users upload data for charting, request specific chart types, or mention visualizations. Produces portable JSON specs with inline data islands that work in Claude artifacts and can be adapted for production.
- check-toolsValidates development tool installations across Python, Node.js, Java, Go, Rust, C/C++, Git, and system utilities. Use when verifying environments or troubleshooting dependencies.
- cloning-projectExports project instructions and knowledge files from the current Claude project. Use when users want to clone, copy, backup, or export a project's configuration and files.