cluster-documents

Name: cluster-documents
Author: dandye/ai-runbooks

$npx mdskill add dandye/ai-runbooks/cluster-documents

Group documents by topic using similarity analysis.

Organizes large document collections to find redundancies.
Depends on text normalization and vector embedding generation.
Uses clustering algorithms to group documents by similarity.
Delivers a structured report with optional visualizations.

SKILL.md

.github/skills/cluster-documentsView on GitHub ↗

---
name: cluster-documents
description: Automated content similarity and grouping analysis. Groups related documents by topic, purpose, or content similarity.
required_roles:
  scribe: roles/scribe.viewer
personas: [information-architect, data-analyst, researcher]
---

# Document Clustering Skill

Analyze a repository of documents to group them based on content similarity, topic, or purpose. This skill helps organize large collections, identify redundancies, and discover relationships.

## Inputs

- `PATH` - The repository to analyze (e.g., "/repository")
- `SIMILARITY_THRESHOLD` - (Optional) Float (0.0-1.0), threshold for grouping (default: 0.8)
- `VISUALIZATION` - (Optional) Boolean, whether to generate a visual representation (default: false)

## Workflow

### Step 1: Text Processing

Ingest documents from `PATH`.
- Normalize text (remove stop words, stemming/lemmatization).
- Generate embeddings or TF-IDF vectors for each document.

### Step 2: Clustering Analysis

Apply clustering algorithms (e.g., K-Means, DBSCAN) to the document vectors.
- Group documents that meet the `SIMILARITY_THRESHOLD`.
- Identify outliers or unique documents.

### Step 3: Cluster Labeling

Analyze the centroid or representative terms of each cluster to assign a meaningful label (Topic).

### Step 4: Output Generation

Generate the clustering report.
- If `VISUALIZATION` is true, create a scatter plot or dendrogram data.

## Required Outputs

A `CLUSTERING_REPORT` object containing:
- **Cluster List**: ID, Label, and List of Documents in each cluster.
- **Redundancy Report**: Sets of highly similar documents (potential duplicates).
- **Visualization Data**: (If requested) Coordinates for plotting.

## Quick Reference

- **Purpose**: Organize unstructured content and find duplicates.
- **Techniques**: Text Mining, NLP, Vector Space Models.