cluster-documents

$npx mdskill add dandye/ai-runbooks/cluster-documents

Group documents by topic using similarity analysis.

  • Organizes large document collections to find redundancies.
  • Depends on text normalization and vector embedding generation.
  • Uses clustering algorithms to group documents by similarity.
  • Delivers a structured report with optional visualizations.
SKILL.md
.github/skills/cluster-documentsView on GitHub ↗
---
name: cluster-documents
description: Automated content similarity and grouping analysis. Groups related documents by topic, purpose, or content similarity.
required_roles:
  scribe: roles/scribe.viewer
personas: [information-architect, data-analyst, researcher]
---

# Document Clustering Skill

Analyze a repository of documents to group them based on content similarity, topic, or purpose. This skill helps organize large collections, identify redundancies, and discover relationships.

## Inputs

- `PATH` - The repository to analyze (e.g., "/repository")
- `SIMILARITY_THRESHOLD` - (Optional) Float (0.0-1.0), threshold for grouping (default: 0.8)
- `VISUALIZATION` - (Optional) Boolean, whether to generate a visual representation (default: false)

## Workflow

### Step 1: Text Processing

Ingest documents from `PATH`.
- Normalize text (remove stop words, stemming/lemmatization).
- Generate embeddings or TF-IDF vectors for each document.

### Step 2: Clustering Analysis

Apply clustering algorithms (e.g., K-Means, DBSCAN) to the document vectors.
- Group documents that meet the `SIMILARITY_THRESHOLD`.
- Identify outliers or unique documents.

### Step 3: Cluster Labeling

Analyze the centroid or representative terms of each cluster to assign a meaningful label (Topic).

### Step 4: Output Generation

Generate the clustering report.
- If `VISUALIZATION` is true, create a scatter plot or dendrogram data.

## Required Outputs

A `CLUSTERING_REPORT` object containing:
- **Cluster List**: ID, Label, and List of Documents in each cluster.
- **Redundancy Report**: Sets of highly similar documents (potential duplicates).
- **Visualization Data**: (If requested) Coordinates for plotting.

## Quick Reference

- **Purpose**: Organize unstructured content and find duplicates.
- **Techniques**: Text Mining, NLP, Vector Space Models.
More from dandye/ai-runbooks