code-clone-assistant

$npx mdskill add terrylica/cc-skills/code-clone-assistant

Detects and refactors code duplication using PMD CPD and Semgrep

  • Identifies code clones, DRY violations, and copy-paste duplication
  • Uses PMD CPD for exact matches and Semgrep for pattern-based clones
  • Analyzes token and AST patterns across files to find similar or identical code
  • Provides actionable refactoring suggestions and highlights duplicate locations
SKILL.md
.github/skills/code-clone-assistantView on GitHub ↗
---
name: code-clone-assistant
description: Detect and refactor code duplication with PMD CPD. TRIGGERS - code clones, DRY violations, duplicate code.
allowed-tools: Read, Grep, Bash, Edit, Write
---

# Code Clone Assistant

Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).

> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

## Tools

- **PMD CPD v7.17.0+**: Exact duplicate detection
- **Semgrep v1.140.0+**: Pattern-based detection

**Tested**: October 2025 - 30 violations detected across 3 sample files
**Coverage**: ~3x more violations than using either tool alone

---

## When to Use This Skill

Use this skill when:

- Finding duplicate code in a codebase
- Detecting DRY violations
- Refactoring similar code patterns
- Identifying copy-paste code

---

## Why Two Tools?

PMD CPD and Semgrep detect different clone types:

| Aspect       | PMD CPD                          | Semgrep                          |
| ------------ | -------------------------------- | -------------------------------- |
| **Detects**  | Exact copy-paste duplicates      | Similar patterns with variations |
| **Scope**    | Across files ✅                  | Within/across files (Pro only)   |
| **Matching** | Token-based (ignores formatting) | Pattern-based (AST matching)     |
| **Rules**    | ❌ No custom rules               | ✅ Custom rules                  |

**Result**: Using both finds ~3x more DRY violations.

### Clone Types

| Type   | Description                     | PMD CPD         | Semgrep     |
| ------ | ------------------------------- | --------------- | ----------- |
| Type-1 | Exact copies                    | ✅ Default      | ✅          |
| Type-2 | Renamed identifiers             | ✅ `--ignore-*` | ✅          |
| Type-3 | Near-miss with variations       | ⚠️ Partial      | ✅ Patterns |
| Type-4 | Semantic clones (same behavior) | ❌              | ❌          |

---

## Quick Start Workflow

```bash
# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md

# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif

# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity

# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests
```

---

---

## Accepted Exceptions (Known Intentional Duplication)

Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, **always check for accepted exceptions before recommending refactoring**.

### When Duplication Is Acceptable

| Pattern                                         | Why Acceptable                                                                                                                                                                                                    | Example                                                            |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| **Generation-per-directory experiments**        | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible.                                                    | SQL templates, sweep scripts where each `gen{NNN}/` is independent |
| **SQL templates with placeholder substitution** | SQL has no import/include mechanism. Templates use `sed` placeholder replacement (`__PLACEHOLDER__`), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. | ClickHouse sweep templates sharing signal detection + metrics CTEs |
| **Protocol/schema boilerplate**                 | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract.                                                                           | NDJSON telemetry line construction in wrapper scripts              |
| **Test fixtures and golden files**              | Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies.                                                                              | Test setup code, expected output snapshots                         |

### How to Report Accepted Exceptions

When clone detection finds duplication that matches an accepted exception pattern:

1. **Report it** — always show the user what was found (lines, tokens, files)
2. **Flag as accepted** — explicitly state it matches a known exception pattern
3. **Explain why** — cite the specific reason refactoring is not recommended
4. **Do NOT recommend refactoring** — this is the key difference from actionable findings

**Example output format**:

```
Code Clone Analysis Results

PMD CPD Findings:
  Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
    gen610_template.sql:33 ↔ gen710_template.sql:38
    Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
    Reason: Each generation is immutable. Shared CTEs would break
            experiment provenance and reproducibility.

  Clone 2: 36 lines (478 tokens) — metrics aggregation
    gen610_template.sql:207 ↔ gen710_template.sql:244
    Status: ACCEPTED EXCEPTION (SQL template without include mechanism)

Actionable Findings: 0
Accepted Exceptions: 2
```

### Project-Level Exception Configuration

Projects can declare accepted exception patterns in their `CLAUDE.md`:

```markdown
## Code Clone Exceptions

- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation
```

When this section exists in a project's `CLAUDE.md`, the code-clone-assistant should check it before classifying findings.

---

## Reference Documentation

For detailed information, see:

- [Detection Commands](./references/detection-commands.md) - PMD CPD and Semgrep command details
- [Complete Workflow](./references/complete-workflow.md) - Detection, analysis, and presentation phases
- [Refactoring Strategies](./references/refactoring-strategies.md) - Approaches for addressing violations

---

## Troubleshooting

| Issue                      | Cause                        | Solution                                         |
| -------------------------- | ---------------------------- | ------------------------------------------------ |
| PMD CPD not found          | Not installed or not in PATH | `brew install pmd` or download from PMD releases |
| Semgrep timeout            | Large codebase scan          | Use `--exclude` to limit scope                   |
| No duplicates detected     | minimum-tokens too high      | Lower `--minimum-tokens` value (try 15)          |
| Too many false positives   | minimum-tokens too low       | Increase `--minimum-tokens` (try 30+)            |
| Language not recognized    | Wrong `-l` flag              | Check PMD CPD supported languages list           |
| SARIF parse error          | Semgrep output malformed     | Upgrade Semgrep to latest version                |
| Memory error on large repo | Java heap too small          | Set `PMD_JAVA_OPTS=-Xmx4g`                       |
| Missing clone rules file   | Custom rules not created     | Create `clone-rules.yaml` or use default config  |


## Post-Execution Reflection

After this skill completes, check before closing:

1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure.
2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.

Only update if the issue is real and reproducible — not speculative.
More from terrylica/cc-skills