code-clone-assistant
$
npx mdskill add terrylica/cc-skills/code-clone-assistantDetects and refactors code duplication using PMD CPD and Semgrep
- Identifies code clones, DRY violations, and copy-paste duplication
- Uses PMD CPD for exact matches and Semgrep for pattern-based clones
- Analyzes token and AST patterns across files to find similar or identical code
- Provides actionable refactoring suggestions and highlights duplicate locations
SKILL.md
.github/skills/code-clone-assistantView on GitHub ↗
---
name: code-clone-assistant
description: Detect and refactor code duplication with PMD CPD. TRIGGERS - code clones, DRY violations, duplicate code.
allowed-tools: Read, Grep, Bash, Edit, Write
---
# Code Clone Assistant
Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).
> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
## Tools
- **PMD CPD v7.17.0+**: Exact duplicate detection
- **Semgrep v1.140.0+**: Pattern-based detection
**Tested**: October 2025 - 30 violations detected across 3 sample files
**Coverage**: ~3x more violations than using either tool alone
---
## When to Use This Skill
Use this skill when:
- Finding duplicate code in a codebase
- Detecting DRY violations
- Refactoring similar code patterns
- Identifying copy-paste code
---
## Why Two Tools?
PMD CPD and Semgrep detect different clone types:
| Aspect | PMD CPD | Semgrep |
| ------------ | -------------------------------- | -------------------------------- |
| **Detects** | Exact copy-paste duplicates | Similar patterns with variations |
| **Scope** | Across files ✅ | Within/across files (Pro only) |
| **Matching** | Token-based (ignores formatting) | Pattern-based (AST matching) |
| **Rules** | ❌ No custom rules | ✅ Custom rules |
**Result**: Using both finds ~3x more DRY violations.
### Clone Types
| Type | Description | PMD CPD | Semgrep |
| ------ | ------------------------------- | --------------- | ----------- |
| Type-1 | Exact copies | ✅ Default | ✅ |
| Type-2 | Renamed identifiers | ✅ `--ignore-*` | ✅ |
| Type-3 | Near-miss with variations | ⚠️ Partial | ✅ Patterns |
| Type-4 | Semantic clones (same behavior) | ❌ | ❌ |
---
## Quick Start Workflow
```bash
# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md
# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif
# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity
# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests
```
---
---
## Accepted Exceptions (Known Intentional Duplication)
Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, **always check for accepted exceptions before recommending refactoring**.
### When Duplication Is Acceptable
| Pattern | Why Acceptable | Example |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| **Generation-per-directory experiments** | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. | SQL templates, sweep scripts where each `gen{NNN}/` is independent |
| **SQL templates with placeholder substitution** | SQL has no import/include mechanism. Templates use `sed` placeholder replacement (`__PLACEHOLDER__`), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. | ClickHouse sweep templates sharing signal detection + metrics CTEs |
| **Protocol/schema boilerplate** | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. | NDJSON telemetry line construction in wrapper scripts |
| **Test fixtures and golden files** | Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. | Test setup code, expected output snapshots |
### How to Report Accepted Exceptions
When clone detection finds duplication that matches an accepted exception pattern:
1. **Report it** — always show the user what was found (lines, tokens, files)
2. **Flag as accepted** — explicitly state it matches a known exception pattern
3. **Explain why** — cite the specific reason refactoring is not recommended
4. **Do NOT recommend refactoring** — this is the key difference from actionable findings
**Example output format**:
```
Code Clone Analysis Results
PMD CPD Findings:
Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
gen610_template.sql:33 ↔ gen710_template.sql:38
Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
Reason: Each generation is immutable. Shared CTEs would break
experiment provenance and reproducibility.
Clone 2: 36 lines (478 tokens) — metrics aggregation
gen610_template.sql:207 ↔ gen710_template.sql:244
Status: ACCEPTED EXCEPTION (SQL template without include mechanism)
Actionable Findings: 0
Accepted Exceptions: 2
```
### Project-Level Exception Configuration
Projects can declare accepted exception patterns in their `CLAUDE.md`:
```markdown
## Code Clone Exceptions
- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation
```
When this section exists in a project's `CLAUDE.md`, the code-clone-assistant should check it before classifying findings.
---
## Reference Documentation
For detailed information, see:
- [Detection Commands](./references/detection-commands.md) - PMD CPD and Semgrep command details
- [Complete Workflow](./references/complete-workflow.md) - Detection, analysis, and presentation phases
- [Refactoring Strategies](./references/refactoring-strategies.md) - Approaches for addressing violations
---
## Troubleshooting
| Issue | Cause | Solution |
| -------------------------- | ---------------------------- | ------------------------------------------------ |
| PMD CPD not found | Not installed or not in PATH | `brew install pmd` or download from PMD releases |
| Semgrep timeout | Large codebase scan | Use `--exclude` to limit scope |
| No duplicates detected | minimum-tokens too high | Lower `--minimum-tokens` value (try 15) |
| Too many false positives | minimum-tokens too low | Increase `--minimum-tokens` (try 30+) |
| Language not recognized | Wrong `-l` flag | Check PMD CPD supported languages list |
| SARIF parse error | Semgrep output malformed | Upgrade Semgrep to latest version |
| Memory error on large repo | Java heap too small | Set `PMD_JAVA_OPTS=-Xmx4g` |
| Missing clone rules file | Custom rules not created | Create `clone-rules.yaml` or use default config |
## Post-Execution Reflection
After this skill completes, check before closing:
1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure.
2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.
Only update if the issue is real and reproducible — not speculative.
More from terrylica/cc-skills
- academic-pdf-to-gfmConvert academic PDF papers to GitHub-renderable GFM markdown with math equations. TRIGGERS - PDF, GitHub markdown, math
- adaptive-wfo-epochAdaptive epoch selection for Walk-Forward Optimization. TRIGGERS - WFO epoch, epoch selection, WFE optimization, overfitting epochs.
- adr-code-traceabilityAdd ADR references to code for traceability. TRIGGERS - ADR traceability, code reference, document decision in code.
- adr-graph-easy-architectASCII architecture diagrams for ADRs via graph-easy. TRIGGERS - ADR diagram, architecture diagram, ASCII diagram.
- agent-reach>
- agentic-process-monitorMonitor background processes from Claude Code using sentinel files, heartbeat liveness, and subagent polling. Best practices and.
- alpha-forge-preshipAlpha Forge quality gates for PR review - RNG determinism, URL validation, parameter validation, manifest sync.
- article-extractorExtract MQL5 articles and documentation. TRIGGERS - MQL5 articles, MetaTrader docs, mql5.com resources.
- ascii-diagram-validatorValidate ASCII diagram alignment in markdown. TRIGGERS - diagram alignment, ASCII art, box-drawing diagrams.
- asciinema-analyzerSemantic analysis of asciinema recordings. TRIGGERS - analyze cast, keyword extraction, find patterns in recordings.