academic-pdf-redaction
$
npx mdskill add elizaOS/eliza/academic-pdf-redactionRedact identifying information from academic papers for blind review.
SKILL.md
.github/skills/academic-pdf-redactionView on GitHub ↗
---
name: academic-pdf-redaction
description: Redact text from PDF documents for blind review anonymization
---
# PDF Redaction for Blind Review
Redact identifying information from academic papers for blind review.
## CRITICAL RULES
1. **PRESERVE References section** - Self-citations MUST remain intact
2. **ONLY redact specific text matches** - Never redact entire pages/regions
3. **VERIFY output** - Check that 80%+ of original text remains
## Common Pitfalls to AVOID
```python
# ❌ WRONG - This removes ALL text from the page:
for block in page.get_text("blocks"):
page.add_redact_annot(fitz.Rect(block[:4]))
# ❌ WRONG - Drawing rectangles over text:
page.draw_rect(fitz.Rect(0, 0, 600, 100), fill=(0,0,0))
# ✅ CORRECT - Only redact specific search matches:
for rect in page.search_for("John Smith"):
page.add_redact_annot(rect)
```
## Patterns to Redact (Before References Only)
**IMPORTANT: Use FULL names/phrases, not partial matches!**
- ✅ "John Smith" (full name)
- ❌ "Smith" (partial - would incorrectly match "Smith et al." citations in References)
1. **Author names** - FULL names only (e.g., "John Smith", not just "Smith")
2. **Affiliations** - Universities, companies (e.g., "Duke University")
3. **Email addresses** - Pattern: `*@*.edu`, `*@*.com`
4. **Venue names** - Conference/workshop names (e.g., "ICML 2024", "ICML Workshop")
5. **arXiv identifiers** - Pattern: `arXiv:XXXX.XXXXX`
6. **DOIs** - Pattern: `10.XXXX/...`
7. **Acknowledgement names** - Names in "Acknowledgements" section
8. **Equal contribution footnotes** - e.g., "Equal contribution", "* Equal contribution"
## PyMuPDF (fitz) - Recommended Approach
```python
import fitz
import os
def redact_with_pymupdf(input_path: str, output_path: str, patterns: list[str]):
"""Redact specific patterns from PDF using PyMuPDF."""
doc = fitz.open(input_path)
original_len = sum(len(p.get_text()) for p in doc)
# Find References page - stop redacting there
references_page = None
for i, page in enumerate(doc):
if "references" in page.get_text().lower():
references_page = i
break
for page_num, page in enumerate(doc):
if references_page is not None and page_num >= references_page:
continue # Skip References section
for pattern in patterns:
# ONLY redact exact search matches
for rect in page.search_for(pattern):
page.add_redact_annot(rect, fill=(0, 0, 0))
page.apply_redactions()
os.makedirs(os.path.dirname(output_path), exist_ok=True)
doc.save(output_path)
doc.close()
# MUST verify after saving
verify_redaction(input_path, output_path)
```
## REQUIRED: Verification Function
**Always run this after ANY redaction to catch errors early:**
```python
import fitz
def verify_redaction(original_path, output_path):
"""Verify redaction didn't corrupt the PDF."""
orig = fitz.open(original_path)
redc = fitz.open(output_path)
orig_len = sum(len(p.get_text()) for p in orig)
redc_len = sum(len(p.get_text()) for p in redc)
print(f"Original: {len(orig)} pages, {orig_len} chars")
print(f"Redacted: {len(redc)} pages, {redc_len} chars")
print(f"Retained: {redc_len/orig_len:.1%}")
# DEFENSIVE CHECKS - fail fast if something went wrong
if len(redc) != len(orig):
raise ValueError(f"Page count changed: {len(orig)} -> {len(redc)}")
if redc_len < 1000:
raise ValueError(f"PDF corrupted: only {redc_len} chars remain!")
if redc_len < orig_len * 0.7:
raise ValueError(f"Too much removed: kept only {redc_len/orig_len:.0%}")
orig.close()
redc.close()
print("✓ Verification passed")
```
More from elizaOS/eliza
- ac-branch-pi-modelAC branch pi-model power flow equations (P/Q and |S|) with transformer tap ratio and phase shift, matching `acopf-math-model.md` and MATPOWER branch fields. Use when computing branch flows in either direction, aggregating bus injections for nodal balance, checking MVA (rateA) limits, computing branch loading %, or debugging sign/units issues in AC power flow.
- ada-plan-view-accessibilityUse when checking simplified ADA-derived plan-view bathroom accessibility constraints such as turning space, door clear width, toilet centerline, grab bars, and lavatory knee/toe clearance.
- analyze-ciAnalyze failed GitHub Action jobs for a pull request.
- architectural-dxf-extractionUse when extracting plan-view architectural geometry from DXF files with semantic CAD layers, especially when outputs must normalize rooms, doors, fixtures, clearances, and grab bars into machine-checkable JSON.
- attitude-controller-plannerUse this skill when implementing the inner control loop for a quadrotor — attitude (roll/pitch/yaw) PID control and attitude planning (converting desired acceleration to desired Euler angles). Covers gain layout, integral reset pattern, and the attitude planner inverse kinematics.
- azure-bgpAnalyze and resolve BGP oscillation and BGP route leaks in Azure Virtual WAN–style hub-and-spoke topologies (and similar cloud-managed BGP environments). Detect preference cycles, identify valley-free violations, and propose allowed policy-level mitigations while rejecting prohibited fixes.
- box-least-squaresBox Least Squares (BLS) periodogram for detecting transiting exoplanets and eclipsing binaries. Use when searching for periodic box-shaped dips in light curves. Alternative to Transit Least Squares, available in astropy.timeseries. Based on Kovács et al. (2002).
- browser-testingVERIFY your changes work. Measure CLS, detect theme flicker, test visual stability, check performance. Use BEFORE and AFTER making changes to confirm fixes. Includes ready-to-run scripts: measure-cls.ts, detect-flicker.ts
- cache-policy-comparisonCompare and implement eviction policies (LRU, LFU, FIFO, S3FIFO, ARC) for bounded-capacity caches. Use when choosing or implementing an eviction policy for a buffer pool, page cache, CDN edge, or LLM KV cache, or when writing a replay simulator that supports multiple policies. Clarifies recency vs frequency semantics, queue topology, saturating counters, ghost buffers, and the second-chance rule that distinguishes modern FIFO-family policies from classic LRU.
- casadi-ipopt-nlpNonlinear optimization with CasADi and IPOPT solver. Use when building and solving NLP problems: defining symbolic variables, adding nonlinear constraints, setting solver options, handling multiple initializations, and extracting solutions. Covers power systems optimization patterns including per-unit scaling and complex number formulations.