pdf-ocr

$npx mdskill add TerminalSkills/skills/pdf-ocr

Extract text from scanned PDFs using OCR for digitization and readability

  • Solves the problem of extracting text from image-based or scanned PDF documents
  • Uses Tesseract OCR with Python libraries pdf2image and pytesseract
  • Analyzes document quality and applies preprocessing for low-resolution scans
  • Returns structured text output with language detection and multi-page support

SKILL.md

.github/skills/pdf-ocrView on GitHub ↗
---
name: pdf-ocr
description: >-
  Extract text from scanned PDFs using optical character recognition. Use when a
  user asks to OCR a PDF, read a scanned document, extract text from an image
  PDF, digitize a scanned file, convert a scanned PDF to text, or read text from
  a photograph of a document. Supports multiple languages and handles low-quality
  scans.
license: Apache-2.0
compatibility: "Requires Python 3.9+, Tesseract OCR installed, and pytesseract + pdf2image packages"
metadata:
  author: terminal-skills
  version: "1.0.0"
  category: documents
  tags: ["ocr", "pdf", "scanned", "text-extraction", "tesseract"]
  use-cases:
    - "Extract text from scanned paper documents saved as PDFs"
    - "Digitize old contracts or records that are image-only PDFs"
    - "Read text from PDFs in multiple languages including CJK scripts"
  agents: [claude-code, openai-codex, gemini-cli, cursor]
---

# PDF OCR

## Overview

Extract readable text from scanned or image-based PDF documents using optical character recognition (OCR). This skill converts PDF pages to images, runs OCR to detect text, and outputs clean structured text. Handles multi-page documents, multiple languages, and low-quality scans with preprocessing.

## Instructions

When a user asks to OCR a scanned PDF or extract text from an image-based PDF, follow these steps:

### Step 1: Check if OCR is actually needed

First, attempt normal text extraction. If the PDF already contains selectable text, OCR is unnecessary:

```python
import pdfplumber

def check_text_content(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages[:3]:
            text = page.extract_text()
            if text and len(text.strip()) > 50:
                return True  # Has extractable text, OCR not needed
    return False  # Image-only PDF, needs OCR
```

### Step 2: Install and verify dependencies

Ensure the required tools are available:

```bash
# Install Tesseract OCR engine
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr
# macOS:
brew install tesseract

# Install Python packages
pip install pytesseract pdf2image Pillow

# For additional languages:
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-jpn  # Japanese
```

### Step 3: Convert PDF pages to images

```python
from pdf2image import convert_from_path

def pdf_to_images(pdf_path, dpi=300):
    images = convert_from_path(pdf_path, dpi=dpi)
    return images
```

Use 300 DPI for standard documents. Increase to 400-600 DPI for small text or low-quality scans.

### Step 4: Preprocess images for better accuracy

Apply preprocessing to improve OCR quality:

```python
from PIL import Image, ImageFilter, ImageEnhance

def preprocess_image(image):
    # Convert to grayscale
    gray = image.convert('L')
    # Increase contrast
    enhancer = ImageEnhance.Contrast(gray)
    enhanced = enhancer.enhance(2.0)
    # Sharpen
    sharpened = enhanced.filter(ImageFilter.SHARPEN)
    # Binarize (threshold)
    threshold = 150
    binary = sharpened.point(lambda x: 255 if x > threshold else 0)
    return binary
```

### Step 5: Run OCR on each page

```python
import pytesseract

def ocr_pages(images, lang='eng'):
    results = []
    for i, image in enumerate(images):
        processed = preprocess_image(image)
        text = pytesseract.image_to_string(processed, lang=lang)
        results.append({
            "page": i + 1,
            "text": text.strip(),
            "confidence": get_confidence(processed, lang)
        })
    return results

def get_confidence(image, lang='eng'):
    data = pytesseract.image_to_data(image, lang=lang, output_type=pytesseract.Output.DICT)
    confidences = [int(c) for c in data['conf'] if int(c) > 0]
    return sum(confidences) / len(confidences) if confidences else 0
```

### Step 6: Output the results

Combine and format the extracted text. Save as a text file or return directly:

```python
def save_results(results, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        for page in results:
            f.write(f"--- Page {page['page']} (confidence: {page['confidence']:.0f}%) ---\n")
            f.write(page['text'] + "\n\n")
    return output_path
```

## Examples

### Example 1: OCR a scanned contract

**User request:** "Extract text from this scanned contract scan_contract.pdf"

**Actions taken:**
1. Check for existing text layer - none found, OCR needed
2. Convert 5 pages to images at 300 DPI
3. Preprocess and run OCR in English

**Output:**
```
OCR completed for scan_contract.pdf (5 pages)

Page-by-page confidence:
  Page 1: 96% confidence
  Page 2: 94% confidence
  Page 3: 91% confidence
  Page 4: 95% confidence
  Page 5: 88% confidence (lower quality scan detected)

Output saved to: scan_contract_text.txt (4,230 words extracted)

Note: Page 5 had lower image quality. Review that page for accuracy.
```

### Example 2: OCR a multi-language document

**User request:** "Read this scanned document, it's in German: rechnung.pdf"

**Actions taken:**
1. Verify tesseract-ocr-deu language pack is installed
2. Convert pages to images at 300 DPI
3. Run OCR with `lang='deu'`

**Output:**
```
OCR completed for rechnung.pdf (2 pages) using German language model

  Page 1: 93% confidence
  Page 2: 95% confidence

Extracted 812 words. Output saved to: rechnung_text.txt
```

### Example 3: Batch OCR multiple scanned files

**User request:** "OCR all the scanned PDFs in the ./receipts/ folder"

**Actions taken:**
1. Find all PDF files in ./receipts/ (found 12 files)
2. Check each for existing text layer
3. Run OCR on the 10 files that need it

**Output:**
```
Batch OCR complete: 12 files processed

  Already had text: 2 files (skipped)
  OCR completed:    10 files
  Average confidence: 92%

Output files saved to ./receipts/ocr_output/
  receipt_001_text.txt (97% confidence)
  receipt_002_text.txt (94% confidence)
  ...
  receipt_010_text.txt (85% confidence - review recommended)
```

## Guidelines

- Always check for existing text content before running OCR. Many PDFs already have a text layer.
- Use 300 DPI as the default resolution. Increase for small fonts or poor quality scans.
- Report confidence scores per page so users know which pages may need manual review.
- For multi-language documents, specify the correct Tesseract language code. Multiple languages can be combined: `lang='eng+deu'`.
- Preprocess images before OCR: grayscale conversion, contrast enhancement, and binarization significantly improve accuracy.
- For rotated or skewed scans, apply deskewing before OCR using image rotation detection.
- Large PDFs should be processed page by page to manage memory usage.
- Common Tesseract language codes: eng (English), deu (German), fra (French), spa (Spanish), jpn (Japanese), chi_sim (Chinese Simplified), kor (Korean).

More from TerminalSkills/skills