databricks-unstructured-pdf-generation

Name: databricks-unstructured-pdf-generation
Author: databricks/databricks-agent-skills

$npx mdskill add databricks/databricks-agent-skills/databricks-unstructured-pdf-generation

Convert HTML to PDF and store reports in Unity Catalog.

Creates test documents, demos, and evaluation datasets.
Integrates with plutoprint, Unity Catalog, and Python scripts.
Executes parallel conversion and uploads based on file timestamps.
Delivers PDFs to storage and generates JSON test questions.

SKILL.md

.github/skills/databricks-unstructured-pdf-generationView on GitHub ↗

---
name: databricks-unstructured-pdf-generation
description: "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets."
---

# PDF Generation from HTML

Convert HTML content to PDF documents and upload them to Unity Catalog Volumes.

## Workflow

1. Write HTML files to `./raw_data/html/` (write multiple files in parallel for speed)
2. Convert HTML → PDF using `<SKILL_ROOT>/scripts/pdf_generator.py` (parallel conversion)
3. Upload PDFs to Unity Catalog volume using `databricks fs cp`
4. Generate `doc_questions.json` with test questions for each document

> **Path convention:** `<SKILL_ROOT>` below = the directory containing this SKILL.md. Resolve to the absolute install path (e.g. `~/.claude/skills/databricks-unstructured-pdf-generation`). `./raw_data/...` paths are relative to your own project cwd.

## Dependencies

```bash
uv pip install plutoprint
```

## Step 1: Write HTML Files

```bash
mkdir -p ./raw_data/html
```

Write HTML documents to `./raw_data/html/filename.html`. Use subdirectories to organize (structure is preserved).

## Step 2: Convert to PDF

```bash
# Convert entire folder (parallel, 4 workers)
python <SKILL_ROOT>/scripts/pdf_generator.py convert --input ./raw_data/html --output ./raw_data/pdf
```

Skips files where PDF exists and is newer than HTML. Use `--force` to reconvert all.

## Step 3: Upload to Volume

`databricks fs` requires the `dbfs:` scheme prefix even for UC Volume paths. `-r` copies the *contents* of the source directory into the target (the source directory name is not preserved), so files land directly under `raw_data/`.

```bash
databricks fs cp -r --overwrite ./raw_data/pdf dbfs:/Volumes/my_catalog/my_schema/raw_data
```

## Step 4: Generate Test Questions

Create `./raw_data/pdf/pdf_eval_questions.json` with questions for Knowledge Assistant evaluation or MAS:

```json
{
  "api_errors_guide.pdf": {
    "question": "What is the solution for error ERR-4521?",
    "expected_fact": "Call /api/v2/auth/refresh with refresh_token before the 3600s TTL expires"
  },
  "installation_manual.pdf": {
    "question": "What port does the service use by default?",
    "expected_fact": "Port 8443 for HTTPS, configurable via CONFIG_PORT environment variable"
  }
}
```

This JSON can be used to build KA test cases and validate retrieval accuracy.

## Document Content Guidelines

When generating documents for Knowledge Assistant testing or demos:

- **Multi-page documents**: Each PDF should be several pages with substantial content
- **Specific error codes and solutions**: Include product-specific error codes, causes, and resolution steps
- **Technical details**: API endpoints, configuration parameters, version numbers, specific commands
- **Simple CSS**: Keep styling minimal for fast HTML creation and reliable PDF conversion
- **Queryable facts**: Include details a KA must read the document to answer (not general knowledge)

**Good document types:**
- Product user manuals with troubleshooting sections
- API error reference guides (error codes, causes, solutions)
- Installation/configuration guides with specific steps
- Technical specifications with version-specific details

**Example content:** Instead of generic "Connection failed" errors, write:
- "Error ERR-4521: OAuth token expired. Cause: Token TTL exceeded 3600s default. Solution: Call `/api/v2/auth/refresh` with your refresh_token before expiration. See Section 4.2 for token lifecycle management."

## CLI Reference

```
python <SKILL_ROOT>/scripts/pdf_generator.py convert [OPTIONS]

  --input, -i     Input HTML file or folder (required)
  --output, -o    Output folder for PDFs (required)
  --force, -f     Force reconvert (ignore timestamps)
  --workers, -w   Parallel workers (default: 4)
```

## Folder Structure

Subfolder structure is preserved:

```
./raw_data/html/                    ./raw_data/pdf/
├── report.html             →       ├── report.pdf
├── quarterly/                      ├── quarterly/
│   └── q1.html             →       │   └── q1.pdf
└── legal/                          └── legal/
    └── terms.html          →           └── terms.pdf
```

## Troubleshooting

| Issue | Solution |
|-------|----------|
| "plutoprint not installed" | `uv pip install plutoprint` |
| PDF looks wrong | Check HTML/CSS syntax |
| "Volume does not exist" | `databricks volumes create CATALOG SCHEMA VOLUME_NAME MANAGED` (four separate positional args, not `catalog.schema.volume`) |

More from databricks/databricks-agent-skills