markdown-new

$npx mdskill add TerminalSkills/skills/markdown-new

Convert public URLs to clean Markdown for LLMs using markdown.new

  • Extract readable content from web pages for summarization or analysis
  • Uses markdown.new service to strip ads, navigation, and boilerplate
  • Chooses between auto, browser, or direct methods for content extraction
  • Returns clean Markdown via API response for immediate use or storage

SKILL.md

.github/skills/markdown-newView on GitHub ↗
---
name: markdown-new
description: >-
  Convert any public URL into clean, LLM-ready Markdown using the markdown.new
  service. Use for content extraction, RAG ingestion, article summarization,
  research, archiving, and token-efficient web reading.
license: Apache-2.0
compatibility: "No special requirements"
metadata:
  author: terminal-skills
  version: "1.0.0"
  category: automation
  tags: ["markdown", "web-scraping", "content-extraction", "url-to-markdown", "rag"]
---

# markdown-new

Convert public web pages into clean Markdown via [markdown.new](https://markdown.new) — a free hosted service that strips navigation, ads, and boilerplate, returning only the readable content.

## When to Use

- Extracting article text for summarization or analysis
- Building RAG pipelines that ingest web content
- Archiving pages in a readable format
- Reducing token usage compared to raw HTML or full browser snapshots
- Research workflows where you need clean text from multiple URLs

## API

### Prefix Mode (simplest)

Prepend `https://markdown.new/` to any URL:

```bash
# Basic conversion
curl -s 'https://markdown.new/https://example.com/article'

# With options
curl -s 'https://markdown.new/https://example.com?method=browser&retain_images=true'
```

### POST Mode (recommended for automation)

```bash
curl -s -X POST https://markdown.new/ \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/article",
    "method": "auto",
    "retain_images": false
  }'
```

### Parameters

| Parameter | Values | Default | Description |
|-----------|--------|---------|-------------|
| `method` | `auto`, `ai`, `browser` | `auto` | Conversion pipeline |
| `retain_images` | `true`, `false` | `false` | Keep image links in output |

### Method Selection

- **`auto`** — fastest; lets the service pick the best pipeline. Use first.
- **`ai`** — forces Workers AI HTML-to-Markdown conversion. Good for well-structured HTML.
- **`browser`** — headless browser rendering. Use for JavaScript-heavy SPAs and pages where `auto` misses content.

**Strategy:** Always try `auto` first. Fall back to `browser` only when output is incomplete or empty.

### Response Headers

The service returns useful metadata in response headers:

- `x-markdown-tokens` — estimated token count of the output
- `x-rate-limit-remaining` — requests remaining in current window

## Usage Patterns

### Single Page Extraction

```python
"""fetch_article.py — Extract a single article as Markdown."""
import requests

def fetch_markdown(url: str, method: str = "auto") -> str:
    """Convert a URL to clean Markdown.

    Args:
        url: Public HTTP/HTTPS URL to convert.
        method: Conversion method — "auto", "ai", or "browser".

    Returns:
        Markdown string of the page content.
    """
    resp = requests.post(
        "https://markdown.new/",
        json={"url": url, "method": method, "retain_images": False},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.text

# Extract an article
content = fetch_markdown("https://example.com/blog/post-title")
print(f"Extracted {len(content)} chars")
```

### Batch Extraction with Rate Limiting

```python
"""batch_extract.py — Extract multiple URLs with rate limiting."""
import time
import requests

def batch_extract(urls: list[str], delay: float = 0.5) -> dict[str, str]:
    """Extract Markdown from multiple URLs with rate limiting.

    Args:
        urls: List of public URLs to convert.
        delay: Seconds to wait between requests to respect rate limits.

    Returns:
        Dict mapping URL to extracted Markdown content.
    """
    results = {}
    for url in urls:
        try:
            resp = requests.post(
                "https://markdown.new/",
                json={"url": url, "method": "auto"},
                timeout=30,
            )
            if resp.status_code == 429:  # Rate limited
                print(f"Rate limited, waiting 60s...")
                time.sleep(60)
                resp = requests.post(
                    "https://markdown.new/",
                    json={"url": url, "method": "auto"},
                    timeout=30,
                )
            resp.raise_for_status()
            results[url] = resp.text
        except Exception as e:
            print(f"Failed {url}: {e}")
            results[url] = ""
        time.sleep(delay)  # Respect rate limits
    return results
```

### Shell One-Liner

```bash
# Quick article extraction — pipe to file or another tool
curl -s 'https://markdown.new/https://example.com/article' > article.md

# Extract and count tokens (rough estimate: words / 0.75)
curl -s 'https://markdown.new/https://example.com/article' | wc -w
```

### Node.js

```javascript
// fetch-markdown.js — URL to Markdown in Node.js
async function fetchMarkdown(url, method = 'auto') {
  const resp = await fetch('https://markdown.new/', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ url, method, retain_images: false }),
  });

  if (resp.status === 429) {
    throw new Error('Rate limited — wait and retry');
  }

  if (!resp.ok) {
    throw new Error(`Conversion failed: ${resp.status}`);
  }

  return resp.text();
}
```

## Limits and Best Practices

- **Rate limit:** ~500 requests/day per IP. Monitor `x-rate-limit-remaining` header.
- **429 responses** mean you've hit the limit — back off and retry after a delay.
- **Public URLs only** — the service cannot access authenticated or private pages.
- **Respect robots.txt** and copyright when extracting content.
- **Verify critical extractions** — output is not guaranteed complete for every page.
- **Use `auto` first**, fall back to `browser` for JS-heavy pages.
- **Disable `retain_images`** when you only need text — reduces output size.

## Combining with Other Tools

- Pair with **whisper** for multimedia research (audio transcription + article extraction)
- Feed output into **langchain** or **langgraph** for RAG pipelines
- Use with **elasticsearch** to build a searchable content index
- Combine with **sox** / **yt-dlp** for multi-format content ingestion

More from TerminalSkills/skills