crawl4ai

$npx mdskill add TerminalSkills/skills/crawl4ai

Extract structured data from websites for AI applications using Crawl4AI

  • Solve the problem of unstructured web data for LLM training and RAG pipelines
  • Uses Python, AsyncWebCrawler, and CrawlerRunConfig for efficient crawling
  • Leverages JavaScript rendering, CSS selectors, and LLM-powered extraction
  • Delivers clean markdown and structured data through session-managed results
SKILL.md
.github/skills/crawl4aiView on GitHub ↗
---
name: crawl4ai
description: >-
  You are an expert in Crawl4AI, the open-source web crawler built for AI
  applications. You help developers extract clean, structured data from
  websites for LLM training, RAG pipelines, and content analysis — with
  automatic markdown conversion, JavaScript rendering, CSS-based extraction,
  LLM-powered structured extraction, and session management for multi-page
  crawling.
license: Apache-2.0
compatibility: ''
metadata:
  author: terminal-skills
  version: 1.0.0
  category: AI & Machine Learning
  tags:
    - web-scraping
    - crawling
    - llm
    - markdown
    - extraction
    - ai
    - python
---

# Crawl4AI — LLM-Friendly Web Crawler

You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.

## Core Capabilities

### Basic Crawling

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://docs.example.com/getting-started",
        config=CrawlerRunConfig(
            word_count_threshold=10,       # Skip blocks with <10 words
            cache_mode=CacheMode.ENABLED,
        ),
    )

    print(result.markdown)                 # Clean markdown (LLM-ready)
    print(result.cleaned_html)             # Cleaned HTML
    print(result.media["images"])          # Extracted images
    print(result.links["internal"])        # Internal links
    print(result.links["external"])        # External links
    print(result.metadata)                 # Title, description, keywords
```

### LLM-Powered Structured Extraction

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: float
    description: str
    rating: float | None
    features: list[str]

extraction = LLMExtractionStrategy(
    provider="openai/gpt-4o-mini",
    api_token=os.environ["OPENAI_API_KEY"],
    schema=Product.model_json_schema(),
    instruction="Extract all products from this page",
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://shop.example.com/keyboards",
        config=CrawlerRunConfig(extraction_strategy=extraction),
    )

    products = [Product(**p) for p in json.loads(result.extracted_content)]
    for p in products:
        print(f"{p.name}: ${p.price} — {p.rating}★")
```

### CSS-Based Extraction (No LLM)

```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product listings",
    "baseSelector": ".product-card",       # Repeating element
    "fields": [
        {"name": "title", "selector": "h3.product-title", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
    ],
}

extraction = JsonCssExtractionStrategy(schema)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://shop.example.com/all",
        config=CrawlerRunConfig(
            extraction_strategy=extraction,
            js_code="window.scrollTo(0, document.body.scrollHeight);",  # Scroll to load more
            wait_for=".product-card:nth-child(20)",                     # Wait for 20 items
        ),
    )
    products = json.loads(result.extracted_content)
```

### Multi-Page Crawling

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async with AsyncWebCrawler() as crawler:
    # Session-based: maintains cookies, auth state
    session_id = "docs-crawl"

    # Login first
    await crawler.arun(
        url="https://docs.example.com/login",
        config=CrawlerRunConfig(
            session_id=session_id,
            js_code="""
                document.querySelector('#email').value = 'user@example.com';
                document.querySelector('#password').value = 'pass';
                document.querySelector('form').submit();
            """,
            wait_for="#dashboard",
        ),
    )

    # Crawl authenticated pages
    urls = ["https://docs.example.com/api", "https://docs.example.com/guides"]
    for url in urls:
        result = await crawler.arun(
            url=url,
            config=CrawlerRunConfig(session_id=session_id),
        )
        # Save markdown for RAG indexing
        save_to_knowledge_base(url, result.markdown)
```

## Installation

```bash
pip install crawl4ai
crawl4ai-setup                             # Install Playwright browsers
```

## Best Practices

1. **Markdown output** — Use `result.markdown` for LLM/RAG input; clean, structured, no HTML noise
2. **CSS extraction** — Use `JsonCssExtractionStrategy` for structured pages; no LLM cost, fast, deterministic
3. **LLM extraction** — Use `LLMExtractionStrategy` for unstructured pages; Pydantic schema ensures valid output
4. **JavaScript rendering** — Crawl4AI uses Playwright; handles SPAs, infinite scroll, dynamic content
5. **Sessions** — Use `session_id` for multi-page crawls; maintains cookies, auth state across requests
6. **Caching** — Enable `CacheMode.ENABLED` for development; avoid re-crawling the same pages
7. **Wait conditions** — Use `wait_for` CSS selectors to ensure content is loaded before extraction
8. **Rate limiting** — Add delays between requests; respect robots.txt; be a good citizen
More from TerminalSkills/skills