web-scraper

Name: web-scraper
Author: guia-matthieu/clawfu-skills

$npx mdskill add guia-matthieu/clawfu-skills/web-scraper

Extracts structured data from websites for competitor research, lead generation, and content audits using BeautifulSoup and requests.

Helps with collecting pricing, product listings, contact information, and monitoring website changes.
Integrates with BeautifulSoup, requests, pandas, click, and lxml for web scraping and data processing.
Uses analysis frameworks to structure data and identify opportunities based on user-defined strategic priorities.
Presents results as usable structured data, such as extracted elements or links, for further agent processing.

SKILL.md

.github/skills/web-scraperView on GitHub ↗

---
name: web-scraper
description: "Extract structured data from websites. Use when: collecting competitor pricing; scraping product listings; extracting contact information; gathering research data; monitoring website changes"
license: MIT
metadata:
  author: ClawFu
  version: 1.0.0
  mcp-server: "@clawfu/mcp-skills"
---

# Web Scraper

> Extract structured data from websites using BeautifulSoup and requests - turn any webpage into usable data.

## When to Use This Skill

- **Competitor research** - Scrape pricing, features, positioning
- **Lead generation** - Extract contact info from directories
- **Content audit** - Pull headings, links, meta data
- **Price monitoring** - Track competitor pricing changes
- **Data collection** - Gather research data from multiple sources


## What Claude Does vs What You Decide

| Claude Does | You Decide |
|-------------|------------|
| Structures analysis frameworks | Strategic priorities |
| Synthesizes market data | Competitive positioning |
| Identifies opportunities | Resource allocation |
| Creates strategic options | Final strategy selection |
| Suggests implementation approaches | Execution decisions |

## Dependencies

```bash
pip install beautifulsoup4 requests pandas click lxml
```

## Commands

### Scrape Elements
```bash
python scripts/main.py scrape https://example.com --selector "h1,h2,p"
python scripts/main.py scrape https://example.com --selector ".product-price"
```

### Extract Links
```bash
python scripts/main.py links https://example.com
python scripts/main.py links https://example.com --internal-only
```

### Extract Emails
```bash
python scripts/main.py emails https://example.com
python scripts/main.py emails https://example.com --depth 2
```

### Extract Structured Data
```bash
python scripts/main.py structured https://example.com/article --schema article
python scripts/main.py structured https://example.com/product --schema product
```

## Examples

### Example 1: Scrape Competitor Pricing
```bash
python scripts/main.py scrape https://competitor.com/pricing --selector ".price,.plan-name"

# Output:
# Extracted 6 elements
# 1. Starter - $29/mo
# 2. Pro - $99/mo
# 3. Enterprise - Contact us
```

### Example 2: Extract Article Content
```bash
python scripts/main.py structured https://blog.example.com/post --schema article

# Output: article_data.json
# {
#   "title": "How to Scale Your Startup",
#   "author": "Jane Doe",
#   "date": "2024-01-15",
#   "content": "...",
#   "word_count": 1523
# }
```

## CSS Selector Reference

| Selector | Description | Example |
|----------|-------------|---------|
| `tag` | Element type | `h1`, `p`, `div` |
| `.class` | Class name | `.price`, `.title` |
| `#id` | Element ID | `#main-content` |
| `tag.class` | Tag with class | `div.product` |
| `tag[attr]` | Has attribute | `a[href]` |
| `parent > child` | Direct child | `ul > li` |
| `tag1, tag2` | Multiple | `h1, h2, h3` |

## Ethical Scraping Guidelines

1. **Check robots.txt** - Respect site's scraping policy
2. **Rate limit** - Don't overload servers (1-2 req/sec)
3. **Identify yourself** - Use descriptive User-Agent
4. **Cache requests** - Don't re-scrape unchanged pages
5. **Terms of Service** - Check if scraping is allowed

## Skill Boundaries

### What This Skill Does Well
- Structuring strategic analysis
- Identifying market opportunities
- Creating strategic frameworks
- Synthesizing competitive data

### What This Skill Cannot Do
- Replace market research
- Guarantee strategic success
- Know proprietary competitor info
- Make executive decisions

## Related Skills

- [competitor-monitor](../competitor-monitor/) - Monitor competitor changes
- [pdf-extractor](../pdf-extractor/) - Extract from PDFs

## Skill Metadata


- **Mode**: centaur
```yaml
category: automation
subcategory: data-extraction
dependencies: [beautifulsoup4, requests, pandas]
difficulty: intermediate
time_saved: 5+ hours/week
```