scrapeninja

Name: scrapeninja
Author: vm0-ai/vm0-skills
$npx mdskill add vm0-ai/vm0-skills/scrapeninja
Execute high-performance web scraping via ScrapeNinja API
Extracts text and data from any URL instantly
Integrates with ScrapeNinja API for backend execution
Selects rendering mode based on site complexity
Returns structured JSON with status and body length
SKILL.md
.github/skills/scrapeninjaView on GitHub ↗
---
name: scrapeninja
description: ScrapeNinja API for web scraping. Use when user mentions "ScrapeNinja",
  "scrape", "web scraping", or data extraction.
---

## Troubleshooting

If requests fail, run `zero doctor check-connector --env-name SCRAPENINJA_TOKEN` or `zero doctor check-connector --url https://scrapeninja.p.rapidapi.com/scrape --method POST`

## How to Use

### 1. Basic Scrape (Non-JS, Fast)

High-performance scraping with Chrome TLS fingerprint, no JavaScript:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '{status: .info.statusCode, url: .info.finalUrl, bodyLength: (.body | length)}'
```

**With custom headers and retries:**

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "headers": ["Accept-Language: en-US"],
  "retryNum": 3,
  "timeout": 15
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```

### 2. Scrape with JavaScript Rendering

For JavaScript-heavy sites (React, Vue, etc.):

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "waitForSelector": "h1",
  "timeout": 20
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '{status: .info.statusCode, bodyLength: (.body | length)}'
```

**With screenshot:**

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "screenshot": true
}
```

Then run:

```bash
# Get screenshot URL from response
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq -r '.info.screenshot'
```

### 3. Geo-Based Proxy Selection

Use proxies from specific regions:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "geo": "eu"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq .info
```

Available geos: `us`, `eu`, `br` (Brazil), `fr` (France), `de` (Germany), `4g-eu`

### 4. Smart Retries

Retry on specific HTTP status codes or text patterns:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "retryNum": 3,
  "statusNotExpected": [403, 429, 503],
  "textNotExpected": ["captcha", "Access Denied"]
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```

### 5. Extract Data with Cheerio

Extract structured JSON using Cheerio extractor functions:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://news.ycombinator.com",
  "extractor": "function(input, cheerio) { let $ = cheerio.load(input); return $(\".titleline > a\").slice(0,5).map((i,el) => ({title: $(el).text(), url: $(el).attr(\"href\")})).get(); }"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '.extractor'
```

### 6. Intercept AJAX Requests

Capture XHR/fetch responses:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "catchAjaxHeadersUrlMask": "api/data"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '.info.catchedAjax'
```

### 7. Block Resources for Speed

Speed up JS rendering by blocking images and media:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "blockImages": true,
  "blockMedia": true
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```

## API Endpoints

| Endpoint | Description |
|----------|-------------|
| `/scrape` | Fast non-JS scraping with Chrome TLS fingerprint |
| `/scrape-js` | Full Chrome browser with JS rendering |
| `/v2/scrape-js` | Enhanced JS rendering for protected sites (APIRoad only) |

## Request Parameters

### Common Parameters (all endpoints)

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | URL to scrape |
| `headers` | string[] | - | Custom HTTP headers |
| `retryNum` | int | 1 | Number of retry attempts |
| `geo` | string | `us` | Proxy geo: us, eu, br, fr, de, 4g-eu |
| `proxy` | string | - | Custom proxy URL (overrides geo) |
| `timeout` | int | 10/16 | Timeout per attempt in seconds |
| `textNotExpected` | string[] | - | Text patterns that trigger retry |
| `statusNotExpected` | int[] | [403, 502] | HTTP status codes that trigger retry |
| `extractor` | string | - | Cheerio extractor function |

### JS Rendering Parameters (`/scrape-js`, `/v2/scrape-js`)

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `waitForSelector` | string | - | CSS selector to wait for |
| `postWaitTime` | int | - | Extra wait time after load (1-12s) |
| `screenshot` | bool | true | Take page screenshot |
| `blockImages` | bool | false | Block image loading |
| `blockMedia` | bool | false | Block CSS/fonts loading |
| `catchAjaxHeadersUrlMask` | string | - | URL pattern to intercept AJAX |
| `viewport` | object | 1920x1080 | Custom viewport size |

## Response Format

```json
{
  "info": {
  "statusCode": 200,
  "finalUrl": "https://example.com",
  "headers": ["content-type: text/html"],
  "screenshot": "base64-encoded-png",
  "catchedAjax": {
  "url": "https://example.com/api/data",
  "method": "GET",
  "body": "...",
  "status": 200
  }
  },
  "body": "<html>...</html>",
  "extractor": { "extracted": "data" }
}
```

## Guidelines

1. **Start with `/scrape`**: Use the fast non-JS endpoint first, only switch to `/scrape-js` if needed
2. **Retries**: Set `retryNum` to 2-3 for unreliable sites
3. **Geo Selection**: Use `eu` for European sites, `us` for American sites
4. **Extractors**: Test extractors at https://scrapeninja.net/cheerio-sandbox/
5. **Blocked Sites**: For Cloudflare/Datadome protected sites, use `/v2/scrape-js` via APIRoad
6. **Screenshots**: Set `screenshot: false` to speed up JS rendering
7. **Rate Limits**: Check your plan limits on RapidAPI/APIRoad dashboard

## Tools

- **Playground**: https://scrapeninja.net/scraper-sandbox
- **Cheerio Sandbox**: https://scrapeninja.net/cheerio-sandbox
- **cURL Converter**: https://scrapeninja.net/curl-to-scraper