scrapeninja
$
npx mdskill add vm0-ai/vm0-skills/scrapeninjaExecute high-performance web scraping via ScrapeNinja API
- Extracts text and data from any URL instantly
- Integrates with ScrapeNinja API for backend execution
- Selects rendering mode based on site complexity
- Returns structured JSON with status and body length
SKILL.md
.github/skills/scrapeninjaView on GitHub ↗
---
name: scrapeninja
description: ScrapeNinja API for web scraping. Use when user mentions "ScrapeNinja",
"scrape", "web scraping", or data extraction.
---
## Troubleshooting
If requests fail, run `zero doctor check-connector --env-name SCRAPENINJA_TOKEN` or `zero doctor check-connector --url https://scrapeninja.p.rapidapi.com/scrape --method POST`
## How to Use
### 1. Basic Scrape (Non-JS, Fast)
High-performance scraping with Chrome TLS fingerprint, no JavaScript:
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com"
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '{status: .info.statusCode, url: .info.finalUrl, bodyLength: (.body | length)}'
```
**With custom headers and retries:**
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com",
"headers": ["Accept-Language: en-US"],
"retryNum": 3,
"timeout": 15
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```
### 2. Scrape with JavaScript Rendering
For JavaScript-heavy sites (React, Vue, etc.):
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com",
"waitForSelector": "h1",
"timeout": 20
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '{status: .info.statusCode, bodyLength: (.body | length)}'
```
**With screenshot:**
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com",
"screenshot": true
}
```
Then run:
```bash
# Get screenshot URL from response
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq -r '.info.screenshot'
```
### 3. Geo-Based Proxy Selection
Use proxies from specific regions:
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com",
"geo": "eu"
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq .info
```
Available geos: `us`, `eu`, `br` (Brazil), `fr` (France), `de` (Germany), `4g-eu`
### 4. Smart Retries
Retry on specific HTTP status codes or text patterns:
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com",
"retryNum": 3,
"statusNotExpected": [403, 429, 503],
"textNotExpected": ["captcha", "Access Denied"]
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```
### 5. Extract Data with Cheerio
Extract structured JSON using Cheerio extractor functions:
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://news.ycombinator.com",
"extractor": "function(input, cheerio) { let $ = cheerio.load(input); return $(\".titleline > a\").slice(0,5).map((i,el) => ({title: $(el).text(), url: $(el).attr(\"href\")})).get(); }"
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '.extractor'
```
### 6. Intercept AJAX Requests
Capture XHR/fetch responses:
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com",
"catchAjaxHeadersUrlMask": "api/data"
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '.info.catchedAjax'
```
### 7. Block Resources for Speed
Speed up JS rendering by blocking images and media:
Write to `/tmp/scrapeninja_request.json`:
```json
{
"url": "https://example.com",
"blockImages": true,
"blockMedia": true
}
```
Then run:
```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```
## API Endpoints
| Endpoint | Description |
|----------|-------------|
| `/scrape` | Fast non-JS scraping with Chrome TLS fingerprint |
| `/scrape-js` | Full Chrome browser with JS rendering |
| `/v2/scrape-js` | Enhanced JS rendering for protected sites (APIRoad only) |
## Request Parameters
### Common Parameters (all endpoints)
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | URL to scrape |
| `headers` | string[] | - | Custom HTTP headers |
| `retryNum` | int | 1 | Number of retry attempts |
| `geo` | string | `us` | Proxy geo: us, eu, br, fr, de, 4g-eu |
| `proxy` | string | - | Custom proxy URL (overrides geo) |
| `timeout` | int | 10/16 | Timeout per attempt in seconds |
| `textNotExpected` | string[] | - | Text patterns that trigger retry |
| `statusNotExpected` | int[] | [403, 502] | HTTP status codes that trigger retry |
| `extractor` | string | - | Cheerio extractor function |
### JS Rendering Parameters (`/scrape-js`, `/v2/scrape-js`)
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `waitForSelector` | string | - | CSS selector to wait for |
| `postWaitTime` | int | - | Extra wait time after load (1-12s) |
| `screenshot` | bool | true | Take page screenshot |
| `blockImages` | bool | false | Block image loading |
| `blockMedia` | bool | false | Block CSS/fonts loading |
| `catchAjaxHeadersUrlMask` | string | - | URL pattern to intercept AJAX |
| `viewport` | object | 1920x1080 | Custom viewport size |
## Response Format
```json
{
"info": {
"statusCode": 200,
"finalUrl": "https://example.com",
"headers": ["content-type: text/html"],
"screenshot": "base64-encoded-png",
"catchedAjax": {
"url": "https://example.com/api/data",
"method": "GET",
"body": "...",
"status": 200
}
},
"body": "<html>...</html>",
"extractor": { "extracted": "data" }
}
```
## Guidelines
1. **Start with `/scrape`**: Use the fast non-JS endpoint first, only switch to `/scrape-js` if needed
2. **Retries**: Set `retryNum` to 2-3 for unreliable sites
3. **Geo Selection**: Use `eu` for European sites, `us` for American sites
4. **Extractors**: Test extractors at https://scrapeninja.net/cheerio-sandbox/
5. **Blocked Sites**: For Cloudflare/Datadome protected sites, use `/v2/scrape-js` via APIRoad
6. **Screenshots**: Set `screenshot: false` to speed up JS rendering
7. **Rate Limits**: Check your plan limits on RapidAPI/APIRoad dashboard
## Tools
- **Playground**: https://scrapeninja.net/scraper-sandbox
- **Cheerio Sandbox**: https://scrapeninja.net/cheerio-sandbox
- **cURL Converter**: https://scrapeninja.net/curl-to-scraper
More from vm0-ai/vm0-skills
- account-reconciliationPerform account reconciliations comparing general ledger balances against subledgers, bank statements, or external records. Use for bank reconciliation, GL-to-subledger reconciliation, intercompany reconciliation, balance sheet reconciliation, reconciling item analysis, outstanding item aging, or clearing open items.
- agentphoneBuild AI phone agents with AgentPhone API. Use when the user wants to make phone calls, send/receive SMS, manage phone numbers, create voice agents, set up webhooks, or check usage — anything related to telephony, phone numbers, or voice AI.
- ahrefsAhrefs SEO API for backlink and keyword analysis. Use when user mentions
- amplitudeAmplitude product analytics API. Use when user mentions "Amplitude",
- analysis-qaQuality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.
- anthropic-managed-agentsAnthropic Managed Agents API for programmatically creating, running, and streaming AI agents on Anthropic's cloud infrastructure. Use when the user mentions "Managed Agents", "Anthropic agent sessions", or needs to create/run/stream an Anthropic agent with tool use (bash, git, web), attach GitHub repositories, or inject secrets via Vault. Do NOT use for standard Claude Messages API — use the Claude API skill instead.
- apifyApify web scraping platform. Use when user mentions "scrape website",
- asanaAsana API for tasks and projects. Use when user mentions "Asana", "asana.com",
- atlassianAtlassian API for Confluence and Jira. Use when user mentions "Confluence
- attioAttio REST API for AI-native CRM operations — manage companies, people, deals, and custom objects, plus notes, tasks, lists, and comments. Use when the user mentions "Attio", "CRM record", "create company", "add person", "list entry", "CRM note", or "CRM task".