bright-data
$
npx mdskill add vm0-ai/vm0-skills/bright-dataExecute large-scale web scraping and proxy services for data collection.
- Enables gathering profiles, posts, and comments from Twitter, Reddit, YouTube, Instagram, TikTok, and LinkedIn.
- Depends on Bright Data API for triggering asynchronous or synchronous data collection jobs.
- Selects scraping targets based on user-specified URLs or platform requirements.
- Delivers results via JSON snapshots or immediate responses depending on request size.
SKILL.md
.github/skills/bright-dataView on GitHub ↗
---
name: bright-data
description: Bright Data proxy and web scraping API. Use when user mentions "Bright
Data", "proxy", "web scraping at scale", or data collection.
---
## Troubleshooting
If requests fail, run `zero doctor check-connector --env-name BRIGHTDATA_TOKEN` or `zero doctor check-connector --url https://api.brightdata.com/datasets/v3/trigger --method POST`
## Social Media Scraping
Bright Data supports scraping these social media platforms:
| Platform | Profiles | Posts | Comments | Reels/Videos |
|----------|----------|-------|----------|--------------|
| Twitter/X | ✅ | ✅ | - | - |
| Reddit | - | ✅ | ✅ | - |
| YouTube | ✅ | ✅ | ✅ | - |
| Instagram | ✅ | ✅ | ✅ | ✅ |
| TikTok | ✅ | ✅ | ✅ | - |
| LinkedIn | ✅ | ✅ | - | - |
## How to Use
### 1. Trigger Scraping (Asynchronous)
Trigger a data collection job and get a `snapshot_id` for later retrieval.
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://twitter.com/username"},
{"url": "https://twitter.com/username2"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Response:**
```json
{
"snapshot_id": "s_m4x7enmven8djfqak"
}
```
### 2. Trigger Scraping (Synchronous)
Get results immediately in the response (for small requests).
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.reddit.com/r/technology/comments/xxxxx"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
### 3. Monitor Progress
Check the status of a scraping job (replace `<snapshot-id>` with your actual snapshot ID):
```bash
curl -s "https://api.brightdata.com/datasets/v3/progress/<snapshot-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
**Response:**
```json
{
"snapshot_id": "s_m4x7enmven8djfqak",
"dataset_id": "gd_xxxxx",
"status": "running"
}
```
Status values: `running`, `ready`, `failed`
### 4. Download Results
Once status is `ready`, download the collected data (replace `<snapshot-id>` with your actual snapshot ID):
```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshot/<snapshot-id>?format=json" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
### 5. List Snapshots
Get all your snapshots:
```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshots" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {snapshot_id, dataset_id, status}'
```
### 6. Cancel Snapshot
Cancel a running job (replace `<snapshot-id>` with your actual snapshot ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/cancel?snapshot_id=<snapshot-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
## Platform-Specific Examples
### Twitter/X - Scrape Profile
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://twitter.com/elonmusk"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `x_id`, `profile_name`, `biography`, `is_verified`, `followers`, `following`, `profile_image_link`
### Twitter/X - Scrape Posts
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://twitter.com/username/status/123456789"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `post_id`, `text`, `replies`, `likes`, `retweets`, `views`, `hashtags`, `media`
### Reddit - Scrape Subreddit Posts
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.reddit.com/r/technology", "sort_by": "hot"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Parameters:** `url`, `sort_by` (new/top/hot)
**Returns:** `post_id`, `title`, `description`, `num_comments`, `upvotes`, `date_posted`, `community`
### Reddit - Scrape Comments
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.reddit.com/r/technology/comments/xxxxx/post_title"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `comment_id`, `user_posted`, `comment_text`, `upvotes`, `replies`
### YouTube - Scrape Video Info
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `title`, `views`, `likes`, `num_comments`, `video_length`, `transcript`, `channel_name`
### YouTube - Search by Keyword
Write to `/tmp/brightdata_request.json`:
```json
[
{"keyword": "artificial intelligence", "num_of_posts": 50}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
### YouTube - Scrape Comments
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.youtube.com/watch?v=xxxxx", "load_replies": 3}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `comment_text`, `likes`, `replies`, `username`, `date`
### Instagram - Scrape Profile
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.instagram.com/username"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `followers`, `post_count`, `profile_name`, `is_verified`, `biography`
### Instagram - Scrape Posts
Write to `/tmp/brightdata_request.json`:
```json
[
{
"url": "https://www.instagram.com/username",
"num_of_posts": 20,
"start_date": "01-01-2024",
"end_date": "12-31-2024"
}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
## Account Management
### Check Account Status
```bash
curl -s "https://api.brightdata.com/status" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
**Response:**
```json
{
"status": "active",
"customer": "hl_xxxxxxxx",
"can_make_requests": true,
"ip": "x.x.x.x"
}
```
### Get Active Zones
```bash
curl -s "https://api.brightdata.com/zone/get_active_zones" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {name, type}'
```
### Get Bandwidth Usage
```bash
curl -s "https://api.brightdata.com/customer/bw" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
## Getting Dataset IDs
To use the scraping features, you need a `dataset_id`:
1. Go to [Bright Data Control Panel](https://brightdata.com/cp/datasets)
2. Create a new Web Scraper dataset or select an existing one
3. Choose the platform (Twitter, Reddit, YouTube, etc.)
4. Copy the `dataset_id` from the dataset settings
Dataset IDs can also be found in the bandwidth usage API response under the `data` field keys (e.g., `v__ds_api_gd_xxxxx` where `gd_xxxxx` is your dataset ID).
## Common Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `url` | Target URL to scrape | `https://twitter.com/user` |
| `keyword` | Search keyword | `"artificial intelligence"` |
| `num_of_posts` | Limit number of results | `50` |
| `start_date` | Filter by date (MM-DD-YYYY) | `"01-01-2024"` |
| `end_date` | Filter by date (MM-DD-YYYY) | `"12-31-2024"` |
| `sort_by` | Sort order (Reddit) | `new`, `top`, `hot` |
| `format` | Response format | `json`, `csv` |
## Rate Limits
- Batch mode: up to 100 concurrent requests
- Maximum input size: 1GB per batch
- Exceeding limits returns `429` error
## Guidelines
1. **Create datasets first**: Use the Control Panel to create scraper datasets
2. **Use async for large jobs**: Use `/trigger` for discovery and batch operations
3. **Use sync for small jobs**: Use `/scrape` for single URL quick lookups
4. **Check status before download**: Poll `/progress` until status is `ready`
5. **Respect rate limits**: Don't exceed 100 concurrent requests
6. **Date format**: Use MM-DD-YYYY for date parameters
More from vm0-ai/vm0-skills
- account-reconciliationPerform account reconciliations comparing general ledger balances against subledgers, bank statements, or external records. Use for bank reconciliation, GL-to-subledger reconciliation, intercompany reconciliation, balance sheet reconciliation, reconciling item analysis, outstanding item aging, or clearing open items.
- agentphoneBuild AI phone agents with AgentPhone API. Use when the user wants to make phone calls, send/receive SMS, manage phone numbers, create voice agents, set up webhooks, or check usage — anything related to telephony, phone numbers, or voice AI.
- ahrefsAhrefs SEO API for backlink and keyword analysis. Use when user mentions
- amplitudeAmplitude product analytics API. Use when user mentions "Amplitude",
- analysis-qaQuality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.
- anthropic-managed-agentsAnthropic Managed Agents API for programmatically creating, running, and streaming AI agents on Anthropic's cloud infrastructure. Use when the user mentions "Managed Agents", "Anthropic agent sessions", or needs to create/run/stream an Anthropic agent with tool use (bash, git, web), attach GitHub repositories, or inject secrets via Vault. Do NOT use for standard Claude Messages API — use the Claude API skill instead.
- apifyApify web scraping platform. Use when user mentions "scrape website",
- asanaAsana API for tasks and projects. Use when user mentions "Asana", "asana.com",
- atlassianAtlassian API for Confluence and Jira. Use when user mentions "Confluence
- attioAttio REST API for AI-native CRM operations — manage companies, people, deals, and custom objects, plus notes, tasks, lists, and comments. Use when the user mentions "Attio", "CRM record", "create company", "add person", "list entry", "CRM note", or "CRM task".