bright-data

$npx mdskill add vm0-ai/vm0-skills/bright-data

Execute large-scale web scraping and proxy services for data collection.

  • Enables gathering profiles, posts, and comments from Twitter, Reddit, YouTube, Instagram, TikTok, and LinkedIn.
  • Depends on Bright Data API for triggering asynchronous or synchronous data collection jobs.
  • Selects scraping targets based on user-specified URLs or platform requirements.
  • Delivers results via JSON snapshots or immediate responses depending on request size.
SKILL.md
.github/skills/bright-dataView on GitHub ↗
---
name: bright-data
description: Bright Data proxy and web scraping API. Use when user mentions "Bright
  Data", "proxy", "web scraping at scale", or data collection.
---

## Troubleshooting

If requests fail, run `zero doctor check-connector --env-name BRIGHTDATA_TOKEN` or `zero doctor check-connector --url https://api.brightdata.com/datasets/v3/trigger --method POST`

## Social Media Scraping

Bright Data supports scraping these social media platforms:

| Platform | Profiles | Posts | Comments | Reels/Videos |
|----------|----------|-------|----------|--------------|
| Twitter/X | ✅ | ✅ | - | - |
| Reddit | - | ✅ | ✅ | - |
| YouTube | ✅ | ✅ | ✅ | - |
| Instagram | ✅ | ✅ | ✅ | ✅ |
| TikTok | ✅ | ✅ | ✅ | - |
| LinkedIn | ✅ | ✅ | - | - |

## How to Use

### 1. Trigger Scraping (Asynchronous)

Trigger a data collection job and get a `snapshot_id` for later retrieval.

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/username"},
  {"url": "https://twitter.com/username2"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Response:**
```json
{
  "snapshot_id": "s_m4x7enmven8djfqak"
}
```

### 2. Trigger Scraping (Synchronous)

Get results immediately in the response (for small requests).

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology/comments/xxxxx"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

### 3. Monitor Progress

Check the status of a scraping job (replace `<snapshot-id>` with your actual snapshot ID):

```bash
curl -s "https://api.brightdata.com/datasets/v3/progress/<snapshot-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

**Response:**
```json
{
  "snapshot_id": "s_m4x7enmven8djfqak",
  "dataset_id": "gd_xxxxx",
  "status": "running"
}
```

Status values: `running`, `ready`, `failed`

### 4. Download Results

Once status is `ready`, download the collected data (replace `<snapshot-id>` with your actual snapshot ID):

```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshot/<snapshot-id>?format=json" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

### 5. List Snapshots

Get all your snapshots:

```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshots" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {snapshot_id, dataset_id, status}'
```

### 6. Cancel Snapshot

Cancel a running job (replace `<snapshot-id>` with your actual snapshot ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/cancel?snapshot_id=<snapshot-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

## Platform-Specific Examples

### Twitter/X - Scrape Profile

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/elonmusk"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `x_id`, `profile_name`, `biography`, `is_verified`, `followers`, `following`, `profile_image_link`

### Twitter/X - Scrape Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/username/status/123456789"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `post_id`, `text`, `replies`, `likes`, `retweets`, `views`, `hashtags`, `media`

### Reddit - Scrape Subreddit Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology", "sort_by": "hot"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Parameters:** `url`, `sort_by` (new/top/hot)

**Returns:** `post_id`, `title`, `description`, `num_comments`, `upvotes`, `date_posted`, `community`

### Reddit - Scrape Comments

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology/comments/xxxxx/post_title"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `comment_id`, `user_posted`, `comment_text`, `upvotes`, `replies`

### YouTube - Scrape Video Info

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `title`, `views`, `likes`, `num_comments`, `video_length`, `transcript`, `channel_name`

### YouTube - Search by Keyword

Write to `/tmp/brightdata_request.json`:

```json
[
  {"keyword": "artificial intelligence", "num_of_posts": 50}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

### YouTube - Scrape Comments

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.youtube.com/watch?v=xxxxx", "load_replies": 3}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `comment_text`, `likes`, `replies`, `username`, `date`

### Instagram - Scrape Profile

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.instagram.com/username"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `followers`, `post_count`, `profile_name`, `is_verified`, `biography`

### Instagram - Scrape Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {
    "url": "https://www.instagram.com/username",
    "num_of_posts": 20,
    "start_date": "01-01-2024",
    "end_date": "12-31-2024"
  }
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

## Account Management

### Check Account Status

```bash
curl -s "https://api.brightdata.com/status" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

**Response:**
```json
{
  "status": "active",
  "customer": "hl_xxxxxxxx",
  "can_make_requests": true,
  "ip": "x.x.x.x"
}
```

### Get Active Zones

```bash
curl -s "https://api.brightdata.com/zone/get_active_zones" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {name, type}'
```

### Get Bandwidth Usage

```bash
curl -s "https://api.brightdata.com/customer/bw" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

## Getting Dataset IDs

To use the scraping features, you need a `dataset_id`:

1. Go to [Bright Data Control Panel](https://brightdata.com/cp/datasets)
2. Create a new Web Scraper dataset or select an existing one
3. Choose the platform (Twitter, Reddit, YouTube, etc.)
4. Copy the `dataset_id` from the dataset settings

Dataset IDs can also be found in the bandwidth usage API response under the `data` field keys (e.g., `v__ds_api_gd_xxxxx` where `gd_xxxxx` is your dataset ID).

## Common Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `url` | Target URL to scrape | `https://twitter.com/user` |
| `keyword` | Search keyword | `"artificial intelligence"` |
| `num_of_posts` | Limit number of results | `50` |
| `start_date` | Filter by date (MM-DD-YYYY) | `"01-01-2024"` |
| `end_date` | Filter by date (MM-DD-YYYY) | `"12-31-2024"` |
| `sort_by` | Sort order (Reddit) | `new`, `top`, `hot` |
| `format` | Response format | `json`, `csv` |

## Rate Limits

- Batch mode: up to 100 concurrent requests
- Maximum input size: 1GB per batch
- Exceeding limits returns `429` error

## Guidelines

1. **Create datasets first**: Use the Control Panel to create scraper datasets
2. **Use async for large jobs**: Use `/trigger` for discovery and batch operations
3. **Use sync for small jobs**: Use `/scrape` for single URL quick lookups
4. **Check status before download**: Poll `/progress` until status is `ready`
5. **Respect rate limits**: Don't exceed 100 concurrent requests
6. **Date format**: Use MM-DD-YYYY for date parameters
More from vm0-ai/vm0-skills