theharvester

Name: theharvester
Author: TerminalSkills/skills
$npx mdskill add TerminalSkills/skills/theharvester
Stealthly harvest emails, subdomains, and IPs from public sources.
Gathers corporate contacts and exposed assets without triggering alerts.
Integrates with Google, LinkedIn, Shodan, Hunter, and DNSDumpster.
Executes queries against multiple OSINT databases simultaneously.
Outputs structured lists of emails, domains, and IP ranges.
SKILL.md
.github/skills/theharvesterView on GitHub ↗
---
name: theharvester
description: >-
  Passive email, subdomain, and IP harvesting from public sources using theHarvester. Use when:
  gathering corporate email lists, enumerating subdomains passively, pre-engagement recon, finding
  exposed employee contacts without triggering alerts.
license: Apache-2.0
compatibility: "Python 3.9+"
metadata:
  author: terminal-skills
  version: "1.0.0"
  category: research
  tags: [theharvester, email, subdomain, passive, harvesting]
  use-cases:
    - "Harvest all email addresses for a target domain from public sources"
    - "Enumerate subdomains passively before an authorized penetration test"
    - "Find employee contacts via LinkedIn, Hunter, and search engines"
    - "Gather IP ranges and hostnames associated with an organization"
  agents: [claude-code, openai-codex, gemini-cli, cursor]
---

# theHarvester

## Overview

theHarvester is a passive OSINT tool that aggregates information about a target domain from multiple public sources. It finds email addresses, subdomains, hostnames, and IP ranges without making any direct requests to the target — making it ideal for stealth recon during the pre-engagement phase of penetration tests or OSINT investigations.

**Sources include:** Google, Bing, DuckDuckGo, LinkedIn, Shodan, Hunter.io, CertSpotter, DNSDumpster, VirusTotal, and more.

## Instructions

### Step 1: Install theHarvester

```bash
# Option 1: pip (in a virtual environment recommended)
pip install theHarvester

# Option 2: Clone from GitHub (most up-to-date)
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
pip install -r requirements/base.txt

# Option 3: Docker
docker pull ghcr.io/laramies/theharvester
docker run ghcr.io/laramies/theharvester -d example.com -b google
```

### Step 2: Basic usage

```bash
# Syntax: theHarvester -d <domain> -b <source> [options]
# -d  target domain
# -b  data source(s)
# -l  limit results (default: 500)
# -f  output filename (supports XML and JSON)
# -n  DNS lookup on discovered hosts
# -v  verify host via DNS resolution

# Search a single source
theHarvester -d example.com -b google

# Search all available sources
theHarvester -d example.com -b all

# Limit results, enable DNS lookup, save output
theHarvester -d example.com -b google,bing,linkedin -l 200 -n -f results_example

# Run from cloned repo
python3 theHarvester.py -d example.com -b all -l 500 -f output
```

### Step 3: Choose sources strategically

```bash
# Email harvesting — best sources
theHarvester -d example.com -b google,bing,hunter,linkedin

# Subdomain enumeration — best sources
theHarvester -d example.com -b certspotter,dnsdumpster,virustotal,shodan

# Comprehensive (slower, uses all sources)
theHarvester -d example.com -b all -l 1000 -f full_recon_example

# LinkedIn employee discovery (requires LinkedIn API key in api-keys.yaml)
theHarvester -d example.com -b linkedin -l 200
```

### Step 4: Configure API keys

```yaml
# api-keys.yaml (place in theHarvester directory or specify with -c flag)
apikeys:
  hunter:
    key: YOUR_HUNTER_IO_KEY
  shodan:
    key: YOUR_SHODAN_KEY
  virustotal:
    key: YOUR_VIRUSTOTAL_KEY
  binaryedge:
    key: YOUR_BINARYEDGE_KEY
  fullhunt:
    key: YOUR_FULLHUNT_KEY
  securityTrails:
    key: YOUR_SECURITYTRAILS_KEY
  github:
    key: YOUR_GITHUB_TOKEN
```

### Step 5: Parse and process output with Python

```python
import json
import subprocess
import re

def run_harvester(domain, sources="google,bing,certspotter,dnsdumpster", limit=500):
    """Run theHarvester and return parsed results."""
    output_file = f"harvester_{domain.replace('.', '_')}"
    cmd = [
        "theHarvester",
        "-d", domain,
        "-b", sources,
        "-l", str(limit),
        "-f", output_file,
    ]
    print(f"Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
    print(result.stdout)

    # Parse JSON output
    json_file = f"{output_file}.json"
    try:
        with open(json_file) as f:
            data = json.load(f)
        return data
    except FileNotFoundError:
        # Fall back to parsing stdout
        return parse_stdout(result.stdout)

def parse_stdout(output):
    """Extract emails, hosts, and IPs from raw stdout."""
    emails = set(re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', output))
    # Filter out false positives
    emails = {e for e in emails if not e.endswith(('.png', '.jpg', '.css', '.js'))}

    hosts = set(re.findall(r'[\w\.-]+\.\w{2,}', output))
    ips = set(re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', output))

    return {"emails": list(emails), "hosts": list(hosts), "ips": list(ips)}

def deduplicate_and_report(data, domain):
    """Clean and summarize harvested data."""
    emails = sorted(set(data.get("emails", [])))
    hosts = sorted(set(data.get("hosts", [])))
    ips = sorted(set(data.get("ips", [])))

    # Filter to target domain
    domain_emails = [e for e in emails if domain in e]
    domain_hosts = [h for h in hosts if domain in h]

    print(f"\n=== Harvest Report: {domain} ===")
    print(f"Emails found:    {len(domain_emails)}")
    print(f"Subdomains:      {len(domain_hosts)}")
    print(f"IP addresses:    {len(ips)}")

    if domain_emails:
        print("\nEmails:")
        for e in domain_emails[:20]:
            print(f"  {e}")

    if domain_hosts:
        print("\nSubdomains:")
        for h in domain_hosts[:20]:
            print(f"  {h}")

    return {
        "emails": domain_emails,
        "subdomains": domain_hosts,
        "ips": ips,
    }

# Usage
results = run_harvester("target-company.com", sources="google,bing,certspotter,hunter")
clean = deduplicate_and_report(results, "target-company.com")

# Save cleaned results
with open("clean_results.json", "w") as f:
    json.dump(clean, f, indent=2)
```

### Step 6: Combine with other tools

```bash
# Pass discovered subdomains to nmap (only with explicit authorization)
theHarvester -d example.com -b all -f hosts
cat hosts.json | python3 -c "
import json, sys
data = json.load(sys.stdin)
for host in data.get('hosts', []):
    print(host)
" > subdomains.txt

# Feed subdomains into amass for deeper DNS enumeration
cat subdomains.txt | amass enum -df - -passive

# Check emails against breach databases
cat emails.txt | while read email; do
    curl -s "https://haveibeenpwned.com/api/v3/breachedaccount/$email" \
         -H "hibp-api-key: YOUR_HIBP_KEY"
done
```

## Available Sources Reference

| Source | Data Type | API Key Required |
|--------|-----------|-----------------|
| `google` | Emails, subdomains | No |
| `bing` | Emails, subdomains | No |
| `duckduckgo` | Emails, subdomains | No |
| `linkedin` | Employees, emails | Optional |
| `hunter` | Emails | Yes |
| `certspotter` | Subdomains (SSL certs) | No |
| `dnsdumpster` | Subdomains, IPs | No |
| `virustotal` | Subdomains | Yes |
| `shodan` | IPs, open ports | Yes |
| `securitytrails` | Subdomains, DNS | Yes |
| `github` | Emails, code | Yes |
| `binaryedge` | IPs, services | Yes |

## Guidelines

- **Always get authorization** before running theHarvester against a target — passive does not mean invisible. Data queries may be logged by third-party services.
- **Rate limits**: Without API keys, theHarvester relies on scraping search engines which may throttle or block requests. Add API keys for reliable results.
- **Combine sources**: No single source is complete. Use multiple sources and deduplicate.
- **Email format detection**: Once you have a few emails (e.g., `jsmith@corp.com`, `john.smith@corp.com`), infer the naming convention and use it to generate a target list.
- **DNS verification**: Always use `-n` or `-v` to verify discovered hosts are live before reporting.