bio-batch-downloads
$
npx mdskill add GPTomics/bioSkills/bio-batch-downloadsEfficiently download large NCBI datasets using batching and rate limiting
- Solve bulk sequence download challenges for large query results
- Uses BioPython's Bio.Entrez with NCBI history server and API key
- Applies rate limiting and retry logic to avoid API throttling
- Delivers results via batched efetch requests with controlled delays
SKILL.md
.github/skills/bio-batch-downloadsView on GitHub ↗
---
name: bio-batch-downloads
description: Download large datasets from NCBI efficiently using history server, batching, and rate limiting. Use when performing bulk sequence downloads, handling large query results, or production-scale data retrieval.
tool_type: python
primary_tool: Bio.Entrez
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Batch Downloads
**"Download thousands of sequences from NCBI"** → Search NCBI with history server, then batch-fetch results with rate limiting and retry logic.
- Python: `Entrez.esearch()`, `Entrez.efetch()` with `usehistory='y'` (BioPython)
Download large numbers of records from NCBI efficiently using the history server, batching, and proper rate limiting.
## Required Setup
```python
from Bio import Entrez
import time
Entrez.email = 'your.email@example.com' # Required by NCBI
Entrez.api_key = 'your_api_key' # Recommended for large downloads
```
## Rate Limits
| Authentication | Requests/Second | Delay Between |
|---------------|-----------------|---------------|
| Email only | 3 | 0.34 seconds |
| Email + API key | 10 | 0.1 seconds |
Get an API key at: https://www.ncbi.nlm.nih.gov/account/settings/
## History Server
The history server stores search results on NCBI servers, enabling efficient batch retrieval without re-sending large ID lists.
### How It Works
1. Search with `usehistory='y'`
2. Get `WebEnv` (session ID) and `query_key` (result set ID)
3. Fetch results in batches using these identifiers
4. Results stay available for ~15 minutes
```python
# Search with history
handle = Entrez.esearch(db='nucleotide', term='human[orgn] AND mRNA[fkey]', usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv = search['WebEnv']
query_key = search['QueryKey']
total = int(search['Count'])
print(f"Found {total} records, stored in history")
```
## Core Pattern: Batch Download
**Goal:** Download all records matching a search query, handling NCBI rate limits and large result sets.
**Approach:** Search with history server enabled to store results on NCBI, then fetch in batches using `WebEnv`/`query_key` with appropriate delays.
**Reference (BioPython 1.83+):**
```python
from Bio import Entrez, SeqIO
import time
Entrez.email = 'your.email@example.com'
def batch_download(db, term, output_file, rettype='fasta', batch_size=500):
handle = Entrez.esearch(db=db, term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv = search['WebEnv']
query_key = search['QueryKey']
total = int(search['Count'])
print(f"Downloading {total} records...")
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
print(f" Fetching {start+1}-{min(start+batch_size, total)}...")
handle = Entrez.efetch(
db=db,
rettype=rettype,
retmode='text',
retstart=start,
retmax=batch_size,
webenv=webenv,
query_key=query_key
)
out.write(handle.read())
handle.close()
time.sleep(0.34) # Rate limiting (no API key)
print(f"Saved to {output_file}")
```
## Code Patterns
### Download All Search Results
```python
from Bio import Entrez
import time
Entrez.email = 'your.email@example.com'
Entrez.api_key = 'your_api_key' # Optional
def download_search_results(db, term, output_file, rettype='fasta', batch_size=500):
# Search with history server
handle = Entrez.esearch(db=db, term=term, usehistory='y', retmax=0)
search = Entrez.read(handle)
handle.close()
webenv = search['WebEnv']
query_key = search['QueryKey']
total = int(search['Count'])
if total == 0:
print("No records found")
return
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
end = min(start + batch_size, total)
print(f"Downloading {start+1}-{end} of {total}")
attempts = 3
for attempt in range(attempts):
try:
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
out.write(handle.read())
handle.close()
break
except Exception as e:
if attempt < attempts - 1:
print(f" Retry {attempt+1}: {e}")
time.sleep(5)
else:
raise
time.sleep(delay)
print(f"Downloaded {total} records to {output_file}")
download_search_results('nucleotide', 'human[orgn] AND insulin[gene] AND mRNA[fkey]', 'insulin_mrna.fasta')
```
### Download by ID List
```python
def download_by_ids(db, ids, output_file, rettype='fasta', batch_size=200):
total = len(ids)
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
batch = ids[start:start+batch_size]
print(f"Downloading {start+1}-{start+len(batch)} of {total}")
handle = Entrez.efetch(db=db, id=','.join(batch), rettype=rettype, retmode='text')
out.write(handle.read())
handle.close()
time.sleep(delay)
print(f"Downloaded {total} records to {output_file}")
# Example with list of IDs
ids = ['NM_007294', 'NM_000059', 'NM_000546', 'NM_001126112', 'NM_004985']
download_by_ids('nucleotide', ids, 'genes.fasta')
```
### Post IDs to History (EPost)
For very large ID lists, post them to the history server first:
```python
def post_and_download(db, ids, output_file, rettype='fasta', batch_size=500):
# Post IDs to history server
handle = Entrez.epost(db=db, id=','.join(ids))
result = Entrez.read(handle)
handle.close()
webenv = result['WebEnv']
query_key = result['QueryKey']
total = len(ids)
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
end = min(start + batch_size, total)
print(f"Fetching {start+1}-{end} of {total}")
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
out.write(handle.read())
handle.close()
time.sleep(delay)
print(f"Downloaded {total} records")
```
### Download GenBank with Progress
```python
from Bio import Entrez, SeqIO
from io import StringIO
import time
def download_genbank_records(term, output_file, batch_size=100):
Entrez.email = 'your.email@example.com'
# Search
handle = Entrez.esearch(db='nucleotide', term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv, query_key = search['WebEnv'], search['QueryKey']
total = int(search['Count'])
records = []
for start in range(0, total, batch_size):
print(f"Fetching {start+1}-{min(start+batch_size, total)} of {total}")
handle = Entrez.efetch(db='nucleotide', rettype='gb', retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
batch_records = list(SeqIO.parse(handle, 'genbank'))
handle.close()
records.extend(batch_records)
time.sleep(0.34)
SeqIO.write(records, output_file, 'genbank')
print(f"Saved {len(records)} GenBank records")
return records
```
### Download with Retry Logic
```python
import time
from urllib.error import HTTPError
def robust_download(db, term, output_file, rettype='fasta', batch_size=500, max_retries=3):
handle = Entrez.esearch(db=db, term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv, query_key = search['WebEnv'], search['QueryKey']
total = int(search['Count'])
delay = 0.1 if Entrez.api_key else 0.34
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
for retry in range(max_retries):
try:
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
data = handle.read()
handle.close()
if data.strip():
out.write(data)
break
except HTTPError as e:
if e.code == 429: # Rate limit
wait = 10 * (retry + 1)
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
elif retry == max_retries - 1:
raise
else:
time.sleep(5)
time.sleep(delay)
print(f"Downloaded to {output_file}")
```
### Stream to File (Memory Efficient)
```python
def stream_download(db, term, output_file, rettype='fasta', batch_size=1000):
handle = Entrez.esearch(db=db, term=term, usehistory='y')
search = Entrez.read(handle)
handle.close()
webenv, query_key = search['WebEnv'], search['QueryKey']
total = int(search['Count'])
downloaded = 0
with open(output_file, 'w') as out:
for start in range(0, total, batch_size):
handle = Entrez.efetch(db=db, rettype=rettype, retmode='text',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
# Stream chunks to file
while True:
chunk = handle.read(8192)
if not chunk:
break
out.write(chunk)
handle.close()
downloaded = min(start + batch_size, total)
print(f"Progress: {downloaded}/{total} ({100*downloaded/total:.1f}%)")
time.sleep(0.34)
```
## Batch Size Guidelines
| Database | rettype | Recommended Batch |
|----------|---------|-------------------|
| nucleotide | fasta | 500-1000 |
| nucleotide | gb | 100-200 |
| protein | fasta | 500-1000 |
| protein | gp | 100-200 |
| pubmed | abstract | 1000-2000 |
| pubmed | xml | 200-500 |
Smaller batches for GenBank/XML (more data per record).
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `HTTPError 429` | Rate limit exceeded | Increase delay, use API key |
| `HTTPError 400` | Invalid WebEnv/query_key | Session expired, re-search |
| Incomplete data | Connection interrupted | Add retry logic |
| Memory error | Batch too large | Reduce batch_size |
| Empty response | No more records | Check total vs start |
## Decision Tree
```
Need to download many NCBI records?
├── Have search query?
│ └── Use esearch with usehistory='y', then batch efetch
├── Have list of IDs?
│ ├── < 200 IDs? → Direct efetch with comma-separated IDs
│ └── >= 200 IDs? → Use epost, then batch efetch
├── Need records as Biopython objects?
│ └── Parse each batch with SeqIO
├── Downloading > 10,000 records?
│ └── Use streaming to avoid memory issues
└── Getting rate limited?
└── Get API key, add retry logic
```
## Related Skills
- entrez-search - Build search queries for batch downloads
- entrez-fetch - Single record fetching
- sra-data - For raw sequencing data (use SRA toolkit instead)
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.