bio-sra-data
$
npx mdskill add GPTomics/bioSkills/bio-sra-dataFetch raw sequencing reads from NCBI SRA as FASTQ files
- Retrieves fastq sequences for specific sra accessions efficiently
- Depends on fasterq-dump and the official sra toolkit software
- Validates installed versions match before executing download commands
- Outputs raw sequence data directly to local storage or streams
SKILL.md
.github/skills/bio-sra-dataView on GitHub ↗
---
name: bio-sra-data
description: Download sequencing data from NCBI SRA using the SRA toolkit. Use when downloading FASTQ files from SRA accessions, prefetching large datasets, or validating SRA downloads.
tool_type: cli
primary_tool: sra-tools
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, Entrez Direct 21.0+, SRA Toolkit 3.0+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# SRA Data
Download raw sequencing data from the Sequence Read Archive using the SRA toolkit.
**"Download FASTQ from SRA"** → Fetch raw sequencing reads from an SRA accession as FASTQ files.
- CLI: `fasterq-dump SRR_ACCESSION` (SRA Toolkit)
- Python: `subprocess.run(['fasterq-dump', accession])` or `Entrez.efetch()` for metadata
## Installation
```bash
# macOS
brew install sratoolkit
# Ubuntu/Debian
sudo apt install sra-toolkit
# conda (recommended)
conda install -c bioconda sra-tools
# Verify installation
fasterq-dump --version
```
## Core Commands
### fasterq-dump - Download FASTQ (Recommended)
Fast, multithreaded FASTQ extraction. Preferred over `fastq-dump`.
```bash
# Download single SRA run as FASTQ
fasterq-dump SRR12345678
# Output: SRR12345678.fastq (single-end)
# Or: SRR12345678_1.fastq, SRR12345678_2.fastq (paired-end)
```
**Key Options:**
| Option | Description | Example |
|--------|-------------|---------|
| `-O` / `--outdir` | Output directory | `-O ./fastq/` |
| `-o` / `--outfile` | Output filename | `-o sample.fastq` |
| `-e` / `--threads` | Number of threads | `-e 8` |
| `-p` / `--progress` | Show progress bar | `-p` |
| `-S` / `--split-files` | Split paired reads (default) | `-S` |
| `-3` / `--split-3` | Also output unpaired reads | `-3` |
| `--skip-technical` | Skip technical reads | `--skip-technical` |
| `-t` / `--temp` | Temp directory | `-t /tmp` |
| `-f` / `--force` | Overwrite existing | `-f` |
```bash
# Common usage with options
fasterq-dump SRR12345678 -O ./data/ -e 8 -p --skip-technical
# Force split files (paired-end)
fasterq-dump SRR12345678 -S -O ./data/
```
### prefetch - Download SRA Files First
For large files or unreliable connections, prefetch first, then convert.
```bash
# Prefetch SRA file (downloads .sra to ~/ncbi/sra/)
prefetch SRR12345678
# Then convert to FASTQ
fasterq-dump ~/ncbi/sra/SRR12345678.sra
# Or convert in place
fasterq-dump SRR12345678 # Will find prefetched file
```
**Prefetch Options:**
| Option | Description |
|--------|-------------|
| `-O` / `--output-directory` | Download location |
| `-p` / `--progress` | Show progress |
| `-f` / `--force` | Re-download if exists |
| `--max-size` | Max file size (e.g., `50G`) |
| `-X` / `--max-size` | Same as above |
```bash
# Prefetch with size limit
prefetch SRR12345678 --max-size 100G -p
# Prefetch multiple accessions
prefetch SRR12345678 SRR12345679 SRR12345680
# Prefetch from a list file
prefetch --option-file accessions.txt
```
### vdb-validate - Verify Downloads
Check integrity of downloaded SRA files.
```bash
# Validate a downloaded file
vdb-validate SRR12345678
# Validate with detailed output
vdb-validate SRR12345678 2>&1
```
### sra-stat - Get Run Statistics
Get information about an SRA run without downloading.
```bash
# Basic stats
sra-stat --quick SRR12345678
# Detailed XML output
sra-stat --xml SRR12345678
```
## Configuration
### vdb-config - Configure SRA Toolkit
Set up cache location and other settings.
```bash
# Interactive configuration
vdb-config -i
# Set cache directory
vdb-config --set /repository/user/main/public/root=/path/to/cache
# Check current configuration
vdb-config --cfg
```
### Cache Location
Default: `~/ncbi/` on Linux/macOS
```bash
# Create dedicated cache
mkdir -p /data/sra_cache
vdb-config --set /repository/user/main/public/root=/data/sra_cache
```
## Code Patterns
### Download Single Run
```bash
#!/bin/bash
SRR="SRR12345678"
OUTDIR="./fastq"
mkdir -p $OUTDIR
fasterq-dump $SRR -O $OUTDIR -e 8 -p
```
### Download Multiple Runs
```bash
#!/bin/bash
# From a list of accessions
while read SRR; do
echo "Downloading $SRR..."
fasterq-dump $SRR -O ./fastq/ -e 4 -p
done < accessions.txt
```
### Prefetch Then Convert (Large Files)
```bash
#!/bin/bash
SRR="SRR12345678"
# Prefetch first (resumable)
prefetch $SRR -p
# Validate
vdb-validate $SRR
# Convert to FASTQ
fasterq-dump $SRR -O ./fastq/ -e 8 -p
# Optionally remove .sra file
rm -f ~/ncbi/sra/${SRR}.sra
```
### Batch Download Script
**Goal:** Download, validate, and convert multiple SRA accessions from a list file in a single automated run.
**Approach:** Loop through accessions, prefetch each .sra file for resumable downloading, validate integrity with vdb-validate, then convert to FASTQ with fasterq-dump.
```bash
#!/bin/bash
# download_sra.sh - Download multiple SRA runs
ACCESSIONS="$1"
OUTDIR="${2:-./fastq}"
THREADS="${3:-4}"
mkdir -p $OUTDIR
while read SRR; do
if [[ -z "$SRR" ]] || [[ "$SRR" == \#* ]]; then
continue
fi
echo "Processing $SRR..."
# Prefetch
prefetch $SRR -p -O $OUTDIR
# Validate
if ! vdb-validate ${OUTDIR}/${SRR}/${SRR}.sra 2>/dev/null; then
echo "Validation failed for $SRR, skipping..."
continue
fi
# Convert
fasterq-dump ${OUTDIR}/${SRR}/${SRR}.sra -O $OUTDIR -e $THREADS -p
# Cleanup .sra
rm -rf ${OUTDIR}/${SRR}
echo "Completed $SRR"
done < "$ACCESSIONS"
```
### Python Wrapper
```python
import subprocess
import os
def download_sra(accession, outdir='.', threads=4, skip_technical=True):
os.makedirs(outdir, exist_ok=True)
cmd = ['fasterq-dump', accession, '-O', outdir, '-e', str(threads), '-p']
if skip_technical:
cmd.append('--skip-technical')
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"fasterq-dump failed: {result.stderr}")
return result.stdout
# Download a run
download_sra('SRR12345678', outdir='./data', threads=8)
```
### Find SRA Accessions with Entrez
**Goal:** Discover SRA run accessions for a BioProject or search query without browsing the SRA website.
**Approach:** Search the SRA database via Entrez, then fetch run info in CSV format and parse out the run accessions (SRR IDs).
```python
from Bio import Entrez
Entrez.email = 'your.email@example.com'
def find_sra_runs(term, max_results=100):
handle = Entrez.esearch(db='sra', term=term, retmax=max_results)
search = Entrez.read(handle)
handle.close()
if not search['IdList']:
return []
handle = Entrez.efetch(db='sra', id=','.join(search['IdList']), rettype='runinfo', retmode='text')
runinfo = handle.read()
handle.close()
# Parse CSV-like output
runs = []
for line in runinfo.strip().split('\n')[1:]:
if line:
fields = line.split(',')
if len(fields) > 0:
runs.append(fields[0]) # First field is Run accession
return runs
# Find runs for a project
runs = find_sra_runs('PRJNA123456[bioproject]')
print(f"Found {len(runs)} runs")
```
## SRA Accession Types
| Prefix | Type | Description |
|--------|------|-------------|
| SRR | Run | Individual sequencing run |
| SRX | Experiment | Experimental design |
| SRS | Sample | Biological sample |
| SRP | Project/Study | Research project |
| PRJNA | BioProject | NCBI BioProject ID |
| SAMN | BioSample | NCBI BioSample ID |
Use Run accessions (SRR*) with fasterq-dump.
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `item not found` | Invalid accession | Check accession exists |
| `disk full` | Insufficient space | Check temp and output dirs |
| `timeout` | Network issues | Use prefetch first |
| `path not found` | Bad output path | Create output directory |
| `permission denied` | Cache permission | Check vdb-config |
## Comparison: fasterq-dump vs fastq-dump
| Feature | fasterq-dump | fastq-dump |
|---------|--------------|------------|
| Speed | Fast (multithreaded) | Slow (single-threaded) |
| Memory | Higher | Lower |
| Progress | Built-in | None |
| Recommended | Yes | Legacy only |
Always prefer `fasterq-dump` unless memory constrained.
## Decision Tree
```
Need SRA sequencing data?
├── Know the SRR accession?
│ └── fasterq-dump SRR... -O ./fastq/ -p
├── Large file (>20GB)?
│ └── prefetch first, then fasterq-dump
├── Multiple runs?
│ └── Loop through accessions or use prefetch --option-file
├── Need to find accessions?
│ └── Search SRA database with Entrez
├── Download interrupted?
│ └── prefetch supports resume
└── Verify integrity?
└── vdb-validate SRR...
```
## Related Skills
- entrez-search - Search SRA database to find accessions
- sequence-io - Read downloaded FASTQ files with Biopython
- sequence-io/paired-end-fastq - Handle paired R1/R2 files
- alignment-files - Align downloaded reads
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.