bio-workflows-fastq-to-variants
$
npx mdskill add GPTomics/bioSkills/bio-workflows-fastq-to-variantsOrchestrate DNA sequencing workflow from FASTQ files to variant calls
- Call variants from whole-genome or exome sequencing data
- Uses BWA, bcftools, GATK, and fastp for alignment and variant calling
- Applies quality checks at each step to ensure data integrity
- Outputs variant calls in VCF format with quality metrics
SKILL.md
.github/skills/bio-workflows-fastq-to-variantsView on GitHub ↗
---
name: bio-workflows-fastq-to-variants
description: End-to-end DNA sequencing workflow from FASTQ files to variant calls. Covers QC, alignment with BWA, BAM processing, and variant calling with bcftools or GATK HaplotypeCaller. Use when calling variants from raw sequencing reads.
tool_type: cli
primary_tool: bcftools
workflow: true
depends_on:
- read-qc/fastp-workflow
- read-alignment/bwa-alignment
- alignment-files/alignment-sorting
- alignment-files/duplicate-handling
- variant-calling/variant-calling
- variant-calling/filtering-best-practices
qc_checkpoints:
- after_qc: "Q30 >85%, adapter content <1%"
- after_alignment: "Mapping rate >95%, properly paired >90%"
- after_dedup: "Duplication rate <30% for WGS, <50% for exome"
- after_calling: "Ti/Tv ratio ~2.1 for WGS, dbSNP overlap >95%"
---
## Version Compatibility
Reference examples tested with: BWA-MEM2 2.2.1+, Ensembl VEP 111+, GATK 4.5+, bcftools 1.19+, fastp 0.23+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# FASTQ to Variants Workflow
**"Call variants from my whole-genome or exome FASTQ files"** → Orchestrate fastp QC, BWA-MEM2 alignment, duplicate marking, BQSR, GATK HaplotypeCaller variant calling, and VQSR/hard filtering to produce filtered VCF output.
Complete pipeline from raw DNA sequencing FASTQ files to filtered variant calls.
## Workflow Overview
```
FASTQ files
|
v
[1. QC & Trimming] -----> fastp
|
v
[2. Alignment] ---------> bwa-mem2
|
v
[3. BAM Processing] ----> sort, markdup, index
|
v
[4. Variant Calling] ---> bcftools (primary) or GATK
|
v
[5. Filtering] ---------> Quality filters
|
v
Filtered VCF
```
## Primary Path: BWA + bcftools
### Step 1: Quality Control with fastp
```bash
# Single sample
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.trimmed.fq.gz -O sample_R2.trimmed.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50 \
--html sample_fastp.html
# Batch processing
for sample in sample1 sample2 sample3; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--html qc/${sample}_fastp.html
done
```
**QC Checkpoint 1:** Check fastp reports
- Q30 bases >85% (DNA typically higher quality than RNA)
- Adapter content <1%
- No unusual GC distribution
### Step 2: BWA-MEM2 Alignment
```bash
# Index reference (once)
bwa-mem2 index reference.fa
# Align with read group info
for sample in sample1 sample2 sample3; do
bwa-mem2 mem -t 8 \
-R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA\tLB:lib1" \
reference.fa \
trimmed/${sample}_R1.fq.gz \
trimmed/${sample}_R2.fq.gz \
| samtools view -bS - > aligned/${sample}.bam
done
```
**QC Checkpoint 2:** Check alignment stats
```bash
samtools flagstat aligned/${sample}.bam
```
- Mapped reads >95%
- Properly paired >90%
### Step 3: BAM Processing
```bash
for sample in sample1 sample2 sample3; do
# Sort by coordinate
samtools sort -@ 8 -o aligned/${sample}.sorted.bam aligned/${sample}.bam
# Mark duplicates (samtools method)
samtools fixmate -m aligned/${sample}.sorted.bam - | \
samtools sort -@ 8 - | \
samtools markdup -@ 8 - aligned/${sample}.markdup.bam
# Index
samtools index aligned/${sample}.markdup.bam
# Cleanup intermediate
rm aligned/${sample}.bam aligned/${sample}.sorted.bam
done
```
**QC Checkpoint 3:** Check duplication rate
```bash
samtools flagstat aligned/${sample}.markdup.bam | grep "duplicates"
```
- WGS: <30% duplicates
- Exome/targeted: <50% duplicates
### Step 4: Variant Calling with bcftools
```bash
# Single sample calling
bcftools mpileup -Ou -f reference.fa aligned/sample1.markdup.bam | \
bcftools call -mv -Oz -o variants/sample1.vcf.gz
# Multi-sample calling (joint calling)
bcftools mpileup -Ou -f reference.fa \
aligned/sample1.markdup.bam \
aligned/sample2.markdup.bam \
aligned/sample3.markdup.bam | \
bcftools call -mv -Oz -o variants/cohort.vcf.gz
bcftools index variants/cohort.vcf.gz
```
### Step 5: Variant Filtering
```bash
# Basic quality filter
bcftools filter -Oz \
-e 'QUAL<20 || DP<10 || MQ<30' \
-o variants/cohort.filtered.vcf.gz \
variants/cohort.vcf.gz
# More stringent filter
bcftools filter -Oz \
-e 'QUAL<30 || DP<10 || DP>200 || MQ<40 || MQB<0.1' \
-s "LowQual" \
-o variants/cohort.filtered.vcf.gz \
variants/cohort.vcf.gz
# Stats
bcftools stats variants/cohort.filtered.vcf.gz > variants/vcf_stats.txt
```
**QC Checkpoint 4:** Check variant stats
- Ti/Tv ratio ~2.1 for whole genome
- Ti/Tv ratio ~2.8-3.0 for exome
- >95% overlap with dbSNP for known sites
## Alternative Path: BWA + GATK HaplotypeCaller
### Step 4 Alternative: GATK Variant Calling (DRAGEN Mode -- Recommended)
```bash
gatk CreateSequenceDictionary -R reference.fa
samtools faidx reference.fa
# DRAGEN-GATK mode: no BQSR needed, improved STR handling and QUAL calibration
gatk HaplotypeCaller \
-R reference.fa \
-I aligned/sample1.markdup.bam \
-O variants/sample1.g.vcf.gz \
-ERC GVCF \
--dragen-mode
```
### Step 4 Alternative: GATK Standard Mode (with BQSR)
```bash
gatk BaseRecalibrator \
-R reference.fa \
-I aligned/sample1.markdup.bam \
--known-sites dbsnp.vcf.gz \
-O recal_data.table
gatk ApplyBQSR \
-R reference.fa \
-I aligned/sample1.markdup.bam \
--bqsr-recal-file recal_data.table \
-O aligned/sample1.recal.bam
gatk HaplotypeCaller \
-R reference.fa \
-I aligned/sample1.recal.bam \
-O variants/sample1.g.vcf.gz \
-ERC GVCF
# Joint genotyping (for multiple samples)
gatk GenomicsDBImport \
-V variants/sample1.g.vcf.gz \
-V variants/sample2.g.vcf.gz \
-V variants/sample3.g.vcf.gz \
--genomicsdb-workspace-path genomicsdb \
-L intervals.bed
gatk GenotypeGVCFs \
-R reference.fa \
-V gendb://genomicsdb \
-O variants/cohort.vcf.gz
```
### Step 5 Alternative: GATK Variant Filtering
```bash
# Hard filtering (for small cohorts)
gatk VariantFiltration \
-R reference.fa \
-V variants/cohort.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "LowQD" \
--filter-expression "FS > 60.0" --filter-name "HighFS" \
--filter-expression "MQ < 40.0" --filter-name "LowMQ" \
--filter-expression "MQRankSum < -12.5" --filter-name "LowMQRS" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "LowRPRS" \
-O variants/cohort.filtered.vcf.gz
# VQSR (for large cohorts >30 samples)
gatk VariantRecalibrator \
-R reference.fa \
-V variants/cohort.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
--resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP \
-O cohort.snp.recal \
--tranches-file cohort.snp.tranches
gatk ApplyVQSR \
-R reference.fa \
-V variants/cohort.vcf.gz \
-O variants/cohort.vqsr.vcf.gz \
--recal-file cohort.snp.recal \
--tranches-file cohort.snp.tranches \
-mode SNP \
--truth-sensitivity-filter-level 99.5
```
## Parameter Recommendations
| Step | Parameter | WGS | Exome/Targeted |
|------|-----------|-----|----------------|
| bwa-mem2 | -t | 8-16 | 8 |
| samtools markdup | - | Required | Required |
| bcftools mpileup | -d | 250 (default) | 1000 |
| bcftools mpileup | -q | 20 | 20 |
| bcftools filter | QUAL | >20 | >30 |
| bcftools filter | DP | >10, <2x mean | >20 |
| GATK | intervals | - | Target BED |
## Choosing a Variant Caller
| Criterion | bcftools | GATK HaplotypeCaller | DeepVariant |
|-----------|----------|---------------------|-------------|
| Speed | Fastest | Moderate | Slowest (without GPU) |
| SNP accuracy | Good | Good | Highest |
| Indel accuracy | Moderate (weak in homopolymers) | Good (local assembly) | Highest |
| Joint calling | Basic multi-sample | GVCF workflow (scales to 100k+) | GLnexus (scales to 250k+) |
| Best for | Quick analysis, non-model organisms | Clinical pipelines, GATK ecosystem | Highest accuracy, GPU available |
| VQSR | Not supported | For large cohorts (>30 samples) | Not needed (use GLnexus) |
| DRAGEN mode | N/A | Recommended for single-sample (skips BQSR) | N/A |
For human germline: DeepVariant provides the highest accuracy. GATK DRAGEN-mode is the recommended standard-mode alternative (no BQSR needed). bcftools is best for exploratory work, non-model organisms, or when computational resources are limited.
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| Low mapping rate | Wrong reference, contamination | Verify reference genome version |
| High duplication | PCR over-amplification, low input | Check library prep, may need more input DNA |
| Low Ti/Tv | False positives | Increase quality filters |
| Missing variants | Too stringent filters, low depth | Relax filters, check coverage |
| Many indels at homopolymers | Sequencing errors | Filter homopolymer regions |
## Complete Pipeline Script
```bash
#!/bin/bash
set -e
# Configuration
THREADS=8
REF="reference.fa"
SAMPLES="sample1 sample2 sample3"
OUTDIR="results"
mkdir -p ${OUTDIR}/{trimmed,aligned,variants,qc}
echo "=== Step 1: QC with fastp ==="
for sample in $SAMPLES; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--html ${OUTDIR}/qc/${sample}_fastp.html \
-w ${THREADS}
done
echo "=== Step 2: Alignment with bwa-mem2 ==="
for sample in $SAMPLES; do
bwa-mem2 mem -t ${THREADS} \
-R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" \
${REF} \
${OUTDIR}/trimmed/${sample}_R1.fq.gz \
${OUTDIR}/trimmed/${sample}_R2.fq.gz | \
samtools view -@ ${THREADS} -bS - > ${OUTDIR}/aligned/${sample}.bam
done
echo "=== Step 3: BAM Processing ==="
for sample in $SAMPLES; do
samtools fixmate -@ ${THREADS} -m ${OUTDIR}/aligned/${sample}.bam - | \
samtools sort -@ ${THREADS} - | \
samtools markdup -@ ${THREADS} - ${OUTDIR}/aligned/${sample}.markdup.bam
samtools index ${OUTDIR}/aligned/${sample}.markdup.bam
rm ${OUTDIR}/aligned/${sample}.bam
done
echo "=== Step 4: Joint Variant Calling ==="
bcftools mpileup -Ou -f ${REF} ${OUTDIR}/aligned/*.markdup.bam | \
bcftools call -mv -Oz -o ${OUTDIR}/variants/cohort.vcf.gz
bcftools index ${OUTDIR}/variants/cohort.vcf.gz
echo "=== Step 5: Filtering ==="
bcftools filter -Oz \
-e 'QUAL<20 || DP<10 || MQ<30' \
-o ${OUTDIR}/variants/cohort.filtered.vcf.gz \
${OUTDIR}/variants/cohort.vcf.gz
bcftools index ${OUTDIR}/variants/cohort.filtered.vcf.gz
echo "=== Stats ==="
bcftools stats ${OUTDIR}/variants/cohort.filtered.vcf.gz > ${OUTDIR}/variants/stats.txt
echo "=== Pipeline Complete ==="
echo "Filtered VCF: ${OUTDIR}/variants/cohort.filtered.vcf.gz"
```
## Related Skills
- read-qc/fastp-workflow - Detailed QC options
- read-alignment/bwa-alignment - BWA-MEM2 parameters
- alignment-files/duplicate-handling - Duplicate marking details
- variant-calling/variant-calling - bcftools calling options
- variant-calling/gatk-variant-calling - GATK HaplotypeCaller details
- variant-calling/filtering-best-practices - Advanced filtering strategies
- variant-calling/variant-annotation - Annotate variants with VEP
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.