bio-workflows-atacseq-pipeline
$
npx mdskill add GPTomics/bioSkills/bio-workflows-atacseq-pipelineExecutes end-to-end ATAC-seq analysis from FASTQ to differential accessibility.
- Processes raw FASTQ data through alignment, peak calling, and footprinting.
- Depends on MACS3, fastp, Bowtie2, and deepTools for core operations.
- Validates QC metrics before proceeding to the next pipeline stage.
- Delivers peak sets, accessibility differences, and TF binding predictions.
SKILL.md
.github/skills/bio-workflows-atacseq-pipelineView on GitHub ↗
---
name: bio-workflows-atacseq-pipeline
description: End-to-end ATAC-seq workflow from FASTQ files to differential accessibility and TF footprinting. Covers alignment, peak calling with MACS3, QC metrics, and optional TOBIAS footprinting. Use when running end-to-end ATAC-seq analysis from FASTQ to differential accessibility.
tool_type: mixed
primary_tool: MACS3
workflow: true
depends_on:
- read-qc/fastp-workflow
- read-alignment/bowtie2-alignment
- alignment-files/duplicate-handling
- atac-seq/atac-peak-calling
- atac-seq/atac-qc
- atac-seq/consensus-peakset
- atac-seq/differential-accessibility
- atac-seq/footprinting
- atac-seq/motif-deviation
- atac-seq/nucleosome-positioning
qc_checkpoints:
- after_qc: "Q30 >85%, adapter content <5%"
- after_alignment: "Mapping rate >80%, mitochondrial <20%"
- after_peaks: "FRiP >20%, TSS enrichment >5"
- after_footprinting: "Motif enrichment validates TF activity"
---
## Version Compatibility
Reference examples tested with: Bowtie2 2.5.3+, MACS3 3.0+, bedtools 2.31+, deepTools 3.5+, fastp 0.23+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# ATAC-seq Pipeline
**"Run end-to-end ATAC-seq analysis from FASTQ to differential accessibility"** → Orchestrate QC, Bowtie2 alignment, MACS3 peak calling, FRiP/TSS enrichment QC, differential accessibility, and optional TOBIAS footprinting.
Complete workflow from raw ATAC-seq FASTQ files to accessibility peaks, differential analysis, and TF footprinting.
## Workflow Overview
```
FASTQ files
|
v
[1. QC & Trimming] -----> fastp (Nextera adapters)
|
v
[2. Alignment] ---------> Bowtie2
|
v
[3. BAM Processing] ----> filter, shift, dedup
|
v
[4. Peak Calling] ------> MACS3
|
v
[5. QC] ----------------> TSS enrichment, FRiP, fragment size
|
v
[6. Differential] ------> DiffBind (optional)
|
v
[7. Footprinting] ------> TOBIAS (optional)
|
v
Accessibility peaks + TF activity
```
## Primary Path: Bowtie2 + MACS3
### Step 1: Quality Control with fastp
```bash
# ATAC-seq uses Nextera adapters
NEXTERA_R1="CTGTCTCTTATACACATCT"
NEXTERA_R2="CTGTCTCTTATACACATCT"
for sample in sample1 sample2 sample3; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
--adapter_sequence ${NEXTERA_R1} \
--adapter_sequence_r2 ${NEXTERA_R2} \
--qualified_quality_phred 20 \
--length_required 25 \
--html qc/${sample}_fastp.html
done
```
### Step 2: Alignment with Bowtie2
```bash
# Build index (once)
bowtie2-build genome.fa bt2_index/genome
# Align with ATAC-seq specific settings
for sample in sample1 sample2 sample3; do
bowtie2 -p 8 -x bt2_index/genome \
-1 trimmed/${sample}_R1.fq.gz \
-2 trimmed/${sample}_R2.fq.gz \
--very-sensitive \
--no-mixed --no-discordant \
-X 2000 \
2> aligned/${sample}.log | \
samtools view -@ 4 -bS -q 30 -f 2 - | \
samtools sort -@ 4 -o aligned/${sample}.bam
done
```
### Step 3: BAM Processing
ATAC-seq requires special processing: removing mitochondrial reads, shifting reads for Tn5 insertion, and removing duplicates.
```bash
for sample in sample1 sample2 sample3; do
# Remove mitochondrial reads
samtools view -h aligned/${sample}.bam | \
grep -v chrM | \
samtools view -b - > aligned/${sample}.noMT.bam
# Mark and remove duplicates
samtools fixmate -m aligned/${sample}.noMT.bam - | \
samtools sort - | \
samtools markdup -r - aligned/${sample}.dedup.bam
samtools index aligned/${sample}.dedup.bam
# Shift reads for Tn5 (+ strand +4bp, - strand -5bp)
alignmentSieve -b aligned/${sample}.dedup.bam \
-o aligned/${sample}.shifted.bam \
--ATACshift \
-p 8
samtools index aligned/${sample}.shifted.bam
done
```
Alternative manual Tn5 shift with bedtools:
```bash
# Convert to BED and shift
bedtools bamtobed -i aligned/${sample}.dedup.bam | \
awk 'BEGIN{OFS="\t"} {if($6=="+"){$2=$2+4} else if($6=="-"){$3=$3-5} print}' | \
sort -k1,1 -k2,2n > aligned/${sample}.shifted.bed
```
### Step 4: Peak Calling with MACS3
`-f BAMPE` mode silently IGNORES `--shift/--extsize`; use `-f BAM` for the ENCODE-style shift-extend pattern, or omit `--shift/--extsize` when using BAMPE. See atac-seq/atac-peak-calling for the full ENCODE 4 IDR + pseudoreplicate pipeline.
```bash
# ENCODE 4 pattern: shift-extend on single-end-ified reads (-f BAM)
macs3 callpeak \
-t aligned/sample1.shifted.bam \
-f BAM \
-g hs \
-n sample1 \
--outdir peaks \
--nomodel --shift -75 --extsize 150 \
--keep-dup all \
-p 0.01
# For calling on all samples together (pooled)
macs3 callpeak \
-t aligned/*.shifted.bam \
-f BAM \
-g hs \
-n consensus \
--outdir peaks \
--nomodel --shift -75 --extsize 150 \
--keep-dup all \
-p 0.01
```
For consensus peakset construction (Corces 2018 iterative overlap, 501 bp fixed-width), see atac-seq/consensus-peakset.
### Step 5: ATAC-seq QC
```bash
# TSS enrichment (using deepTools)
computeMatrix reference-point \
-S bigwig/sample1.bw \
-R genes.bed \
--referencePoint TSS \
-a 2000 -b 2000 \
-o tss_matrix.gz
plotProfile -m tss_matrix.gz -o qc/tss_enrichment.pdf
# Fragment size distribution
samtools view aligned/sample1.dedup.bam | \
awk '{print sqrt($9^2)}' | \
sort | uniq -c | \
awk '{print $2"\t"$1}' > qc/fragment_sizes.txt
# FRiP calculation
total=$(samtools view -c aligned/sample1.shifted.bam)
in_peaks=$(bedtools intersect -a aligned/sample1.shifted.bam \
-b peaks/sample1_peaks.narrowPeak -u | samtools view -c)
echo "FRiP: $(echo "scale=4; $in_peaks/$total" | bc)"
```
**QC Checkpoint:** Assess ATAC quality
- TSS enrichment score >5 (ideally >10)
- FRiP >20%
- Nucleosome-free (<100bp) and mono/di-nucleosome peaks visible
### Step 6: Differential Accessibility with DiffBind
```r
library(DiffBind)
# Create sample sheet
samples <- data.frame(
SampleID = c('control_1', 'control_2', 'treated_1', 'treated_2'),
Condition = c('control', 'control', 'treated', 'treated'),
Replicate = c(1, 2, 1, 2),
bamReads = c('aligned/control_1.shifted.bam', 'aligned/control_2.shifted.bam',
'aligned/treated_1.shifted.bam', 'aligned/treated_2.shifted.bam'),
Peaks = c('peaks/control_1_peaks.narrowPeak', 'peaks/control_2_peaks.narrowPeak',
'peaks/treated_1_peaks.narrowPeak', 'peaks/treated_2_peaks.narrowPeak')
)
# Create DBA object
dba <- dba(sampleSheet = samples)
# Count reads in peaks
dba <- dba.count(dba)
# Normalize
dba <- dba.normalize(dba)
# Contrast
dba <- dba.contrast(dba, categories = DBA_CONDITION)
# Differential analysis
dba <- dba.analyze(dba)
# Report
report <- dba.report(dba)
write.csv(as.data.frame(report), 'differential_peaks.csv')
# Visualization
dba.plotMA(dba)
dba.plotVolcano(dba)
```
### Step 7: TF Footprinting with TOBIAS
```bash
# Correct Tn5 bias
TOBIAS ATACorrect \
-b aligned/sample1.shifted.bam \
-g genome.fa \
-p peaks/consensus_peaks.narrowPeak \
--outdir footprinting \
--cores 8
# Score footprints
TOBIAS ScoreBigwig \
--signal footprinting/sample1_corrected.bw \
--regions peaks/consensus_peaks.narrowPeak \
--output footprinting/sample1_footprints.bw \
--cores 8
# Bind detection
TOBIAS BINDetect \
--motifs motifs.jaspar \
--signals footprinting/sample1_footprints.bw \
--genome genome.fa \
--peaks peaks/consensus_peaks.narrowPeak \
--outdir footprinting/bindetect \
--cores 8
# Differential footprinting (two conditions)
TOBIAS BINDetect \
--motifs motifs.jaspar \
--signals footprinting/control_footprints.bw footprinting/treated_footprints.bw \
--genome genome.fa \
--peaks peaks/consensus_peaks.narrowPeak \
--outdir footprinting/differential \
--cores 8
```
## Parameter Recommendations
| Step | Parameter | Value |
|------|-----------|-------|
| fastp | adapter | Nextera (CTGTCTCTTATACACATCT) |
| Bowtie2 | -X | 2000 (max insert size) |
| samtools | -q | 30 (MAPQ filter) |
| MACS3 | --shift | -75 (for Tn5 shift) |
| MACS3 | --extsize | 150 |
| MACS3 | -q | 0.01-0.05 |
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| High mitochondrial | Normal for ATAC | Filter chrM reads |
| Low TSS enrichment | Poor library, overdigestion | Check Tn5 concentration |
| Many small peaks | Tn5 insertion noise | Increase -q threshold |
| No nucleosome periodicity | Overdigestion | Adjust Tn5:DNA ratio |
## Complete Pipeline Script
```bash
#!/bin/bash
set -e
THREADS=8
INDEX="bt2_index/genome"
GENOME="genome.fa"
SAMPLES="sample1 sample2 sample3"
OUTDIR="atac_results"
mkdir -p ${OUTDIR}/{trimmed,aligned,peaks,qc,bigwig}
# Step 1: QC
for sample in $SAMPLES; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--adapter_sequence CTGTCTCTTATACACATCT \
--html ${OUTDIR}/qc/${sample}_fastp.html -w ${THREADS}
done
# Step 2-3: Align and process
for sample in $SAMPLES; do
bowtie2 -p ${THREADS} -x ${INDEX} \
-1 ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-2 ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--very-sensitive --no-mixed --no-discordant -X 2000 \
2> ${OUTDIR}/qc/${sample}_bowtie2.log | \
samtools view -@ ${THREADS} -bS -q 30 -f 2 - | \
grep -v chrM | \
samtools fixmate -m - - | \
samtools sort -@ ${THREADS} - | \
samtools markdup -r - - | \
alignmentSieve --ATACshift -b /dev/stdin -o ${OUTDIR}/aligned/${sample}.bam
samtools index ${OUTDIR}/aligned/${sample}.bam
done
# Step 4: Peak calling
macs3 callpeak -t ${OUTDIR}/aligned/*.bam -f BAMPE -g hs \
-n consensus --outdir ${OUTDIR}/peaks \
--nomodel --shift -75 --extsize 150 -q 0.01
echo "Pipeline complete. Peaks: ${OUTDIR}/peaks/consensus_peaks.narrowPeak"
```
## Related Skills
- atac-seq/atac-peak-calling - MACS3 / Genrich / HMMRATAC details, ENCODE 4 IDR
- atac-seq/atac-qc - TSS enrichment, FRiP, NRF/PBC1/PBC2 details
- atac-seq/consensus-peakset - Corces 2018 iterative-overlap fixed-width consensus
- atac-seq/differential-accessibility - DiffBind / csaw / DESeq2; spike-in normalization
- atac-seq/footprinting - TOBIAS three-step; per-TF failure modes
- atac-seq/motif-deviation - chromVAR for motif accessibility variability
- atac-seq/nucleosome-positioning - V-plot, NucleoATAC, +1 nucleosome
- atac-seq/single-cell-atac - For scATAC instead of bulk
- atac-seq/co-accessibility - Cicero cis-regulatory inference
- atac-seq/enhancer-gene-linking - ABC, ENCODE-rE2G enhancer-gene mapping
- atac-seq/deep-learning-atac - chromBPNet variant effect prediction
- atac-seq/allele-specific-accessibility - WASP + caQTL mapping
- chip-seq/peak-annotation - Annotate ATAC peaks to genes
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.