bio-workflow-management-snakemake-workflows
$
npx mdskill add GPTomics/bioSkills/bio-workflow-management-snakemake-workflowsBuild reproducible bioinformatics pipelines with Snakemake
- Automates multi-step analyses with make-like dependency tracking
- Integrates Python workflows, BWA, FastQC, and SLURM cluster execution
- Decides execution by parsing rule syntax and resolving wildcards
- Delivers reproducible pipelines ready for HPC or cloud deployment
SKILL.md
.github/skills/bio-workflow-management-snakemake-workflowsView on GitHub ↗
---
name: bio-workflow-management-snakemake-workflows
description: Build reproducible bioinformatics pipelines with Snakemake using rules, wildcards, and automatic dependency resolution. Use when creating Python-based workflows, automating multi-step analyses with make-like dependency tracking, or running pipelines on HPC clusters with SLURM.
tool_type: python
primary_tool: Snakemake
---
## Version Compatibility
Reference examples tested with: BWA 0.7.17+, FastQC 0.12+, MultiQC 1.21+, Nextflow 23.10+, Salmon 1.10+, Snakemake 8.0+, bcftools 1.19+, fastp 0.23+, pandas 2.2+, samtools 1.19+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Snakemake Workflows
**"Build a reproducible bioinformatics pipeline with Snakemake"** → Define analysis steps as rules with input/output declarations, automatic dependency resolution via wildcards, and cluster execution support for HPC/cloud environments.
- Python: Snakefile rule syntax with `expand()`, `wildcards`, and `config` for parameterization
Compatible with Snakemake 7.x, 8.x, and 9.x. For Snakemake 8.0+, use `--executor` instead of `--cluster`.
## Basic Rule Structure
```python
# Snakefile
rule all:
input:
expand("results/{sample}_counts.txt", sample=SAMPLES)
rule align:
input:
r1 = "data/{sample}_R1.fq.gz",
r2 = "data/{sample}_R2.fq.gz",
index = "ref/genome.fa"
output:
bam = "aligned/{sample}.bam"
threads: 8
shell:
"bwa mem -t {threads} {input.index} {input.r1} {input.r2} | "
"samtools sort -@ {threads} -o {output.bam}"
```
## Config File
```yaml
# config.yaml
samples:
- sample1
- sample2
- sample3
reference: "ref/genome.fa"
threads: 8
```
```python
# Snakefile
configfile: "config.yaml"
SAMPLES = config["samples"]
REF = config["reference"]
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
```
## Wildcards and Expand
```python
# Define samples
SAMPLES = ["A", "B", "C"]
CHROMOSOMES = [str(i) for i in range(1, 23)] + ["X", "Y"]
# Expand for all combinations
rule all:
input:
expand("results/{sample}_{chrom}.vcf", sample=SAMPLES, chrom=CHROMOSOMES)
# Access wildcard in rule
rule process:
input:
"data/{sample}.bam"
output:
"results/{sample}.vcf"
params:
sample = lambda wildcards: wildcards.sample
shell:
"bcftools call -s {params.sample} {input} > {output}"
```
## Python Integration
```python
rule analyze:
input:
counts = "data/{sample}_counts.txt"
output:
results = "results/{sample}_de.csv"
run:
import pandas as pd
counts = pd.read_csv(input.counts, sep='\t')
# Process data
results = counts.groupby('gene').sum()
results.to_csv(output.results)
```
## Conda Environments
```python
rule fastqc:
input:
"data/{sample}.fq.gz"
output:
"qc/{sample}_fastqc.html"
conda:
"envs/qc.yaml"
shell:
"fastqc {input} -o qc/"
```
```yaml
# envs/qc.yaml
channels:
- bioconda
- conda-forge
dependencies:
- fastqc=0.12.1
- multiqc=1.14
```
## Container Support
```python
rule align:
input:
r1 = "data/{sample}_R1.fq.gz"
output:
bam = "aligned/{sample}.bam"
container:
"docker://biocontainers/bwa:v0.7.17"
shell:
"bwa mem {input} | samtools sort -o {output}"
```
```bash
# Run with Singularity
snakemake --use-singularity --singularity-args "-B /data"
```
## Resource Management
```python
rule memory_intensive:
input:
"data/{sample}.bam"
output:
"results/{sample}.vcf"
threads: 4
resources:
mem_mb = 16000,
time = "4:00:00",
gpu = 1
shell:
"variant_caller --threads {threads} {input} > {output}"
```
## Cluster Execution
```yaml
# cluster.yaml
__default__:
partition: normal
time: "1:00:00"
mem: 4G
align:
partition: normal
time: "8:00:00"
mem: 32G
cpus-per-task: 8
```
```bash
# Snakemake 8.x with executor plugin
pip install snakemake-executor-plugin-slurm
snakemake --executor slurm --jobs 100
# Snakemake 7.x cluster execution
snakemake --cluster "sbatch --partition={cluster.partition} \
--time={cluster.time} --mem={cluster.mem}" \
--cluster-config cluster.yaml --jobs 100
# Or use profile (both versions)
snakemake --profile slurm
```
## Checkpoints
```python
# For rules that determine downstream files dynamically
checkpoint split_samples:
input:
"data/all_samples.txt"
output:
directory("split/")
shell:
"split_script.py {input} {output}"
def get_split_files(wildcards):
checkpoint_output = checkpoints.split_samples.get(**wildcards).output[0]
return expand("split/{sample}.txt", sample=glob_wildcards("split/{sample}.txt").sample)
rule aggregate:
input:
get_split_files
output:
"results/aggregated.txt"
shell:
"cat {input} > {output}"
```
## Modular Workflows
```python
# Snakefile
include: "rules/qc.smk"
include: "rules/align.smk"
include: "rules/call.smk"
rule all:
input:
rules.qc_all.input,
rules.call_all.input
```
```python
# rules/qc.smk
rule fastqc:
input: "data/{sample}.fq.gz"
output: "qc/{sample}_fastqc.html"
shell: "fastqc {input} -o qc/"
rule qc_all:
input: expand("qc/{sample}_fastqc.html", sample=SAMPLES)
```
## Temporary and Protected Files
```python
rule align:
input:
"data/{sample}.fq.gz"
output:
bam = temp("aligned/{sample}.unsorted.bam"), # Auto-deleted
sorted = protected("aligned/{sample}.bam") # Write-protected
shell:
"bwa mem {input} > {output.bam} && "
"samtools sort {output.bam} -o {output.sorted}"
```
## Benchmarking
```python
rule heavy_computation:
input:
"data/{sample}.bam"
output:
"results/{sample}.txt"
benchmark:
"benchmarks/{sample}.tsv"
shell:
"heavy_tool {input} > {output}"
```
## Logging
```python
rule process:
input:
"data/{sample}.bam"
output:
"results/{sample}.txt"
log:
"logs/{sample}.log"
shell:
"(command {input} > {output}) 2> {log}"
```
## Run Commands
```bash
# Dry run (show what would be executed)
snakemake -n
# Run with 8 cores
snakemake --cores 8
# Force rerun of specific rule
snakemake --forcerun align
# Run until specific target
snakemake results/sample1.vcf
# Generate DAG visualization
snakemake --dag | dot -Tpdf > dag.pdf
# Generate report
snakemake --report report.html
```
## Complete RNA-seq Workflow
```python
configfile: "config.yaml"
SAMPLES = config["samples"]
rule all:
input:
"results/multiqc_report.html",
"results/deseq2_results.csv"
rule fastp:
input:
r1 = "data/{sample}_R1.fq.gz",
r2 = "data/{sample}_R2.fq.gz"
output:
r1 = "trimmed/{sample}_R1.fq.gz",
r2 = "trimmed/{sample}_R2.fq.gz",
json = "qc/{sample}_fastp.json"
threads: 4
shell:
"fastp -i {input.r1} -I {input.r2} -o {output.r1} -O {output.r2} "
"--json {output.json} --thread {threads}"
rule salmon_quant:
input:
r1 = "trimmed/{sample}_R1.fq.gz",
r2 = "trimmed/{sample}_R2.fq.gz",
index = config["salmon_index"]
output:
directory("salmon/{sample}")
threads: 8
shell:
"salmon quant -i {input.index} -l A -1 {input.r1} -2 {input.r2} "
"-o {output} --threads {threads}"
rule multiqc:
input:
expand("qc/{sample}_fastp.json", sample=SAMPLES),
expand("salmon/{sample}", sample=SAMPLES)
output:
"results/multiqc_report.html"
shell:
"multiqc qc/ salmon/ -o results/"
rule deseq2:
input:
quants = expand("salmon/{sample}", sample=SAMPLES),
metadata = config["metadata"]
output:
"results/deseq2_results.csv"
script:
"scripts/deseq2.R"
```
## Related Skills
- workflow-management/nextflow-pipelines - Nextflow alternative
- workflows/rnaseq-to-de - RNA-seq workflow
- read-qc/fastp-workflow - QC in pipelines
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.