bio-workflows-rnaseq-to-de
$
npx mdskill add GPTomics/bioSkills/bio-workflows-rnaseq-to-deRuns end-to-end RNA-seq analysis from FASTQ to differential expression results
- Solves the task of analyzing RNA-seq data from raw reads to gene expression insights
- Uses fastp, STAR/featureCounts, Salmon, and DESeq2 for processing and analysis
- Applies quality checks at QC, quantification, and differential expression stages
- Delivers visualizations and DESeq2 results for downstream interpretation
SKILL.md
.github/skills/bio-workflows-rnaseq-to-deView on GitHub ↗
---
name: bio-workflows-rnaseq-to-de
description: End-to-end RNA-seq workflow from FASTQ files to differential expression results. Covers QC, quantification (Salmon or STAR+featureCounts), and DESeq2 analysis with visualization. Use when running RNA-seq from FASTQ to DE results.
tool_type: mixed
primary_tool: DESeq2
workflow: true
depends_on:
- read-qc/fastp-workflow
- rna-quantification/alignment-free-quant
- rna-quantification/tximport-workflow
- differential-expression/deseq2-basics
- differential-expression/de-visualization
qc_checkpoints:
- after_qc: "Q30 >80%, adapter content <5%"
- after_quant: "Mapping rate >70%, >10M reads mapped"
- after_de: "Dispersion fit reasonable, no sample outliers"
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, STAR 2.7.11+, Salmon 1.10+, Subread 2.0+, fastp 0.23+, ggplot2 3.5+, kallisto 0.50+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# RNA-seq to Differential Expression Workflow
**"Run RNA-seq analysis from FASTQ to differentially expressed genes"** → Orchestrate fastp QC, STAR/HISAT2 alignment, featureCounts/Salmon quantification, DESeq2 differential expression, shrinkage estimation, and results visualization (volcano, MA plots).
Complete pipeline from raw FASTQ files to differential expression results.
## Workflow Overview
```
FASTQ files
|
v
[1. QC & Trimming] -----> fastp
|
v
[2. Quantification] ----> Salmon (recommended) or STAR + featureCounts
|
v
[3. Import to R] -------> tximport (for Salmon) or direct counts
|
v
[4. DE Analysis] -------> DESeq2
|
v
[5. Visualization] -----> Volcano, MA, heatmaps
|
v
Significant gene list
```
## Primary Path: Salmon + DESeq2
### Step 1: Quality Control with fastp
```bash
# Single sample
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.trimmed.fq.gz -O sample_R2.trimmed.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 35 \
--html sample_fastp.html
# Batch processing
for sample in sample1 sample2 sample3; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--html qc/${sample}_fastp.html
done
```
**QC Checkpoint 1:** Check fastp reports
- Q30 bases >80%
- Adapter content <5%
- Duplication rate reasonable for library type
### Step 2: Salmon Quantification
```bash
# Build index (once per transcriptome)
salmon index -t transcriptome.fa -i salmon_index -k 31
# Quantify each sample
for sample in sample1 sample2 sample3; do
salmon quant -i salmon_index \
-l A \
-1 trimmed/${sample}_R1.fq.gz \
-2 trimmed/${sample}_R2.fq.gz \
-o quants/${sample} \
--validateMappings \
--gcBias \
--seqBias \
-p 8
done
```
**QC Checkpoint 2:** Check Salmon logs
- Mapping rate >70%
- >10 million reads mapped
### Step 3: Import with tximport
```r
library(tximport)
library(DESeq2)
# Create tx2gene mapping (Ensembl example)
tx2gene <- read.csv('tx2gene.csv') # columns: TXNAME, GENEID
# List quantification files
samples <- c('sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6')
files <- file.path('quants', samples, 'quant.sf')
names(files) <- samples
# Import transcript-level estimates
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
# Create sample metadata
coldata <- data.frame(
condition = factor(c('control', 'control', 'control', 'treated', 'treated', 'treated')),
row.names = samples
)
```
### Step 4: DESeq2 Analysis
```r
# Create DESeqDataSet from tximport
dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
# Pre-filter low count genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# Set reference level
dds$condition <- relevel(dds$condition, ref = 'control')
# Run DESeq2
dds <- DESeq(dds)
# Get results with shrinkage
res <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
# Summary
summary(res)
```
**QC Checkpoint 3:** Check DESeq2 diagnostics
- Dispersion plot shows expected trend
- PCA separates conditions
- No severe outliers in sample distances
### Step 5: Visualization and Export
```r
library(ggplot2)
library(pheatmap)
library(ggrepel)
# Volcano plot
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
res_df$significant <- res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1
ggplot(res_df, aes(x = log2FoldChange, y = -log10(pvalue), color = significant)) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c('grey', 'red')) +
theme_minimal() +
labs(title = 'Volcano Plot', x = 'Log2 Fold Change', y = '-Log10 P-value')
# Heatmap of top genes
vsd <- vst(dds, blind = FALSE)
top_genes <- head(order(res$padj), 50)
pheatmap(assay(vsd)[top_genes,], scale = 'row', show_rownames = FALSE)
# Export significant genes
sig_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
write.csv(as.data.frame(sig_genes), 'significant_genes.csv')
```
## Alternative Path: STAR + featureCounts + DESeq2
### Step 2 Alternative: STAR Alignment
```bash
# Build STAR index (once)
STAR --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles genome.fa \
--sjdbGTFfile genes.gtf \
--sjdbOverhang 100 \
--runThreadN 8
# Align each sample
for sample in sample1 sample2 sample3; do
STAR --genomeDir star_index \
--readFilesIn trimmed/${sample}_R1.fq.gz trimmed/${sample}_R2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix aligned/${sample}_ \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--runThreadN 8
done
```
### Step 3 Alternative: featureCounts
```bash
# Count reads per gene
featureCounts -T 8 -p --countReadPairs \
-a genes.gtf \
-o counts.txt \
aligned/*_Aligned.sortedByCoord.out.bam
```
### Step 4 Alternative: Load Counts Directly
```r
# Load featureCounts output
counts <- read.table('counts.txt', header = TRUE, row.names = 1, skip = 1)
counts <- counts[, 6:ncol(counts)] # Remove annotation columns
colnames(counts) <- gsub('_Aligned.sortedByCoord.out.bam', '', colnames(counts))
# Create DESeqDataSet directly
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition)
```
## Parameter Recommendations
| Step | Parameter | Recommendation |
|------|-----------|----------------|
| fastp | --qualified_quality_phred | 20 (standard) |
| fastp | --length_required | 35 for 2x100, 50 for 2x150 |
| Salmon | -l | A (auto-detect library type) |
| Salmon | --gcBias | Enable for better accuracy |
| STAR | --sjdbOverhang | read_length - 1 |
| featureCounts | -s | 0=unstranded, 1=stranded, 2=reversely stranded |
| DESeq2 | lfcShrink type | apeglm (recommended) |
| DESeq2 | alpha | 0.05 (standard significance) |
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| Low mapping rate (<50%) | Wrong reference, contamination | Check species, run FastQ Screen |
| High duplication | Low complexity library, over-sequencing | Check library prep, may be normal for low-input |
| No DE genes | Low power, batch effects | Add replicates, include batch in design |
| All genes DE | Normalization issue, sample swap | Check sample metadata, rerun normalization |
| Outlier samples | Technical failure, sample swap | Remove or investigate, check PCA |
## Complete Bash Pipeline Script
```bash
#!/bin/bash
set -e
THREADS=8
SAMPLES="sample1 sample2 sample3 sample4 sample5 sample6"
SALMON_INDEX="salmon_index"
OUTDIR="results"
mkdir -p ${OUTDIR}/{trimmed,quants,qc}
# Step 1: QC and trim
for sample in $SAMPLES; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--html ${OUTDIR}/qc/${sample}_fastp.html \
-w ${THREADS}
done
# Step 2: Quantify
for sample in $SAMPLES; do
salmon quant -i ${SALMON_INDEX} -l A \
-1 ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-2 ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
-o ${OUTDIR}/quants/${sample} \
--validateMappings --gcBias -p ${THREADS}
done
echo "Quantification complete. Run R script for DE analysis."
```
## Complete R Analysis Script
```r
library(tximport)
library(DESeq2)
library(apeglm)
library(ggplot2)
library(pheatmap)
# Configuration
samples <- c('sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6')
conditions <- c('control', 'control', 'control', 'treated', 'treated', 'treated')
quant_dir <- 'results/quants'
# Import
tx2gene <- read.csv('tx2gene.csv')
files <- file.path(quant_dir, samples, 'quant.sf')
names(files) <- samples
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
# DESeq2
coldata <- data.frame(condition = factor(conditions), row.names = samples)
dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
dds <- dds[rowSums(counts(dds)) >= 10,]
dds$condition <- relevel(dds$condition, ref = 'control')
dds <- DESeq(dds)
# Results
res <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
sig <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
cat('Significant genes:', nrow(sig), '\n')
write.csv(as.data.frame(sig), 'significant_genes.csv')
```
## Related Skills
- read-qc/fastp-workflow - Detailed QC options and parameters
- rna-quantification/alignment-free-quant - Salmon and kallisto details
- rna-quantification/tximport-workflow - tximport options and tx2gene creation
- differential-expression/deseq2-basics - Complete DESeq2 reference
- differential-expression/de-visualization - Advanced visualization options
- pathway-analysis/go-enrichment - Next step: functional enrichment
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.