bio-de-deseq2-basics
$
npx mdskill add GPTomics/bioSkills/bio-de-deseq2-basicsExecute RNA-seq differential expression analysis with DESeq2.
- Analyzes RNA-seq count data to identify statistically significant gene expression changes.
- Depends on DESeq2, apeglm, BiocManager, and R/Bioconductor environments.
- Adapts code patterns to match installed package versions and API signatures.
- Outputs log fold change shrinkage results and DESeqDataSet objects for downstream interpretation.
SKILL.md
.github/skills/bio-de-deseq2-basicsView on GitHub ↗
---
name: bio-de-deseq2-basics
description: Perform differential expression analysis using DESeq2 in R/Bioconductor. Use for analyzing RNA-seq count data, creating DESeqDataSet objects, running the DESeq workflow, and extracting results with log fold change shrinkage. Use when performing DE analysis with DESeq2.
tool_type: r
primary_tool: DESeq2
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, Salmon 1.10+, edgeR 4.0+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# DESeq2 Basics
Differential expression analysis using DESeq2 for RNA-seq count data.
## Required Libraries
```r
library(DESeq2)
library(apeglm) # For lfcShrink with type='apeglm'
```
## Installation
```r
if (!require('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('DESeq2')
BiocManager::install('apeglm')
```
## Creating DESeqDataSet
**Goal:** Construct a DESeqDataSet object from various input formats for DE analysis.
**Approach:** Wrap count data and sample metadata into the DESeq2 container, specifying the experimental design formula.
**"Load my RNA-seq counts into DESeq2"** → Create a DESeqDataSet from a count matrix, SummarizedExperiment, or tximport object with sample metadata and a design formula.
### From Count Matrix
```r
# counts: matrix with genes as rows, samples as columns
# coldata: data frame with sample metadata (rownames must match colnames of counts)
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ condition)
```
### From SummarizedExperiment
```r
library(SummarizedExperiment)
dds <- DESeqDataSet(se, design = ~ condition)
```
### From tximport (Salmon/Kallisto)
```r
library(tximport)
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
```
## Standard DESeq2 Workflow
**Goal:** Run the complete DESeq2 pipeline from raw counts to shrunken log fold change estimates.
**Approach:** Create dataset, pre-filter low-count genes, set reference level, run size factor estimation + dispersion estimation + Wald test, then apply LFC shrinkage.
**"Find differentially expressed genes between treated and control"** → Test for significant expression changes between conditions using negative binomial models with empirical Bayes shrinkage.
```r
# Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ condition)
# Pre-filter low count genes (recommended)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# Set reference level for condition
dds$condition <- relevel(dds$condition, ref = 'control')
# Run DESeq2 pipeline (estimateSizeFactors, estimateDispersions, nbinomWaldTest)
dds <- DESeq(dds)
# Get results
res <- results(dds)
# Apply log fold change shrinkage (recommended for visualization/ranking)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
```
## Design Formulas
**Goal:** Specify the experimental design to model biological and nuisance variables.
**Approach:** Build R formula objects that encode condition, batch, and interaction terms for the GLM.
```r
# Simple two-group comparison
design = ~ condition
# Controlling for batch effects
design = ~ batch + condition
# Interaction model
design = ~ genotype + treatment + genotype:treatment
# Multi-factor without interaction
design = ~ genotype + treatment
```
## Specifying Contrasts
**Goal:** Extract results for specific pairwise or complex comparisons from a fitted DESeq2 model.
**Approach:** Use coefficient names or contrast vectors to define which groups to compare.
```r
# See available coefficients
resultsNames(dds)
# Results by coefficient name
res <- results(dds, name = 'condition_treated_vs_control')
# Results by contrast (compare specific levels)
res <- results(dds, contrast = c('condition', 'treated', 'control'))
# Contrast with list format (for complex designs)
res <- results(dds, contrast = list('conditionB', 'conditionA'))
```
## Log Fold Change Shrinkage
**Goal:** Reduce noisy fold change estimates for low-count genes to improve ranking and visualization.
**Approach:** Apply empirical Bayes shrinkage (apeglm, ashr, or normal) to moderate log fold changes toward zero.
```r
# apeglm method (default, recommended)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
# ashr method (alternative)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'ashr')
# normal method (original, less recommended)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'normal')
```
## Setting Significance Thresholds
**Goal:** Control the stringency of differential expression calls using adjusted p-value and fold change cutoffs.
**Approach:** Set alpha for multiple testing correction and optionally apply a minimum log fold change threshold.
```r
# Default: padj < 0.1
res <- results(dds)
# Custom alpha threshold
res <- results(dds, alpha = 0.05)
# With log fold change threshold
res <- results(dds, lfcThreshold = 1) # |log2FC| > 1
```
## Accessing DESeq2 Results
**Goal:** Retrieve, filter, and sort DE results for downstream use.
**Approach:** Extract results as a data frame, subset by significance, and order by p-value or fold change.
```r
# Summary of results
summary(res)
# Get significant genes
sig <- subset(res, padj < 0.05)
# Order by adjusted p-value
resOrdered <- res[order(res$padj),]
# Order by log fold change
resOrdered <- res[order(abs(res$log2FoldChange), decreasing = TRUE),]
# Convert to data frame
res_df <- as.data.frame(res)
```
## Result Columns
| Column | Description |
|--------|-------------|
| `baseMean` | Mean of normalized counts across all samples |
| `log2FoldChange` | Log2 fold change (treatment vs control) |
| `lfcSE` | Standard error of log2 fold change |
| `stat` | Wald statistic |
| `pvalue` | Raw p-value |
| `padj` | Adjusted p-value (Benjamini-Hochberg) |
## Normalization and Counts
**Goal:** Obtain normalized expression values suitable for visualization and cross-sample comparison.
**Approach:** Extract size-factor-normalized counts or apply variance-stabilizing / rlog transformations.
```r
# Get normalized counts
normalized_counts <- counts(dds, normalized = TRUE)
# Get size factors
sizeFactors(dds)
# Variance stabilizing transformation (for visualization)
vsd <- vst(dds, blind = FALSE)
# Regularized log transformation (alternative, slower)
rld <- rlog(dds, blind = FALSE)
```
## Multi-Factor Designs
**Goal:** Account for batch or other nuisance variables while testing the effect of interest.
**Approach:** Include batch as a covariate in the design formula so DESeq2 adjusts for it during testing.
```r
# Design with batch correction
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ batch + condition)
dds <- DESeq(dds)
# Extract condition effect (controlling for batch)
res <- results(dds, name = 'condition_treated_vs_control')
```
## Interaction Models
**Goal:** Identify genes whose response to treatment differs between genotypes (or other factor combinations).
**Approach:** Fit a model with interaction terms and test the interaction coefficient for significance.
```r
# Interaction between genotype and treatment
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ genotype + treatment + genotype:treatment)
dds <- DESeq(dds)
# Test interaction term
res_interaction <- results(dds, name = 'genotypeKO.treatmentdrug')
# Or use contrast for difference of differences
res_interaction <- results(dds, contrast = list(
c('genotypeKO.treatmentdrug'),
c()
))
```
## Likelihood Ratio Test
**Goal:** Test whether a factor (e.g., condition) explains significant variance compared to a reduced model.
**Approach:** Compare full and reduced GLMs using a likelihood ratio test instead of Wald tests.
```r
# Compare full vs reduced model
dds <- DESeq(dds, test = 'LRT', reduced = ~ batch)
# Results from LRT
res <- results(dds)
```
## Pre-Filtering Strategies
**Goal:** Remove uninformative genes to reduce multiple testing burden and improve statistical power.
**Approach:** Apply count-based filters requiring minimum expression across a threshold number of samples.
```r
# Remove genes with low counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# Keep genes with at least n counts in at least k samples
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep,]
# Filter by expression level
keep <- rowMeans(counts(dds, normalized = TRUE)) >= 10
dds <- dds[keep,]
```
## Working with Existing Objects
```r
# Update design formula
design(dds) <- ~ batch + condition
dds <- DESeq(dds)
# Subset samples
dds_subset <- dds[, dds$group == 'A']
# Subset genes
dds_genes <- dds[rownames(dds) %in% gene_list,]
```
## Exporting Results
**Goal:** Save DE results and normalized counts to files for sharing or downstream tools.
**Approach:** Convert results to data frames and write as CSV files.
```r
# Write to CSV
write.csv(as.data.frame(resOrdered), file = 'deseq2_results.csv')
# Write normalized counts
write.csv(as.data.frame(normalized_counts), file = 'normalized_counts.csv')
```
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| "design matrix not full rank" | Confounded variables or missing levels | Check coldata for confounding |
| "counts matrix should be integers" | Non-integer counts (e.g., from tximport) | Use DESeqDataSetFromTximport() |
| "all samples have 0 counts" | Gene filtering issue | Check count matrix format |
| "factor levels not in colData" | Typo in design formula | Verify column names in coldata |
## Deprecated Features
| Feature | Status | Alternative |
|---------|--------|-------------|
| No-replicate designs | Removed (v1.22) | Require biological replicates |
| `betaPrior = TRUE` | Deprecated | Use `lfcShrink()` instead |
| `rlog()` for large datasets | Not recommended | Use `vst()` for >100 samples |
## Quick Reference: Workflow Steps
```r
# 1. Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(counts, coldata, design = ~ condition)
# 2. Pre-filter
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# 3. Set reference level
dds$condition <- relevel(dds$condition, ref = 'control')
# 4. Run DESeq2
dds <- DESeq(dds)
# 5. Get results with shrinkage
res <- lfcShrink(dds, coef = resultsNames(dds)[2], type = 'apeglm')
# 6. Filter significant results
sig <- subset(res, padj < 0.05)
```
## Decision Guidance
### Shrinkage Method Selection
| Method | Use When | Limitation |
|--------|----------|------------|
| apeglm (default) | Coefficient-based comparisons (`coef=`) | Cannot use `contrast=` |
| ashr | Arbitrary contrasts needed; many coefficients | Slightly less aggressive shrinkage |
| normal | **Avoid** — over-shrinks large precise effects | Kept for backward compatibility only |
Shrinkage changes LFC estimates only, NOT p-values. Use shrunken LFCs for ranking (GSEA input, heatmap ordering) and visualization (volcano x-axis). Use un-shrunken p-values for significance calls.
### LRT vs Wald Test
| Scenario | Test |
|----------|------|
| Pairwise comparison (A vs B) | Wald (default) |
| Factor with >= 3 levels (any gene changing across conditions) | LRT with `reduced = ~ 1` |
| Time series (any temporal change) | LRT |
| Testing a specific coefficient direction | Wald |
LRT is omnibus (ANOVA-like). The LFC in LRT output is last-level-vs-reference, NOT the omnibus effect. Filter LRT results on padj only, not LFC.
### Why padj = NA
| Cause | baseMean | pvalue | padj |
|-------|----------|--------|------|
| Zero counts across all samples | 0 | NA | NA |
| Cook's distance outlier (automatic when any group >= 7 samples) | > 0 | NA | NA |
| Below independent filtering threshold | > 0 | numeric | NA |
Independent filtering optimizes a mean-expression cutoff at the `results()` step to maximize BH-adjusted rejections. This is separate from manual pre-filtering and uses the fitted model's information.
### Size Factor Alternatives
Default median-of-ratios assumes most genes are NOT differentially expressed.
| Scenario | Solution |
|----------|----------|
| High zero-inflation (single-cell) | `type = 'poscounts'` |
| Very small libraries | `type = 'iterate'` |
| Known stable reference genes available | `controlGenes` parameter |
| Prokaryotic stress (majority DE) | Spike-in normalization or `controlGenes` |
### Pre-filtering
```r
# Minimal (speed only; independent filtering handles statistical optimization)
keep <- rowSums(counts(dds)) >= 10
# Group-aware (recommended): require counts in at least the smallest group
keep <- rowSums(counts(dds) >= 10) >= min(table(dds$condition))
```
### Prokaryotic RNA-seq
Bacterial/archaeal experiments differ from eukaryotic in ways that affect DESeq2:
- Use non-spliced aligners (BWA-MEM, Bowtie2) — no introns in prokaryotes
- Polycistronic operons cause read-through between adjacent genes
- rRNA depletion is essential (80-95% rRNA without poly-A selection)
- Under stress conditions, a majority of genes may be DE, violating normalization assumptions — use spike-in normalization or `controlGenes` with known stable housekeeping genes
- KEGG organism codes are strain-specific (e.g., `pae` for P. aeruginosa PAO1); find codes with `clusterProfiler::search_kegg_organism()`
- Annotation: use Prokka/Bakta GFF files rather than Ensembl/biomaRt
### Choosing DESeq2 vs edgeR
| Scenario | Recommended | Rationale |
|----------|-------------|-----------|
| Standard bulk RNA-seq (n >= 3/group) | Either — expect ~90% overlap | Both perform well; concordance is high |
| Very small sample size (n = 2-3/group) | edgeR QL F-test | QL framework provides tighter FPR control with few replicates |
| Salmon/Kallisto quantification | DESeq2 via tximport | DESeqDataSetFromTximport handles offset matrix natively |
| Need LFC shrinkage for ranking/GSEA | DESeq2 | apeglm/ashr shrinkage built in; edgeR has no equivalent |
| Formal fold-change threshold testing | edgeR `glmTreat` | More flexible than DESeq2 `lfcThreshold` |
| Large datasets (>100 samples) | edgeR | Faster with C++ backend in v4 |
| Python-only environment | DESeq2 (via PyDESeq2) | No edgeR Python equivalent |
If overlap between DESeq2 and edgeR is < 60%, investigate filtering, normalization, or dispersion — discordance usually indicates a modeling issue.
### Python Alternative: PyDESeq2
```python
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
dds = DeseqDataSet(counts=count_df, metadata=metadata, design='~condition')
dds.deseq2()
stat_res = DeseqStats(dds, contrast=('condition', 'treated', 'control'))
stat_res.summary()
results_df = stat_res.results_df
```
PyDESeq2 (scverse, v0.5+) supports Wald test, multi-factor designs, apeglm shrinkage. No LRT yet. Results closely match R DESeq2.
## Related Skills
- edger-basics - Alternative DE analysis with edgeR
- de-visualization - MA plots, volcano plots, heatmaps
- de-results - Extract and export significant genes
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.