bio-chipseq-differential-binding
$
npx mdskill add GPTomics/bioSkills/bio-chipseq-differential-bindingIdentifies differentially bound ChIP-seq regions between conditions
- Detects genomic regions with significant binding differences across experimental groups.
- Uses DiffBind, DESeq2, or PyDESeq2 for analysis based on input data type.
- Selects method by checking installed package versions and matching API signatures.
- Outputs normalized counts, p-values, fold-changes, and statistical significance metrics.
SKILL.md
.github/skills/bio-chipseq-differential-bindingView on GitHub ↗
---
name: bio-chipseq-differential-binding
description: Identifies differentially bound ChIP-seq regions between conditions using DiffBind (from BAMs), DESeq2, or PyDESeq2 (from count matrices). Handles normalization, statistical testing, and fold-change estimation with ChIP-seq-specific considerations. Use when comparing ChIP-seq binding between experimental conditions.
tool_type: mixed
primary_tool: DiffBind
---
## Version Compatibility
Reference examples tested with: DiffBind 3.12+, DESeq2 1.42+, edgeR 4.0+, PyDESeq2 0.4+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- Python: `pip show pydeseq2` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Differential Binding Analysis
**"Compare ChIP-seq binding between conditions"** → Identify genomic regions with statistically significant differences in transcription factor or histone mark occupancy between experimental groups.
- R (from BAMs): `DiffBind::dba()` → `dba.count()` → `dba.normalize()` → `dba.analyze()`
- R (from count matrix): `DESeq2::DESeqDataSetFromMatrix()` → `DESeq()` → `results()`
- Python (from count matrix): `pydeseq2.DeseqDataSet()` → `.deseq2()` → `DeseqStats()`
## Choosing an Approach
| Scenario | Recommended |
|----------|-------------|
| Have BAM + peak files | DiffBind (handles consensus peaks, counting, normalization) |
| Have a count matrix (peaks x samples) | DESeq2 directly (R) or PyDESeq2 (Python) |
| Need background normalization | DiffBind with `background=TRUE` or csaw |
| Suspect global binding changes | Custom size factors or spike-in normalization |
| Pre-defined regions (promoters, enhancers) | DESeq2/PyDESeq2 on counted regions |
DiffBind wraps DESeq2/edgeR internally. When a count matrix already exists, DiffBind adds no statistical value — its main contributions (consensus peak generation, summit re-centering, read counting) are already done.
## ChIP-seq Normalization
Normalization is the single most consequential analytical decision for differential binding. The choice of reference reads matters far more than algorithm (RLE vs TMM produce essentially identical results on the same reference reads).
### Reference Reads
| Reference | How applied | Assumption | Risk |
|-----------|-------------|------------|------|
| Reads in peaks | DESeq2 default `estimateSizeFactors()` on count matrix | Most peaks NOT differentially bound | Fails with global changes; can reverse conclusions |
| Full library | Total aligned reads from BAM | Sequencing depth is main technical variable | Conservative; DiffBind 3.0+ default |
| Background bins | 10-15 kb genomic bins via csaw | Background is stable across conditions | Robust to composition bias; requires BAMs |
| Spike-in | Exogenous chromatin (e.g., Drosophila) | Spike-in ratio is constant | Gold standard for global changes; requires spike-in protocol |
### When Default DESeq2 Normalization Works
Running `estimateSizeFactors()` on a ChIP-seq count matrix computes RLE normalization on reads-in-peaks. This works when:
- Most peaks are expected to be stable (typical TF ChIP-seq with localized changes)
- Changes are roughly symmetric (similar numbers of gained and lost peaks)
- Differential peaks are a minority of the total peak set
### When It Fails
- **Global binding changes**: EZH2 inhibitor reducing H3K27me3, BET inhibitor reducing BRD4, knockdown of the ChIPped factor
- **Asymmetric changes**: predominantly gained or lost, not balanced
- **Small peak sets** where differential peaks are a large fraction of total
**Diagnostic**: if the MA plot loess curve deviates substantially from y=0, normalization is not centering correctly. Verify biological plausibility of the gain/loss ratio.
### Custom Size Factors
When default normalization is inappropriate:
```r
# Full library size normalization (when total aligned reads are known)
library_sizes <- c(15e6, 14e6, 16e6, 15e6, 13e6, 14e6)
names(library_sizes) <- colnames(counts(dds))
sizeFactors(dds) <- library_sizes / mean(library_sizes)
# Use known stable peaks as normalization reference
stable_idx <- which(rownames(dds) %in% known_stable_peaks)
dds <- estimateSizeFactors(dds, controlGenes = stable_idx)
```
## From Count Matrix (R - DESeq2)
**Goal:** Identify differentially bound peaks from a pre-computed count matrix using DESeq2's negative binomial framework.
**Approach:** Load counts into DESeqDataSet, apply minimal ChIP-seq-appropriate filtering, run the DESeq2 pipeline, extract results at the desired significance threshold.
```r
library(DESeq2)
counts <- read.delim('counts.tsv', row.names = 1, check.names = FALSE)
coldata <- data.frame(
condition = factor(c(rep('treated', 3), rep('control', 3))),
row.names = colnames(counts)
)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition)
# ChIP-seq peaks are already enriched regions; filter less aggressively than RNA-seq
keep <- rowSums(counts(dds)) >= 1
dds <- dds[keep,]
dds$condition <- relevel(dds$condition, ref = 'control')
dds <- DESeq(dds)
# alpha matches intended significance threshold (optimizes independent filtering)
res <- results(dds, alpha = 0.05)
```
### Pre-filtering for ChIP-seq
Unlike RNA-seq where thousands of genes have zero expression, ChIP-seq peaks were called because they have signal. Aggressive filtering can remove truly differential peaks — a peak present in treatment but absent in control will have near-zero counts in control samples.
| Feature set size | Recommended filter |
|-----------------|-------------------|
| < 500 peaks | `rowSums(counts(dds)) >= 1` (remove only all-zero rows) |
| 500-5,000 peaks | `rowSums(counts(dds)) >= 5` |
| > 5,000 peaks | `rowSums(counts(dds)) >= 10` or standard RNA-seq thresholds |
### The Alpha Parameter
The `alpha` argument in `results()` controls independent filtering optimization. Setting it to match the intended significance threshold (e.g., `alpha = 0.05` when filtering at padj < 0.05) maximizes discoveries at that cutoff. Using the default `alpha = 0.1` while filtering at padj < 0.05 is slightly suboptimal but the effect is usually small.
### Log Fold Change Shrinkage
LFC shrinkage (apeglm, ashr) improves fold-change estimates for ranking and visualization but does NOT change p-values or padj. Always use padj from `results()` for significance calls.
```r
library(apeglm)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
```
## From Count Matrix (Python - PyDESeq2)
**Goal:** Run differential binding analysis in Python using PyDESeq2's implementation of the DESeq2 negative binomial framework.
**Approach:** Load count data into a DeseqDataSet (samples as rows, features as columns), run the pipeline, extract results with BH-adjusted p-values.
```python
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
from pydeseq2.default_inference import DefaultInference
counts = pd.read_csv('counts.tsv', sep='\t', index_col=0)
metadata = pd.DataFrame(
{'condition': ['treated'] * 3 + ['control'] * 3},
index=counts.columns
)
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(counts=counts.T, metadata=metadata, design='~ condition',
refit_cooks=True, inference=inference)
dds.deseq2()
ds = DeseqStats(dds, contrast=['condition', 'treated', 'control'], inference=inference)
ds.summary()
results = ds.results_df
```
PyDESeq2 expects counts with samples as rows and features as columns — transpose a peaks-by-samples matrix with `.T`. The `contrast` argument takes `['variable', 'numerator', 'denominator']`, so positive log2FC means higher in the numerator condition.
## DiffBind Workflow (From BAMs)
**Goal:** Run the complete DiffBind pipeline from BAM and peak files to differential binding results.
**Approach:** Create a sample sheet linking BAMs, peaks, and metadata; load into DiffBind; count reads in consensus peaks; normalize; run statistical testing.
### Sample Sheet
```r
samples <- data.frame(
SampleID = c('ctrl_1', 'ctrl_2', 'treat_1', 'treat_2'),
Tissue = c('cell', 'cell', 'cell', 'cell'),
Factor = c('H3K4me3', 'H3K4me3', 'H3K4me3', 'H3K4me3'),
Condition = c('control', 'control', 'treatment', 'treatment'),
Replicate = c(1, 2, 1, 2),
bamReads = c('ctrl1.bam', 'ctrl2.bam', 'treat1.bam', 'treat2.bam'),
Peaks = c('ctrl1_peaks.narrowPeak', 'ctrl2_peaks.narrowPeak',
'treat1_peaks.narrowPeak', 'treat2_peaks.narrowPeak'),
PeakCaller = c('macs', 'macs', 'macs', 'macs')
)
write.csv(samples, 'samples.csv', row.names = FALSE)
```
### Load, Count, Normalize, Analyze
```r
library(DiffBind)
dba_obj <- dba(sampleSheet = 'samples.csv')
dba_obj <- dba.count(dba_obj, summits = 250, minOverlap = 2)
dba_obj <- dba.normalize(dba_obj)
dba_obj <- dba.contrast(dba_obj, design = '~ Condition')
dba_obj <- dba.analyze(dba_obj, method = DBA_DESEQ2)
```
### DiffBind Normalization Options
```r
# Default: full library size (conservative, recommended starting point)
dba_obj <- dba.normalize(dba_obj)
# Background normalization (robust to composition bias; uses csaw bins)
dba_obj <- dba.normalize(dba_obj, background = TRUE)
# Inspect applied normalization
dba.normalize(dba_obj, bRetrieve = TRUE)
```
| Method | When to use |
|--------|------------|
| `DBA_NORM_LIB` + full library (default) | Most analyses; conservative baseline |
| `DBA_NORM_RLE` + reads in peaks | Most peaks expected stable; typical TF ChIP |
| Background (`background=TRUE`) | Suspected composition bias or global changes |
| Spike-in (`spikein=TRUE`) | ChIP-Rx experiments with exogenous reference |
| Loess offsets (`offsets=TRUE`) | Abundance-dependent efficiency biases (use cautiously — can over-normalize) |
### Results and Export
```r
db_all <- dba.report(dba_obj, th = 1)
results_df <- as.data.frame(db_all)
write.csv(results_df, 'differential_binding.csv', row.names = FALSE)
# Significant peaks at FDR < 0.05
db_sig <- dba.report(dba_obj, th = 0.05)
# Export to BED
library(rtracklayer)
export(db_sig, 'diff_peaks.bed', format = 'BED')
```
### Visualization
```r
dba.plotPCA(dba_obj, DBA_CONDITION, label = DBA_ID)
dba.plotMA(dba_obj)
dba.plotVolcano(dba_obj)
dba.plotHeatmap(dba_obj, contrast = 1, correlations = FALSE)
```
## Multi-Factor Design
```r
# DiffBind with blocking factor
dba_obj <- dba.contrast(dba_obj, design = '~ Batch + Condition')
dba_obj <- dba.analyze(dba_obj)
# DESeq2 directly with batch correction
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata,
design = ~ batch + condition)
dds <- DESeq(dds)
res <- results(dds, name = 'condition_treated_vs_control')
```
## DiffBind 3.0+ Notes
Key defaults changed in DiffBind 3.0:
- `summits=200` recenters peaks (was `FALSE`); set `summits=FALSE` for broad histone marks
- `dba.normalize()` is now required before analysis
- Blacklist filtering applied by default
- Full library size normalization replaces reads-in-peaks (more conservative)
- Use design formulas instead of `group1`/`group2` for contrasts
## Sample Sheet Columns
| Column | Required | Description |
|--------|----------|-------------|
| SampleID | Yes | Unique identifier |
| Condition | Yes | Experimental condition |
| Replicate | Yes | Replicate number |
| bamReads | Yes | Path to BAM file |
| Peaks | Yes | Path to peak file |
| PeakCaller | Yes | macs, bed, narrow |
| bamControl | No | Path to input BAM |
| Tissue | No | Tissue/cell type |
| Factor | No | ChIP target |
## Related Skills
- peak-calling - Generate input peak files
- peak-annotation - Annotate differential peaks to genes
- differential-expression/deseq2-basics - DESeq2 fundamentals for count-based testing
- differential-expression/edger-basics - Alternative statistical framework
- pathway-analysis/go-enrichment - Functional enrichment of peak-associated genes
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.