bio-chipseq-qc
$
npx mdskill add GPTomics/bioSkills/bio-chipseq-qcAssess ChIP-seq experiment quality using key metrics
- Quantify enrichment success via FRiP and cross-correlation analysis.
- Depends on deepTools, MACS3, Subread, bedtools, pybedtools, pysam.
- Decides actions by checking installed software versions first.
- Delivers results through CLI commands or Python scripts.
SKILL.md
.github/skills/bio-chipseq-qcView on GitHub ↗
---
name: bio-chipseq-qc
description: ChIP-seq quality control metrics including FRiP (Fraction of Reads in Peaks), cross-correlation analysis (NSC/RSC), library complexity, and IDR (Irreproducibility Discovery Rate) for replicate concordance. Use to assess experiment quality before downstream analysis. Use when assessing ChIP-seq data quality metrics.
tool_type: mixed
primary_tool: deepTools
---
## Version Compatibility
Reference examples tested with: MACS3 3.0+, Subread 2.0+, bedtools 2.31+, deepTools 3.5+, pybedtools 0.9+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# ChIP-seq Quality Control
**"Assess the quality of my ChIP-seq experiment"** → Compute FRiP, cross-correlation (NSC/RSC), library complexity, and IDR replicate concordance to evaluate enrichment success.
- CLI: `deeptools plotFingerprint`, `phantompeakqualtools run_spp.R`
- Python: `pysam` + `pybedtools` for custom QC metrics
Quality metrics for assessing ChIP-seq experiment success and replicate reproducibility.
## FRiP (Fraction of Reads in Peaks)
**Goal:** Quantify enrichment strength by measuring the proportion of reads falling within called peaks.
**Approach:** Count reads overlapping peak regions and divide by total mapped reads.
### Calculate FRiP with bedtools
```bash
# Count reads in peaks
reads_in_peaks=$(bedtools intersect -a chip.bam -b peaks.narrowPeak -u | samtools view -c -)
total_reads=$(samtools view -c -F 260 chip.bam)
# Calculate FRiP
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip"
```
### Calculate FRiP with featureCounts
```bash
# Convert peaks to SAF format
awk 'BEGIN{OFS="\t"} {print $4, $1, $2, $3, "."}' peaks.narrowPeak > peaks.saf
# Count reads in peaks
featureCounts -a peaks.saf -F SAF -o peak_counts.txt chip.bam
# FRiP from summary
grep -v "^#" peak_counts.txt.summary
```
### Calculate FRiP with pysam
```python
import pysam
import pybedtools
def calculate_frip(bam_file, peak_file):
bam = pysam.AlignmentFile(bam_file, 'rb')
total_reads = bam.count(read_callback=lambda r: not r.is_unmapped and not r.is_secondary)
peaks = pybedtools.BedTool(peak_file)
reads_in_peaks = 0
for peak in peaks:
reads_in_peaks += bam.count(peak.chrom, peak.start, peak.end)
frip = reads_in_peaks / total_reads
return frip
frip = calculate_frip('chip.bam', 'peaks.narrowPeak')
print(f'FRiP: {frip:.4f}')
```
### FRiP Thresholds
| Target | Minimum FRiP | Good FRiP |
|--------|--------------|-----------|
| TF (narrow) | 0.01 | > 0.05 |
| Histone (broad) | 0.10 | > 0.20 |
| H3K4me3 | 0.05 | > 0.15 |
| H3K27ac | 0.05 | > 0.10 |
## Cross-Correlation Analysis (NSC/RSC)
**Goal:** Assess ChIP enrichment quality by measuring strand cross-correlation signal.
**Approach:** Calculate correlation between forward and reverse strand read coverage at varying shifts to detect fragment-length enrichment.
### Run phantompeakqualtools
```bash
# Run SPP cross-correlation analysis
Rscript run_spp.R \
-c=chip.bam \
-savp=chip_cc.pdf \
-out=chip_cc.txt \
-odir=qc/
# Output columns:
# 1: filename
# 2: numReads
# 3: estFragLen (estimated fragment length)
# 4: corr_estFragLen
# 5: phantomPeak
# 6: corr_phantomPeak
# 7: argmin_corr (minimum strand shift)
# 8: min_corr
# 9: NSC (Normalized Strand Coefficient)
# 10: RSC (Relative Strand Coefficient)
# 11: QualityTag
```
### Interpret NSC and RSC
```bash
# Parse results
awk -F'\t' '{
print "Fragment length:", $3
print "NSC:", $9
print "RSC:", $10
print "Quality:", $11
}' chip_cc.txt
```
### NSC/RSC Thresholds
| Metric | Marginal | Acceptable | Ideal |
|--------|----------|------------|-------|
| NSC | < 1.05 | 1.05 - 1.1 | > 1.1 |
| RSC | < 0.8 | 0.8 - 1.0 | > 1.0 |
| QualityTag | -2 | 0 | 1 or 2 |
### Plot Cross-Correlation in R
```r
library(spp)
chip_data <- read.bam.tags('chip.bam')
binding_characteristics <- get.binding.characteristics(chip_data, srange=c(50, 500), bin=5)
# Cross-correlation plot
pdf('cc_plot.pdf')
plot(binding_characteristics$cross.correlation, type='l',
xlab='Strand shift', ylab='Cross-correlation')
abline(v=binding_characteristics$peak$x, col='red')
dev.off()
# Extract metrics
print(paste('Fragment length:', binding_characteristics$peak$x))
```
## Library Complexity (NRF, PBC1, PBC2)
**Goal:** Detect PCR amplification artifacts by measuring library complexity metrics.
**Approach:** Calculate the fraction of unique reads and positional redundancy to assess PCR bottlenecking.
### Calculate with bedtools
```bash
# NRF: Non-Redundant Fraction (unique reads / total reads)
total=$(samtools view -c -F 260 chip.bam)
unique=$(samtools view -F 260 chip.bam | cut -f1-4 | sort -u | wc -l)
nrf=$(echo "scale=4; $unique / $total" | bc)
echo "NRF: $nrf"
# PBC1: PCR Bottleneck Coefficient 1 (M1/Mdistinct)
# M1 = locations with exactly 1 read
# Mdistinct = distinct genomic locations
bedtools bamtobed -i chip.bam | \
awk '{print $1":"$2"-"$3}' | \
sort | uniq -c | \
awk '{
if($1==1) m1++
mdist++
} END {
print "M1:", m1
print "Mdistinct:", mdist
print "PBC1:", m1/mdist
}'
```
### Library Complexity Thresholds
| Metric | Severe | Mild | None |
|--------|--------|------|------|
| NRF | < 0.5 | 0.5 - 0.8 | > 0.8 |
| PBC1 | < 0.5 | 0.5 - 0.8 | > 0.8 |
| PBC2 | < 1 | 1 - 3 | > 3 |
## IDR (Irreproducibility Discovery Rate)
**Goal:** Assess replicate concordance by measuring consistency of ranked peak lists.
**Approach:** Compare signal-ranked peaks from two replicates using IDR statistical framework to identify reproducible peaks.
### Run IDR Analysis
```bash
# Call peaks on each replicate
macs3 callpeak -t rep1.bam -c input.bam -n rep1 -g hs
macs3 callpeak -t rep2.bam -c input.bam -n rep2 -g hs
# Sort by signal value (column 7)
sort -k7,7nr rep1_peaks.narrowPeak > rep1_sorted.narrowPeak
sort -k7,7nr rep2_peaks.narrowPeak > rep2_sorted.narrowPeak
# Run IDR
idr --samples rep1_sorted.narrowPeak rep2_sorted.narrowPeak \
--input-file-type narrowPeak \
--rank signal.value \
--output-file idr_output.txt \
--plot idr_plot.pdf \
--log-output-file idr.log
```
### IDR with Pooled Peaks
```bash
# Call peaks on pooled data
samtools merge -f pooled.bam rep1.bam rep2.bam
macs3 callpeak -t pooled.bam -c input.bam -n pooled -g hs
# Run IDR with oracle
idr --samples rep1_sorted.narrowPeak rep2_sorted.narrowPeak \
--peak-list pooled_peaks.narrowPeak \
--input-file-type narrowPeak \
--rank signal.value \
--output-file idr_oracle.txt
```
### Interpret IDR Results
```bash
# Count peaks at different IDR thresholds
awk '$5 >= 540' idr_output.txt | wc -l # IDR < 0.05 (conservative)
awk '$5 >= 415' idr_output.txt | wc -l # IDR < 0.1 (optimal)
# IDR output columns:
# 1-3: chr, start, end
# 4: name
# 5: scaled IDR (-125 * log2(IDR))
# 6: strand
# 7: signal (from rep1)
# 8: signal (from rep2)
# 9: local IDR
# 10: global IDR
```
### IDR Self-Consistency Check
```bash
# Split one sample and check self-consistency
samtools view -s 0.5 chip.bam -b > pseudo_rep1.bam
samtools view -s 2.5 chip.bam -b > pseudo_rep2.bam
# Call peaks on pseudo-replicates
macs3 callpeak -t pseudo_rep1.bam -c input.bam -n pseudo1 -g hs
macs3 callpeak -t pseudo_rep2.bam -c input.bam -n pseudo2 -g hs
# Run IDR
idr --samples pseudo1_peaks.narrowPeak pseudo2_peaks.narrowPeak \
--input-file-type narrowPeak \
--output-file self_idr.txt
```
### IDR Quality Guidelines
| Comparison | Expected IDR Peaks | Notes |
|------------|-------------------|-------|
| True replicates | > 70% of pooled | Biological concordance |
| Pseudo-replicates | > 80% of sample | Technical consistency |
| Rep vs Pooled | ~100% of rep peaks | Subset relationship |
## deepTools QC Metrics
**Goal:** Visualize ChIP enrichment and sample correlation using deepTools fingerprint and correlation plots.
**Approach:** Generate cumulative read coverage curves and pairwise sample correlation matrices from BAM files.
### plotFingerprint
```bash
# Assess enrichment with fingerprint plot
plotFingerprint \
-b chip.bam input.bam \
--labels ChIP Input \
-o fingerprint.pdf \
--outRawCounts fingerprint.tab \
--outQualityMetrics fingerprint_qc.txt
# Good ChIP shows curve shifted right of diagonal
# Input follows diagonal
```
### computeMatrix and plotProfile
```bash
# TSS enrichment
computeMatrix reference-point \
-S chip.bw \
-R genes.bed \
--referencePoint TSS \
-a 3000 -b 3000 \
-o matrix.gz
plotProfile \
-m matrix.gz \
-o tss_enrichment.pdf \
--perGroup
```
### plotCorrelation
```bash
# Sample correlation
multiBamSummary bins \
-b rep1.bam rep2.bam rep3.bam \
-o results.npz
plotCorrelation \
-in results.npz \
--corMethod spearman \
--whatToPlot heatmap \
-o correlation.pdf \
--outFileCorMatrix correlation.tab
```
## Complete QC Pipeline
**Goal:** Run all major ChIP-seq QC metrics in a single automated script.
**Approach:** Combine FRiP, cross-correlation, library complexity, and fingerprint analysis into one pipeline.
```bash
#!/bin/bash
sample=$1
input=$2
peaks=$3
echo "=== ChIP-seq QC Report: $sample ===" > qc_report.txt
# FRiP
reads_in_peaks=$(bedtools intersect -a $sample -b $peaks -u | samtools view -c -)
total_reads=$(samtools view -c -F 260 $sample)
frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc)
echo "FRiP: $frip" >> qc_report.txt
# Cross-correlation
Rscript run_spp.R -c=$sample -out=cc.txt -odir=.
nsc=$(cut -f9 cc.txt)
rsc=$(cut -f10 cc.txt)
echo "NSC: $nsc" >> qc_report.txt
echo "RSC: $rsc" >> qc_report.txt
# Library complexity
unique=$(samtools view -F 260 $sample | cut -f1-4 | sort -u | wc -l)
nrf=$(echo "scale=4; $unique / $total_reads" | bc)
echo "NRF: $nrf" >> qc_report.txt
# Fingerprint
plotFingerprint -b $sample $input -o fingerprint.pdf --outQualityMetrics fingerprint_qc.txt
cat qc_report.txt
```
## QC Summary Table
| Metric | Tool | Ideal Value |
|--------|------|-------------|
| FRiP | bedtools/featureCounts | > 0.05 (TF), > 0.1 (histone) |
| NSC | phantompeakqualtools | > 1.1 |
| RSC | phantompeakqualtools | > 1.0 |
| NRF | samtools/bedtools | > 0.8 |
| PBC1 | bedtools | > 0.8 |
| IDR (replicates) | idr | > 70% concordance |
## Related Skills
- peak-calling - Call peaks before QC analysis
- alignment-files - BAM statistics and filtering
- differential-binding - Compare conditions after QC
- atac-seq/atac-qc - Similar QC for ATAC-seq
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.