bio-chipseq-peak-annotation

$npx mdskill add GPTomics/bioSkills/bio-chipseq-peak-annotation

Annotate ChIP-seq peaks to genomic features and genes

  • Assigns peaks to promoter, exon, intron, or intergenic categories
  • Integrates ChIPseeker, HOMER, pandas, and pyranges tools
  • Uses custom GTF files or pre-built annotation databases
  • Outputs peak classifications with signed distance to TSS
SKILL.md
.github/skills/bio-chipseq-peak-annotationView on GitHub ↗
---
name: bio-chipseq-peak-annotation
description: Annotate ChIP-seq peaks to genomic features and nearest genes. Classify peaks as promoter, exon, intron, or intergenic using ChIPseeker (R), HOMER annotatePeaks.pl (CLI), or Python (pandas/pyranges). Supports pre-built annotation databases and custom GTF files. Handles promoter definition, feature priority, category collapsing, and signed distance-to-TSS. Use when assigning genomic context to ChIP-seq peaks or linking peaks to target genes.
tool_type: mixed
primary_tool: ChIPseeker
---

## Version Compatibility

Reference examples tested with: ChIPseeker 1.38+, GenomicFeatures 1.54+, rtracklayer 1.62+, HOMER 4.11+, pyranges 0.0.129+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `annotatePeaks.pl` (HOMER prints version on run)

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Peak Annotation

**"Annotate my peaks to genes and genomic features"** -> Assign each peak to a genomic feature category (promoter, exon, intron, intergenic), find the nearest gene, and calculate signed distance to TSS.
- R: `ChIPseeker::annotatePeak(peaks, TxDb=txdb)`
- CLI: `annotatePeaks.pl peaks.bed hg38 -gtf annotation.gtf`
- Python: Parse GTF, compute intervals, classify by overlap

## Choosing an Annotation Approach

| Context | Recommended | Why |
|---------|-------------|-----|
| Standard genome, pre-built annotations available | ChIPseeker with TxDb package | Simplest setup; automatic gene symbol mapping via annoDb |
| Custom or project-specific GTF provided | ChIPseeker + makeTxDbFromGFF, HOMER -gtf, or Python | All three handle custom annotations; choose based on pipeline |
| HOMER already in pipeline | HOMER annotatePeaks.pl | Reuses tag directory; combined annotation + motif workflow |
| Fine-grained control over classification logic | Python | Full control over priority rules, distance calculation, output format |
| Quick annotation with standard categories | HOMER annotatePeaks.pl | Single command; no R environment required |

**Critical:** Always use the same annotation source for peak annotation as was used for the analysis. Mixing databases (e.g., UCSC knownGene TxDb with a GENCODE GTF alignment) produces gene name mismatches and incorrect feature assignments. When a specific GTF is provided, use it directly rather than a pre-built TxDb package.

## Coordinate Systems and TSS

BED files use 0-based half-open coordinates `[start, end)`. GTF files use 1-based closed coordinates `[start, end]`. Mixing these without conversion shifts annotations by one base.

**Peak center** (from BED): `(start + end) // 2`

**TSS from GTF:**
- Plus-strand genes: TSS = `start` (1-based) -> 0-based: `start - 1`
- Minus-strand genes: TSS = `end` (1-based) -> 0-based: `end`

**Signed distance:** Negative = upstream of TSS, positive = downstream.
- Plus-strand: `distance = peak_center - tss`
- Minus-strand: `distance = -(peak_center - tss)`

## Gene Assignment Conventions

Peak annotation involves two decisions: (1) which gene to assign, and (2) what genomic feature the peak overlaps. These can come from different genes, creating a decoupling problem that affects downstream interpretation.

### Nearest-TSS vs Host-Gene Assignment

| Approach | Gene assigned from | Feature assigned from | Tools |
|----------|-------------------|----------------------|-------|
| Nearest-TSS | Gene with closest TSS | Physical overlap at peak center | ChIPseeker default (`overlap='TSS'`), HOMER |
| Host-gene priority | Gene whose body contains the peak | Same gene's features | ChIPseeker `overlap='all'` |

**Nearest-TSS (default):** HOMER and ChIPseeker both use a two-step process by default. Step 1 finds the gene with the nearest TSS. Step 2 independently determines the genomic feature at the peak center. A peak inside gene A's intron but near gene B's TSS reports: nearest_gene=B, feature=intron -- but the intron belongs to gene A, not gene B.

**Host-gene priority:** ChIPseeker's `overlap='all'` parameter changes this. If a peak overlaps any part of a gene body, that gene is reported as nearest, coupling gene and feature. Peaks with no gene body overlap fall back to nearest TSS.

### Choosing a Convention

| Context | Convention | Rationale |
|---------|-----------|-----------|
| Distal TF binding (enhancers) | Nearest-TSS | Enhancers often regulate the nearest gene, not the gene they sit in |
| Histone marks in gene bodies (H3K36me3, H3K27me3) | Host-gene | Mark typically reflects host gene's transcriptional state |
| Promoter-associated marks (H3K4me3, H3K27ac) | Either | Most peaks are at promoters where both conventions agree |
| Custom annotation against a specific GTF | Host-gene | Consistent gene-feature coupling avoids misleading annotations |
| Reproducing published HOMER results | Nearest-TSS | Matches HOMER's default behavior |

When a task requests "nearest gene," clarify whether it means nearest by TSS distance or the gene whose feature the peak physically overlaps. For most annotation tasks where gene and feature should be consistent, use the host-gene convention.

## Annotation with ChIPseeker (R)

### Pre-built TxDb (Standard Genomes)

```r
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
peaks <- readPeakFile('peaks.narrowPeak')
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-3000, 3000), annoDb = 'org.Hs.eg.db')
anno_df <- as.data.frame(peak_anno)
```

### Custom GTF Annotations

**Goal:** Annotate peaks using a project-specific GTF rather than a pre-built annotation package.

**Approach:** Build a TxDb from the GTF with `makeTxDbFromGFF()`, annotate peaks against it, then map gene symbols from the original GTF since custom TxDb objects lack the ID mappings that `annoDb` requires.

```r
library(ChIPseeker)
library(GenomicFeatures)
library(rtracklayer)

txdb <- makeTxDbFromGFF('genes.gtf.gz', format = 'gtf')
peaks <- readPeakFile('peaks.bed')
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000), overlap = 'all')
anno_df <- as.data.frame(peak_anno)

# Map gene symbols from GTF (annoDb does not work with custom TxDb)
gtf <- import('genes.gtf.gz')
gene_map <- unique(data.frame(
    gene_id = sub('\\..*', '', gtf$gene_id),
    symbol = gtf$gene_name, stringsAsFactors = FALSE))
gene_map <- gene_map[!is.na(gene_map$symbol), ]
anno_df$geneId_base <- sub('\\..*', '', anno_df$geneId)
anno_df$SYMBOL <- gene_map$symbol[match(anno_df$geneId_base, gene_map$gene_id)]
```

The version-suffix stripping (`sub('\\..*', '', ...)`) handles GENCODE gene IDs like `ENSG00000142192.25` where the TxDb may store the full ID but the GTF attribute has it without the version.

### Custom Promoter Definition

The `tssRegion` parameter defines the promoter window around each TSS:

| Window | Use case |
|--------|----------|
| c(-1000, 1000) | Strict core promoter |
| c(-2000, 2000) | Common custom definition |
| c(-3000, 3000) | ChIPseeker default; broader capture |
| c(-2000, 500) | Asymmetric; emphasizes upstream regulatory elements |

Match this to analysis requirements. Many studies define specific windows (e.g., 2kb symmetric) -- always check.

### Annotation Priority

ChIPseeker resolves overlapping features using `genomicAnnotationPriority`:

Default: `Promoter > 5'UTR > 3'UTR > Exon > Intron > Downstream > Intergenic`

A peak overlapping both a promoter of gene A and an intron of gene B receives "Promoter". To customize:

```r
peak_anno <- annotatePeak(peaks, TxDb = txdb, tssRegion = c(-2000, 2000),
    genomicAnnotationPriority = c('Promoter', '5UTR', '3UTR', 'Exon', 'Intron',
                                   'Downstream', 'Intergenic'))
```

### Collapse Annotation Categories

ChIPseeker returns detailed subcategories. To collapse to four standard categories:

| ChIPseeker Output | Collapsed |
|-------------------|-----------|
| Promoter (<=1kb), Promoter (1-2kb), Promoter (2-3kb) | promoter |
| 5' UTR, 3' UTR, 1st Exon, Other Exon | exon |
| 1st Intron, Other Intron | intron |
| Downstream (<=300), Downstream (<=1kb), Distal Intergenic | intergenic |

```r
collapse_annotation <- function(ann) {
    ifelse(grepl('Promoter', ann), 'promoter',
    ifelse(grepl("5' UTR|3' UTR|Exon", ann), 'exon',
    ifelse(grepl('Intron', ann), 'intron', 'intergenic')))
}
anno_df$feature <- collapse_annotation(anno_df$annotation)
```

To suppress subcategories at the source:

```r
options(ChIPseeker.ignore_1st_exon = TRUE)
options(ChIPseeker.ignore_1st_intron = TRUE)
options(ChIPseeker.ignore_promoter_subcategory = TRUE)
```

### Export Results

```r
output <- data.frame(chr = anno_df$seqnames, start = anno_df$start, end = anno_df$end,
    nearest_gene = anno_df$SYMBOL, distance_to_tss = anno_df$distanceToTSS,
    feature = anno_df$feature)
write.table(output, 'annotations.tsv', sep = '\t', row.names = FALSE, quote = FALSE)
```

## Annotation with HOMER (CLI)

### Standard Genome

```bash
annotatePeaks.pl peaks.bed hg38 > annotated.txt
```

### Custom GTF

```bash
# With installed genome
annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf > annotated.txt

# Without installed genome (annotation from GTF only)
annotatePeaks.pl peaks.bed none -gtf genes.gtf > annotated.txt
```

For gzipped GTFs, decompress first: `gunzip -k genes.gtf.gz`

### Annotation Statistics

```bash
annotatePeaks.pl peaks.bed hg38 -gtf genes.gtf -annStats ann_stats.txt > annotated.txt
```

### Parse HOMER Output

HOMER uses a two-part annotation process: (1) find the nearest TSS to assign a gene, and (2) independently classify the genomic feature at the peak center. The gene and annotation can come from different genes -- a peak in gene A's intron near gene B's TSS reports gene B with an intron annotation. This matches ChIPseeker's default `overlap='TSS'` behavior.

HOMER produces 19 tab-delimited columns. Key columns for annotation:

| Column | Name | Content |
|--------|------|---------|
| 8 | Annotation | Feature category (promoter-TSS, exon, intron, Intergenic) |
| 10 | Distance to TSS | Signed distance (negative = upstream) |
| 16 | Gene Name | Gene symbol |

```bash
awk -F'\t' 'NR>1 {print $2, $3, $4, $16, $10, $8}' OFS='\t' annotated.txt > summary.tsv
```

### HOMER Category Mapping

| HOMER Category | Collapsed |
|---------------|-----------|
| promoter-TSS | promoter |
| 5' UTR, 3' UTR, exon, non-coding | exon |
| intron | intron |
| Intergenic, TTS | intergenic |

**Limitation:** HOMER defines promoter as -1kb to +100bp from TSS; this window is not configurable via flags. For custom promoter windows, reclassify using the Distance to TSS column:

```bash
awk -F'\t' 'NR>1 {
    dist = ($10 < 0) ? -$10 : $10
    feat = (dist <= 2000) ? "promoter" : $8
    print $2, $3, $4, $16, $10, feat
}' OFS='\t' annotated.txt > reclassified.tsv
```

## Annotation with Python

### Parse GTF and Extract Gene Models

**Goal:** Build gene, exon, and TSS tables from a GTF file for peak annotation.

**Approach:** Parse GTF line by line, extract gene and exon features, compute TSS positions from strand and coordinates, converting from 1-based GTF to 0-based BED coordinates.

```python
import gzip, pandas as pd

def parse_gtf(gtf_path):
    records = []
    opener = gzip.open if gtf_path.endswith('.gz') else open
    with opener(gtf_path, 'rt') as f:
        for line in f:
            if line.startswith('#'):
                continue
            fields = line.strip().split('\t')
            attrs = {}
            for item in fields[8].strip().rstrip(';').split(';'):
                item = item.strip()
                if ' ' in item:
                    key, val = item.split(' ', 1)
                    attrs[key] = val.strip('"')
            records.append({'chrom': fields[0], 'feature': fields[2],
                            'start': int(fields[3]) - 1, 'end': int(fields[4]),
                            'strand': fields[6], **attrs})
    return pd.DataFrame(records)

gtf = parse_gtf('genes.gtf.gz')
genes = gtf[gtf['feature'] == 'gene'].copy()
genes['tss'] = genes.apply(lambda r: r['start'] if r['strand'] == '+' else r['end'], axis=1)
exons = gtf[gtf['feature'] == 'exon']
```

### Annotate Peaks with Host-Gene Convention

**Goal:** For each peak, classify its genomic feature and assign it to the appropriate gene with consistent gene-feature coupling.

**Approach:** Check features in priority order (promoter > exon > intron > intergenic). For promoter, find the nearest TSS within the window. For exon or intron, assign the host gene whose body contains the peak. For intergenic, fall back to nearest TSS. Compute strand-aware signed distance relative to the assigned gene's TSS.

```python
peaks = pd.read_csv('peaks.bed', sep='\t', header=None,
                     names=['chr', 'start', 'end', 'peak_id', 'score'])
peaks['center'] = (peaks['start'] + peaks['end']) // 2
promoter_window = 2000  # bp from TSS; match to analysis requirements

results = []
for _, peak in peaks.iterrows():
    chrom_genes = genes[genes['chrom'] == peak['chr']]
    chrom_exons = exons[exons['chrom'] == peak['chr']]
    abs_dists = (chrom_genes['tss'] - peak['center']).abs()
    nearest_tss_gene = chrom_genes.loc[abs_dists.idxmin()]

    # Feature classification with host-gene coupling: promoter > exon > intron > intergenic
    if abs_dists.min() <= promoter_window:
        feature, assigned = 'promoter', nearest_tss_gene
    else:
        exon_hits = chrom_exons[(chrom_exons['start'] <= peak['center']) & (peak['center'] < chrom_exons['end'])]
        gene_hits = chrom_genes[(chrom_genes['start'] <= peak['center']) & (peak['center'] < chrom_genes['end'])]
        if len(exon_hits) > 0:
            host_gene_name = exon_hits.iloc[0].get('gene_name', '')
            host = chrom_genes[chrom_genes['gene_name'] == host_gene_name]
            feature, assigned = 'exon', host.iloc[0] if len(host) > 0 else nearest_tss_gene
        elif len(gene_hits) > 0:
            closest_host = gene_hits.loc[(gene_hits['tss'] - peak['center']).abs().idxmin()]
            feature, assigned = 'intron', closest_host
        else:
            feature, assigned = 'intergenic', nearest_tss_gene

    raw_dist = peak['center'] - assigned['tss']
    signed_dist = -raw_dist if assigned['strand'] == '-' else raw_dist

    results.append({'peak_id': peak['peak_id'], 'chr': peak['chr'], 'start': peak['start'],
                    'end': peak['end'], 'nearest_gene': assigned['gene_name'],
                    'distance_to_tss': int(signed_dist), 'feature': feature})

result_df = pd.DataFrame(results)
result_df.to_csv('annotations.tsv', sep='\t', index=False)
```

When multiple genes overlap the peak center (common on opposite strands), the host gene with the closest TSS is selected as a tiebreaker.

### Alternative: pyranges

For projects with pyranges installed, GTF parsing is simpler:

```python
import pyranges as pr

gtf = pr.read_gtf('genes.gtf.gz')  # auto-converts to 0-based half-open
genes = gtf[gtf.Feature == 'gene']
peaks = pr.read_bed('peaks.bed')
nearest = peaks.nearest(genes)  # adds Distance column
```

Feature classification and signed distance still require manual logic on the result DataFrame.

## Visualization

```r
plotAnnoPie(peak_anno)
plotAnnoBar(peak_anno)
plotDistToTSS(peak_anno, title = 'Distribution of peaks relative to TSS')
```

### Compare Multiple Peak Sets

```r
peak_files <- list(H3K4me3 = 'h3k4me3_peaks.bed', H3K27ac = 'h3k27ac_peaks.bed')
anno_list <- lapply(peak_files, function(f) annotatePeak(readPeakFile(f), TxDb = txdb))
plotAnnoBar(anno_list)
plotDistToTSS(anno_list)
```

## Key Parameters

| Parameter (ChIPseeker) | Default | Description |
|------------------------|---------|-------------|
| tssRegion | c(-3000, 3000) | Promoter window around TSS |
| level | "transcript" | "transcript" or "gene"; gene-level merges all isoforms |
| genomicAnnotationPriority | Promoter > ... > Intergenic | Feature priority for overlapping annotations |
| overlap | "TSS" | "TSS": gene = nearest TSS (gene and feature can be from different genes). "all": gene = host gene if peak overlaps any gene body (coupled annotation). Use "all" when gene-feature consistency matters |
| sameStrand | FALSE | Restrict to same-strand genes only |
| addFlankGeneInfo | FALSE | Include neighboring gene information |

## Related Skills

- peak-calling - Generate peak files with MACS3 or HOMER
- motif-analysis - De novo and known motif enrichment in peak regions
- differential-binding - Compare peaks between conditions
- chipseq-visualization - Signal tracks, heatmaps, profile plots
- genome-intervals/gtf-gff-handling - Parse and convert GTF/GFF annotation files
- genome-intervals/proximity-operations - bedtools closest and window operations
- pathway-analysis/go-enrichment - Functional enrichment of peak-associated genes
More from GPTomics/bioSkills