bio-rna-quantification-featurecounts-counting

Name: bio-rna-quantification-featurecounts-counting
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-rna-quantification-featurecounts-counting

Counts RNA-seq reads per gene using featureCounts from BAM files

Solves the task of generating gene-level read counts for differential expression analysis
Relies on Subread featureCounts, STAR/HISAT2 BAM files, and GTF annotations
Uses GTF annotations to map reads to genes and count overlaps
Exports a gene-by-sample count matrix as a text file for downstream analysis

SKILL.md

.github/skills/bio-rna-quantification-featurecounts-countingView on GitHub ↗

---
name: bio-rna-quantification-featurecounts-counting
description: Count reads per gene from aligned BAM files using Subread featureCounts. Use when processing BAM files from STAR/HISAT2 to generate gene-level counts for DESeq2/edgeR.
tool_type: cli
primary_tool: featureCounts
---

## Version Compatibility

Reference examples tested with: DESeq2 1.42+, HISAT2 2.2.1+, STAR 2.7.11+, Subread 2.0+, edgeR 4.0+, pandas 2.2+, scanpy 1.10+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# featureCounts Counting

**"Count reads per gene from my BAM files"** → Assign aligned reads to genomic features using a GTF annotation to produce a gene-by-sample count matrix for DE analysis.
- CLI: `featureCounts -a genes.gtf -o counts.txt sample1.bam sample2.bam`

Count reads mapping to genomic features (genes, exons) from BAM files.

## Basic Usage

```bash
# Single sample
featureCounts -a annotation.gtf -o counts.txt aligned.bam

# Multiple samples (recommended - single matrix output)
featureCounts -a annotation.gtf -o counts.txt sample1.bam sample2.bam sample3.bam

# All BAMs in directory
featureCounts -a annotation.gtf -o counts.txt *.bam
```

## Paired-End Data

```bash
# Count fragments, not reads (required for paired-end)
featureCounts -p --countReadPairs -a annotation.gtf -o counts.txt *.bam

# Check proper pairs only
featureCounts -p --countReadPairs -B -C -a annotation.gtf -o counts.txt *.bam
```

**Flags:**
- `-p` - Input is paired-end
- `--countReadPairs` - Count fragments instead of reads
- `-B` - Only count properly paired reads
- `-C` - Don't count chimeric fragments

## Strand-Specific Libraries

```bash
# Unstranded (default)
featureCounts -s 0 -a annotation.gtf -o counts.txt *.bam

# Forward stranded (e.g., dUTP, NSR)
featureCounts -s 1 -a annotation.gtf -o counts.txt *.bam

# Reverse stranded (e.g., Illumina TruSeq, most common)
featureCounts -s 2 -a annotation.gtf -o counts.txt *.bam
```

**Determining strandedness:** Use `infer_experiment.py` from RSeQC or check library prep protocol.

## Feature Types

```bash
# Count at gene level (default)
featureCounts -t exon -g gene_id -a annotation.gtf -o counts.txt *.bam

# Count at transcript level
featureCounts -t exon -g transcript_id -a annotation.gtf -o counts.txt *.bam

# Count CDS only
featureCounts -t CDS -g gene_id -a annotation.gtf -o counts.txt *.bam
```

**Flags:**
- `-t` - Feature type in GTF (default: exon)
- `-g` - Meta-feature attribute (default: gene_id)

## Multi-Mapping Reads

```bash
# Discard multi-mappers (default, recommended for DE)
featureCounts -a annotation.gtf -o counts.txt *.bam

# Count multi-mappers (fractional)
featureCounts -M --fraction -a annotation.gtf -o counts.txt *.bam

# Count multi-mappers (full count to each location)
featureCounts -M -a annotation.gtf -o counts.txt *.bam
```

## Overlapping Features

```bash
# Discard reads overlapping multiple features (default)
featureCounts -a annotation.gtf -o counts.txt *.bam

# Count reads overlapping multiple features
featureCounts -O -a annotation.gtf -o counts.txt *.bam

# Fractional count for overlaps
featureCounts -O --fraction -a annotation.gtf -o counts.txt *.bam
```

## Performance Options

```bash
# Use multiple threads
featureCounts -T 8 -a annotation.gtf -o counts.txt *.bam

# Use less memory (slower)
featureCounts --largeBAM -a annotation.gtf -o counts.txt *.bam
```

## Output Files

featureCounts produces two files:

1. **counts.txt** - Main count matrix
```
Geneid  Chr     Start   End     Strand  Length  sample1.bam  sample2.bam
GENE1   chr1    100     500     +       400     1523         1891
GENE2   chr1    1000    2000    -       1000    892          756
```

2. **counts.txt.summary** - Assignment statistics
```
Status                  sample1.bam  sample2.bam
Assigned                1523456      1678234
Unassigned_Unmapped     12345        11234
Unassigned_NoFeatures   234567       245678
```

## Extract Count Matrix

```bash
# Remove first 6 columns (metadata) to get just counts
cut -f1,7- counts.txt | tail -n +2 > count_matrix.txt
```

## Python Processing

```python
import pandas as pd

counts = pd.read_csv('counts.txt', sep='\t', comment='#')
count_matrix = counts.set_index('Geneid').iloc[:, 5:]  # Skip metadata columns
count_matrix.columns = [c.replace('.bam', '') for c in count_matrix.columns]
count_matrix.to_csv('count_matrix.csv')
```

## R Processing

```r
counts <- read.table('counts.txt', header=TRUE, row.names=1, skip=1)
count_matrix <- counts[, 6:ncol(counts)]  # Skip metadata columns
colnames(count_matrix) <- gsub('.bam', '', colnames(count_matrix))
```

## Common Issues

**Low assignment rate:**
- Check strandedness setting (`-s`)
- Verify GTF matches reference genome version
- Check BAM alignment quality

**Zero counts for known expressed genes:**
- Ensure feature type matches GTF (`-t exon` vs `-t gene`)
- Check gene_id attribute name in GTF

## Related Skills

- alignment-files/sam-bam-basics - Input BAM file handling
- genome-intervals/gtf-gff-handling - GTF annotation files
- differential-expression/deseq2-basics - Downstream analysis with counts
- rna-quantification/count-matrix-qc - QC of count data