bio-alignment-filtering
$
npx mdskill add GPTomics/bioSkills/bio-alignment-filteringFilter alignments by quality and regions using samtools.
- Extract specific reads while removing low-quality data
- Depends on samtools CLI and pysam Python library APIs
- Decides actions based on flag bits, mapping scores, or genomic coordinates
- Delivers filtered BAM files via command-line output or programmatic iteration
SKILL.md
.github/skills/bio-alignment-filteringView on GitHub ↗
---
name: bio-alignment-filtering
description: Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
tool_type: cli
primary_tool: samtools
---
## Version Compatibility
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Alignment Filtering
**"Filter my BAM file to keep only high-quality reads"** → Select reads by FLAG bits, mapping quality, and genomic regions using samtools view or pysam.
- CLI: `samtools view` with `-F`/`-f`/`-q`/`-L` flags (samtools)
- Python: `pysam.AlignmentFile` iteration with attribute filters (pysam)
Filter alignments by flags, quality, and regions using samtools and pysam.
## Filter Flags
| Option | Description |
|--------|-------------|
| `-f FLAG` | Include reads with ALL bits set |
| `-F FLAG` | Exclude reads with ANY bits set |
| `-G FLAG` | Exclude reads with ALL bits set |
| `-q MAPQ` | Minimum mapping quality |
| `-L BED` | Include reads overlapping regions |
## Common FLAG Values
| Flag | Hex | Meaning |
|------|-----|---------|
| 1 | 0x1 | Paired |
| 2 | 0x2 | Proper pair |
| 4 | 0x4 | Unmapped |
| 8 | 0x8 | Mate unmapped |
| 16 | 0x10 | Reverse strand |
| 32 | 0x20 | Mate reverse strand |
| 64 | 0x40 | First in pair (read1) |
| 128 | 0x80 | Second in pair (read2) |
| 256 | 0x100 | Secondary alignment |
| 512 | 0x200 | Failed QC |
| 1024 | 0x400 | Duplicate |
| 2048 | 0x800 | Supplementary |
## Filter by FLAG
### Keep Only Mapped Reads
```bash
samtools view -F 4 -o mapped.bam input.bam
```
### Keep Only Unmapped Reads
```bash
samtools view -f 4 -o unmapped.bam input.bam
```
### Keep Only Properly Paired
```bash
samtools view -f 2 -o proper.bam input.bam
```
### Remove Duplicates
```bash
samtools view -F 1024 -o nodup.bam input.bam
```
### Remove Secondary and Supplementary
```bash
samtools view -F 2304 -o primary.bam input.bam
```
### Keep Only Primary Alignments
```bash
samtools view -F 256 -F 2048 -o primary.bam input.bam
# Or combined: -F 2304
```
### Keep Read1 Only
```bash
samtools view -f 64 -o read1.bam input.bam
```
### Keep Read2 Only
```bash
samtools view -f 128 -o read2.bam input.bam
```
### Forward Strand Only
```bash
samtools view -F 16 -o forward.bam input.bam
```
### Reverse Strand Only
```bash
samtools view -f 16 -o reverse.bam input.bam
```
## Filter by Mapping Quality
### Minimum MAPQ
```bash
samtools view -q 30 -o highqual.bam input.bam
```
### MAPQ and Mapped
```bash
samtools view -F 4 -q 30 -o filtered.bam input.bam
```
### Aligner-Aware MAPQ Thresholds
MAPQ scales differ by aligner; the same `-q 30` filter does different things. See sam-bam-basics for the full MAPQ-by-aligner table. Filtering recommendations:
| Aligner | "Drop ambiguous" | "High confidence" |
|---------|------------------|-------------------|
| BWA-MEM / BWA-MEM2 | `-q 1` | `-q 30` (or `-q 60` for unique only) |
| Bowtie2 | `-q 1` | `-q 23` (~99% accurate per Heng Li) |
| **STAR** | `-q 255` | `-q 255` (255 is the unique-mapped sentinel; -q 60 drops everything) |
| HISAT2 | `-q 1` | `-q 60` |
| minimap2 (DNA, long-read) | `-q 1` | `-q 60` |
| pbmm2 (PacBio) | `-q 1` | `-q 60` |
For Phred-scaled aligners (BWA, minimap2), MAPQ Q maps to ~10^(-Q/10) probability of wrong mapping. For STAR, the values 0/1/2/3/255 are sentinels, not probabilities.
### Drop Ambiguous Across Aligners (Universal)
```bash
samtools view -q 1 in.bam # exclude MAPQ=0; works for all aligners
```
## Filter by Region
### Single Region
```bash
samtools view -o region.bam input.bam chr1:1000000-2000000
```
### Multiple Regions
```bash
samtools view -o regions.bam input.bam chr1:1000-2000 chr2:3000-4000
```
### Regions from BED File
```bash
samtools view -L targets.bed -o targets.bam input.bam
```
### Combine Region and Quality
```bash
samtools view -q 30 -L targets.bed -o filtered.bam input.bam
```
## Combined Filters
### Standard Quality Filter
**Goal:** Produce a clean BAM containing only primary, mapped, non-duplicate reads with high mapping confidence.
**Approach:** Combine FLAG exclusion (-F for unmapped + secondary + duplicate + supplementary) with a MAPQ threshold.
**Reference (samtools 1.19+):**
```bash
samtools view -F 3332 -q 30 -o filtered.bam input.bam
# 3332 = 4 (unmapped) + 256 (secondary) + 1024 (duplicate) + 2048 (supplementary)
```
### Variant Calling Prep -- Assay-Aware
**Goal:** Choose a filter that matches what the downstream caller expects. Stripping supplementary alignments breaks SV callers; requiring proper-pair drops valid spliced RNA-seq reads.
| Assay / caller | Recommended filter | Why |
|----------------|-------------------|-----|
| Germline WGS short-variant (HaplotypeCaller, DeepVariant) | `-f 2 -F 3328 -q 20` | Primary, no dup, proper pair, MAPQ>=20 |
| Somatic short-variant (Mutect2, Strelka2) | `-F 3328 -q 1` | Drop only MAPQ=0; somatic callers handle low MAPQ; chimeric reads at SVs may carry real somatic SNVs |
| Long-read short-variant (clair3, DeepVariant ONT) | `-F 3328 -q 5` | Long-read MAPQ scale is lower |
| Long-read SV (Sniffles, cuteSV) | `-F 1024` only | **Keep supplementary** -- SA tag is the SV signal |
| Short-read SV (Manta, GRIDSS, Delly, SvABA) | `-F 1024` only | Same -- supplementary required |
| ChIP-seq peak calling | `-F 1804 -q 30` | Drop dup + secondary + supp + unmapped + mate-unmapped + QC-fail |
| ATAC-seq | `-F 1804 -q 30 -f 2` | Same plus proper pair |
| RNA-seq quantification (STAR) | `-q 255` | Unique only (STAR sentinel) |
| RNA-seq quantification (HISAT2) | `-F 256 -q 60` | Different aligner semantics |
| RNA-seq variant (after `SplitNCigarReads`) | `-F 3328 -q 20` | Standard germline after split-N-trim |
| Panel / amplicon | After `samtools ampliconclip`; `-F 1024 -q 20` | Primer overlap makes proper-pair unreliable |
| ctDNA / cfDNA (UMI) | After fgbio consensus; do not pre-filter raw | |
**Reference (samtools 1.19+):**
```bash
# Short-variant germline
samtools view -f 2 -F 3328 -q 20 -o clean.bam input.bam
# 3328 = 256 (secondary) + 1024 (duplicate) + 2048 (supplementary)
# SV calling: KEEP supplementary
samtools view -F 1024 -o sv_input.bam input.bam # NOT -F 2304 or -F 3328
# ChIP-seq / ATAC-seq common filter
samtools view -F 1804 -q 30 -o filtered.bam input.bam
# 1804 = 4 + 8 + 256 + 512 + 1024 = unmapped + mate-unmapped + secondary + QC-fail + duplicate
```
**Cost of getting this wrong:** filtering `-F 2304` or `-F 3328` before SV calling produces zero SV calls -- a single-flag mistake that silently invalidates the analysis.
## Subsample Reads (Deterministic, Pair-Consistent)
`samtools view -s SEED.FRAC` -- integer is the hash seed; fractional is the keep fraction. The hash is on QNAME, so:
1. Mate consistency: read1 and read2 are kept or dropped together.
2. Reproducibility: same seed + same fraction returns the same reads.
3. **Sequential downsampling requires different seeds.** `-s 1.5` then `-s 1.25` keeps a nested 5/8 of the original (not 12.5%). Use different integer seeds for independent samples.
```bash
# 10% with seed 42 (always the same reads; pair-consistent)
samtools view -s 42.1 -b -o subset.bam input.bam
# Sequential cuts with INDEPENDENT seeds
samtools view -s 1.5 -b in.bam > half1.bam
samtools view -s 2.25 -b half1.bam > quarter.bam # 12.5% of original
# Coverage-matching to a target read count
total=$(samtools view -c -F 2304 input.bam)
target=10000000
frac=$(awk -v t=$target -v n=$total 'BEGIN{printf "%.6f", t/n}')
samtools view -s "1.${frac#*.}" -b -o matched.bam input.bam
# Tumor-normal coverage matching (pull tumor down to normal)
normal_reads=$(samtools view -c -F 2308 normal.bam)
tumor_reads=$(samtools view -c -F 2308 tumor.bam)
if [ "$tumor_reads" -gt "$normal_reads" ]; then
frac=$(awk -v n=$normal_reads -v t=$tumor_reads 'BEGIN{printf "%.6f", n/t}')
samtools view -s "1.${frac#*.}" -b -o tumor_matched.bam tumor.bam
fi
```
A subsampled BAM without an integer seed (`-s 0.1`) is non-reproducible -- production pipelines should reject it.
## Expression Filtering
`samtools view -e EXPR` (or `--expr`, since samtools 1.16) supports arbitrary expression filtering on tags, FLAG, MAPQ, RNAME, CIGAR, etc. Powerful for filtering by `NM`, `AS`, `NH`, `cs`, etc. that the FLAG-based filters cannot reach:
```bash
# Reads with >=2 mismatches (NM tag)
samtools view -e '[NM] >= 2' in.bam
# Soft clip on the left, on chr1
samtools view -e 'cigar=~"^[0-9]+S" && rname=="chr1"' in.bam
# Combine with FLAG and MAPQ
samtools view -F 2308 -q 30 -e '[NM] <= 5 && [AS] >= 100' in.bam
# Drop reads with low mapped fraction (samtools-internal helpers)
samtools view -e 'sclen / qlen < 0.2' in.bam
```
Note: in samtools 1.16+, `![NM]` is true only if NM is missing (was buggy in earlier versions); NULL values from missing tags propagate through arithmetic.
## Filter by Read Group
```bash
samtools view -r library_A in.bam # single read group
samtools view -R rg_list.txt in.bam # multiple via file (one ID per line)
```
## pysam Python Alternative
### Basic Filtering
```python
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('filtered.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if read.is_unmapped:
continue
if read.mapping_quality < 30:
continue
if read.is_duplicate:
continue
outfile.write(read)
```
### Filter with Function
**Goal:** Apply a multi-criteria quality filter to produce clean alignments for downstream analysis.
**Approach:** Define a predicate checking mapped status, primary alignment, duplicate flag, and MAPQ; stream reads through it.
**Reference (pysam 0.22+):**
```python
import pysam
def passes_filter(read):
if read.is_unmapped:
return False
if read.is_secondary or read.is_supplementary:
return False
if read.is_duplicate:
return False
if read.mapping_quality < 30:
return False
return True
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('filtered.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if passes_filter(read):
outfile.write(read)
```
### Filter by Region
```python
import pysam
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('region.bam', 'wb', header=infile.header) as outfile:
for read in infile.fetch('chr1', 1000000, 2000000):
outfile.write(read)
```
### Filter from BED File
**Goal:** Extract only reads overlapping target regions defined in a BED file.
**Approach:** Parse BED into a list of (chrom, start, end) tuples, then fetch reads from each region and write to output.
**Reference (pysam 0.22+):**
```python
import pysam
def read_bed(bed_path):
regions = []
with open(bed_path) as f:
for line in f:
if line.startswith('#'):
continue
parts = line.strip().split('\t')
regions.append((parts[0], int(parts[1]), int(parts[2])))
return regions
regions = read_bed('targets.bed')
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('targets.bam', 'wb', header=infile.header) as outfile:
for chrom, start, end in regions:
for read in infile.fetch(chrom, start, end):
outfile.write(read)
```
### Subsample (Pair-Consistent)
Hash on QNAME so mates stay together (a fresh `random.random()` per read drops mates inconsistently and breaks paired-end tools):
```python
import pysam
import zlib
fraction = 0.1
seed = 42
threshold = int(0xffffffff * fraction)
def template_hash(qname, seed):
return zlib.crc32(qname.encode()) ^ seed
with pysam.AlignmentFile('input.bam', 'rb') as infile:
with pysam.AlignmentFile('subset.bam', 'wb', header=infile.header) as outfile:
for read in infile:
if template_hash(read.query_name, seed) <= threshold:
outfile.write(read)
```
## Quick Reference
| Task | samtools command |
|------|------------------|
| Mapped only | `view -F 4` |
| Unmapped only | `view -f 4` |
| Properly paired | `view -f 2` |
| Primary only | `view -F 2304` |
| No duplicates | `view -F 1024` |
| High MAPQ | `view -q 30` |
| Region | `view file.bam chr1:1-1000` |
| BED regions | `view -L file.bed` |
| Subsample 10% (reproducible) | `view -s 42.1` |
| Standard filter | `view -F 3332 -q 30` |
## Common Filter Combinations
| Purpose | Flags |
|---------|-------|
| Clean reads | `-F 3332 -q 30` (mapped, primary, no dups, high qual) |
| Variant calling | `-f 2 -F 3328 -q 20` (proper pair, primary, no dups) |
| Coverage analysis | `-F 1284 -q 1` (mapped, primary, no dups) |
| Count unique | `-F 2304` (primary only) |
Flag breakdowns:
- 2304 = 256 + 2048 (secondary + supplementary)
- 3328 = 256 + 1024 + 2048 (secondary + duplicate + supplementary)
- 3332 = 4 + 256 + 1024 + 2048 (unmapped + secondary + duplicate + supplementary)
- 1284 = 4 + 256 + 1024 (unmapped + secondary + duplicate)
## Related Skills
- sam-bam-basics - FLAG semantics, MAPQ-by-aligner, secondary vs supplementary
- alignment-sorting - Sort before/after filtering
- alignment-indexing - Required for region filtering
- alignment-amplicon-clipping - Primer clipping for amplicon panels
- duplicate-handling - Mark duplicates before filtering
- bam-statistics - Check filter effects
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.
- bio-alignment-structuralAlign protein structures using Foldseek 3Di, TM-align, US-align, DALI, or Foldmason for structural MSA. Predict, score, and superpose backbone coordinates when sequence identity is below the twilight zone or remote-homology detection is required. Use when sequence MSA fails (<25% identity), when the dark proteome is the target, when AlphaFoldDB / ESM Atlas search is needed, or when structural superposition is the goal.