bio-read-alignment-bwa-alignment

Name: bio-read-alignment-bwa-alignment
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-read-alignment-bwa-alignment

Align DNA reads to reference genomes using BWA-MEM2.

Processes paired-end and single-end sequencing data for whole-genome analysis.
Depends on bwa-mem2 CLI and requires samtools for downstream sorting.
Executes alignment commands based on input file formats and read group needs.
Outputs SAM or BAM files containing mapped read positions and quality scores.

SKILL.md

.github/skills/bio-read-alignment-bwa-alignmentView on GitHub ↗

---
name: bio-read-alignment-bwa-alignment
description: Align DNA short reads to reference genomes using bwa-mem2, the faster successor to BWA-MEM. Use when aligning DNA short reads to a reference genome.
tool_type: cli
primary_tool: bwa-mem2
---

## Version Compatibility

Reference examples tested with: GATK 4.5+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# BWA-MEM2 Alignment

**"Align reads with BWA"** → Map DNA reads to a reference genome using BWA-MEM2, the standard aligner for whole-genome and exome sequencing.
- CLI: `bwa-mem2 mem -t 8 ref.fa R1.fq R2.fq | samtools sort -o aligned.bam`

## Build Index

```bash
# Index reference genome (required once)
bwa-mem2 index reference.fa

# Creates: reference.fa.0123, reference.fa.amb, reference.fa.ann, reference.fa.bwt.2bit.64, reference.fa.pac
```

## Basic Alignment

```bash
# Paired-end reads
bwa-mem2 mem -t 8 reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam

# Single-end reads
bwa-mem2 mem -t 8 reference.fa reads.fq.gz > aligned.sam
```

## Alignment with Read Groups

```bash
# Add read group information (required for GATK)
bwa-mem2 mem -t 8 \
    -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
    reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam
```

## Direct to Sorted BAM

```bash
# Pipe to samtools for sorted BAM output
bwa-mem2 mem -t 8 \
    -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz | \
    samtools sort -@ 4 -o aligned.sorted.bam -

# Index the BAM
samtools index aligned.sorted.bam
```

## Mark Duplicates Pipeline

**Goal:** Produce a duplicate-marked, sorted BAM file from raw reads in a single streaming pipeline.

**Approach:** Pipe BWA-MEM2 output through samtools fixmate (to add mate score tags), coordinate sort, and markdup in a single command chain to avoid intermediate files.

```bash
# Full pipeline: align, fixmate, sort, markdup
bwa-mem2 mem -t 8 -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz | \
    samtools fixmate -m -@ 4 - - | \
    samtools sort -@ 4 - | \
    samtools markdup -@ 4 - aligned.markdup.bam

samtools index aligned.markdup.bam
```

## Common Options

```bash
bwa-mem2 mem -t 8 \         # Threads
    -M \                     # Mark shorter split hits as secondary (Picard compatible)
    -Y \                     # Use soft clipping for supplementary alignments
    -K 100000000 \           # Process INT input bases in each batch
    -R '@RG\tID:s1\tSM:s1' \ # Read group
    reference.fa r1.fq r2.fq
```

## Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| -t | 1 | Number of threads |
| -k | 19 | Minimum seed length |
| -w | 100 | Band width for extension |
| -r | 1.5 | Re-seeding trigger ratio |
| -c | 500 | Skip seeds with more than INT hits |
| -A | 1 | Match score |
| -B | 4 | Mismatch penalty |
| -O | 6 | Gap open penalty |
| -E | 1 | Gap extension penalty |
| -M | off | Mark secondary alignments |

## Output Filters

```bash
# Filter unmapped and low quality
bwa-mem2 mem -t 8 reference.fa r1.fq r2.fq | \
    samtools view -@ 4 -bS -q 20 -F 4 - | \
    samtools sort -@ 4 -o aligned.filtered.bam -
```

## Split Read Alignment

```bash
# For SV detection, use -Y for soft clipping
bwa-mem2 mem -t 8 -Y reference.fa r1.fq r2.fq > aligned.sam
```

## Memory Requirements

- Index loading: ~10GB for human genome
- Per thread: ~1-2GB
- Typical human WGS: 30-50GB RAM with 8 threads

## BWA-MEM (Alternative)

```bash
# Build index
bwa index reference.fa

# Paired-end alignment
bwa mem -t 8 reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam

# With read groups
bwa mem -t 8 -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam

# Direct to sorted BAM
bwa mem -t 8 -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz | \
    samtools sort -@ 4 -o aligned.sorted.bam -
```

## BWA-MEM vs BWA-MEM2

| Feature | BWA-MEM | BWA-MEM2 |
|---------|---------|----------|
| Status | Active | Archived |
| Speed | 1x | 2-3x faster |
| Index format | .bwt | .bwt.2bit.64 |
| Results | Baseline | Nearly identical |
| Memory | ~5GB | ~10GB |

## Related Skills

- read-qc/fastp-workflow - Preprocess reads before alignment
- alignment-files/alignment-sorting - Post-alignment processing
- alignment-files/duplicate-handling - Mark duplicates
- variant-calling/variant-calling - Call variants from BAM