bio-read-alignment-hisat2-alignment
$
npx mdskill add GPTomics/bioSkills/bio-read-alignment-hisat2-alignmentAlign RNA-seq reads efficiently with HISAT2 when memory is limited.
- Handles gene expression quantification workflows requiring splice-aware mapping.
- Depends on HISAT2 CLI and requires compatible samtools 1.19+ for post-processing.
- Executes alignment commands directly via CLI without external service dependencies.
- Outputs sorted BAM files ready for downstream analysis pipelines.
SKILL.md
.github/skills/bio-read-alignment-hisat2-alignmentView on GitHub ↗
---
name: bio-read-alignment-hisat2-alignment
description: Align RNA-seq reads with HISAT2, a memory-efficient splice-aware aligner. Use when STAR's memory requirements are too high or for general RNA-seq alignment.
tool_type: cli
primary_tool: HISAT2
---
## Version Compatibility
Reference examples tested with: samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# HISAT2 RNA-seq Alignment
**"Align RNA-seq reads with HISAT2"** → Map RNA-seq reads to a reference genome with splice-aware alignment. Suitable for gene expression quantification workflows.
- CLI: `hisat2 -x index -1 R1.fq -2 R2.fq | samtools sort -o aligned.bam`
## Build Index
```bash
# Basic index (no annotation)
hisat2-build -p 8 reference.fa hisat2_index
# Index with splice sites and exons (recommended)
hisat2_extract_splice_sites.py annotation.gtf > splice_sites.txt
hisat2_extract_exons.py annotation.gtf > exons.txt
hisat2-build -p 8 \
--ss splice_sites.txt \
--exon exons.txt \
reference.fa hisat2_index
```
## Basic Alignment
```bash
# Paired-end reads
hisat2 -p 8 -x hisat2_index \
-1 reads_1.fq.gz -2 reads_2.fq.gz \
-S aligned.sam
# Single-end reads
hisat2 -p 8 -x hisat2_index \
-U reads.fq.gz \
-S aligned.sam
```
## Direct to Sorted BAM
```bash
# Pipe to samtools
hisat2 -p 8 -x hisat2_index \
-1 r1.fq.gz -2 r2.fq.gz | \
samtools sort -@ 4 -o aligned.sorted.bam -
samtools index aligned.sorted.bam
```
## Stranded Libraries
```bash
# Forward stranded (e.g., Ligation)
hisat2 -p 8 -x hisat2_index \
--rna-strandness FR \
-1 r1.fq.gz -2 r2.fq.gz -S aligned.sam
# Reverse stranded (e.g., dUTP, TruSeq - most common)
hisat2 -p 8 -x hisat2_index \
--rna-strandness RF \
-1 r1.fq.gz -2 r2.fq.gz -S aligned.sam
# Single-end stranded
hisat2 -p 8 -x hisat2_index \
--rna-strandness F \ # or R for reverse
-U reads.fq.gz -S aligned.sam
```
## Novel Splice Junction Discovery
```bash
# Output novel splice junctions
hisat2 -p 8 -x hisat2_index \
--novel-splicesite-outfile novel_splices.txt \
-1 r1.fq.gz -2 r2.fq.gz -S aligned.sam
# Use known + novel junctions for subsequent alignments
hisat2 -p 8 -x hisat2_index \
--novel-splicesite-infile novel_splices.txt \
-1 r1.fq.gz -2 r2.fq.gz -S aligned.sam
```
## Two-Pass Alignment (Manual)
**Goal:** Improve splice junction sensitivity by discovering novel junctions across all samples in a first pass, then realigning with the combined junction set.
**Approach:** Run HISAT2 on each sample to extract novel splice sites, merge and deduplicate junctions across samples, then realign all samples using the combined junction catalog.
```bash
# Pass 1: Discover junctions from all samples
for r1 in *_R1.fq.gz; do
r2=${r1/_R1/_R2}
base=$(basename $r1 _R1.fq.gz)
hisat2 -p 8 -x hisat2_index \
--novel-splicesite-outfile ${base}_splices.txt \
-1 $r1 -2 $r2 -S /dev/null
done
# Combine and filter junctions
cat *_splices.txt | sort -u > combined_splices.txt
# Pass 2: Realign with all junctions
for r1 in *_R1.fq.gz; do
r2=${r1/_R1/_R2}
base=$(basename $r1 _R1.fq.gz)
hisat2 -p 8 -x hisat2_index \
--novel-splicesite-infile combined_splices.txt \
-1 $r1 -2 $r2 | \
samtools sort -@ 4 -o ${base}.sorted.bam -
done
```
## Read Group Information
```bash
hisat2 -p 8 -x hisat2_index \
--rg-id sample1 \
--rg SM:sample1 \
--rg PL:ILLUMINA \
--rg LB:lib1 \
-1 r1.fq.gz -2 r2.fq.gz -S aligned.sam
```
## Downstream Quantification
```bash
# Output name-sorted BAM for htseq-count
hisat2 -p 8 -x hisat2_index -1 r1.fq.gz -2 r2.fq.gz | \
samtools sort -n -@ 4 -o aligned.namesorted.bam -
# Or coordinate-sorted for featureCounts
hisat2 -p 8 -x hisat2_index -1 r1.fq.gz -2 r2.fq.gz | \
samtools sort -@ 4 -o aligned.sorted.bam -
```
## Key Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| -p | 1 | Number of threads |
| -x | - | Index basename |
| --rna-strandness | unstranded | FR/RF/F/R |
| --dta | off | Downstream transcriptome assembly |
| --dta-cufflinks | off | For Cufflinks |
| --min-intronlen | 20 | Minimum intron length |
| --max-intronlen | 500000 | Maximum intron length |
| -k | 5 | Max alignments to report |
## For StringTie/Cufflinks
```bash
# Use --dta for StringTie
hisat2 -p 8 -x hisat2_index \
--dta \
-1 r1.fq.gz -2 r2.fq.gz | \
samtools sort -@ 4 -o aligned.sorted.bam -
```
## Alignment Summary
```bash
# HISAT2 prints summary to stderr
hisat2 -p 8 -x hisat2_index -1 r1.fq.gz -2 r2.fq.gz -S aligned.sam 2> summary.txt
```
Example:
```
50000000 reads; of these:
50000000 (100.00%) were paired; of these:
2500000 (5.00%) aligned concordantly 0 times
45000000 (90.00%) aligned concordantly exactly 1 time
2500000 (5.00%) aligned concordantly >1 times
95.00% overall alignment rate
```
## Memory Comparison
| Aligner | Human Genome Memory |
|---------|-------------------|
| STAR | ~30GB |
| HISAT2 | ~8GB |
## Related Skills
- read-alignment/star-alignment - Alternative with more features
- rna-quantification/featurecounts-counting - Count aligned reads
- rna-quantification/alignment-free-quant - Skip alignment entirely
- differential-expression/deseq2-basics - Downstream DE analysis
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.