bio-read-alignment-star-alignment

$npx mdskill add GPTomics/bioSkills/bio-read-alignment-star-alignment

Align RNA-seq reads with STAR for splice-aware genome mapping

  • Solve RNA-seq alignment requiring accurate splice junction detection
  • Uses STAR CLI, genome index, and FASTQ read files
  • Chooses two-pass mode to discover novel splice junctions
  • Produces sorted BAM files with aligned reads for downstream analysis
SKILL.md
.github/skills/bio-read-alignment-star-alignmentView on GitHub ↗
---
name: bio-read-alignment-star-alignment
description: Align RNA-seq reads with STAR (Spliced Transcripts Alignment to a Reference). Supports two-pass mode for novel splice junction discovery. Use when aligning RNA-seq data requiring splice-aware alignment.
tool_type: cli
primary_tool: STAR
---

## Version Compatibility

Reference examples tested with: STAR 2.7.11+, Subread 2.0+, fastp 0.23+, kallisto 0.50+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# STAR RNA-seq Alignment

**"Align RNA-seq reads with STAR"** → Map RNA-seq reads to a reference genome with fast, sensitive splice-aware alignment. Preferred for large datasets and downstream fusion/chimeric read detection.
- CLI: `STAR --runMode alignReads --genomeDir index/ --readFilesIn R1.fq R2.fq --outSAMtype BAM SortedByCoordinate`

## Generate Genome Index

```bash
# Basic index generation
STAR --runMode genomeGenerate \
    --runThreadN 8 \
    --genomeDir star_index/ \
    --genomeFastaFiles reference.fa \
    --sjdbGTFfile annotation.gtf \
    --sjdbOverhang 100    # Read length - 1
```

## Index with Specific Read Length

```bash
# For 150bp reads, use sjdbOverhang=149
STAR --runMode genomeGenerate \
    --runThreadN 8 \
    --genomeDir star_index_150/ \
    --genomeFastaFiles reference.fa \
    --sjdbGTFfile annotation.gtf \
    --sjdbOverhang 149
```

## Basic Alignment

```bash
# Paired-end alignment
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn reads_1.fq.gz reads_2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate
```

## Single-End Alignment

```bash
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn reads.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate
```

## Two-Pass Mode

```bash
# Two-pass mode for better novel junction detection
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --twopassMode Basic
```

## Quantification Mode

```bash
# Output gene counts (like featureCounts)
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts
```

Output: `sample_ReadsPerGene.out.tab` with columns:
1. Gene ID
2. Unstranded counts
3. Forward strand counts
4. Reverse strand counts

## ENCODE Options

```bash
# ENCODE recommended settings
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMunmapped Within \
    --outSAMattributes NH HI AS NM MD \
    --outFilterType BySJout \
    --outFilterMultimapNmax 20 \
    --outFilterMismatchNmax 999 \
    --outFilterMismatchNoverReadLmax 0.04 \
    --alignIntronMin 20 \
    --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 \
    --alignSJoverhangMin 8 \
    --alignSJDBoverhangMin 1
```

## Fusion Detection

```bash
# For chimeric/fusion detection
STAR --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn r1.fq.gz r2.fq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --chimSegmentMin 12 \
    --chimJunctionOverhangMin 8 \
    --chimOutType Junctions WithinBAM SoftClip \
    --chimMainSegmentMultNmax 1
```

## Output Files

| File | Description |
|------|-------------|
| *Aligned.sortedByCoord.out.bam | Sorted BAM file |
| *Log.final.out | Alignment summary statistics |
| *Log.out | Detailed log |
| *SJ.out.tab | Splice junctions |
| *ReadsPerGene.out.tab | Gene counts (if --quantMode) |
| *Chimeric.out.junction | Fusion candidates (if chimeric) |

## Memory Requirements

```bash
# Reduce memory for limited systems
STAR --genomeLoad NoSharedMemory \
    --limitBAMsortRAM 10000000000 \  # 10GB for sorting
    ...

# For very large genomes, limit during index generation
STAR --runMode genomeGenerate \
    --limitGenomeGenerateRAM 31000000000 \  # 31GB
    ...
```

## Shared Memory Mode

```bash
# Load genome into shared memory (for multiple samples)
STAR --genomeLoad LoadAndExit --genomeDir star_index/

# Run alignments (faster startup)
STAR --genomeLoad LoadAndKeep --genomeDir star_index/ ...

# Remove from memory when done
STAR --genomeLoad Remove --genomeDir star_index/
```

## Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| --runThreadN | 1 | Number of threads |
| --sjdbOverhang | 100 | Read length - 1 |
| --outFilterMultimapNmax | 10 | Max multi-mapping |
| --alignIntronMax | 0 | Max intron size |
| --outFilterMismatchNmax | 10 | Max mismatches |
| --outSAMtype | SAM | Output format |
| --quantMode | - | GeneCounts for counting |
| --twopassMode | None | Basic for two-pass |

## Related Skills

- rna-quantification/featurecounts-counting - Alternative counting
- rna-quantification/alignment-free-quant - Salmon/kallisto alternative
- differential-expression/deseq2-basics - Downstream DE analysis
- read-qc/fastp-workflow - Preprocess reads
More from GPTomics/bioSkills