bio-ribo-seq-riboseq-preprocessing
$
npx mdskill add GPTomics/bioSkills/bio-ribo-seq-riboseq-preprocessingTrim adapters, select sizes, remove rRNA, and align Ribo-seq reads.
- Prepares ribosome profiling data for translation analysis.
- Depends on Bowtie2, STAR, cutadapt, and pysam.
- Executes a fixed pipeline of trimming, selection, and alignment.
- Outputs quality-filtered BAM files ready for downstream use.
SKILL.md
.github/skills/bio-ribo-seq-riboseq-preprocessingView on GitHub ↗
---
name: bio-ribo-seq-riboseq-preprocessing
description: Preprocess ribosome profiling data including adapter trimming, size selection, rRNA removal, and alignment. Use when preparing Ribo-seq reads for downstream analysis of translation.
tool_type: cli
primary_tool: bowtie2
---
## Version Compatibility
Reference examples tested with: Bowtie2 2.5.3+, STAR 2.7.11+, cutadapt 4.4+, numpy 1.26+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Ribo-seq Preprocessing
**"Preprocess my ribosome profiling data"** → Trim adapters, size-select ribosome-protected fragments (26-34 nt), remove rRNA contamination, and align to the transcriptome for translation analysis.
- CLI: `cutadapt` → `bowtie2` (rRNA removal) → `STAR` (genome alignment)
## Workflow Overview
```
Raw Ribo-seq FASTQ
|
v
Adapter trimming (cutadapt)
|
v
Size selection (25-35 nt typical)
|
v
rRNA removal (SortMeRNA/bowtie2)
|
v
Alignment to transcriptome
|
v
Quality filtered BAM
```
## Adapter Trimming
**Goal:** Remove 3' adapter sequences from ribosome footprint reads to recover the true insert.
**Approach:** Run cutadapt with the known adapter sequence and length filters to discard fragments outside the expected footprint range.
```bash
# Trim 3' adapter
cutadapt \
-a CTGTAGGCACCATCAAT \
-m 20 \
-M 40 \
-o trimmed.fastq.gz \
input.fastq.gz
```
## Size Selection
**Goal:** Retain only reads corresponding to ribosome-protected fragments (typically 28-32 nt).
**Approach:** Apply minimum and maximum length filters with cutadapt to select the footprint size range.
```bash
# Select ribosome footprint size range
# Typical: 28-32 nt (protected by ribosome)
cutadapt \
-m 28 \
-M 32 \
-o size_selected.fastq.gz \
trimmed.fastq.gz
```
## rRNA Removal
**Goal:** Deplete ribosomal RNA reads that typically constitute the majority of a Ribo-seq library.
**Approach:** Align reads against rRNA reference databases using SortMeRNA or Bowtie2 and collect only unmapped (non-rRNA) reads.
```bash
# Option 1: SortMeRNA (comprehensive)
sortmerna \
--ref rRNA_databases/silva-bac-16s-id90.fasta \
--ref rRNA_databases/silva-euk-18s-id95.fasta \
--ref rRNA_databases/silva-euk-28s-id98.fasta \
--reads size_selected.fastq.gz \
--aligned rRNA_reads \
--other non_rRNA_reads \
--fastx \
--threads 8
# Option 2: Bowtie2 to rRNA index
bowtie2 -x rRNA_index \
-U size_selected.fastq.gz \
--un non_rRNA.fastq.gz \
-S /dev/null \
-p 8
```
## Alignment to Transcriptome
**Goal:** Map cleaned ribosome footprint reads to the genome or transcriptome for positional analysis.
**Approach:** Align with STAR (spliced) or Bowtie2 (transcriptome) using stringent filters for uniquely mapped reads with few mismatches.
```bash
# STAR alignment (spliced)
STAR --runMode alignReads \
--genomeDir STAR_index \
--readFilesIn non_rRNA.fastq.gz \
--readFilesCommand zcat \
--outFilterMultimapNmax 1 \
--outFilterMismatchNmax 2 \
--alignIntronMax 1 \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix riboseq_
# Or bowtie2 to transcriptome
bowtie2 -x transcriptome_index \
-U non_rRNA.fastq.gz \
-S aligned.sam \
--no-unal \
-p 8
```
## Quality Metrics
**Goal:** Assess preprocessing success by checking read length distribution and mapping rates.
**Approach:** Extract read lengths from the aligned BAM and run samtools flagstat to verify expected footprint sizes and mapping efficiency.
```bash
# Check read length distribution
samtools view aligned.bam | \
awk '{print length($10)}' | \
sort | uniq -c | sort -k2n
# Expected: Peak at 28-30 nt
# Check mapping rate
samtools flagstat aligned.bam
```
## Python Preprocessing
```python
import pysam
import numpy as np
from collections import Counter
def get_length_distribution(bam_path):
'''Get read length distribution from BAM'''
lengths = Counter()
with pysam.AlignmentFile(bam_path, 'rb') as bam:
for read in bam:
if not read.is_unmapped:
lengths[read.query_length] += 1
return lengths
def filter_by_length(bam_in, bam_out, min_len=28, max_len=32):
'''Filter BAM by read length'''
with pysam.AlignmentFile(bam_in, 'rb') as infile:
with pysam.AlignmentFile(bam_out, 'wb', template=infile) as outfile:
for read in infile:
if min_len <= read.query_length <= max_len:
outfile.write(read)
```
## Related Skills
- ribosome-periodicity - Validate preprocessing quality
- read-qc - General quality control
- read-alignment - Alignment concepts
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.