bio-small-rna-seq-mirdeep2-analysis

Name: bio-small-rna-seq-mirdeep2-analysis
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-small-rna-seq-mirdeep2-analysis

Discover and quantify miRNAs from small RNA-seq data using miRDeep2

Identify novel miRNAs and profile known miRNAs in small RNA-seq datasets
Uses miRDeep2 CLI tools mapper.pl, miRDeep2.pl, and quantifier.pl
Maps reads to genome, scores precursor hairpins, and predicts miRNA candidates
Generates ARF files and miRNA abundance tables for downstream analysis

SKILL.md

.github/skills/bio-small-rna-seq-mirdeep2-analysisView on GitHub ↗

---
name: bio-small-rna-seq-mirdeep2-analysis
description: Discover novel miRNAs and quantify known miRNAs using miRDeep2 de novo prediction from small RNA-seq data. Use when identifying new miRNAs or performing comprehensive miRNA profiling with discovery.
tool_type: cli
primary_tool: miRDeep2
---

## Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# miRDeep2 Analysis

**"Discover novel miRNAs from my small RNA-seq data"** → Identify known and novel miRNAs by mapping reads to the genome and scoring precursor hairpin structures using a probabilistic model.
- CLI: `mapper.pl` for read mapping, `miRDeep2.pl` for de novo miRNA prediction

## Workflow Overview

```
Collapsed reads (FASTA)
    |
    v
mapper.pl ---------> Align to genome, create ARF file
    |
    v
miRDeep2.pl -------> Predict novel miRNAs, quantify known
    |
    v
quantifier.pl -----> Quantify known miRNAs only (optional)
```

## Step 1: Prepare Genome Index

**Goal:** Build a bowtie index from the reference genome for miRDeep2 read mapping.

**Approach:** Run bowtie-build on the genome FASTA to create the index files required by mapper.pl.

```bash
# Build bowtie index for miRDeep2 mapper
bowtie-build genome.fa genome_index
```

## Step 2: Map Reads with mapper.pl

**Goal:** Collapse identical reads and align them to the reference genome.

**Approach:** Use mapper.pl to clip adapters, filter by length, collapse duplicates, and map with bowtie to produce ARF alignment files.

```bash
# Collapse reads and map to genome
mapper.pl reads.fastq \
    -e \
    -h \
    -i \
    -j \
    -k TGGAATTCTCGGGTGCCAAGG \
    -l 18 \
    -m \
    -p genome_index \
    -s reads_collapsed.fa \
    -t reads_vs_genome.arf \
    -v

# Key options:
# -e: Input is FASTQ
# -h: Parse Illumina headers
# -k: Clip 3' adapter
# -l 18: Discard reads < 18 nt
# -m: Collapse reads
# -p: Bowtie index prefix
# -s: Output collapsed FASTA
# -t: Output ARF alignment file
```

## Step 3: Run miRDeep2 Prediction

**Goal:** Predict novel miRNAs and quantify known miRNAs from aligned small RNA reads.

**Approach:** Run miRDeep2.pl with collapsed reads, genome, alignments, and miRBase references to score candidate miRNA loci.

```bash
# Predict novel miRNAs
miRDeep2.pl \
    reads_collapsed.fa \
    genome.fa \
    reads_vs_genome.arf \
    mature_ref.fa \
    mature_other.fa \
    hairpin_ref.fa \
    -t Human \
    2> report.log

# Arguments:
# 1. Collapsed reads FASTA
# 2. Genome FASTA
# 3. Alignment ARF file
# 4. Known mature miRNAs (same species)
# 5. Known mature miRNAs (other species, for conservation)
# 6. Known hairpin precursors
# -t: Species for miRBase lookup
```

## Prepare miRBase References

**Goal:** Download and extract species-specific miRNA references from miRBase.

**Approach:** Fetch mature and hairpin FASTA files from miRBase, then grep species-specific entries by prefix.

```bash
# Download from miRBase
wget https://www.mirbase.org/download/mature.fa
wget https://www.mirbase.org/download/hairpin.fa

# Extract species-specific sequences
grep -A1 ">hsa-" mature.fa > mature_human.fa
grep -A1 ">hsa-" hairpin.fa > hairpin_human.fa
```

## Step 4: Quantify Known miRNAs Only

**Goal:** Quantify expression of known miRNAs without running novel discovery.

**Approach:** Run quantifier.pl with hairpin and mature references against collapsed reads for fast quantification.

```bash
# If not doing novel discovery
quantifier.pl \
    -p hairpin_human.fa \
    -m mature_human.fa \
    -r reads_collapsed.fa \
    -t hsa

# Output: miRNAs_expressed_all_samples.csv
```

## Output Files

| File | Description |
|------|-------------|
| result_*.html | Interactive results report |
| result_*.csv | Predicted novel miRNAs with scores |
| miRNAs_expressed_all_samples*.csv | Expression quantification |
| pdfs_*.pdf | Secondary structure plots |

## Interpret miRDeep2 Scores

```
Score interpretation:
>10: High confidence novel miRNA
5-10: Medium confidence
1-5: Low confidence, needs validation
<1: Likely false positive

Key metrics:
- miRDeep2 score: Overall confidence
- Total read count: Expression level
- Mature/star ratio: Strand bias (expect asymmetry)
- Randfold p-value: Structural stability
```

## Parse Results in Python

**Goal:** Load miRDeep2 prediction and quantification results into pandas DataFrames.

**Approach:** Parse tab-delimited output files and filter novel miRNA predictions by confidence score threshold.

```python
import pandas as pd

def parse_mirdeep2_results(csv_path):
    '''Parse miRDeep2 novel miRNA predictions'''
    df = pd.read_csv(csv_path, sep='\t', skiprows=1)

    # Filter high-confidence predictions
    # Score > 10 indicates high confidence novel miRNA
    high_conf = df[df['miRDeep2 score'] > 10]

    return high_conf

# Parse quantification results
def parse_quantifier_output(csv_path):
    '''Parse quantifier.pl expression matrix'''
    df = pd.read_csv(csv_path, sep='\t')
    return df
```

## Related Skills

- smrna-preprocessing - Prepare reads for miRDeep2
- mirge3-analysis - Faster quantification alternative
- differential-mirna - DE analysis of miRNA counts