bio-small-rna-seq-mirdeep2-analysis
$
npx mdskill add GPTomics/bioSkills/bio-small-rna-seq-mirdeep2-analysisDiscover and quantify miRNAs from small RNA-seq data using miRDeep2
- Identify novel miRNAs and profile known miRNAs in small RNA-seq datasets
- Uses miRDeep2 CLI tools mapper.pl, miRDeep2.pl, and quantifier.pl
- Maps reads to genome, scores precursor hairpins, and predicts miRNA candidates
- Generates ARF files and miRNA abundance tables for downstream analysis
SKILL.md
.github/skills/bio-small-rna-seq-mirdeep2-analysisView on GitHub ↗
---
name: bio-small-rna-seq-mirdeep2-analysis
description: Discover novel miRNAs and quantify known miRNAs using miRDeep2 de novo prediction from small RNA-seq data. Use when identifying new miRNAs or performing comprehensive miRNA profiling with discovery.
tool_type: cli
primary_tool: miRDeep2
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# miRDeep2 Analysis
**"Discover novel miRNAs from my small RNA-seq data"** → Identify known and novel miRNAs by mapping reads to the genome and scoring precursor hairpin structures using a probabilistic model.
- CLI: `mapper.pl` for read mapping, `miRDeep2.pl` for de novo miRNA prediction
## Workflow Overview
```
Collapsed reads (FASTA)
|
v
mapper.pl ---------> Align to genome, create ARF file
|
v
miRDeep2.pl -------> Predict novel miRNAs, quantify known
|
v
quantifier.pl -----> Quantify known miRNAs only (optional)
```
## Step 1: Prepare Genome Index
**Goal:** Build a bowtie index from the reference genome for miRDeep2 read mapping.
**Approach:** Run bowtie-build on the genome FASTA to create the index files required by mapper.pl.
```bash
# Build bowtie index for miRDeep2 mapper
bowtie-build genome.fa genome_index
```
## Step 2: Map Reads with mapper.pl
**Goal:** Collapse identical reads and align them to the reference genome.
**Approach:** Use mapper.pl to clip adapters, filter by length, collapse duplicates, and map with bowtie to produce ARF alignment files.
```bash
# Collapse reads and map to genome
mapper.pl reads.fastq \
-e \
-h \
-i \
-j \
-k TGGAATTCTCGGGTGCCAAGG \
-l 18 \
-m \
-p genome_index \
-s reads_collapsed.fa \
-t reads_vs_genome.arf \
-v
# Key options:
# -e: Input is FASTQ
# -h: Parse Illumina headers
# -k: Clip 3' adapter
# -l 18: Discard reads < 18 nt
# -m: Collapse reads
# -p: Bowtie index prefix
# -s: Output collapsed FASTA
# -t: Output ARF alignment file
```
## Step 3: Run miRDeep2 Prediction
**Goal:** Predict novel miRNAs and quantify known miRNAs from aligned small RNA reads.
**Approach:** Run miRDeep2.pl with collapsed reads, genome, alignments, and miRBase references to score candidate miRNA loci.
```bash
# Predict novel miRNAs
miRDeep2.pl \
reads_collapsed.fa \
genome.fa \
reads_vs_genome.arf \
mature_ref.fa \
mature_other.fa \
hairpin_ref.fa \
-t Human \
2> report.log
# Arguments:
# 1. Collapsed reads FASTA
# 2. Genome FASTA
# 3. Alignment ARF file
# 4. Known mature miRNAs (same species)
# 5. Known mature miRNAs (other species, for conservation)
# 6. Known hairpin precursors
# -t: Species for miRBase lookup
```
## Prepare miRBase References
**Goal:** Download and extract species-specific miRNA references from miRBase.
**Approach:** Fetch mature and hairpin FASTA files from miRBase, then grep species-specific entries by prefix.
```bash
# Download from miRBase
wget https://www.mirbase.org/download/mature.fa
wget https://www.mirbase.org/download/hairpin.fa
# Extract species-specific sequences
grep -A1 ">hsa-" mature.fa > mature_human.fa
grep -A1 ">hsa-" hairpin.fa > hairpin_human.fa
```
## Step 4: Quantify Known miRNAs Only
**Goal:** Quantify expression of known miRNAs without running novel discovery.
**Approach:** Run quantifier.pl with hairpin and mature references against collapsed reads for fast quantification.
```bash
# If not doing novel discovery
quantifier.pl \
-p hairpin_human.fa \
-m mature_human.fa \
-r reads_collapsed.fa \
-t hsa
# Output: miRNAs_expressed_all_samples.csv
```
## Output Files
| File | Description |
|------|-------------|
| result_*.html | Interactive results report |
| result_*.csv | Predicted novel miRNAs with scores |
| miRNAs_expressed_all_samples*.csv | Expression quantification |
| pdfs_*.pdf | Secondary structure plots |
## Interpret miRDeep2 Scores
```
Score interpretation:
>10: High confidence novel miRNA
5-10: Medium confidence
1-5: Low confidence, needs validation
<1: Likely false positive
Key metrics:
- miRDeep2 score: Overall confidence
- Total read count: Expression level
- Mature/star ratio: Strand bias (expect asymmetry)
- Randfold p-value: Structural stability
```
## Parse Results in Python
**Goal:** Load miRDeep2 prediction and quantification results into pandas DataFrames.
**Approach:** Parse tab-delimited output files and filter novel miRNA predictions by confidence score threshold.
```python
import pandas as pd
def parse_mirdeep2_results(csv_path):
'''Parse miRDeep2 novel miRNA predictions'''
df = pd.read_csv(csv_path, sep='\t', skiprows=1)
# Filter high-confidence predictions
# Score > 10 indicates high confidence novel miRNA
high_conf = df[df['miRDeep2 score'] > 10]
return high_conf
# Parse quantification results
def parse_quantifier_output(csv_path):
'''Parse quantifier.pl expression matrix'''
df = pd.read_csv(csv_path, sep='\t')
return df
```
## Related Skills
- smrna-preprocessing - Prepare reads for miRDeep2
- mirge3-analysis - Faster quantification alternative
- differential-mirna - DE analysis of miRNA counts
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.