bio-outlier-splicing-detection
$
npx mdskill add GPTomics/bioSkills/bio-outlier-splicing-detectionDetects aberrant splicing in rare-disease patients
- Solves undiagnosed Mendelian disease via single-sample outlier detection.
- Depends on FRASER, OUTRIDER, LeafcutterMD, and DROP tools.
- Decides using beta-binomial autoencoders and Dirichlet-multinomial models.
- Delivers clinical diagnostics for cryptic splicing validation.
SKILL.md
.github/skills/bio-outlier-splicing-detectionView on GitHub ↗
---
name: bio-outlier-splicing-detection
description: Detects aberrant splicing in single rare-disease patients vs a control panel using FRASER 2.0 (Bioconductor; Beta-binomial autoencoder on Intron Jaccard Index, default delta cutoff 0.1, q hyperparameter), OUTRIDER (gene-level outlier expression via autoencoder denoising), LeafcutterMD (Dirichlet-multinomial outlier mode of LeafCutter for annotation-free junctions), and DROP (Snakemake pipeline integrating FRASER2 + OUTRIDER + monoallelic expression for clinical diagnostics). The statistical model is fundamentally different from differential splicing — single-sample-vs-cohort outlier detection rather than two-group comparison. Standard tool in EU rare-disease (Solve-RD) and NIH UDN programs. Use when applying RNA-seq to undiagnosed Mendelian disease, validating predicted splice variants in clinical samples, or detecting cryptic splicing in disease tissue.
tool_type: r
primary_tool: FRASER
---
## Version Compatibility
Reference examples tested with: FRASER 2.0 (>=1.99.0), OUTRIDER 1.20+, LeafcutterMD via leafcutter 0.2.9+, DROP 1.4+, R 4.4+, BiocManager 1.30+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Outlier Splicing Detection
For clinical RNA-seq diagnostics in rare disease, the question is not "what differs between groups?" but "what is aberrant in this single patient relative to a panel of unaffected samples?". The statistical framework is **single-sample-vs-cohort outlier detection**, fundamentally different from two-group differential splicing. Tools in this space are designed for clinical Mendelian diagnostic settings.
## Tool Taxonomy
| Tool | Statistic | Test target | Fails when |
|------|-----------|-------------|------------|
| FRASER 2.0 | Beta-binomial autoencoder on Intron Jaccard Index | Splicing outliers (per-sample, per-junction) | Cohort <20 samples; tissue mismatch |
| OUTRIDER | Autoencoder-denoised expression Z-score | Gene-level expression outliers (LoF, monoallelic) | Cohort <20 samples |
| LeafcutterMD | Dirichlet-multinomial outlier mode | Annotation-free intron usage | Beta-binomial fits poorly OR few controls |
| DROP | Snakemake pipeline | All of above + monoallelic expression | Pipeline complexity for small projects |
Core reference: **FRASER 2.0** for splicing outliers, **OUTRIDER** for expression outliers, **DROP** to combine. Standard tool in EU rare-disease programs (Solve-RD) and NIH UDN.
## Decision Tree by Diagnostic Scenario
| Scenario | Recommended approach |
|----------|----------------------|
| Single rare-disease patient + panel of n>=50 controls | FRASER 2.0 (Intron Jaccard Index) |
| Single patient + small panel (n=20-50) | FRASER 2.0 with auxiliary GTEx controls; tune q carefully |
| Patient + cohort <20 | Insufficient for outlier detection; consider differential or recruit more samples |
| Outlier expression suspected (loss of function, monoallelic) | OUTRIDER on same cohort |
| Annotation-free outlier (cryptic exon, novel junction) | LeafcutterMD |
| Integrated diagnostic pipeline (splicing + expression + MAE) | DROP |
| TDP-43 ALS post-mortem brain (cryptic exons) | FRASER 2.0; expect UNC13A, STMN2, ATG4B |
| SF3B1-mutant cancer sample | FRASER 2.0 with cohort-matched RNA-seq; expect cryptic 3'ss |
| Familial dysautonomia (ELP1) | FRASER 2.0 in fibroblast/iPSC; CNS tissue gives strongest signal |
| Stargardt deep-intronic ABCA4 | FRASER 2.0 in retina-relevant tissue |
| Solid tumor splicing biomarker | Differential splicing (n>=10 vs cohort) — see differential-splicing skill |
| RNA validation of SpliceAI hit | FRASER 2.0 + cross-reference with predicted variant location |
## When to Use Outlier vs Differential
**Outlier regime** (this skill):
- Single patient or small case series vs control panel
- Question: "What is aberrant in this patient?"
- Statistical model: single-sample p-value vs cohort distribution
**Differential regime** (differential-splicing skill):
- Two well-defined groups, n>=3 each
- Question: "What differs between groups?"
- Statistical model: two-group LRT or related
If you have n>=10 patients with a shared phenotype, prefer **differential** (more power); if single patient or heterogeneous case series, use **outlier**.
## FRASER 2.0 Workflow
**Goal:** Detect aberrant splicing in patient samples vs cohort using the Intron Jaccard Index.
**Approach:** Count split reads per junction, compute Intron Jaccard Index per intron, fit a Beta-binomial autoencoder to estimate expected values, then flag outliers by p-value and delta.
```r
library(FRASER); library(BiocParallel)
bam_files <- list.files('bams/', pattern='.bam$', full.names=TRUE)
sample_table <- data.frame(
sampleID = gsub('.bam', '', basename(bam_files)),
bamFile = bam_files,
pairedEnd = TRUE
)
settings <- FraserDataSet(
colData = sample_table,
workingDir = 'fraser_workdir',
name = 'rare_disease_cohort'
)
settings <- countRNAData(settings, BPPARAM = MulticoreParam(8))
fds <- calculatePSIValues(settings)
fds <- filterExpressionAndVariability(
fds,
minDeltaPsi = 0.0,
minExpressionInOneSample = 20,
quantile = 0.05,
quantileMinExpression = 1
)
fitMetrics(fds) <- 'jaccard'
currentType(fds) <- 'jaccard' # canonical setter for active metric in FRASER 2.0
fds <- FRASER(
fds,
q = c(jaccard = 10),
BPPARAM = MulticoreParam(8)
)
results <- results(
fds,
psiType = 'jaccard',
padjCutoff = 0.05,
deltaPsiCutoff = 0.1
)
patient_results <- results[results$sampleID == 'PATIENT_001', ]
patient_results <- patient_results[order(patient_results$padjust), ]
```
**FRASER 2.0 changes vs FRASER 1.x:**
- Default `psiType` changed from three metrics (psi5, psi3, theta) to single **Intron Jaccard Index**
- Default `deltaPsiCutoff` dropped from 0.3 to **0.1**
- Pseudocount and filtering parameter optimization
- Bioconductor package version >=1.99.0 == FRASER 2.0
`q = 10` is the autoencoder dimension hyperparameter. **Tune via `optimHyperParams(fds, type='jaccard')` for cohort-specific optimum** — too low: confounders not removed; too high: real signal absorbed.
## OUTRIDER for Gene-Level Outlier Expression
**Goal:** Detect genes with aberrantly high or low expression in patient samples.
**Approach:** Autoencoder denoising of expression matrix; outliers identified by Z-score and adjusted p-value.
```r
library(OUTRIDER); library(BiocParallel)
countTable <- read.table('counts.tsv', header=TRUE, row.names=1)
ods <- OutriderDataSet(countData = countTable)
ods <- filterExpression(ods, minCounts=TRUE, filterGenes=TRUE)
ods <- estimateBestQ(ods, BPPARAM = MulticoreParam(8))
ods <- OUTRIDER(ods, BPPARAM = MulticoreParam(8))
res <- results(ods, padjCutoff = 0.05, zScoreCutoff = 0)
patient_outliers <- res[res$sampleID == 'PATIENT_001', ]
```
OUTRIDER (Brechtmann 2018 *Am J Hum Genet*) catches loss-of-function alleles producing transcript collapse, monoallelic effects, and tissue-inappropriate expression — complements splice outlier detection.
## LeafcutterMD for Annotation-Free Outlier Intron Usage
**Goal:** Detect outlier intron usage relative to a control panel without annotation dependence.
**Approach:** Run LeafCutter in MD (Mahalanobis Distance) mode against the control panel.
```bash
for bam in *.bam; do
regtools junctions extract -a 8 -m 50 -s XS "$bam" -o "${bam%.bam}.junc"
done
ls *.junc > juncfiles.txt
python leafcutter_cluster_regtools.py -j juncfiles.txt -o leafcutter -m 50 -l 500000
leafcutterMD.R \
--num_threads 4 \
--output_prefix patient_outlier \
leafcutter_perind_numers.counts.gz
```
LeafcutterMD (Jenkinson 2020 *Bioinformatics*) reports per-sample p-values per intron-cluster; useful when FRASER's Beta-binomial model fits poorly or when novel-junction sensitivity matters.
## DROP Pipeline (Integrated Workflow)
**Goal:** Run FRASER2 + OUTRIDER + monoallelic expression in a unified Snakemake pipeline for clinical diagnostics.
**Approach:** DROP is distributed via **bioconda** (not PyPI). Install in a dedicated environment, then configure with patient + control sample sheets; pipeline handles QC, alignment, counting, autoencoding, and reporting.
```bash
# Install via bioconda (DROP is not on PyPI)
mamba create -n drop_env -c conda-forge -c bioconda drop --override-channels
conda activate drop_env
drop init my_diagnostic_run
cd my_diagnostic_run
# Edit config.yaml:
# - sample_table: samples.tsv (patient + controls)
# - aberrantSplicing: enabled
# - aberrantExpression: enabled
# - mae: enabled (monoallelic expression)
snakemake --cores 16 --use-conda
```
DROP (Yepez 2021 *Nat Protocols*) is the standard tool in EU rare-disease genome+RNA-seq programs (Solve-RD) and the NIH UDN. v1.4+ uses FRASER 2.0. The **MAE module** uses a custom z-score test on heterozygous SNPs from RNA-seq (allele-specific expression) — useful for catching dominant-negative or monoallelic LoF that splicing/expression outliers miss. Cohort >=30 samples recommended for confident outlier detection.
## Variant + Outlier Integration
**Goal:** Connect a candidate splice-altering DNA variant to RNA-level confirmation.
**Approach:** Cross-reference SpliceAI hits with FRASER2 outliers in the same sample.
```r
library(dplyr)
variants <- read.table('spliceai_hits.tsv', header=TRUE, sep='\t')
fraser_hits <- read.table('fraser_results.tsv', header=TRUE, sep='\t')
confirmed <- variants %>%
filter(delta_max >= 0.2) %>%
inner_join(
fraser_hits %>% filter(sampleID == 'PATIENT_001', padjust < 0.05),
by = c('chrom' = 'seqnames'),
relationship = 'many-to-many'
) %>%
filter(abs(pos - start) < 1000 | abs(pos - end) < 1000)
```
A SpliceAI hit + concordant FRASER2 outlier in the patient = strong PS3 functional evidence in the ACMG framework. This integration is the highest-value clinical pipeline step — converts a computational PP3 to functional PS3.
## Cohort Size and Power
| Cohort size | Power | Comment |
|-------------|-------|---------|
| n < 20 | Marginal | High FDR; consider GTEx tissue-matched controls as auxiliary |
| n = 20-50 | Acceptable | FRASER autoencoder can fit; tune q carefully |
| n >= 50 | Recommended | Standard clinical diagnostic cohort size |
| n >= 100 | Optimal | Tissue-matched and batch-matched gives best calibration |
GTEx-derived tissue-matched controls can supplement small in-house cohorts but introduce batch effects; use only when in-house n < 30 and document the pooling strategy.
## Tissue Choice for Mendelian RNA-seq
| Tissue | Pros | Cons | Genes captured |
|--------|------|------|----------------|
| Whole blood (PAXgene) | Easy, standard | Globin contamination; many disease genes silent | ~70-80% of clinical genes |
| Fibroblast (skin biopsy) | Reasonable expression | Requires culture; senescence variability | ~75-85% |
| Muscle biopsy | Best for muscular dystrophy | Invasive | ~85-90% for muscle disorders |
| iPSC-derived neuron / cardiomyocyte | Disease-relevant tissue | Cost, variability | ~95% if differentiation works |
| Urine sediment | Non-invasive | Low yield | ~50-60% |
For UDN-style cases: blood first, then fibroblast if blood lacks expression of candidate gene. Critical: **a negative blood RNA-seq doesn't rule out a candidate gene that's silent in blood** — verify gene expression with GTEx tissue panel before committing to the tissue.
## Hyperparameter Tuning
```r
fds <- estimateBestQ(fds, type='jaccard', useOHT=TRUE)
plotEncDimSearch(fds, type='jaccard')
```
The encoding dimension `q` should be where the loss curve plateaus. Too low: confounders not removed; too high: real signal absorbed.
For typical 50-100 sample cohorts, q=8-15 is the usual range. For very small cohorts (n=20-30), q=5-8.
## Per-Tool Failure Modes
### FRASER 2.0: Q Hyperparameter Mistuning
**Trigger:** Default q=10 used without tuning; or wrong q for cohort size.
**Mechanism:** Q is the autoencoder bottleneck dimension; too small → confounders leak into outlier signal; too large → real biological signal absorbed by autoencoder.
**Symptom:** Either no significant outliers (q too high) or many spurious calls clustering by batch (q too low).
**Fix:** Run `estimateBestQ(fds, type='jaccard', useOHT=TRUE)` (Optimal Hard Thresholding default; very fast); use the returned `bestQ(fds)` value. For exhaustive search, pass `useOHT=FALSE, q_param=c(2, 5, 10, 15)` and inspect `plotEncDimSearch` for the plateau.
### FRASER 2.0: Tissue Mismatch
**Trigger:** Patient sample from different tissue than majority of controls.
**Mechanism:** FRASER autoencoder learns tissue-specific expression patterns; tissue-mismatched patient appears as global outlier.
**Symptom:** Hundreds of "significant" outliers in patient; not biologically interpretable.
**Fix:** Strict tissue matching; if controls are mixed-tissue, use only controls from patient's tissue.
### OUTRIDER: Few Controls
**Trigger:** Cohort <20 samples.
**Mechanism:** Autoencoder needs sufficient samples to learn expression covariance; fails to fit at very small cohort sizes.
**Symptom:** Convergence warnings; uncalibrated p-values.
**Fix:** Pool with GTEx auxiliary controls; or use simpler outlier methods (z-score on log-CPM).
### LeafcutterMD: Cluster Count Limits
**Trigger:** Very few clusters in patient sample (low coverage or filtered out).
**Mechanism:** LeafcutterMD computes Mahalanobis distance over clusters; few observations → unstable distance.
**Symptom:** Inflated or deflated p-values; few significant calls.
**Fix:** Increase coverage; relax filtering (`-m 10` instead of 50); or switch to FRASER2.
### DROP: Snakemake Pipeline Failures
**Trigger:** Missing dependencies or incompatible R/Bioconductor versions.
**Mechanism:** DROP orchestrates many tools; version mismatches cascade through pipeline.
**Symptom:** Snakemake step fails partway through; cryptic R errors.
**Fix:** Use `--use-conda` flag for environment isolation; pin versions in environment yamls.
## Reconciliation: When Outlier Tools Disagree
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| FRASER2 sig, OUTRIDER not | Splicing change without expression collapse | Standard splicing outlier; report |
| OUTRIDER sig, FRASER2 not | Expression LoF without splicing change | Likely promoter / regulatory; not splicing |
| Both sig at same gene | LoF allele triggering NMD on splicing-disrupted transcript | Strong combined evidence; expect downstream |
| LeafcutterMD sig, FRASER2 not | Novel cryptic event not in annotation | High-priority novel finding; investigate |
| All tools null but biology suggests change | Underpowered cohort or wrong tissue | Verify gene expression in tissue; recruit larger cohort |
## Disease-Specific Expectations
| Condition | Expected outlier signature | Tissue |
|-----------|----------------------------|--------|
| ALS / FTD (TDP-43 loss) | Cryptic exons in UNC13A, STMN2, ATG4B | Post-mortem brain ONLY |
| SF3B1-mutant MDS / CLL / uveal melanoma | Aberrant 3'ss ~10-30nt upstream of canonical | Bone marrow / tumor tissue |
| Spinal muscular atrophy (untreated SMN2) | SMN exon 7 skipping | Fibroblast / iPSC-MN |
| Familial dysautonomia (ELP1 c.2204+6T>C) | ELP1 exon 20 skipping (>=99% in CNS, partial elsewhere) | iPSC-neuron > fibroblast > blood |
| Deep-intronic CFTR / USH2A / CEP290 | Pseudoexon inclusion | Cognate disease tissue (lung / retina) |
| Duchenne muscular dystrophy (DMD) | Out-of-frame exon skipping pattern | Muscle biopsy |
| Stargardt (ABCA4) deep-intronic | Pseudoexon in retina | Retinal organoid / iPSC-RPE |
For each, the gene must be expressed in the queried tissue. Verify with GTEx before assuming negative result rules out the gene.
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `FRASER: cohort too small` | n<10 | Pool with auxiliary controls; or recruit more patients |
| `FRASER: countRNAData failed on chromosome X` | BAM index missing or corrupted | Re-index BAMs; check `samtools idxstats` |
| `estimateBestQ: convergence not reached` | Default q range insufficient | Expand `q_param=c(2,5,10,15,20)`; or use `useOHT=TRUE` for the fast deterministic alternative |
| `OUTRIDER: estimateBestQ slow` | Default range tries q=2-30 | Restrict to expected range (`q=4:15`) |
| `DROP: snakemake job failed at FRASER` | DROP-FRASER version mismatch | Update DROP to latest; verify FRASER 2.0 compatibility |
| `LeafcutterMD: insufficient clusters` | Cluster filter too strict | Lower `-m` minimum cluster reads |
| `Variant integration: chrom format mismatch` | VCF uses 1, FRASER uses chr1 (or vice versa) | Standardize with `bcftools annotate --rename-chrs` |
## Common Pitfalls
- **Using bulk differential-splicing tools for n=1 vs cohort** — rMATS, leafcutter (regular), SUPPA2 are not designed for this. Use FRASER2 / LeafcutterMD.
- **Ignoring tissue choice** — clinical gene expression varies dramatically across blood / fibroblast / muscle. A negative blood RNA-seq doesn't rule out a candidate gene that's silent in blood.
- **Forgetting batch effects** — combining in-house and external (GTEx) controls introduces sequencing batch confounding; use ComBat or include batch as covariate.
- **Skipping the variant + outlier integration** — RNA-only outlier without DNA variant suggests cellular state or technical artifact; DNA-only prediction without RNA confirmation is supporting only (PP3, not PS3).
- **Treating all FRASER2 outliers as pathogenic** — many are benign tissue-specific variation. Filter against gnomAD splice constraint and disease gene panels.
- **Q hyperparameter not tuned** — default `q=10` works for ~50-100 sample cohorts; tune for outliers.
- **Wrong default delta cutoff for FRASER 1.x vs 2.0** — 1.x default 0.3, 2.0 default 0.1; document which version.
## Quality Thresholds
| Metric | Recommendation | Source |
|--------|----------------|--------|
| Cohort size | n>=50 (ideal); n>=20 (minimum) | Solve-RD / UDN convention |
| FRASER 2.0 padj | < 0.05 | Standard |
| FRASER 2.0 delta Jaccard | >= 0.1 (default in v2.0) | Scheller 2023 *AJHG* |
| OUTRIDER padj | < 0.05 | Brechtmann 2018 *AJHG* |
| OUTRIDER zScore | abs >= 2 | Conservative |
| Sequencing depth | >=50M PE reads/sample | Standard for AS analysis |
| Tissue match between patient and controls | Required | Critical for FRASER2 calibration |
| Batch match | Strongly recommended | Reduces autoencoder confounding |
## Related Skills
- splice-variant-prediction - SpliceAI / Pangolin for in-silico prediction; integration target
- differential-splicing - When testing multiple patients vs controls (>=10 vs cohort)
- splicing-qc - Library / depth / tissue prerequisites
- variant-calling/clinical-interpretation - ACMG/AMP framework integration
- workflows/clinical-trial-pipeline - Trial-grade RNA-seq diagnostics
## References
- Mertes et al 2021 *Nat Commun* - FRASER 1.x
- Scheller et al 2023 *Am J Hum Genet* - FRASER 2.0 (Intron Jaccard Index)
- Brechtmann et al 2018 *Am J Hum Genet* - OUTRIDER
- Jenkinson et al 2020 *Bioinformatics* - LeafcutterMD
- Yepez et al 2021 *Nat Protocols* - DROP pipeline
- Cummings et al 2017 *Sci Transl Med* - RNA-seq for muscular dystrophy diagnostics
- Kremer et al 2017 *Nat Commun* - RNA-seq for mitochondrial disease
- Brown et al 2022 *Nature* - UNC13A cryptic exon (TDP-43 / ALS)
- Klim et al 2019 *Nat Neurosci* - STMN2 cryptic splicing (ALS)
- Darman et al 2015 *Cell Rep* - SF3B1 cryptic 3'ss
- Walker et al 2023 *Am J Hum Genet* - ClinGen SVI splicing recommendations
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.