bio-ecological-genomics-species-delimitation
$
npx mdskill add GPTomics/bioSkills/bio-ecological-genomics-species-delimitationDelimits species boundaries from DNA barcoding using multi-method consensus.
- Resolves cryptic species complexes and validates taxonomic assignments.
- Integrates ASAP, bPTP, GMYC, BPP, and delimtools for analysis.
- Applies integrative taxonomy best practices to determine species boundaries.
- Outputs comparative results across distance, tree, and coalescent methods.
SKILL.md
.github/skills/bio-ecological-genomics-species-delimitationView on GitHub ↗
---
name: bio-ecological-genomics-species-delimitation
description: Delimits species boundaries from molecular data using distance-based (ASAP), tree-based (bPTP, GMYC), and coalescent (BPP) methods. Compares multiple delimitation results with delimtools. Use when delineating putative species from DNA barcoding data, resolving cryptic species complexes, or validating taxonomic assignments. Emphasizes multi-method consensus following integrative taxonomy best practice.
tool_type: mixed
primary_tool: ASAP
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, numpy 1.26+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Species Delimitation
**"Delineate species boundaries from my DNA barcoding data"** → Apply multiple delimitation methods (distance-based ASAP, tree-based bPTP/GMYC, coalescent BPP) to molecular data and compare results for multi-method consensus following integrative taxonomy best practice.
- CLI: ASAP web tool or standalone for distance-based partitioning
- Python: bPTP via `PTP-pyqt5` for Bayesian branching-rate analysis
- R: `splits::gmyc()` for coalescent/speciation transition model
Delimits putative species from molecular data using complementary distance-based, tree-based, and coalescent methods.
## Overview of Methods
| Method | Input | Approach | Strengths | Limitations |
|--------|-------|----------|-----------|-------------|
| ASAP | Aligned sequences | Distance-based partitioning | Fast, automatic, ranks partitions | Single locus only |
| bPTP | Rooted phylogeny | Bayesian branching rates | Bayesian support values | Over-splits with structure |
| GMYC | Ultrametric tree | Coalescent/speciation transition | Well-established theory | Requires ultrametric tree |
| BPP | Multi-locus alignment | Full coalescent model | Multi-locus, statistically rigorous | Computationally intensive |
## ASAP (Assemble Species by Automatic Partitioning)
**Goal:** Partition aligned barcode sequences into putative species using pairwise genetic distances.
**Approach:** Run ASAP with a substitution model (K2P, p-distance) and rank candidate partitions by asap-score.
Distance-based method that finds the optimal partition of sequences into putative species:
```bash
# ASAP web interface: https://bioinfo.mnhn.fr/abi/public/asap/asapweb.html
# Upload aligned FASTA, select distance model
# Command-line ASAP (if available)
ASAP -i aligned_sequences.fasta -d K2P -o asap_results/
# -d K2P: Kimura 2-parameter distance (standard for COI barcoding)
# -d p: uncorrected p-distance (for closely related taxa)
# -d JC: Jukes-Cantor (simplest substitution model)
```
### Interpreting ASAP Results
```
# ASAP ranks partitions by asap-score (lower = better)
# Reports: partition rank, number of groups, asap-score, p-value, threshold distance
# Always examine top 5-10 partitions, not just the best one
# Large gaps between consecutive asap-scores indicate robust partitioning
```
### Python ASAP-Style Analysis
```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from scipy.cluster.hierarchy import fcluster, linkage
from scipy.spatial.distance import squareform
import numpy as np
alignment = AlignIO.read('aligned_sequences.fasta', 'fasta')
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
names = [r.id for r in alignment]
n = len(names)
dist_array = np.zeros((n, n))
for i in range(n):
for j in range(n):
dist_array[i][j] = dm[names[i], names[j]]
condensed = squareform(dist_array)
Z = linkage(condensed, method='average')
# Scan thresholds to identify barcode gap
# COI barcode gap typically at 2-3% for animals
thresholds = np.arange(0.01, 0.10, 0.005)
for t in thresholds:
clusters = fcluster(Z, t=t, criterion='distance')
n_clusters = len(set(clusters))
print(f'Threshold {t:.3f}: {n_clusters} groups')
```
## bPTP (Bayesian Poisson Tree Processes)
**Goal:** Delimit species by detecting the transition from speciation to coalescent branching rates on a phylogeny.
**Approach:** Run Bayesian PTP MCMC on a rooted tree to obtain posterior support values for each species partition.
Tree-based method that models speciation as a branching rate shift:
```bash
# iTaxoTools PTP (Python 3 compatible)
pip install PTP-pyqt5
# Run bPTP from command line
python -m PTP.PTP -t rooted_tree.nwk -o bptp_results \
--type bayesian --ngen 100000 --burnin 0.1 --seed 42
# --type bayesian: Bayesian MCMC (preferred over ML for support values)
# --ngen 100000: MCMC generations; increase for large trees
# --burnin 0.1: discard first 10% of samples
```
### Interpreting bPTP Output
```
# Support values for each species partition:
# > 0.95: strong support
# 0.80-0.95: moderate support
# < 0.80: weak support, boundary uncertain
# bPTP can over-split when populations have strong geographic structure
# Compare against distance-based methods to identify over-splitting
```
## GMYC (Generalized Mixed Yule Coalescent)
**Goal:** Identify the threshold on an ultrametric tree where branching shifts from interspecific (Yule) to intraspecific (coalescent) processes.
**Approach:** Prepare an ultrametric tree with chronos, run single-threshold GMYC, evaluate with a likelihood ratio test, and extract species assignments.
Models the transition between interspecific (Yule) and intraspecific (coalescent) branching rates on an ultrametric tree:
```r
library(splits)
library(ape)
# Step 1: Prepare ultrametric tree
# GMYC requires an ultrametric (time-calibrated) tree
tree <- read.tree('rooted_tree.nwk')
# chronos() for quick ultrametric conversion (penalized likelihood)
# lambda=1: smoothing parameter; higher = more clock-like
ultrametric_tree <- chronos(tree, lambda = 1)
class(ultrametric_tree) <- 'phylo'
# Verify ultrametric property
is.ultrametric(ultrametric_tree)
# Step 2: Run single-threshold GMYC
gmyc_result <- gmyc(ultrametric_tree, method = 'single')
# Summary: number of entities, confidence interval, likelihood ratio test
summary(gmyc_result)
# Number of putative species (ML entities)
cat('GMYC species:', gmyc_result$entity[1], '\n')
# Likelihood ratio test p-value
# p < 0.05: significant speciation-coalescent transition detected
cat('LR test p-value:', gmyc_result$p.value[1], '\n')
# Step 3: Extract species assignments
species_list <- spec.list(gmyc_result)
cat('Species assignments:\n')
print(species_list)
```
### Multiple-Threshold GMYC
```r
# Allows different coalescent-to-speciation transitions across the tree
# Better for datasets with variable population sizes or divergence times
gmyc_multi <- gmyc(ultrametric_tree, method = 'multiple')
summary(gmyc_multi)
```
### GMYC Visualization
```r
# Color-coded tree by species partition
pdf('gmyc_species_tree.pdf', width = 10, height = 12)
plot(gmyc_result, cex = 0.6)
dev.off()
# Support surface: likelihood across threshold positions
pdf('gmyc_support.pdf', width = 8, height = 5)
plot(gmyc_result, type = 'support')
dev.off()
```
## BPP (Bayesian Phylogenetics and Phylogeography)
**Goal:** Perform rigorous multi-locus species delimitation under the multispecies coalescent model.
**Approach:** Configure BPP A10 analysis (joint delimitation + species tree estimation) with priors on theta and tau, then run MCMC to obtain posterior probabilities for each delimitation model.
Full multispecies coalescent model for rigorous species delimitation:
```
# BPP control file for species delimitation (A10 analysis)
# A10: joint species delimitation and species tree estimation
seed = 12345
seqfile = alignment.phy
Imapfile = imap.txt # maps individuals to putative species
outfile = bpp_results.txt
mcmcfile = bpp_mcmc.txt
speciesdelimitation = 1 # 1 = delimitation mode
speciestree = 1 # 1 = estimate species tree
speciesmodelprior = 1 # 1 = uniform prior on all rooted trees
# species&tree: species names, sample sizes, starting tree
species&tree = 4 speciesA speciesB speciesC speciesD
10 8 12 6
((speciesA, speciesB), (speciesC, speciesD));
# Prior on theta (ancestral population size)
# thetaprior = alpha beta: inverse-gamma(alpha, beta)
# alpha=3, beta=0.004: mean=0.002 (typical for moderate diversity)
thetaprior = 3 0.004
# Prior on tau (divergence time)
# tauprior = alpha beta: inverse-gamma(alpha, beta)
# alpha=3, beta=0.002: mean=0.001 (relatively recent divergence)
tauprior = 3 0.002
# MCMC settings
nsample = 100000
sampfreq = 2
burnin = 50000
```
### Interpreting BPP Results
```
# Posterior probability for each delimitation model
# PP > 0.95: strong support for that delimitation
# PP 0.50-0.95: moderate support
# PP < 0.50: uncertain, consider lumping species
# Always run multiple independent chains to check convergence
```
## Multi-Method Comparison
**Goal:** Evaluate concordance across multiple delimitation methods to identify robust consensus species boundaries.
**Approach:** Compare partition vectors from ASAP, bPTP, and GMYC using the adjusted Rand index and count species per method.
Compare delimitation results across methods to identify consensus species:
```r
library(ape)
# Partition vectors from each method (species assignment per individual)
asap_partition <- c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4)
bptp_partition <- c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5)
gmyc_partition <- c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4)
# Compare with adjusted Rand index
# ARI = 1: perfect agreement; ARI = 0: random; ARI < 0: less than random
library(fossil)
cat('ASAP vs bPTP ARI:', adj.rand.index(asap_partition, bptp_partition), '\n')
cat('ASAP vs GMYC ARI:', adj.rand.index(asap_partition, gmyc_partition), '\n')
cat('bPTP vs GMYC ARI:', adj.rand.index(bptp_partition, gmyc_partition), '\n')
# Consensus: individuals assigned to the same species by >= 2 methods
comparison <- data.frame(
individual = paste0('ind_', 1:10),
ASAP = asap_partition,
bPTP = bptp_partition,
GMYC = gmyc_partition
)
# Count number of species per method
cat('\nSpecies counts:\n')
cat(' ASAP:', length(unique(asap_partition)), '\n')
cat(' bPTP:', length(unique(bptp_partition)), '\n')
cat(' GMYC:', length(unique(gmyc_partition)), '\n')
```
## Ultrametric Tree Preparation
**Goal:** Convert a maximum-likelihood phylogeny into an ultrametric (time-calibrated) tree suitable for GMYC and other coalescent methods.
**Approach:** Apply penalized likelihood (chronos) with a relaxed clock model, then verify and fix near-ultrametric rounding errors.
Most tree-based methods require ultrametric input:
```r
library(ape)
library(phytools)
tree <- read.tree('ml_tree.nwk')
# Method 1: chronos (penalized likelihood, fastest)
# model='correlated': rates correlated across branches (relaxed clock)
ultra_tree <- chronos(tree, model = 'correlated')
# Method 2: BEAST2 (Bayesian, most rigorous but slow)
# Produces ultrametric tree with credibility intervals
# Run externally, import the MCC tree
# Verify
is.ultrametric(ultra_tree)
# Fix near-ultrametric trees with small rounding errors
ultra_tree <- force.ultrametric(ultra_tree, method = 'extend')
```
## Related Skills
- edna-metabarcoding - Generate barcode sequences from eDNA
- conservation-genetics - Population-level genetic assessment
- phylogenetics/tree-io - Tree input/output for delimitation methods
- phylogenetics/modern-tree-inference - Phylogenetic tree construction
- database-access/entrez-fetch - Retrieve barcode sequences from GenBank
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.