bio-phylo-modern-tree-inference
$
npx mdskill add GPTomics/bioSkills/bio-phylo-modern-tree-inferenceBuilds maximum likelihood phylogenetic trees with model selection and support assessment
- Infers publication-quality ML trees from sequence alignments
- Uses IQ-TREE2 and RAxML-NG with ModelFinder and ModelTest-NG
- Automates model selection, bootstrapping, and topology testing
- Delivers trees with branch support and concordance factors
SKILL.md
.github/skills/bio-phylo-modern-tree-inferenceView on GitHub ↗
---
name: bio-phylo-modern-tree-inference
description: Build maximum likelihood phylogenetic trees using IQ-TREE2 and RAxML-NG with expert model selection, branch support assessment, and topology testing. Use when inferring publication-quality ML trees, selecting substitution models, interpreting bootstrap and concordance factor support, or running partitioned phylogenomic analyses.
tool_type: cli
primary_tool: IQ-TREE2
---
## Version Compatibility
Reference examples tested with: IQ-TREE 2.2+, RAxML-NG 1.2+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `iqtree2 --version` then `iqtree2 --help` to confirm flags
- CLI: `raxml-ng --version` then `raxml-ng --help` to confirm flags
If commands fail, introspect the installed version and adapt flags rather than retrying.
# Modern ML Tree Inference
**"Infer a maximum likelihood tree from my alignment"** -> Build publication-quality ML trees with automatic substitution model selection, ultrafast bootstrap support, concordance factors, and topology testing.
- CLI: `iqtree2 -s alignment.fa -m MFP -B 1000 -alrt 1000 -bnni` (IQ-TREE2)
- CLI: `raxml-ng --all --msa alignment.fa --model GTR+G --bs-trees 100` (RAxML-NG)
## IQ-TREE2 vs RAxML-NG Decision
| Factor | IQ-TREE2 | RAxML-NG |
|--------|----------|----------|
| Model selection | Built-in ModelFinder | External ModelTest-NG |
| Ultrafast bootstrap | UFBoot2 | No |
| Branch lengths | Good | More accurate |
| Concordance factors | Built-in (gCF/sCF) | No |
| Very large trees (>1000 taxa) | Good | Better |
| Transfer bootstrap | No | Yes |
| Partition models | Extensive | Good |
**Default recommendation:** IQ-TREE2 for most workflows (integrated model selection, UFBoot2, concordance factors). Use RAxML-NG when precise branch lengths matter, for very large trees, or when using transfer bootstrap for rogue-taxon-prone datasets.
**Best practice for important results:** Run both tools and compare topologies.
## Model Selection
### Use ModelFinder (`-m MFP`), Not `-m TEST`
**Goal:** Select the substitution model that best describes the evolutionary process in the alignment.
**Approach:** ModelFinder (`-m MFP`) tests standard models plus FreeRate (+R) models and performs concurrent model-tree search. The older `-m TEST` does not test FreeRate models and tests on a fixed tree.
```bash
# Recommended: ModelFinder Plus (includes FreeRate models)
iqtree2 -s alignment.fasta -m MFP -B 1000 -alrt 1000 -bnni -T AUTO
# Model selection only (no tree inference)
iqtree2 -s alignment.fasta -m MF -T AUTO
# Partition model with automatic merging
iqtree2 -s concat.fasta -p partitions.nex -m MFP+MERGE -B 1000 -bnni -T AUTO
```
### Rate Heterogeneity Models
| Model | Description | When Selected |
|-------|-------------|---------------|
| +G4 | Discrete gamma (4 categories) | Standard default; sufficient for most datasets |
| +I+G4 | Invariant sites + gamma | Often selected despite theoretical identifiability concerns; safe to use |
| +R4/+R5 | FreeRate model | Better fit for large datasets; relaxes gamma assumption |
| +R (auto k) | FreeRate with automatic categories | **Only tested by `-m MFP`**, not `-m TEST` |
FreeRate models can absorb rate variation from long tails that discretized gamma cannot. For large datasets or datasets with complex rate variation, FreeRate often fits better.
### BIC vs AIC for Model Selection
BIC is the IQ-TREE default and recommended for most analyses. BIC penalizes complexity more heavily than AIC, reducing overfitting risk. AIC tends to select overly complex models.
### DNA Model Hierarchy
| Model | Free Parameters | When Appropriate |
|-------|----------------|------------------|
| JC69 | 0 | Almost never in practice; null model |
| K2P/K80 | 1 (kappa) | Very closely related sequences with balanced composition |
| HKY85 | 4 | Moderate divergence, single-gene analyses |
| GTR | 8 | Default; almost always selected by model testing |
The real decision is usually not GTR vs HKY but which rate heterogeneity model (+G vs +I+G vs +R).
### Protein Models
Let ModelFinder choose. For deep phylogenies, profile mixture models (C10-C60 in IQ-TREE, CAT in PhyloBayes) can outperform fixed-matrix models by capturing site-specific amino acid preferences.
```bash
# Protein with ModelFinder
iqtree2 -s protein.fasta -m MFP -B 1000 -bnni -st AA -T AUTO
# Protein with mixture model (for deep phylogenies)
iqtree2 -s protein.fasta -m LG+C60+F+G -B 1000 -bnni -st AA -T AUTO
```
## Branch Support Assessment
### Standard Analysis: UFBoot2 + SH-aLRT
```bash
# Recommended for most analyses
iqtree2 -s alignment.fasta -m MFP -B 1000 -alrt 1000 -bnni -T AUTO --seed 12345
```
The `-bnni` flag is critical: it further optimizes each bootstrap tree with NNI, reducing overestimation from model violations. This flag is default since IQ-TREE 2.2.0 but should be specified explicitly for clarity.
### Interpreting Support Values
| UFBoot2 | SH-aLRT | Interpretation |
|---------|---------|----------------|
| >= 95 | >= 80 | Strong support |
| 80-94 | 70-79 | Moderate support |
| < 80 | < 70 | Weak support |
**Critical nuance:** UFBoot values are NOT comparable to standard bootstrap. UFBoot >= 95 corresponds roughly to standard bootstrap >= 70. Do not apply the traditional >= 70 threshold to UFBoot values.
### When Low Support Matters
- Low support on backbone branches: genuine topological uncertainty. Investigate with concordance factors or coalescent methods
- Low support on recent divergences within a well-sampled clade: may reflect insufficient data rather than genuine uncertainty
- Low support throughout the tree: suspect rapid radiation, incomplete lineage sorting (ILS), hybridization, or inadequate data
### Transfer Bootstrap (RAxML-NG)
For large trees (>1000 taxa), transfer bootstrap expectation (TBE) is less sensitive to rogue taxa than standard bootstrap:
```bash
raxml-ng --all --msa alignment.fasta --model GTR+G --bs-trees autoMRE --bs-metric tbe
```
### Concordance Factors
Concordance factors quantify agreement among loci (gCF) and sites (sCF), complementing bootstrap:
```bash
# After obtaining gene trees and species tree
iqtree2 -t species.treefile --gcf gene_trees.treefile -s concat.fasta --scf 100
# Likelihood-based sCF (more accurate; requires recent IQ-TREE)
iqtree2 -t species.treefile --gcf gene_trees.treefile -s concat.fasta --scfl 100
```
| Metric | Interpretation |
|--------|---------------|
| gCF/sCF > 50% | Majority of loci/sites support this branch |
| gCF/sCF ~ 33% | Completely equivocal (three resolutions equally likely) |
| gCF << sCF | Gene tree estimation error, not genuine discordance |
| sCF < 33% | A different topology is better supported at this node |
For publication, report UFBoot + SH-aLRT at minimum; add concordance factors for phylogenomic datasets.
## Partitioned Analysis
For multi-gene concatenated datasets where genes evolve at different rates:
```bash
# Edge-linked proportional (recommended default)
iqtree2 -s concat.fasta -p partitions.nex -m MFP -B 1000 -bnni -T AUTO
# Edge-unlinked (independent branch lengths per partition)
# Most general but parameter-rich; risk of overfitting with missing data
iqtree2 -s concat.fasta -Q partitions.nex -m MFP -B 1000 -bnni -T AUTO
# With automatic partition merging
iqtree2 -s concat.fasta -p partitions.nex -m MFP+MERGE -B 1000 -bnni -T AUTO
```
| Flag | Model | Recommendation |
|------|-------|----------------|
| `-q` | Edge-linked equal | Unrealistic; not recommended |
| `-p` | Edge-linked proportional | **Recommended default** |
| `-Q` | Edge-unlinked | Justified when different genes have different relative rates across lineages (heterotachy) |
## Topology Testing
### AU Test (Approximately Unbiased)
**Goal:** Test whether alternative tree topologies are significantly worse than the best tree.
**Approach:** Compare candidate topologies using the AU test, which provides proper multiple-testing correction. Preferred over the overly conservative SH test.
```bash
# Compare candidate topologies
iqtree2 -s alignment.fasta -m GTR+F+R3 --trees candidates.treefile --test-au --test 10000 -n 0
```
- p-AU >= 0.05: Tree cannot be rejected
- p-AU < 0.05: Tree is significantly worse
Use topology tests for evaluating specific competing hypotheses, not for fishing through thousands of random topologies.
## Long Branch Attraction (LBA) Awareness
LBA causes distantly related long-branched taxa to group together artifactually due to model misspecification. It affects ML and Bayesian methods, not just parsimony.
**Detection signs:**
- Two long-branched taxa group together when they should not
- Removing one long-branch taxon causes the other to move
- Switching to a more complex model changes the placement
**Mitigation (in order of effectiveness):**
1. Add taxa that break long branches (most effective)
2. Use site-heterogeneous models (C60 in IQ-TREE, CAT-GTR in PhyloBayes)
3. Remove saturated sites (3rd codon positions, hypervariable regions)
4. Use amino acids instead of nucleotides for coding regions
5. RY-coding or Dayhoff-6 recoding for proteins
When LBA is suspected, consider running PhyloBayes with CAT-GTR model (see bayesian-inference skill).
## IQ-TREE2 Output Files
| File | Description |
|------|-------------|
| `.treefile` | Best ML tree (Newick) |
| `.iqtree` | Full report with model parameters |
| `.contree` | Consensus tree with support values |
| `.splits.nex` | Bootstrap splits (Nexus) |
| `.model.gz` | Model parameters |
| `.log` | Run log |
| `.ckp.gz` | Checkpoint for resuming |
## RAxML-NG Usage
```bash
# ML search + bootstrap
raxml-ng --all --msa alignment.fasta --model GTR+G --bs-trees 100
# Thorough search with multiple starting trees
raxml-ng --msa alignment.fasta --model GTR+G --tree pars{10} --prefix ml_search
# Protein models
raxml-ng --msa protein.fasta --model LG+G8+F
# Constrained tree search
raxml-ng --msa alignment.fasta --model GTR+G --tree-constraint constraint.tre
# Check alignment before full run
raxml-ng --check --msa alignment.fasta --model GTR+G
```
## RAxML-NG Output Files
| File | Description |
|------|-------------|
| `.raxml.bestTree` | Best ML tree |
| `.raxml.support` | Tree with bootstrap support |
| `.raxml.bootstraps` | All bootstrap trees |
| `.raxml.mlTrees` | All ML trees from search |
| `.raxml.log` | Analysis log |
## Large Dataset Strategies
```bash
# IQ-TREE2 fast mode for >500 taxa
iqtree2 -s large.fasta -m GTR+G -B 1000 -bnni -T 4 -mem 8G -fast
# RAxML-NG with limited starting trees
raxml-ng --msa large.fasta --model GTR+G --tree pars{5} --threads 8
```
For very large trees (>1000 taxa), consider FastTree 2 for an initial exploratory tree, then refine with RAxML-NG.
## Constrained Analysis
```bash
# Enforce monophyly constraint
iqtree2 -s alignment.fasta -m MFP -g constraint.tre -B 1000 -bnni
# Constraint file: Newick with taxa to constrain
# ((Human,Chimp),Gorilla);
```
## Reproducibility
```bash
# Always set random seed for reproducible results
iqtree2 -s alignment.fasta -m MFP -B 1000 -bnni --seed 12345 -T AUTO
raxml-ng --msa alignment.fasta --model GTR+G --seed 12345 --bs-trees 100
```
## Resuming Interrupted Runs
```bash
iqtree2 -s alignment.fasta -m MFP -B 1000 --redo-tree
raxml-ng --msa alignment.fasta --model GTR+G --redo
```
## Related Skills
- bayesian-inference - Bayesian tree inference with MrBayes, BEAST2, convergence diagnostics
- species-trees - Coalescent methods (ASTRAL) when gene tree discordance is high
- divergence-dating - Molecular clock analysis and divergence time estimation
- tree-io - Read and convert output tree files
- tree-visualization - Visualize trees with support values
- distance-calculations - Compare with distance-based methods
- alignment/alignment-io - Prepare alignments for tree inference
- alignment/multiple-alignment - Alignment quality affects tree inference
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.