bio-phylo-distance-calculations
$
npx mdskill add GPTomics/bioSkills/bio-phylo-distance-calculationsBuilds phylogenetic trees using distance-based methods from sequence alignments
- Solves the task of generating exploratory phylogenies from aligned sequences
- Uses Biopython's Bio.Phylo.TreeConstruction for distance calculations and tree building
- Chooses Neighbor-Joining or UPGMA based on dataset size and user goals
- Returns phylogenetic trees with optional bootstrap support as output
SKILL.md
.github/skills/bio-phylo-distance-calculationsView on GitHub ↗
---
name: bio-phylo-distance-calculations
description: Compute evolutionary distances and build phylogenetic trees using Biopython Bio.Phylo.TreeConstruction. Use when creating distance matrices from alignments, building NJ/UPGMA trees, generating bootstrap consensus, or needing quick exploratory phylogenies before running full ML analysis.
tool_type: python
primary_tool: Bio.Phylo.TreeConstruction
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Distance Calculations and Tree Building
**"Build a phylogenetic tree from my alignment"** → Compute evolutionary distance matrices from sequence alignments and construct neighbor-joining or UPGMA trees with bootstrap support.
- Python: `Bio.Phylo.TreeConstruction.DistanceCalculator()`, `DistanceTreeConstructor()`
Compute distances from alignments and construct phylogenetic trees.
## When to Use Distance Methods vs ML
| Scenario | Recommended Method |
|----------|-------------------|
| Quick exploratory tree before committing to a long ML run | NJ |
| Sanity check on data quality (unexpected groupings?) | NJ |
| Very large datasets where ML is prohibitive | NJ |
| Molecular clock data (ultrametric trees) | UPGMA (rare) |
| Publication-quality trees | **ML (IQ-TREE2/RAxML-NG)** or Bayesian |
| Formal hypothesis testing | **ML or Bayesian** |
NJ trees are fast (O(n^3)) and useful for exploration. For any analysis intended for publication, use ML methods (see modern-tree-inference skill). NJ starting trees are used internally by IQ-TREE (BIONJ) and RAxML-NG.
**UPGMA warning:** UPGMA assumes a molecular clock (equal rates across all lineages). This assumption is almost never met for molecular data. Use NJ instead unless clocklike behavior has been verified.
## Evolutionary Distance Corrections
Raw identity-based distances underestimate true evolutionary distance because they do not account for multiple substitutions at the same site. For divergent sequences, corrected distances are more appropriate:
| Model | Correction | Use When |
|-------|------------|----------|
| Identity | None (raw mismatch proportion) | Closely related sequences; quick exploration |
| Jukes-Cantor | Assumes equal substitution rates | Simple correction for moderate divergence |
| Kimura 2-parameter | Distinguishes transitions from transversions | Better for DNA when Ti/Tv ratio differs from 1 |
Biopython's `DistanceCalculator` models (`identity`, `blastn`, `trans`) provide basic corrections. For more sophisticated evolutionary distance estimation, use ML-based distances from IQ-TREE2 (`.mldist` output file).
## Required Import
```python
from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.TreeConstruction import DistanceMatrix
from Bio.Phylo.TreeConstruction import ParsimonyScorer, ParsimonyTreeConstructor, NNITreeSearcher
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, bootstrap_trees, bootstrap_consensus
```
## Distance Matrix from Alignment
```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator
alignment = AlignIO.read('alignment.fasta', 'fasta')
# Create calculator with distance model
calculator = DistanceCalculator('identity') # Simple identity-based distance
dm = calculator.get_distance(alignment)
print(dm)
# Available models for DNA
calculator = DistanceCalculator('blastn') # BLASTN-style distance
# Available models for protein
calculator = DistanceCalculator('blosum62') # BLOSUM62-based distance
```
## Available Distance Models
| Model | Type | Description |
|-------|------|-------------|
| `identity` | DNA/Protein | 1 - (identical positions / total) |
| `blastn` | DNA | BLASTN scoring distance |
| `trans` | DNA | Transition/transversion weighted |
| `blosum62` | Protein | BLOSUM62 matrix distance |
| `blosum45` | Protein | BLOSUM45 matrix distance |
| `blosum80` | Protein | BLOSUM80 matrix distance |
| `pam250` | Protein | PAM250 matrix distance |
| `pam30` | Protein | PAM30 matrix distance |
## Building Trees with Distance Methods
### Neighbor Joining (NJ)
```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
constructor = DistanceTreeConstructor()
nj_tree = constructor.nj(dm)
Phylo.draw_ascii(nj_tree)
```
### UPGMA
```python
constructor = DistanceTreeConstructor()
upgma_tree = constructor.upgma(dm)
Phylo.draw_ascii(upgma_tree)
```
### One-Step Tree Building
```python
# Build tree directly from alignment
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
# Or with UPGMA
constructor = DistanceTreeConstructor(calculator, 'upgma')
tree = constructor.build_tree(alignment)
```
## Pairwise Distances Between Taxa
```python
from Bio import Phylo
tree = Phylo.read('tree.nwk', 'newick')
# Distance between two taxa (sum of branch lengths)
taxon1 = tree.find_any(name='Human')
taxon2 = tree.find_any(name='Mouse')
dist = tree.distance(taxon1, taxon2)
print(f'Distance Human-Mouse: {dist:.4f}')
# All pairwise distances
terminals = tree.get_terminals()
for i, t1 in enumerate(terminals):
for t2 in terminals[i+1:]:
d = tree.distance(t1, t2)
print(f'{t1.name}-{t2.name}: {d:.4f}')
```
## Creating Distance Matrix Manually
```python
from Bio.Phylo.TreeConstruction import DistanceMatrix
names = ['A', 'B', 'C', 'D']
# Lower triangular matrix (including diagonal)
matrix = [
[0],
[0.1, 0],
[0.2, 0.15, 0],
[0.3, 0.25, 0.2, 0]
]
dm = DistanceMatrix(names, matrix)
print(dm)
# Build tree from custom matrix
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
```
## Parsimony Tree Construction
Parsimony is largely superseded by ML for most molecular phylogenetics. It remains appropriate for morphological cladistics, rare genomic changes (retroelement insertions, gene order), and as a starting point for ML searches. Parsimony is statistically inconsistent in the Felsenstein zone (long branch attraction).
```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher, ParsimonyTreeConstructor
alignment = AlignIO.read('alignment.fasta', 'fasta')
scorer = ParsimonyScorer()
searcher = NNITreeSearcher(scorer)
# Parsimony needs a starting tree (NJ is standard)
constructor = DistanceTreeConstructor(DistanceCalculator('identity'), 'nj')
starting_tree = constructor.build_tree(alignment)
pars_constructor = ParsimonyTreeConstructor(searcher, starting_tree)
pars_tree = pars_constructor.build_tree(alignment)
print(f'Parsimony score: {scorer.get_score(pars_tree, alignment)}')
Phylo.draw_ascii(pars_tree)
```
## Bootstrap Analysis
```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_trees, bootstrap_consensus, majority_consensus
alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')
# Generate bootstrap trees
boot_trees = list(bootstrap_trees(alignment, 100, constructor))
print(f'Generated {len(boot_trees)} bootstrap trees')
# Get bootstrap consensus
consensus = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
Phylo.draw_ascii(consensus)
```
## Consensus Tree Methods
```python
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, adam_consensus
trees = list(Phylo.parse('bootstrap.nwk', 'newick'))
# Strict consensus (only clades in ALL trees)
strict = strict_consensus(trees)
# Majority rule consensus (clades in >50% of trees)
majority = majority_consensus(trees, cutoff=0.5)
# Adam consensus
adam = adam_consensus(trees)
Phylo.draw_ascii(majority)
```
## Tree Depths and Total Length
```python
tree = Phylo.read('tree.nwk', 'newick')
# Total branch length
total = tree.total_branch_length()
print(f'Total branch length: {total:.4f}')
# Depths from root to each node
depths = tree.depths()
for clade, depth in depths.items():
if clade.is_terminal():
print(f'{clade.name}: {depth:.4f}')
# Maximum depth (tree height)
tree_height = max(depths.values())
print(f'Tree height: {tree_height:.4f}')
```
## Comparing Tree Distances
```python
tree1 = Phylo.read('tree1.nwk', 'newick')
tree2 = Phylo.read('tree2.nwk', 'newick')
# Compare total branch lengths
len1 = tree1.total_branch_length()
len2 = tree2.total_branch_length()
print(f'Tree 1 total: {len1:.4f}')
print(f'Tree 2 total: {len2:.4f}')
# Compare specific pairwise distances
taxa = ['Human', 'Mouse']
t1 = [tree1.find_any(name=t) for t in taxa]
t2 = [tree2.find_any(name=t) for t in taxa]
d1 = tree1.distance(t1[0], t1[1])
d2 = tree2.distance(t2[0], t2[1])
print(f'Human-Mouse distance: Tree1={d1:.4f}, Tree2={d2:.4f}')
```
## Complete Pipeline: Alignment to Bootstrapped Tree
**Goal:** Build a phylogenetic tree from a sequence alignment with bootstrap support assessment for branch confidence.
**Approach:** Read the alignment, compute an identity-based distance matrix, construct a neighbor-joining tree, then generate a majority-rule bootstrap consensus from 100 replicates.
```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_consensus, majority_consensus
alignment = AlignIO.read('sequences.aln', 'clustal')
print(f'Alignment: {len(alignment)} sequences, {alignment.get_alignment_length()} positions')
calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')
# Build simple tree
simple_tree = constructor.build_tree(alignment)
simple_tree.ladderize()
# Build bootstrap consensus (100 replicates)
consensus_tree = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
consensus_tree.ladderize()
Phylo.write(simple_tree, 'nj_tree.nwk', 'newick')
Phylo.write(consensus_tree, 'bootstrap_consensus.nwk', 'newick')
```
## Quick Reference: Distance Models
### DNA Models
| Model | Description |
|-------|-------------|
| `identity` | Simple mismatch counting |
| `blastn` | BLASTN-style scoring |
| `trans` | Weights transitions vs transversions |
### Protein Models
| Model | Description |
|-------|-------------|
| `blosum62` | General proteins |
| `blosum45` | Divergent proteins |
| `blosum80` | Similar proteins |
| `pam250` | Distant homologs |
| `pam30` | Close homologs |
## Related Skills
- tree-io - Save constructed trees to files
- tree-visualization - Draw resulting trees
- tree-manipulation - Root and process built trees
- modern-tree-inference - ML tree inference for publication-quality results
- alignment/alignment-io - Read alignments for tree building
- alignment/msa-statistics - Alignment quality before tree building
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.