bio-phylo-distance-calculations

Name: bio-phylo-distance-calculations
Author: GPTomics/bioSkills
$npx mdskill add GPTomics/bioSkills/bio-phylo-distance-calculations
Builds phylogenetic trees using distance-based methods from sequence alignments
Solves the task of generating exploratory phylogenies from aligned sequences
Uses Biopython's Bio.Phylo.TreeConstruction for distance calculations and tree building
Chooses Neighbor-Joining or UPGMA based on dataset size and user goals
Returns phylogenetic trees with optional bootstrap support as output
SKILL.md
.github/skills/bio-phylo-distance-calculationsView on GitHub ↗
---
name: bio-phylo-distance-calculations
description: Compute evolutionary distances and build phylogenetic trees using Biopython Bio.Phylo.TreeConstruction. Use when creating distance matrices from alignments, building NJ/UPGMA trees, generating bootstrap consensus, or needing quick exploratory phylogenies before running full ML analysis.
tool_type: python
primary_tool: Bio.Phylo.TreeConstruction
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Distance Calculations and Tree Building

**"Build a phylogenetic tree from my alignment"** → Compute evolutionary distance matrices from sequence alignments and construct neighbor-joining or UPGMA trees with bootstrap support.
- Python: `Bio.Phylo.TreeConstruction.DistanceCalculator()`, `DistanceTreeConstructor()`

Compute distances from alignments and construct phylogenetic trees.

## When to Use Distance Methods vs ML

| Scenario | Recommended Method |
|----------|-------------------|
| Quick exploratory tree before committing to a long ML run | NJ |
| Sanity check on data quality (unexpected groupings?) | NJ |
| Very large datasets where ML is prohibitive | NJ |
| Molecular clock data (ultrametric trees) | UPGMA (rare) |
| Publication-quality trees | **ML (IQ-TREE2/RAxML-NG)** or Bayesian |
| Formal hypothesis testing | **ML or Bayesian** |

NJ trees are fast (O(n^3)) and useful for exploration. For any analysis intended for publication, use ML methods (see modern-tree-inference skill). NJ starting trees are used internally by IQ-TREE (BIONJ) and RAxML-NG.

**UPGMA warning:** UPGMA assumes a molecular clock (equal rates across all lineages). This assumption is almost never met for molecular data. Use NJ instead unless clocklike behavior has been verified.

## Evolutionary Distance Corrections

Raw identity-based distances underestimate true evolutionary distance because they do not account for multiple substitutions at the same site. For divergent sequences, corrected distances are more appropriate:

| Model | Correction | Use When |
|-------|------------|----------|
| Identity | None (raw mismatch proportion) | Closely related sequences; quick exploration |
| Jukes-Cantor | Assumes equal substitution rates | Simple correction for moderate divergence |
| Kimura 2-parameter | Distinguishes transitions from transversions | Better for DNA when Ti/Tv ratio differs from 1 |

Biopython's `DistanceCalculator` models (`identity`, `blastn`, `trans`) provide basic corrections. For more sophisticated evolutionary distance estimation, use ML-based distances from IQ-TREE2 (`.mldist` output file).

## Required Import

```python
from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.TreeConstruction import DistanceMatrix
from Bio.Phylo.TreeConstruction import ParsimonyScorer, ParsimonyTreeConstructor, NNITreeSearcher
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, bootstrap_trees, bootstrap_consensus
```

## Distance Matrix from Alignment

```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator

alignment = AlignIO.read('alignment.fasta', 'fasta')

# Create calculator with distance model
calculator = DistanceCalculator('identity')  # Simple identity-based distance
dm = calculator.get_distance(alignment)
print(dm)

# Available models for DNA
calculator = DistanceCalculator('blastn')  # BLASTN-style distance

# Available models for protein
calculator = DistanceCalculator('blosum62')  # BLOSUM62-based distance
```

## Available Distance Models

| Model | Type | Description |
|-------|------|-------------|
| `identity` | DNA/Protein | 1 - (identical positions / total) |
| `blastn` | DNA | BLASTN scoring distance |
| `trans` | DNA | Transition/transversion weighted |
| `blosum62` | Protein | BLOSUM62 matrix distance |
| `blosum45` | Protein | BLOSUM45 matrix distance |
| `blosum80` | Protein | BLOSUM80 matrix distance |
| `pam250` | Protein | PAM250 matrix distance |
| `pam30` | Protein | PAM30 matrix distance |

## Building Trees with Distance Methods

### Neighbor Joining (NJ)

```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
nj_tree = constructor.nj(dm)
Phylo.draw_ascii(nj_tree)
```

### UPGMA

```python
constructor = DistanceTreeConstructor()
upgma_tree = constructor.upgma(dm)
Phylo.draw_ascii(upgma_tree)
```

### One-Step Tree Building

```python
# Build tree directly from alignment
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)

# Or with UPGMA
constructor = DistanceTreeConstructor(calculator, 'upgma')
tree = constructor.build_tree(alignment)
```

## Pairwise Distances Between Taxa

```python
from Bio import Phylo

tree = Phylo.read('tree.nwk', 'newick')

# Distance between two taxa (sum of branch lengths)
taxon1 = tree.find_any(name='Human')
taxon2 = tree.find_any(name='Mouse')
dist = tree.distance(taxon1, taxon2)
print(f'Distance Human-Mouse: {dist:.4f}')

# All pairwise distances
terminals = tree.get_terminals()
for i, t1 in enumerate(terminals):
    for t2 in terminals[i+1:]:
        d = tree.distance(t1, t2)
        print(f'{t1.name}-{t2.name}: {d:.4f}')
```

## Creating Distance Matrix Manually

```python
from Bio.Phylo.TreeConstruction import DistanceMatrix

names = ['A', 'B', 'C', 'D']
# Lower triangular matrix (including diagonal)
matrix = [
    [0],
    [0.1, 0],
    [0.2, 0.15, 0],
    [0.3, 0.25, 0.2, 0]
]
dm = DistanceMatrix(names, matrix)
print(dm)

# Build tree from custom matrix
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
```

## Parsimony Tree Construction

Parsimony is largely superseded by ML for most molecular phylogenetics. It remains appropriate for morphological cladistics, rare genomic changes (retroelement insertions, gene order), and as a starting point for ML searches. Parsimony is statistically inconsistent in the Felsenstein zone (long branch attraction).

```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher, ParsimonyTreeConstructor

alignment = AlignIO.read('alignment.fasta', 'fasta')

scorer = ParsimonyScorer()
searcher = NNITreeSearcher(scorer)

# Parsimony needs a starting tree (NJ is standard)
constructor = DistanceTreeConstructor(DistanceCalculator('identity'), 'nj')
starting_tree = constructor.build_tree(alignment)

pars_constructor = ParsimonyTreeConstructor(searcher, starting_tree)
pars_tree = pars_constructor.build_tree(alignment)

print(f'Parsimony score: {scorer.get_score(pars_tree, alignment)}')
Phylo.draw_ascii(pars_tree)
```

## Bootstrap Analysis

```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_trees, bootstrap_consensus, majority_consensus

alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')

# Generate bootstrap trees
boot_trees = list(bootstrap_trees(alignment, 100, constructor))
print(f'Generated {len(boot_trees)} bootstrap trees')

# Get bootstrap consensus
consensus = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
Phylo.draw_ascii(consensus)
```

## Consensus Tree Methods

```python
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, adam_consensus

trees = list(Phylo.parse('bootstrap.nwk', 'newick'))

# Strict consensus (only clades in ALL trees)
strict = strict_consensus(trees)

# Majority rule consensus (clades in >50% of trees)
majority = majority_consensus(trees, cutoff=0.5)

# Adam consensus
adam = adam_consensus(trees)

Phylo.draw_ascii(majority)
```

## Tree Depths and Total Length

```python
tree = Phylo.read('tree.nwk', 'newick')

# Total branch length
total = tree.total_branch_length()
print(f'Total branch length: {total:.4f}')

# Depths from root to each node
depths = tree.depths()
for clade, depth in depths.items():
    if clade.is_terminal():
        print(f'{clade.name}: {depth:.4f}')

# Maximum depth (tree height)
tree_height = max(depths.values())
print(f'Tree height: {tree_height:.4f}')
```

## Comparing Tree Distances

```python
tree1 = Phylo.read('tree1.nwk', 'newick')
tree2 = Phylo.read('tree2.nwk', 'newick')

# Compare total branch lengths
len1 = tree1.total_branch_length()
len2 = tree2.total_branch_length()
print(f'Tree 1 total: {len1:.4f}')
print(f'Tree 2 total: {len2:.4f}')

# Compare specific pairwise distances
taxa = ['Human', 'Mouse']
t1 = [tree1.find_any(name=t) for t in taxa]
t2 = [tree2.find_any(name=t) for t in taxa]

d1 = tree1.distance(t1[0], t1[1])
d2 = tree2.distance(t2[0], t2[1])
print(f'Human-Mouse distance: Tree1={d1:.4f}, Tree2={d2:.4f}')
```

## Complete Pipeline: Alignment to Bootstrapped Tree

**Goal:** Build a phylogenetic tree from a sequence alignment with bootstrap support assessment for branch confidence.

**Approach:** Read the alignment, compute an identity-based distance matrix, construct a neighbor-joining tree, then generate a majority-rule bootstrap consensus from 100 replicates.

```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_consensus, majority_consensus

alignment = AlignIO.read('sequences.aln', 'clustal')
print(f'Alignment: {len(alignment)} sequences, {alignment.get_alignment_length()} positions')

calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')

# Build simple tree
simple_tree = constructor.build_tree(alignment)
simple_tree.ladderize()

# Build bootstrap consensus (100 replicates)
consensus_tree = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
consensus_tree.ladderize()

Phylo.write(simple_tree, 'nj_tree.nwk', 'newick')
Phylo.write(consensus_tree, 'bootstrap_consensus.nwk', 'newick')
```

## Quick Reference: Distance Models

### DNA Models
| Model | Description |
|-------|-------------|
| `identity` | Simple mismatch counting |
| `blastn` | BLASTN-style scoring |
| `trans` | Weights transitions vs transversions |

### Protein Models
| Model | Description |
|-------|-------------|
| `blosum62` | General proteins |
| `blosum45` | Divergent proteins |
| `blosum80` | Similar proteins |
| `pam250` | Distant homologs |
| `pam30` | Close homologs |

## Related Skills

- tree-io - Save constructed trees to files
- tree-visualization - Draw resulting trees
- tree-manipulation - Root and process built trees
- modern-tree-inference - ML tree inference for publication-quality results
- alignment/alignment-io - Read alignments for tree building
- alignment/msa-statistics - Alignment quality before tree building