bio-comparative-genomics-ancestral-reconstruction

Name: bio-comparative-genomics-ancestral-reconstruction
Author: GPTomics/bioSkills
$npx mdskill add GPTomics/bioSkills/bio-comparative-genomics-ancestral-reconstruction
Reconstruct ancient protein sequences using PAML and IQ-TREE.
Enables protein resurrection and evolutionary trajectory analysis.
Depends on PAML codeml and IQ-TREE marginal likelihood methods.
Validates Python package versions before executing reconstruction.
Outputs inferred ancestral sequences at phylogenetic nodes.
SKILL.md
.github/skills/bio-comparative-genomics-ancestral-reconstructionView on GitHub ↗
---
name: bio-comparative-genomics-ancestral-reconstruction
description: Reconstruct ancestral sequences at phylogenetic nodes using PAML and IQ-TREE marginal likelihood methods. Infer ancient protein sequences and trace evolutionary trajectories through sequence history. Use when inferring ancestral states for protein resurrection or tracing evolutionary history.
tool_type: mixed
primary_tool: PAML
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, IQ-TREE 2.2+, PAML 4.10+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Ancestral Sequence Reconstruction

**"Infer ancestral protein sequences at phylogenetic nodes"** → Reconstruct ancient sequences using marginal or joint likelihood methods on a phylogeny for ancestral protein resurrection or evolutionary trajectory analysis.
- Python: PAML `codeml` with `RateAncestor = 1` for ancestral reconstruction
- CLI: `iqtree2 -s aln -m model --ancestral` for marginal reconstruction

## When ASR is Reliable vs Unreliable

ASR accuracy depends on several factors -- understanding these limits is critical before interpreting results:

| Factor | Good for ASR | Poor for ASR |
|---|---|---|
| Tree depth | Shallow-moderate divergence | Very deep divergence (pre-Cambrian) |
| Branch lengths | Short-moderate, evenly distributed | Long branches (low confidence at those nodes) |
| Taxon sampling | Dense sampling near node of interest | Sparse sampling with long gaps |
| Alignment quality | High-confidence alignment columns | Ambiguously aligned regions |
| Model fit | Model matches data well (use ModelFinder) | Severe model misspecification |
| Sequence type | Proteins (20 states = more signal) | Nucleotides at 3rd codon positions (saturated) |

**Key limitation**: ASR reconstructs the most likely ancestral sequence given the model, but the reconstructed protein may not be functional due to epistatic interactions not captured by site-independent models. Experimentally resurrected proteins should always be tested for activity.

## Joint vs Marginal Reconstruction

- **Marginal**: Integrates over uncertainty at all other nodes; provides posterior probabilities per site. Better for identifying ambiguous positions and designing alternative constructs. Default in IQ-TREE.
- **Joint**: Finds the single most likely set of ancestral states across all nodes simultaneously. No per-site probabilities. Faster but less informative.
- Use marginal reconstruction for protein resurrection studies (need per-site confidence).

## Alignment Recommendations

- Use PRANK for coding sequences -- correctly models insertions (other aligners overestimate deletions)
- Gaps in alignments represent indels; indel reconstruction is a separate problem from substitution reconstruction
- PAML treats gaps as missing data or removes gapped columns (`cleandata=0` keeps them as ambiguity; `cleandata=1` removes all columns with any gap)
- IQ-TREE handles gaps better than PAML for large datasets with extensive indel variation

## PAML Ancestral Reconstruction

**Goal:** Reconstruct ancestral sequences at internal phylogenetic nodes using maximum likelihood.

**Approach:** Create a codeml/baseml control file with RateAncestor=1, run PAML, parse the RST file for ancestral sequences and site-wise posterior probabilities.

```python
'''Ancestral sequence reconstruction with PAML codeml/baseml'''

import subprocess
import re
from Bio import SeqIO
from Bio.Seq import Seq


def create_asr_control(alignment, tree, output_dir, seq_type='protein'):
    '''Create control file for ancestral reconstruction

    RateAncestor = 1: Enable ancestral reconstruction
    Generates RST file with ancestral sequences

    IMPORTANT: Tree MUST be rooted -- ancestral states at internal nodes
    depend on the root position. Use outgroup rooting or midpoint rooting.

    For codons: Use codeml with seqtype = 1
    For amino acids: Use codeml with seqtype = 2
    For nucleotides: Use baseml

    Model selection:
    - Proteins: LG or WAG with +G (rate variation); use ModelFinder for data-driven choice
    - Codons: M0 (single omega) for ASR; site models are for selection testing, not ASR
    - Use the same model for tree inference and ASR for consistency
    '''
    if seq_type == 'protein':
        ctl = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/asr.mlc

      seqtype = 2
        model = 3
    aaRatefile = wag.dat

 RateAncestor = 1
    cleandata = 0
        '''
    else:  # codon
        ctl = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/asr.mlc

      seqtype = 1
    CodonFreq = 2
        model = 0
      NSsites = 0

 RateAncestor = 1
    cleandata = 0
        '''

    ctl_file = f'{output_dir}/asr.ctl'
    with open(ctl_file, 'w') as f:
        f.write(ctl)

    return ctl_file


def parse_rst_file(rst_file):
    '''Parse PAML RST file for ancestral sequences

    RST contains:
    - Tree with node numbers
    - Ancestral sequences at each node
    - Posterior probabilities for each site

    Node numbering: Extant sequences first, then internal nodes
    '''
    ancestors = {}
    current_node = None
    current_seq = []

    with open(rst_file) as f:
        content = f.read()

    # Find ancestral sequence section
    if 'Ancestral reconstruction by' in content:
        sections = content.split('Ancestral reconstruction by')
        for section in sections[1:]:
            lines = section.strip().split('\n')
            for line in lines:
                if line.startswith('node #'):
                    if current_node and current_seq:
                        ancestors[current_node] = ''.join(current_seq)
                    match = re.search(r'node #(\d+)', line)
                    if match:
                        current_node = f'Node_{match.group(1)}'
                        current_seq = []
                elif current_node and line.strip() and not line.startswith(' '):
                    # Sequence line
                    seq_part = ''.join(line.split()[1:]) if len(line.split()) > 1 else ''
                    current_seq.append(seq_part)

    if current_node and current_seq:
        ancestors[current_node] = ''.join(current_seq)

    return ancestors


def extract_marginal_probabilities(rst_file):
    '''Extract site-wise posterior probabilities

    High confidence: P > 0.95 (commonly used threshold)
    Moderate confidence: P > 0.80
    Low confidence: P < 0.80 (consider alternatives)

    Report ambiguous sites for experimental validation
    '''
    site_probs = []

    with open(rst_file) as f:
        in_probs = False
        for line in f:
            if 'Prob of best state' in line:
                in_probs = True
                continue
            if in_probs and line.strip():
                parts = line.split()
                if len(parts) >= 3:
                    try:
                        site = int(parts[0])
                        state = parts[1]
                        prob = float(parts[2])
                        site_probs.append({
                            'site': site,
                            'state': state,
                            'probability': prob,
                            'confidence': 'high' if prob > 0.95 else 'moderate' if prob > 0.8 else 'low'
                        })
                    except ValueError:
                        in_probs = False

    return site_probs
```

## IQ-TREE Ancestral Reconstruction

**Goal:** Perform ancestral reconstruction using IQ-TREE's marginal likelihood method.

**Approach:** Run iqtree2 with --ancestral flag to produce a .state file containing per-node, per-site state probabilities, then parse into ancestral sequences.

```python
def run_iqtree_asr(alignment, tree=None, model='LG+G4', output_prefix='asr'):
    '''Run IQ-TREE for ancestral sequence reconstruction

    IQ-TREE provides:
    - Marginal reconstruction (default)
    - Joint reconstruction (-asr-joint)
    - State file (.state) with probabilities

    Advantages over PAML:
    - Automatic model selection
    - Better handling of gaps
    - Faster for large datasets
    '''
    cmd = f'iqtree2 -s {alignment} -m {model} --ancestral -pre {output_prefix}'

    if tree:
        cmd += f' -te {tree}'

    subprocess.run(cmd, shell=True)

    return f'{output_prefix}.state'


def parse_iqtree_state(state_file):
    '''Parse IQ-TREE .state file

    Format: Node  Site  State  Probability  [other states and probs]
    '''
    ancestors = {}

    with open(state_file) as f:
        next(f)  # Skip header
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) >= 4:
                node = parts[0]
                site = int(parts[1])
                state = parts[2]
                prob = float(parts[3])

                if node not in ancestors:
                    ancestors[node] = {'sequence': [], 'probabilities': []}
                ancestors[node]['sequence'].append(state)
                ancestors[node]['probabilities'].append(prob)

    # Convert to sequences
    for node in ancestors:
        ancestors[node]['sequence'] = ''.join(ancestors[node]['sequence'])

    return ancestors
```

## Alternative State Analysis

```python
def get_alternative_states(site_probs, threshold=0.1):
    '''Identify sites with plausible alternative ancestral states

    Alternative states with P > 0.1 should be considered
    for experimental validation (ancestral protein resurrection)

    These sites may:
    - Affect function differently
    - Represent true ancestral ambiguity
    - Be targets for directed evolution
    '''
    ambiguous_sites = []

    for site_data in site_probs:
        if 'alternatives' in site_data:
            significant_alts = [
                alt for alt in site_data['alternatives']
                if alt['probability'] > threshold
            ]
            if significant_alts:
                ambiguous_sites.append({
                    'site': site_data['site'],
                    'best_state': site_data['state'],
                    'best_prob': site_data['probability'],
                    'alternatives': significant_alts
                })

    return ambiguous_sites


def calculate_sequence_confidence(site_probs):
    '''Calculate overall confidence in ancestral sequence

    Metrics:
    - Mean posterior probability
    - Fraction of high-confidence sites (P > 0.95)
    - Number of ambiguous positions
    '''
    if not site_probs:
        return None

    probs = [s['probability'] for s in site_probs]
    high_conf = sum(1 for p in probs if p > 0.95) / len(probs)
    low_conf = sum(1 for p in probs if p < 0.8)

    return {
        'mean_probability': sum(probs) / len(probs),
        'high_confidence_fraction': high_conf,
        'low_confidence_sites': low_conf,
        'total_sites': len(probs),
        'overall_quality': 'high' if high_conf > 0.9 else 'moderate' if high_conf > 0.7 else 'low'
    }
```

## Ancestral Protein Resurrection

**Goal:** Design protein constructs for experimental resurrection of ancestral proteins.

**Approach:** Use the maximum-likelihood ancestral sequence as the primary construct, then create alternative constructs at ambiguous sites (P < 0.95) for experimental validation.

```python
def design_asr_construct(ancestral_seq, extant_reference, ambiguous_sites):
    '''Design constructs for ancestral protein resurrection

    Strategy:
    1. Use most probable (ML) state at each position as primary construct
    2. Create alternative constructs at ambiguous sites (P < 0.95)
    3. Consider codon optimization for expression host

    Epistasis warning: Site-independent models assume positions evolve
    independently, but intramolecular epistasis means certain residue
    combinations may be incompatible. The ML ancestral sequence may contain
    never-tested combinations of residues, potentially yielding non-functional
    proteins. Test multiple alternative constructs, not just the ML sequence.

    Validation:
    - Test activity of primary + alternative resurrected proteins
    - Compare activity to extant homologs as positive controls
    - Titrate ambiguous positions systematically (especially buried residues)
    '''
    constructs = [{'name': 'ASR_ML', 'sequence': ancestral_seq, 'description': 'Maximum likelihood ancestral'}]

    # Create alternative constructs for ambiguous sites
    for site in ambiguous_sites[:5]:  # Limit to top 5 ambiguous
        alt_seq = list(ancestral_seq)
        best_alt = site['alternatives'][0]
        alt_seq[site['site'] - 1] = best_alt['state']
        constructs.append({
            'name': f"ASR_alt_{site['site']}",
            'sequence': ''.join(alt_seq),
            'description': f"Alternative at position {site['site']}: {best_alt['state']}"
        })

    return constructs
```

## Related Skills

- comparative-genomics/positive-selection - Selection analysis on ancestral branches
- comparative-genomics/ortholog-inference - Identify orthologs for reconstruction
- phylogenetics/modern-tree-inference - Generate rooted trees for ASR
- alignment/multiple-alignment - PRANK alignment for ASR (correctly handles indels)
- alignment/pairwise-alignment - Prepare MSA for reconstruction