bio-structural-biology-modern-structure-prediction
$
npx mdskill add GPTomics/bioSkills/bio-structural-biology-modern-structure-predictionPredict protein structures using modern ML models like AlphaFold3 and ESMFold
- Solves the task of predicting novel protein or complex structures
- Uses tools including AlphaFold3, ESMFold, Chai-1, and Boltz-1
- Chooses models based on speed, accuracy, and input requirements
- Returns predicted structures in standard formats for downstream analysis
SKILL.md
.github/skills/bio-structural-biology-modern-structure-predictionView on GitHub ↗
---
name: bio-structural-biology-modern-structure-prediction
description: Predict protein structures using modern ML models including AlphaFold3, ESMFold, Chai-1, and Boltz-1. Use when predicting structures for novel proteins, protein complexes, or when comparing predictions across multiple methods.
tool_type: python
primary_tool: ESMFold
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, numpy 1.26+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Modern Structure Prediction
**"Predict the structure of my protein"** → Run ML-based structure prediction using ESMFold (single-sequence, fast), AlphaFold3 (MSA-based, highest accuracy), Chai-1, or Boltz-1 and compare predictions across methods.
- Python: ESMFold API via `requests`, local ESMFold with `esm.pretrained`
Predict protein structures using state-of-the-art machine learning models. This covers cloud APIs, local installations, and interpretation of results.
## Model Comparison
| Model | Complexes | Ligands | Speed | Access |
|-------|-----------|---------|-------|--------|
| AlphaFold3 | Yes | Yes | Slow | Server only (2025) |
| ESMFold | No | No | Fast | API or local |
| Chai-1 | Yes | Yes | Moderate | Local or API |
| Boltz-1 | Yes | Yes | Moderate | Local |
| ColabFold | No* | No | Moderate | Colab/local |
*ColabFold can predict complexes with AlphaFold-Multimer.
## ESMFold (Fastest Single-Chain)
**Goal:** Predict a protein's 3D structure from its amino acid sequence using the ESMFold language model, which requires no MSA and runs in seconds.
**Approach:** Submit the sequence to the ESMFold API (or run locally with the esm library), retrieve the predicted PDB coordinates, and assess per-residue confidence via pLDDT scores in the B-factor column.
### Via ESM Atlas API
```python
import requests
def predict_esmfold(sequence):
'''Predict structure using ESMFold API'''
url = 'https://api.esmatlas.com/foldSequence/v1/pdb/'
response = requests.post(url, data=sequence, timeout=300)
if response.status_code == 200:
return response.text
raise Exception(f'ESMFold failed: {response.status_code}')
sequence = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH'
pdb_text = predict_esmfold(sequence)
with open('predicted.pdb', 'w') as f:
f.write(pdb_text)
```
### Local ESMFold
```python
import torch
import esm
def predict_esmfold_local(sequence, device='cuda'):
'''Run ESMFold locally (requires ~16GB GPU memory)'''
model = esm.pretrained.esmfold_v1()
model = model.eval().to(device)
with torch.no_grad():
output = model.infer_pdb(sequence)
return output
# Extract pLDDT from ESMFold output
def extract_esmfold_plddt(pdb_text):
plddt = {}
for line in pdb_text.split('\n'):
if line.startswith('ATOM') and line[12:16].strip() == 'CA':
resnum = int(line[22:26])
bfactor = float(line[60:66])
plddt[resnum] = bfactor
return plddt
```
## AlphaFold3 (Server)
AlphaFold3 predictions via the server at alphafoldserver.com.
### Prepare Input JSON
```python
import json
def create_af3_input(sequences, job_name='prediction'):
'''Create AlphaFold3 server input JSON'''
entities = []
for i, seq in enumerate(sequences):
entities.append({
'type': 'protein',
'sequence': seq,
'count': 1
})
job = {
'name': job_name,
'modelSeeds': [1],
'sequences': entities
}
return json.dumps(job, indent=2)
# Single protein
input_json = create_af3_input(['MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH'])
# Protein complex
input_json = create_af3_input([
'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH',
'MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSS'
])
```
### Process AF3 Results
```python
import json
from Bio.PDB import PDBParser
import numpy as np
def analyze_af3_result(result_dir):
'''Analyze AlphaFold3 prediction results'''
# Load summary
with open(f'{result_dir}/summary_confidences.json') as f:
summary = json.load(f)
# Extract confidence metrics
iptm = summary.get('iptm', None) # Interface pTM (complexes)
ptm = summary.get('ptm', None) # Predicted TM-score
ranking = summary.get('ranking_score', None)
print(f'pTM: {ptm:.3f}' if ptm else 'pTM: N/A')
print(f'ipTM: {iptm:.3f}' if iptm else 'ipTM: N/A')
return summary
```
### AF3 Confidence Interpretation
| Metric | Range | Interpretation |
|--------|-------|----------------|
| pTM | 0-1 | Overall structure confidence |
| ipTM | 0-1 | Interface prediction quality |
| pLDDT | 0-100 | Per-residue confidence |
| PAE | 0-30A | Position error between residue pairs |
## Chai-1 (Local Open-Source)
### Installation
```bash
pip install chai-lab
```
### Basic Prediction
```python
from chai_lab.chai1 import run_inference
import numpy as np
from pathlib import Path
def predict_chai1(fasta_path, output_dir='chai_output'):
'''Run Chai-1 structure prediction'''
Path(output_dir).mkdir(exist_ok=True)
candidates = run_inference(
fasta_file=Path(fasta_path),
output_dir=Path(output_dir),
num_trunk_recycles=3, # 3: Standard. Use 5+ for difficult targets.
num_diffn_timesteps=200, # 200: Standard. 500 for higher quality.
seed=42,
device='cuda:0'
)
return candidates
# Candidates are sorted by confidence
# candidates.cif files contain predicted structures
```
### Chai-1 with Ligands
```python
# Chai-1 supports protein-ligand complexes
# Include ligand SMILES in input FASTA with special format
def create_chai_fasta_with_ligand(protein_seq, ligand_smiles, output_file):
'''Create Chai-1 input with protein and ligand'''
with open(output_file, 'w') as f:
f.write('>protein|chain_A\n')
f.write(f'{protein_seq}\n')
f.write('>ligand|chain_B\n')
f.write(f'{ligand_smiles}\n')
```
## Boltz-1 (Open-Source Complex Prediction)
### Installation
```bash
pip install boltz
```
### Basic Prediction
```python
from boltz import Boltz1
def predict_boltz1(sequences, output_dir='boltz_output'):
'''Run Boltz-1 structure prediction'''
model = Boltz1()
result = model.predict(
sequences=sequences,
output_dir=output_dir,
recycling_steps=3, # 3: Standard. Increase for difficult targets.
sampling_steps=200 # 200: Standard. 500 for publication quality.
)
return result
```
### Boltz-1 for Complexes
```python
# Boltz-1 handles heteromeric complexes
def predict_complex_boltz(chain_sequences):
'''Predict protein complex with Boltz-1'''
model = Boltz1()
result = model.predict(
sequences=chain_sequences, # List of sequences for each chain
output_dir='complex_output'
)
# Extract interface metrics
return result
```
## ColabFold (AlphaFold2 + MMseqs2)
### Command Line
```bash
# Install ColabFold
pip install colabfold
# Run prediction
colabfold_batch input.fasta output_dir/
# With custom templates
colabfold_batch input.fasta output_dir/ --templates
# For complexes (use : to separate chains)
# Create FASTA like: >complex\nSEQUENCE1:SEQUENCE2
```
### Python API
```python
from colabfold.batch import run_colabfold
def predict_colabfold(fasta_file, output_dir, use_templates=False):
'''Run ColabFold prediction'''
run_colabfold(
input_path=fasta_file,
result_dir=output_dir,
use_templates=use_templates,
num_models=5, # 5: Standard. Use 1 for quick predictions.
num_recycles=3, # 3: Standard. Increase for multimers.
model_order=[1,2,3,4,5]
)
```
## Comparing Predictions
```python
from Bio.PDB import PDBParser, Superimposer
import numpy as np
def compare_predictions(pdb_files, labels=None):
'''Compare multiple structure predictions'''
parser = PDBParser(QUIET=True)
structures = [parser.get_structure(f'model_{i}', f) for i, f in enumerate(pdb_files)]
# Extract CA atoms from first chain
def get_ca_atoms(struct):
return [r['CA'] for r in struct[0].get_residues() if 'CA' in r]
all_atoms = [get_ca_atoms(s) for s in structures]
# Pairwise RMSD
n = len(structures)
rmsd_matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
min_len = min(len(all_atoms[i]), len(all_atoms[j]))
super_imposer = Superimposer()
super_imposer.set_atoms(all_atoms[i][:min_len], all_atoms[j][:min_len])
rmsd_matrix[i,j] = rmsd_matrix[j,i] = super_imposer.rms
return rmsd_matrix
# Compare ESMFold vs AlphaFold3 vs Chai-1
rmsd = compare_predictions(['esmfold.pdb', 'af3.pdb', 'chai1.pdb'])
print('RMSD matrix:')
print(rmsd)
```
## When to Use Each Model
| Scenario | Recommended Model |
|----------|-------------------|
| Quick single-chain prediction | ESMFold (API) |
| Highest accuracy single chain | AlphaFold3 or ColabFold |
| Protein-protein complex | AlphaFold3, Chai-1, or Boltz-1 |
| Protein-ligand complex | AlphaFold3 or Chai-1 |
| No GPU available | ESMFold API or AlphaFold3 server |
| Large-scale screening | ESMFold (local) |
| Open-source requirement | Chai-1 or Boltz-1 |
## Memory Requirements
| Model | GPU Memory | Notes |
|-------|------------|-------|
| ESMFold | ~16 GB | Sequence length dependent |
| ColabFold | ~8-16 GB | Model size dependent |
| Chai-1 | ~24 GB | Complex size dependent |
| Boltz-1 | ~24 GB | Complex size dependent |
## Related Skills
- alphafold-predictions - Download pre-computed AlphaFold structures
- structure-io - Parse and write structure files
- geometric-analysis - RMSD, superimposition, distance calculations
- structure-navigation - Navigate predicted structure hierarchy
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.