bio-alignment-structural
$
npx mdskill add GPTomics/bioSkills/bio-alignment-structuralSuperpose protein structures when sequence data fails.
- Enables remote homology detection below 25% identity.
- Integrates Foldseek, TM-align, US-align, DALI, and Foldmason.
- Selects optimal alignment method based on input structure type.
- Outputs backbone coordinates and fold-similarity scores.
SKILL.md
.github/skills/bio-alignment-structuralView on GitHub ↗
---
name: bio-alignment-structural
description: Align protein structures using Foldseek 3Di, TM-align, US-align, DALI, or Foldmason for structural MSA. Predict, score, and superpose backbone coordinates when sequence identity is below the twilight zone or remote-homology detection is required. Use when sequence MSA fails (<25% identity), when the dark proteome is the target, when AlphaFoldDB / ESM Atlas search is needed, or when structural superposition is the goal.
tool_type: mixed
primary_tool: Foldseek
---
## Version Compatibility
Reference examples tested with: Foldseek 8+, TM-align 20220412+, US-align 20231222+, Foldmason 1+, BioPython 1.83+, pymol-open-source 3.0+
Before using code patterns, verify installed versions match. If versions differ:
- CLI: `foldseek --version`, `TMalign`, `USalign`, `foldmason --version`
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
# Structural Alignment
**"Align two protein structures"** → Compute backbone-aware superposition and a fold-similarity score (TM-score, RMSD, or LDDT).
- CLI pairwise: `TMalign A.pdb B.pdb`, `USalign A.pdb B.pdb`
- CLI search at scale: `foldseek easy-search query/ AFDB result.m8 tmp/`
- CLI structural MSA: `foldmason easy-msa structures/*.pdb out tmp/`
- Python pairwise: `Bio.PDB.Superimposer`, or `subprocess` wrapping `TMalign` / `USalign` (see `examples/tm_align_pairwise.py`)
- GUI / scripted molecular-graphics superposition: ChimeraX `matchmaker`, PyMOL `super`/`cealign`
**"Find structural homologs of an AlphaFold model"** → Search a structure database by 3Di-encoded structural alphabet (Foldseek) or by full TM-align rotation (DALI, US-align).
## When to Use Structural Alignment
| Sequence identity | Recommended approach |
|-------------------|---------------------|
| >= 40% | Sequence DP (Bio.Align, BLASTP) is sufficient |
| 25-40% | Sensitive sequence (MMseqs2, jackhmmer, profile-profile HHsearch) |
| 15-25% | Profile-profile (HHsearch) OR Foldseek if structures available |
| < 15% (dark proteome / twilight zone) | Foldseek (3Di), TM-align, US-align, pLM aligners |
Sequence alignment below 15% identity is statistically indistinguishable from random pairings. The exact twilight-zone cutoff is length-dependent: Rost 1999 (Prot Eng) showed the curve drops to 25% at length 80, 20% at length 250 -- short alignments need higher identity for the same statistical signal, so a 15-25% rule of thumb is shorthand for "twilight zone for proteins of typical domain size (~150-300 residues)". If reasonable structural models exist (PDB, AlphaFoldDB, ESMFold), structural alignment is far more reliable in this regime.
### Twilight-Zone Threshold Exceptions
For families with strong functional or structural constraint (ribosomal proteins, class I aminoacyl-tRNA synthetases with HIGH/KMSKS motifs, histone fold, cytochrome c with CXXCH, or any long alignment with well-distributed conservation), sequence MSA may remain reliable down to ~12-15% identity. Verify with a profile-profile method (HHsearch) before committing to structural alignment.
## Pairwise Structural Alignment Tool Selection
| Tool | Reference | Score | Best for |
|------|-----------|-------|----------|
| TM-align | Zhang & Skolnick 2005 NAR | TM-score | Single-chain pairwise; standard fold-similarity benchmark |
| US-align | Zhang et al 2022 Nat Methods | TM-score | Multi-chain protein, RNA, DNA, complexes; original successor to TM-align |
| Foldseek-Multimer | Kim et al 2025 Nat Methods | TM-score | Database-scale multi-chain complex search; 10-100x faster than US-align in pairwise mode, 1000-10000x in DB search; preferred at AFDB-multimer scale |
| FATCAT | Ye & Godzik 2003 Bioinf | FATCAT score | Flexible alignment with allowed twists/breaks |
| CE | Shindyalov & Bourne 1998 Prot Eng | CE score | Combinatorial extension of fragments; PDB legacy |
| DALI | Holm 2022 NAR | Z-score | Distance-matrix alignment; superior at TM 0.3-0.5 (twilight fold); operates on AFDB at scale |
| Bio.PDB.Superimposer | Cock et al 2009 Bioinf | RMSD | Pure-Python superposition with known atom correspondence |
TM-align and US-align report two TM-scores by default: one normalised by chain 1 length and one by chain 2 length. The standard fold-similarity metric for asymmetric reporting is the larger of the two (effectively normalising by the shorter chain), since TM-score is length-asymmetric. The `-a T` flag adds a third score normalised by the average of the two chain lengths (useful for symmetric clustering). **TM > 0.5** indicates the same fold; **TM > 0.8** indicates equivalent topology / close structural relationship; **TM < 0.2** indicates random structural similarity. RMSD alone is misleading: RMSD scales with length, depends on outliers, and an "optimization RMSD" (used to fit) is not the same as an "evaluation RMSD" (used to compare). Always report TM-score alongside RMSD and the alignment length.
| Metric | Threshold | Source / Interpretation |
|--------|-----------|-------------------------|
| TM-score | > 0.5 | Same fold (Zhang & Skolnick 2004 Proteins; reported by TM-align, US-align, Foldseek, DALI) |
| TM-score | > 0.8 | Equivalent topology (homologous) |
| TM-score | < 0.2 | Random structural similarity |
| DALI Z-score | > 20 | Definitely homologous (Holm 2020 Prot Sci) |
| DALI Z-score | 8 - 19 | Probable homology |
| DALI Z-score | 2 - 8 | Candidate; verify with TM-score or biology |
| DALI Z-score | < 2 | Not significant (random) |
| GDT-TS | > 50 | Correct fold (CASP/LGA scoring; reported by LGA, MaxCluster, OpenStructure -- not by TM-align/Foldseek) |
| LDDT | > 0.6 | Conventional cutoff for "correctly modelled" residue (Mariani et al 2013 Bioinf shows the score is fold-architecture-dependent and does not define a universal hard threshold; >= 0.6 is widely used in CAMEO and Foldseek output as a working cutoff) |
| RMSD | < 2 A over >100 residues | Strong superposition; below 1.5 A is excellent |
### Foldseek vs DALI: Modern Comparison
Both target structural homolog search, but with different strengths:
| Question | Foldseek | DALI |
|----------|----------|------|
| Find any same-fold homolog quickly | YES (3Di indexed; 1000-1M seq/s) | Slow (full distance matrix) |
| Sensitivity at TM > 0.5 (same-fold) | Comparable to TM-align | Slightly higher recall |
| Sensitivity at TM 0.3-0.5 (twilight fold) | Recall drops sharply | DALI Z-score retains signal (Holm 2022) |
| Alignment quality for indel-rich pairs | Local 3Di+AA; can miss large indels | Distance-matrix optimal handles indels well |
| AFDB-scale all-vs-all | Tractable | Now tractable post-2022 (Holm DALI server update) |
Practical workflow for remote homology: (1) Foldseek `easy-search` to retrieve same-fold candidates fast, (2) for hits with TM < 0.5 or with structural-indel rich alignments, re-align with DALI for higher-quality residue equivalences. The two tools are complementary; Foldseek-only misses some twilight-fold relationships, DALI-only misses the AFDB-scale opportunity.
### TM-score Threshold Caveats
Apply the 0.5 same-fold rule only to globular domains of ~100-300 residues; for chains <60 residues TM > 0.5 occurs by chance ~3-5%, so use Xu & Zhang 2010's length-aware Gumbel p-value instead. Always report both length-normalised TM-scores (or use `-a T` for the symmetrised average) and pair RMSD with the atom count (`(RMSD, n_atoms_used)`); raw RMSD without length normalisation is misleading.
### TM-align Pairwise Run
**Goal:** Compute TM-score and RMSD between two structures and write a superposed PDB.
**Approach:** Invoke TMalign or USalign with output flags; parse the structured output for downstream filtering.
```bash
TMalign chainA.pdb chainB.pdb -o superposed.sup
TMalign chainA.pdb chainB.pdb -outfmt 2
USalign chainA.pdb chainB.pdb -mol prot -outfmt 2
USalign complex_A.pdb complex_B.pdb -mm 1 -ter 0
```
`-outfmt 2` returns tabular output (one line per pair). For multi-chain complexes, US-align with `-mm 1 -ter 0` aligns full assemblies; TM-align is single-chain only.
**US-align multi-chain score interpretation.** US-align `-mm 1 -ter 0` produces multiple TM-scores per complex: normalised by query length, normalised by template length, and (for symmetric homo-oligomers) the best-matching chain permutation. For different stoichiometries (query hetero-A2B2 vs template hetero-A4B4), use `-mm 4` and `-byresi 0`. The "best-of" TM-score is the appropriate cross-complex homology metric; per-chain TM-scores serve as a sanity check (low per-chain + high complex TM = topology match without local fold conservation, suggests promiscuous interaction not deep homology). See Zhang et al 2022 Nat Methods for the full mode taxonomy.
### Foldseek-Multimer for Database-Scale Complex Search
For multi-chain complex search at AFDB-multimer scale (~214k entries) or against PDB complexes, US-align is too slow; Foldseek-Multimer (Kim et al 2025 Nat Methods) is the modern default. Two modes share the chain-pairing prefilter:
| Mode | Algorithm | Use when |
|------|-----------|----------|
| `Foldseek-MM` (default) | 3Di+AA Gotoh per chain pair | Fast database search; default speed-sensitivity tradeoff |
| `Foldseek-MM-TM` | TM-align per chain pair after prefilter | Top-hit refinement with full TM-score |
```bash
foldseek easy-multimersearch query_complex.pdb afdb_multimer result tmp/
foldseek easy-multimersearch query_complex.pdb afdb_multimer result tmp/ --multimer-tm-threshold 0.5
foldseek easy-multimercluster *.pdb cluster_result tmp/ --multimer-tm-threshold 0.65
```
Decision guide: pairwise complex pair on a few hundred targets -> US-align with `-mm 1 -ter 0`. Database search across thousands to millions of complex entries -> Foldseek-Multimer. For AFDB-Multimer or PDB100-Multimer scale, US-align is computationally infeasible; Foldseek-Multimer matches US-align chain-pairing in >99% of cases at 10-100x speedup pairwise and 10^3-10^4x in database mode (Kim et al 2025 benchmark).
**Reporting convention:** Always quote BOTH the multimer TM-score (whole-complex) AND the worst per-chain TM-score; high multimer TM with one low per-chain score signals topology-matched but locally divergent chains (often promiscuous binders or paralog swaps), which is biologically distinct from a uniformly-conserved complex.
### Bio.PDB.Superimposer
**Goal:** Superpose a known atom correspondence and compute RMSD without running an external aligner.
**Approach:** Use when residue equivalence is already established (e.g. same sequence, different conformations). For unknown correspondence, prefer TM-align / US-align.
```python
from Bio.PDB import PDBParser, Superimposer
parser = PDBParser(QUIET=True)
mobile_atoms = list(parser.get_structure('mobile', 'mobile.pdb').get_atoms())
reference_atoms = list(parser.get_structure('ref', 'reference.pdb').get_atoms())
ca_mobile = [a for a in mobile_atoms if a.get_id() == 'CA']
ca_reference = [a for a in reference_atoms if a.get_id() == 'CA']
n = min(len(ca_mobile), len(ca_reference))
sup = Superimposer()
sup.set_atoms(ca_reference[:n], ca_mobile[:n])
sup.apply(mobile_atoms)
print(f'RMSD: {sup.rms:.3f} A over {n} CA atoms')
```
## Structural Search at Scale: Foldseek
Run Foldseek for AlphaFoldDB-scale structural search; it indexes a 20-letter 3Di alphabet for thousand- to million-fold speedup over TM-align at comparable same-fold sensitivity.
```bash
# Search query structures against AFDB (default: --alignment-type 2)
foldseek easy-search query.pdb afdb_database result.m8 tmp/
# Refine top hits with full TM-align rotation (slower but global TM-score)
foldseek easy-search query.pdb afdb_database result.m8 tmp/ --alignment-type 1
# All-versus-all clustering at TM > 0.5
foldseek easy-cluster structures/*.pdb cluster_result tmp/ --tmscore-threshold 0.5
# Custom output columns
foldseek easy-search query.pdb afdb_database result.m8 tmp/ \
--format-output query,target,evalue,alntmscore,qtmscore,ttmscore,lddt,bits
```
| Foldseek `--alignment-type` | Algorithm | Use when |
|-----------------------------|-----------|----------|
| 0 | 3Di Gotoh-Smith-Waterman (local) | Not recommended; 3Di alone, no amino-acid signal |
| 1 | TMalign (global) | Refine top hits with full TM-score; slow |
| 2 | 3Di+AA Gotoh-Smith-Waterman (local) | Default; best speed-sensitivity tradeoff |
### pLDDT-Filtering AlphaFold Structures Before Foldseek
Mask residues with pLDDT < 70 before Foldseek indexing or search; their backbone coordinates encode as random 3Di letters and contaminate hits below TM ~ 0.4. AFDB clusters from Barrio-Hernandez et al 2023 are pre-filtered; ESMFold predictions need the same step.
```bash
# Mask low-pLDDT residues during database creation (B-factor column stores pLDDT)
# The *.pdb glob requires shell expansion; for Python subprocess calls use glob.glob()
# to expand the file list before passing as argv.
foldseek createdb --mask-bfactor-threshold 70.0 *.pdb afdb_masked
```
## Structural Multiple Sequence Alignment
| Tool | Reference | Best for |
|------|-----------|----------|
| Foldmason | Gilchrist et al 2024 (preprint) | Billion-protein-scale structural MSA on AFDB; the structural counterpart to MAFFT |
| 3D-Coffee / Expresso | Notredame group | <100 chains with mixed PDB and sequence input |
| mTM-align | Dong et al 2018 Bioinf | Multiple structure alignment; output suitable for tree building. Server: `yanglab.qd.sdu.edu.cn/mTM-align/` (Yang Lab moved from Nankai to Shandong University in 2021; verify URL before use) |
| PROMALS3D | Pei, Kim & Grishin 2008 NAR | Hybrid sequence-structure profile MSA; uses PSI-BLAST + DALI/SAP templates |
| MUSTANG | Konagurthu et al 2006 Proteins | Pure structural MSA via residue-residue equivalences |
### Foldmason easy-msa
```bash
foldmason easy-msa structures/*.pdb result tmp/ \
--refine-iters 100 \
--report-mode 1
# Outputs: result_aa.fa (amino-acid MSA), result_3di.fa (3Di MSA), result.nw (guide tree), result.html (LDDT report)
```
`--refine-iters` controls iterative MSA refinement; `--report-mode 1` produces an HTML report with per-column LDDT confidence. The 3Di MSA can be used directly for evolutionary analyses where structure rather than sequence is the appropriate signal.
**Per-column LDDT extraction:** Run with `--report-mode 2` to produce machine-readable JSON output:
```bash
foldmason easy-msa structures/*.pdb result tmp/ --report-mode 2
# Produces result.json with per-column LDDT in machine-readable form
```
```python
import json
with open('result.json') as f:
report = json.load(f)
lddt_per_column = report.get('per_column_lddt')
```
Verify the exact JSON schema with `foldmason easy-msa --help` and inspect a sample `result.json`; the field name may evolve across releases.
## Hybrid Sequence-Structure Approaches
When a small dataset has known structures or templates, hybrid tools dramatically improve MSA accuracy:
- **T-Coffee Expresso** (Notredame et al) -- PSI-BLAST searches PDB for templates, runs SAP or TM-align pairwise structural library, then T-Coffee consistency over the resulting library. Reported in Notredame-group benchmarks to substantially improve correct-column rate over sequence-only T-Coffee. Requires internet access for PSI-BLAST and PDB template fetching; for offline / reproducible runs use 3D-Coffee with locally curated templates.
- **3D-Coffee** -- user-supplied PDB templates with `-template_file templates.txt`; uses SAP or TM-align as the structural method.
- **PROMALS3D** (Pei, Kim & Grishin 2008 NAR) -- PSI-BLAST profiles + secondary-structure prediction + DALI/SAP templates; used heavily in the Grishin lab's superfamily annotations. PROMALS3D server availability has been intermittent; if unreachable, T-Coffee Expresso (mode `expresso`) is a current alternative for structure-informed sequence MSA.
```bash
t_coffee input.fasta -mode expresso -output fasta_aln -outfile aligned.fasta
t_coffee input.fasta -template_file templates.txt -mode 3dcoffee
```
## AlphaFold Integration
The "predict-then-align" workflow has become standard for remote-homology problems:
1. Search sequence with MMseqs2 / jackhmmer to seed an MSA.
2. Predict structure with AlphaFold2 (ColabFold), ESMFold, or AlphaFold3 -- see `structural-biology/modern-structure-prediction` for prediction workflows.
3. Search the predicted structure against AFDB or PDB with Foldseek.
4. Use Foldmason or PROMALS3D to derive a structure-aware MSA from the hits.
5. Refine sequence MSA using the structure-derived column equivalences.
Search AFDB clusters (~2.3 M Foldseek-derived clusters from Barrio-Hernandez et al 2023) as the canonical remote-homology entry point; supplement with the ESM Atlas (~600 M ESMFold metagenomic structures) when AFDB recall is insufficient. For pLDDT semantics and curated AlphaFoldDB entry handling, see `structural-biology/alphafold-predictions`.
## pLM-Based Sequence Aligners
Run a pLM aligner (TM-Vec, vcMSA, DEDAL, pLM-BLAST) when no structure is available but identity is in the twilight zone (<15-25%). They recover signal that DP aligners miss but do NOT replace TM-align / US-align when structures exist; use them as a complementary first pass.
| Tool | Reference | Embedding source |
|------|-----------|------------------|
| vcMSA | McWhite, Armour-Garb & Singh 2023 Genome Res | ProtT5 vector clustering for MSA (alpha-stage tool per its repo README; last release Oct 2023; test on representative inputs before pipeline use) |
| DEDAL | Llinares-Lopez et al 2023 Bioinf | Differentiable end-to-end alignment with pLM features |
| TM-Vec | Hamamsy et al 2024 Nat Biotech | TM-score prediction from ProtT5 embeddings (active fork: `valentynbez/tmvec`; original `tymor22/tm-vec` is in limited maintenance) |
| pLM-BLAST | Kaminski et al 2023 Bioinf | BLAST-style hits via pLM cosine similarity |
These run on raw sequence (no structure required) and recover signal at <15% identity that DP aligners miss entirely. They do NOT replace structural alignment when structures exist; treat them as complementary.
**pLM aligner accuracy ceiling.** TM-Vec, vcMSA, and DEDAL predict alignment-equivalent quantities (TM-score, alignment columns) from sequence-only embeddings. Hamamsy et al 2024 report TM-Vec TM-score prediction RMSE ~0.07 on SCOP -- sufficient for coarse filtering ("is this hit at all structurally similar?") but NOT for fine-grained ranking (e.g. choosing among hits with TM 0.5-0.7). For final structural-similarity scoring, run TM-align or US-align on predicted structures (ColabFold + USalign) rather than relying on the pLM-predicted score.
## Visualisation and Inspection
Structural alignments are read by viewing the superposed structures, not the sequence text:
```bash
# PyMOL headless: -cq runs without GUI/quietly; -d passes commands directly
pymol -cq -d "load reference.pdb; load mobile.pdb; super mobile, reference; ray 800,600; png fig.png"
# ChimeraX (modern, replaces Chimera): MatchMaker uses Needleman-Wunsch + iterative refinement
ChimeraX --cmd "open reference.pdb mobile.pdb; matchmaker #2 to #1; save fig.png"
```
PyMOL `super` performs cycle-fitting for distantly related structures (better than `align` when sequence identity is low); `cealign` runs CE; `tmalign` (PyMOL plugin) wraps TM-align. ChimeraX MatchMaker is the modern replacement and uses Needleman-Wunsch + iterative refinement with the option to invoke external tools.
## Decision Tree by Goal
| Goal | First-line tool |
|------|-----------------|
| Two structures, known correspondence | `Bio.PDB.Superimposer` |
| Two structures, unknown correspondence | TMalign or USalign |
| Multi-chain complex, pairwise | USalign with `-mm 1 -ter 0` |
| Multi-chain complex, database search | `foldseek easy-multimersearch` (Foldseek-Multimer) |
| Twilight-fold homology (TM 0.3-0.5) | DALI (Z-score ranks low-similarity hits better than Foldseek) |
| Structural homolog search at AFDB scale (single chain) | `foldseek easy-search` |
| All-vs-all clustering of structures | `foldseek easy-cluster --tmscore-threshold 0.5` |
| Multiple structure alignment, < 100 chains | T-Coffee Expresso or mTM-align |
| Multiple structure alignment, > 1000 chains | Foldmason `easy-msa` |
| Hybrid sequence-structure MSA | PROMALS3D or T-Coffee Expresso |
| pLM aligner for dark proteome | TM-Vec, vcMSA, DEDAL |
| Distant homology with no structures | Predict with ColabFold/ESMFold first, then Foldseek |
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| TM-align "atoms not enough" | Single-residue chains or ligand-only PDB | Filter to canonical amino acids before running |
| Foldseek "no hits" | Wrong database format | Run `foldseek databases` to confirm AFDB / PDB100 download |
| Bio.PDB Superimposer "different number of atoms" | Atom selection mismatch | Filter both lists to common atom names (CA only, or N/CA/C/O) |
| TM-score normalised by length 1 | Used `-L 1` | Drop `-L` flag; default normalisation is correct |
| Foldmason output empty | Input structures not in same chain ID | Renumber and rename chains uniformly before running |
## Related Skills
- alignment/multiple-alignment - Sequence MSA when identity > 25%; complementary for hybrid tools
- alignment/pairwise-alignment - Sequence pairwise; use first to filter before structural alignment
- alignment/msa-parsing - Parse and analyze structural MSA output for downstream metrics
- alignment/msa-statistics - Apply per-column conservation to structurally-derived MSAs
- alignment/alignment-io - Read and convert structural MSA files for downstream tools
- alignment/alignment-trimming - Trim structural MSAs the same way as sequence MSAs
- structural-biology/modern-structure-prediction - Predict structures (AlphaFold2/3, ESMFold) used as input
- structural-biology/alphafold-predictions - Curated AFDB / pLDDT handling
- structural-biology/structure-navigation - Map alignment columns to PDB residues
- phylogenetics/modern-tree-inference - Trees from structural MSAs (Foldmason output works directly)
## References
- van Kempen M et al. 2024. Fast and accurate protein structure search with Foldseek. Nat Biotech 42:243-246.
- Kim W, Mirdita M, Levy Karin E, Gilchrist CLM, Schweke H, Soding J, Levy E, Steinegger M. 2025. Rapid and sensitive protein complex alignment with Foldseek-Multimer. Nat Methods (in press; Foldseek-Multimer release).
- Zhang Y, Skolnick J. 2005. TM-align: a protein structure alignment algorithm based on the TM-score. NAR 33:2302-2309.
- Zhang C, Shine M, Pyle AM, Zhang Y. 2022. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods 19:1109-1115.
- Holm L. 2020. Using Dali for protein structure comparison. Methods Mol Biol 2112:29-42 (Z-score interpretation thresholds).
- Holm L. 2022. Dali server: structural unification of protein families. NAR 50:W210-W215.
- Gilchrist CLM et al. 2024. Foldmason: multiple protein structure alignment at scale with 3Di. Preprint.
- Barrio-Hernandez I et al. 2023. Clustering predicted structures at the scale of the known protein universe. Nature 622:637-645.
- Hamamsy T et al. 2024. Protein remote homology detection and structural alignment using deep learning. Nat Biotech 42:975-985.
- Xu J, Zhang Y. 2010. How significant is a protein structure similarity with TM-score = 0.5? Bioinf 26:889-895.
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.