biopython-alignment
$
npx mdskill add aipoch/medical-research-skills/biopython-alignmentAlign sequences, convert formats, and compute statistics with Biopython.
- Executes global and local alignments for protein or nucleotide data.
- Reads and writes MSA files in FASTA, Clustal, and Stockholm formats.
- Applies substitution matrices and gap penalties for accurate scoring.
- Calculates identity, coverage, and conservation metrics for analysis.
SKILL.md
.github/skills/biopython-alignmentView on GitHub ↗
---
name: biopython-alignment
description: Sequence alignment and alignment file processing with Biopython (Bio.Align/Bio.AlignIO), triggered when you need global/local pairwise alignment, MSA read/write/format conversion, or alignment statistics/filtering.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)
# biopython-alignment
## When to Use
- You need **global alignment** between two protein (or nucleotide) sequences and want a reproducible score and aligned strings.
- You need **local alignment** to find the best matching fragment/subsequence between two DNA/RNA/protein sequences.
- You need to **read, write, or convert** multiple sequence alignment (MSA) files (e.g., FASTA/Clustal/Stockholm) using Biopython I/O.
- You want to compute **alignment statistics** (e.g., identity, coverage, conservation per column) and filter alignments by thresholds.
- You need to apply **substitution matrices** (e.g., BLOSUM62) and tune gap penalties for biologically meaningful scoring.
## Key Features
- Pairwise alignment via `Bio.Align.PairwiseAligner` (global and local modes).
- Alignment scoring with configurable match/mismatch and gap penalties.
- Protein substitution matrices via `Bio.Align.substitution_matrices` (e.g., BLOSUM/PAM).
- MSA parsing and serialization via `Bio.AlignIO` (read/write/format conversion).
- Basic alignment statistics: identity, aligned length, coverage, and MSA column conservation.
## Dependencies
- `biopython>=1.81`
- `numpy>=1.21`
## Example Usage
```python
# -*- coding: utf-8 -*-
"""
Runnable examples for:
1) Global protein alignment
2) Local DNA alignment (best fragment)
3) MSA parsing + column conservation
Requires: biopython, numpy
"""
from __future__ import annotations
from io import StringIO
import numpy as np
from Bio.Align import PairwiseAligner
from Bio.Align import substitution_matrices
from Bio import AlignIO
def global_protein_alignment(seq_a: str, seq_b: str) -> None:
matrix = substitution_matrices.load("BLOSUM62")
aligner = PairwiseAligner()
aligner.mode = "global"
aligner.substitution_matrix = matrix
aligner.open_gap_score = -10.0
aligner.extend_gap_score = -0.5
alignments = aligner.align(seq_a, seq_b)
best = alignments[0]
print("=== Global protein alignment (best) ===")
print("Score:", best.score)
print(best)
def local_dna_alignment_best_fragment(seq_a: str, seq_b: str) -> None:
aligner = PairwiseAligner()
aligner.mode = "local"
aligner.match_score = 2.0
aligner.mismatch_score = -1.0
aligner.open_gap_score = -2.0
aligner.extend_gap_score = -0.5
best = aligner.align(seq_a, seq_b)[0]
# Extract the aligned fragment coordinates from the first aligned block.
# aligned is a tuple: (aligned_coords_in_seq_a, aligned_coords_in_seq_b)
a_blocks, b_blocks = best.aligned
a_start, a_end = a_blocks[0]
b_start, b_end = b_blocks[0]
print("=== Local DNA alignment (best) ===")
print("Score:", best.score)
print(best)
print("Best fragment in seq_a:", seq_a[a_start:a_end], f"(coords {a_start}:{a_end})")
print("Best fragment in seq_b:", seq_b[b_start:b_end], f"(coords {b_start}:{b_end})")
def msa_column_conservation(fasta_text: str) -> None:
handle = StringIO(fasta_text)
msa = AlignIO.read(handle, "fasta") # MultipleSeqAlignment
# Convert to a 2D array of characters: shape (n_seqs, aln_len)
arr = np.array([list(str(rec.seq)) for rec in msa], dtype="U1")
n_seqs, aln_len = arr.shape
# Conservation per column: fraction of the most common non-gap character.
# Treat '-' as gap; ignore gaps when computing the most common residue.
conservation = []
for j in range(aln_len):
col = arr[:, j]
col = col[col != "-"]
if col.size == 0:
conservation.append(0.0)
continue
values, counts = np.unique(col, return_counts=True)
conservation.append(float(counts.max() / counts.sum()))
print("=== MSA column conservation ===")
print("n_seqs:", n_seqs, "aln_len:", aln_len)
print("conservation:", [round(x, 3) for x in conservation])
def main() -> None:
# 1) Global alignment (protein)
seq_a = "MKTAYIAKQRQISFVKSHFSRQDILD"
seq_b = "MKLAYIAKQRQISFVKSHFTRQDILN"
global_protein_alignment(seq_a, seq_b)
# 2) Local alignment (DNA)
seq_a = "ATGCGTACGTTAGC"
seq_b = "GGGATGCGTACGAAAC"
local_dna_alignment_best_fragment(seq_a, seq_b)
# 3) MSA conservation (FASTA)
fasta_text = ">s1\nACGTACGT\n>s2\nACGTTCGT\n>s3\nACGTACGA\n"
msa_column_conservation(fasta_text)
if __name__ == "__main__":
main()
```
## Implementation Details
- **Pairwise alignment engine**: uses `Bio.Align.PairwiseAligner`, which performs dynamic programming alignment under the selected mode:
- `mode="global"`: aligns full-length sequences end-to-end.
- `mode="local"`: finds the highest-scoring matching region (best subsequence pair).
- **Scoring configuration**:
- For proteins, prefer `substitution_matrix` (e.g., `BLOSUM62`) plus gap penalties (`open_gap_score`, `extend_gap_score`).
- For nucleotides, a simple scheme is common: `match_score`, `mismatch_score`, and gap penalties.
- **Selecting the best alignment**: `aligner.align(a, b)` returns an iterable of alignments sorted by score; use `[0]` for the top-scoring result.
- **Local “best fragment” extraction**:
- `alignment.aligned` returns aligned coordinate blocks for each sequence.
- The first block `(start, end)` typically corresponds to the highest-scoring contiguous aligned region; slice the original sequences with these coordinates to obtain the fragment.
- **MSA I/O and statistics**:
- `Bio.AlignIO.read(handle, fmt)` parses an alignment into a `MultipleSeqAlignment`.
- Column conservation can be computed as:
`max_count(non-gap residues in column) / total_non_gap_count(column)`.
- **Operational conventions (recommended)**:
- Store runtime configuration in `config/task_config.json` and invoke scripts as `python scripts/<task_name>.py`.
- Avoid stacking many CLI `--` parameters; keep parameters in the config file.
- Always specify `encoding="utf-8"` for file I/O; for JSON output use `ensure_ascii=False`.More from aipoch/medical-research-skills
- 3d-molecule-ray-tracerGenerate photorealistic rendering scripts for PyMOL and UCSF ChimeraX.
- abstract-summarizerTransform lengthy academic papers into concise, structured 250-word abstracts.
- abstract-trimmerPrecision editing tool that reduces abstract word count through intelligent compression techniques, maintaining scientific rigor while meeting strict journal and conference requirements.
- academic-abstract-refinerRefines long medical academic texts into SCI-style unstructured Chinese and English abstracts; use when you need to condense drafts/reports/summaries into bilingual abstracts and generate Summary_Report.md.
- academic-cv-generatorGenerate structured academic CVs from free-form Chinese/English text and export to Word (.docx). Use this skill when you are asked to organize, generate, or optimize an academic CV (e.g., publications/projects/awards) into a consistent, formatted document with uniform-colored section headers and optional bilingual output.
- academic-highlight-generatorGenerates submission-ready Elsevier/SCI Highlights from manuscript text or extracted PDF/DOCX/TXT content. Use when a user needs 3-5 concise, evidence-grounded highlight bullets for a research paper, review, meta-analysis, case report, or bioinformatics manuscript.
- academic-norm-reviewDetects content similarity, verifies standardized citations and abbreviations, and flags potential academic integrity risks; use it before submission, during academic writing QA, or for compliance reviews.
- academic-poster-generatorComplete workflow for generating academic research posters from PDF literature; use when you need to extract paper content from PDFs and produce a LaTeX-based poster (beamerposter/tikzposter/baposter) with mandatory figure generation and a final rendered HTML deliverable.
- acronym-unpackerIntelligent medical abbreviation disambiguation tool that resolves ambiguous acronyms using clinical context, specialty-specific knowledge, and document-level semantic analysis.
- active-comparator-single-soc-faers-safety-comparisonGenerates complete FAERS pharmacovigilance study designs for multi-drug or class-level safety comparison inside one predefined SOC or AE family using active comparators, disproportionality analysis, subgroup characterization, and reviewer-facing evidence control.