bio-codon-usage
$
npx mdskill add GPTomics/bioSkills/bio-codon-usageAnalyze codon usage and optimize expression sequences
- Solves problems of predicting protein expression levels from DNA data.
- Depends on Biopython libraries for sequence analysis tools.
- Decides actions by calculating Codon Adaptation Index metrics automatically.
- Delivers results through codon frequency tables and optimization recommendations.
SKILL.md
.github/skills/bio-codon-usageView on GitHub ↗
---
name: bio-codon-usage
description: Analyze codon usage, calculate CAI (Codon Adaptation Index), and examine synonymous codon bias using Biopython. Use when analyzing coding sequences for expression optimization or evolutionary analysis.
tool_type: python
primary_tool: Bio.SeqUtils.CodonUsage
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Codon Usage
Analyze codon usage patterns and calculate codon adaptation metrics using Biopython.
**"Analyze codon usage"** → Count codons in a coding sequence, compute frequencies and bias metrics.
- Python: `Counter` on 3-mers + `CodonAdaptationIndex` (BioPython)
**"Optimize codons for expression"** → Replace codons with host-preferred synonymous codons using a preference table.
- Python: custom mapping dict + `Seq()` (BioPython)
## Required Imports
```python
from Bio.Seq import Seq
from Bio.SeqUtils import GC123
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex
from Bio.Data import CodonTable
from collections import Counter
```
## Basic Codon Counting
**Goal:** Tabulate codon frequencies in a coding sequence.
**Approach:** Split the sequence into triplets from the reading frame start, then count with `Counter`.
### Count Codons in Sequence
```python
from collections import Counter
def count_codons(seq):
seq_str = str(seq).upper()
codons = [seq_str[i:i+3] for i in range(0, len(seq_str) - 2, 3)]
return Counter(codons)
seq = Seq('ATGCGATCGATCGATCGTAA')
codon_counts = count_codons(seq)
```
### Codon Frequencies (Relative)
```python
def codon_frequencies(seq):
counts = count_codons(seq)
total = sum(counts.values())
return {codon: count / total for codon, count in counts.items()}
```
## Codon Adaptation Index (CAI)
**Goal:** Measure how well a gene's codon usage matches highly expressed genes in a target organism.
**Approach:** Train a CAI index from a reference set of highly expressed genes, then score query sequences (0-1 scale, higher = better adapted).
### Using CodonUsage Module
```python
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex
# Create CAI calculator with reference set
cai = CodonAdaptationIndex()
# Generate index from highly expressed genes
cai.generate_index('highly_expressed_genes.fasta')
# Calculate CAI for a sequence
seq = Seq('ATGCGATCGATCGATCGTAA')
cai_value = cai.cai_for_gene(str(seq))
print(f'CAI: {cai_value:.3f}') # Range 0-1, higher = better adapted
```
### CAI with Custom Codon Index
```python
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex
cai = CodonAdaptationIndex()
# Set custom index (relative adaptiveness for each codon)
custom_index = {
'TTT': 0.5, 'TTC': 1.0, # Phe
'TTA': 0.1, 'TTG': 0.5, 'CTT': 0.3, 'CTC': 1.0, 'CTA': 0.1, 'CTG': 1.0, # Leu
# ... define all 64 codons
}
cai.set_cai_index(custom_index)
```
## Synonymous Codon Usage
**Goal:** Quantify bias in synonymous codon usage to detect selection or mutational pressure.
**Approach:** Calculate RSCU — the ratio of observed to expected frequency assuming equal usage of synonymous codons. RSCU = 1.0 means no bias; >1 overused, <1 underused.
### RSCU (Relative Synonymous Codon Usage)
```python
from Bio.Data import CodonTable
def calculate_rscu(seq, table_id=1):
codon_table = CodonTable.unambiguous_dna_by_id[table_id]
counts = count_codons(seq)
# Group codons by amino acid
aa_to_codons = {}
for codon in counts:
if codon in codon_table.stop_codons:
continue
try:
aa = codon_table.forward_table[codon]
aa_to_codons.setdefault(aa, []).append(codon)
except KeyError:
continue
# Calculate RSCU for each codon
rscu = {}
for aa, codons in aa_to_codons.items():
total = sum(counts.get(c, 0) for c in codons)
n_synonymous = len(codons)
expected = total / n_synonymous if n_synonymous > 0 else 0
for codon in codons:
observed = counts.get(codon, 0)
rscu[codon] = observed / expected if expected > 0 else 0
return rscu
```
### Identify Rare Codons
```python
def find_rare_codons(seq, threshold=0.1):
freq = codon_frequencies(seq)
return {codon: f for codon, f in freq.items() if f < threshold}
```
### Codon Bias by Position (GC123)
```python
from Bio.SeqUtils import GC123
seq = Seq('ATGCGATCGATCGATCGATCGATCGATCGTAA')
gc_total, gc_pos1, gc_pos2, gc_pos3 = GC123(seq)
print(f'Total GC: {gc_total:.1f}%')
print(f'1st position GC: {gc_pos1:.1f}%')
print(f'2nd position GC: {gc_pos2:.1f}%')
print(f'3rd position GC: {gc_pos3:.1f}% (wobble position)')
```
## Codon Tables
### Access Codon Tables
```python
from Bio.Data import CodonTable
# Get standard table
std_table = CodonTable.unambiguous_dna_by_id[1]
# List all available tables
for id, table in CodonTable.unambiguous_dna_by_id.items():
print(f'{id}: {table.names[0]}')
```
### Common Codon Tables
| ID | Name | Organism |
|----|------|----------|
| 1 | Standard | Most organisms |
| 2 | Vertebrate Mitochondrial | Human, mouse mito |
| 4 | Mold Mitochondrial | Fungi, protozoa mito |
| 5 | Invertebrate Mitochondrial | Insects, worms mito |
| 11 | Bacterial/Plastid | E. coli, chloroplasts |
### Codon Table Properties
```python
table = CodonTable.unambiguous_dna_by_id[1]
print(f'Start codons: {table.start_codons}')
print(f'Stop codons: {table.stop_codons}')
# Forward table: codon -> amino acid
print(table.forward_table['ATG']) # 'M'
# Back table: amino acid -> list of codons
back_table = {}
for codon, aa in table.forward_table.items():
back_table.setdefault(aa, []).append(codon)
print(f'Leucine codons: {back_table["L"]}')
```
## Code Patterns
### Full Codon Usage Report
```python
def codon_usage_report(seq, table_id=1):
from Bio.Data import CodonTable
table = CodonTable.unambiguous_dna_by_id[table_id]
counts = count_codons(seq)
total = sum(counts.values())
# Group by amino acid
aa_groups = {}
for codon, aa in table.forward_table.items():
aa_groups.setdefault(aa, []).append(codon)
report = {}
for aa, codons in sorted(aa_groups.items()):
aa_total = sum(counts.get(c, 0) for c in codons)
report[aa] = {
'total': aa_total,
'codons': {c: {'count': counts.get(c, 0),
'freq': counts.get(c, 0) / aa_total if aa_total > 0 else 0}
for c in codons}
}
return report
```
### Compare Codon Usage Between Sequences
```python
def compare_codon_usage(seq1, seq2):
freq1 = codon_frequencies(seq1)
freq2 = codon_frequencies(seq2)
all_codons = set(freq1.keys()) | set(freq2.keys())
comparison = {}
for codon in sorted(all_codons):
f1, f2 = freq1.get(codon, 0), freq2.get(codon, 0)
comparison[codon] = {'seq1': f1, 'seq2': f2, 'diff': f1 - f2}
return comparison
```
### Optimize Codons for Expression
**Goal:** Replace codons with host-preferred synonymous alternatives to maximize protein expression.
**Approach:** Map each amino acid to its most-preferred codon in the target host, then reconstruct the DNA sequence.
```python
def optimize_codons(protein_seq, preferred_codons):
'''Replace codons with preferred synonymous codons'''
optimized = []
for aa in str(protein_seq):
if aa in preferred_codons:
optimized.append(preferred_codons[aa])
else:
optimized.append('NNN') # Unknown
return Seq(''.join(optimized))
# E. coli preferred codons
ecoli_preferred = {
'A': 'GCG', 'R': 'CGT', 'N': 'AAC', 'D': 'GAT', 'C': 'TGC',
'Q': 'CAG', 'E': 'GAA', 'G': 'GGT', 'H': 'CAC', 'I': 'ATT',
'L': 'CTG', 'K': 'AAA', 'M': 'ATG', 'F': 'TTC', 'P': 'CCG',
'S': 'TCT', 'T': 'ACC', 'W': 'TGG', 'Y': 'TAC', 'V': 'GTT',
}
```
### Codon Usage from FASTA File
```python
from Bio import SeqIO
def analyze_fasta_codon_usage(filename):
all_counts = Counter()
for record in SeqIO.parse(filename, 'fasta'):
all_counts.update(count_codons(record.seq))
total = sum(all_counts.values())
return {codon: count / total for codon, count in all_counts.items()}
```
### Effective Number of Codons (Nc)
A measure of codon bias (lower = more biased, range 20-61):
```python
import math
def effective_nc(seq, table_id=1):
from Bio.Data import CodonTable
table = CodonTable.unambiguous_dna_by_id[table_id]
counts = count_codons(seq)
# Group by degeneracy class
aa_groups = {}
for codon, aa in table.forward_table.items():
aa_groups.setdefault(aa, []).append(codon)
# Calculate F for each amino acid
nc_sum = 0
for aa, codons in aa_groups.items():
n = sum(counts.get(c, 0) for c in codons)
if n <= 1:
continue
pi_sq_sum = sum((counts.get(c, 0) / n) ** 2 for c in codons)
F = (n * pi_sq_sum - 1) / (n - 1)
nc_sum += 1 / F if F > 0 else len(codons)
return nc_sum if nc_sum > 0 else 61
```
## Property Reference
| Metric | Range | Interpretation |
|--------|-------|----------------|
| CAI | 0-1 | Higher = better adapted to host |
| RSCU | 0-N | 1.0 = no bias, >1 = overused, <1 = underused |
| Nc | 20-61 | Lower = more biased |
| GC3 | 0-100% | GC at wobble position |
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `KeyError` | Non-standard codon | Handle N-containing codons |
| Wrong counts | Sequence not in frame | Ensure length is multiple of 3 |
| No index set | Called CAI without training | Call `generate_index()` first |
## Decision Tree
```
Need to analyze codon usage?
├── Count codon frequencies?
│ └── Use Counter on 3-mers
├── Calculate adaptation to host?
│ └── Use CodonAdaptationIndex (CAI)
├── Identify synonymous bias?
│ └── Calculate RSCU
├── Check wobble position bias?
│ └── Use GC123()
├── Measure overall bias?
│ └── Calculate Nc (effective number of codons)
└── Optimize for expression?
└── Replace with preferred synonymous codons
```
## Related Skills
- transcription-translation - Translate sequences and understand codon tables
- sequence-properties - GC123 for wobble position GC content
- sequence-io/read-sequences - Parse CDS sequences from GenBank files
- database-access/entrez-fetch - Fetch reference gene sets from NCBI for CAI training
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.