bio-comparative-genomics-synteny-analysis
$
npx mdskill add GPTomics/bioSkills/bio-comparative-genomics-synteny-analysisDetects conserved gene order and chromosomal rearrangements between species
- Identifies syntenic blocks to compare genome structure across different organisms.
- Relies on MCScanX, SyRI, JCVI tools for collinearity detection algorithms.
- Validates assembly quality by enforcing minimum N50 thresholds before analysis.
- Outputs visual dot plots and macro/micro synteny maps via Python APIs.
SKILL.md
.github/skills/bio-comparative-genomics-synteny-analysisView on GitHub ↗
---
name: bio-comparative-genomics-synteny-analysis
description: Analyze genome collinearity and syntenic blocks using MCScanX, SyRI, and JCVI for comparative genomics. Detect conserved gene order, chromosomal rearrangements, and whole-genome duplications. Use when comparing genome structure between species or identifying conserved genomic regions.
tool_type: mixed
primary_tool: MCScanX
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, JCVI 1.3+, PAML 4.10+, matplotlib 3.8+, minimap2 2.26+, numpy 1.26+, pandas 2.2+, scipy 1.12+, SyRI 1.6+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Synteny Analysis
**"Compare genome structure between my species"** → Detect conserved gene order (syntenic blocks), chromosomal rearrangements, and whole-genome duplications by aligning genomic collinearity.
- CLI: `MCScanX` for collinear block detection from BLAST results
- Python: `jcvi.compara.synteny` for synteny visualization (dot plots, macro/micro)
## Assembly Quality Requirements
Synteny analysis is highly sensitive to assembly fragmentation. Key thresholds:
- **Minimum N50 of 1 Mb** for robust synteny detection; below this, results are tool-dependent and unreliable
- Fragmented assemblies can **underestimate synteny by up to 40%** (different tools disagree extensively)
- Gene-dense genomes tolerate more fragmentation than gene-sparse ones
- Chromosome-level assemblies are strongly preferred for detecting inversions and translocations
- **Reference-guided scaffolding creates false synteny** -- scaffolding against a related species propagates gene order assumptions, producing circular reasoning in synteny comparisons
Always verify assembly contiguity (N50, BUSCO completeness) before interpreting synteny results. Report assembly quality alongside synteny findings.
## Macrosynteny vs Microsynteny
- **Macrosynteny**: Chromosome-level conservation of gene content (same chromosome, not necessarily same order). Detectable across hundreds of millions of years but decays via accumulation of rearrangements.
- **Microsynteny**: Local conservation of gene order at sub-chromosomal level (few to dozens of genes). Can be deeply conserved, especially for metabolic gene clusters. Used in synteny network analysis for phylogenetic inference.
MCScanX and JCVI detect both scales; SyRI focuses on structural rearrangements at the macrosynteny level.
## Tool Selection
| Tool | Best For | Key Characteristic |
|---|---|---|
| MCScanX | General synteny, WGD detection, downstream analyses (14 tools) | Dynamic programming on BLAST hits; most widely used |
| JCVI/MCScan (Python) | Visualization, publication figures, multi-genome | Superior plotting; `--cscore 0.99` for reciprocal best hits |
| i-ADHoRe 3.0 | Ancient WGD, highly diverged species | Ordered gene lists (no sequence needed); iterative detection |
| AnchorWave | Plant genomes with WGD, polyploidy | CDS/exon anchors; WGD-aware alignment for known ploidy levels |
| SyRI | Structural rearrangements between assemblies | Inversions, translocations, duplications from whole-genome alignment |
| ntSynt | Multi-genome macrosynteny at scale | Alignment-free, minimizer graphs; handles >15% divergence |
## Repeat Masking
Softmask genomes before synteny analysis. Unmasked repeats produce millions of spurious BLAST alignments from transposable elements, drowning real syntenic signal. Use RepeatMasker with a species-specific library (or build one with RepeatModeler2 for non-model organisms). Always softmask (lowercase), never hardmask (N's), to preserve information for downstream tools.
## MCScanX Workflow
**Goal:** Detect conserved gene order (syntenic blocks) between two genomes.
**Approach:** Prepare GFF and all-vs-all BLASTP input files, run MCScanX to identify collinear gene blocks, parse the collinearity output, and classify syntenic relationships by coverage ratios.
```python
'''Synteny analysis with MCScanX and visualization'''
import subprocess
import pandas as pd
from collections import defaultdict
def prepare_mcscanx_input(gff_file, fasta_file, species_prefix):
'''Prepare input files for MCScanX
MCScanX requires:
1. .gff file: gene positions (sp gene chr start end)
2. .blast file: all-vs-all BLASTP results
'''
genes = []
with open(gff_file) as f:
for line in f:
if line.startswith('#'):
continue
parts = line.strip().split('\t')
if parts[2] == 'gene':
chrom = parts[0]
start, end = int(parts[3]), int(parts[4])
gene_id = parts[8].split('ID=')[1].split(';')[0]
genes.append(f'{species_prefix}\t{gene_id}\t{chrom}\t{start}\t{end}')
with open(f'{species_prefix}.gff', 'w') as f:
f.write('\n'.join(genes))
return f'{species_prefix}.gff'
def run_mcscanx(gff1, gff2, blast_file, output_prefix):
'''Run MCScanX for synteny detection
Key parameters:
-k 50: Match score per collinear gene pair
-g -1: Gap penalty per intervening gene
-s 5: Minimum block size (5 genes default; use 10+ for stringent analysis)
-m 25: Maximum gaps between anchor pairs (25 default)
-e 1e-5: BLAST E-value threshold (use 1e-10 for close species)
Default parameters: ~0.59 sensitivity / 0.81 precision on benchmarks.
Tandem duplications are automatically collapsed before detection.
'''
# Combine GFF files
subprocess.run(f'cat {gff1} {gff2} > {output_prefix}.gff', shell=True)
# Copy BLAST file
subprocess.run(f'cp {blast_file} {output_prefix}.blast', shell=True)
# Run MCScanX
# -s 5: Min genes per block (raise to 10 for stringent, lower to 3 for sensitive)
# -m 25: Max gaps (raise for degraded synteny, lower for recent comparisons)
cmd = f'MCScanX -s 5 -m 25 {output_prefix}'
subprocess.run(cmd, shell=True)
return f'{output_prefix}.collinearity'
def parse_collinearity(collinearity_file):
'''Parse MCScanX collinearity output
Output format:
## Alignment X: score=N e_value=X N genes
X-Y: gene1 gene2
'''
blocks = []
current_block = None
with open(collinearity_file) as f:
for line in f:
if line.startswith('## Alignment'):
if current_block:
blocks.append(current_block)
parts = line.strip().split()
score = int(parts[3].split('=')[1])
e_value = float(parts[4].split('=')[1])
n_genes = int(parts[5])
current_block = {
'score': score,
'e_value': e_value,
'n_genes': n_genes,
'gene_pairs': []
}
elif current_block and '-' in line and ':' in line:
parts = line.strip().split()
if len(parts) >= 3:
gene1, gene2 = parts[1], parts[2]
current_block['gene_pairs'].append((gene1, gene2))
if current_block:
blocks.append(current_block)
return blocks
def classify_synteny_type(blocks, species1_chroms, species2_chroms):
'''Classify syntenic relationships
Types:
- 1:1: Direct orthology (conserved)
- 1:many: Lineage-specific duplication
- many:many: Ancient WGD or complex rearrangement
'''
sp1_coverage = defaultdict(list)
sp2_coverage = defaultdict(list)
for block in blocks:
for gene1, gene2 in block['gene_pairs']:
chr1 = species1_chroms.get(gene1)
chr2 = species2_chroms.get(gene2)
if chr1 and chr2:
sp1_coverage[chr1].append(chr2)
sp2_coverage[chr2].append(chr1)
classifications = []
for chr1, partners in sp1_coverage.items():
unique_partners = len(set(partners))
if unique_partners == 1:
classifications.append(('1:1', chr1, partners[0]))
else:
classifications.append(('1:many', chr1, set(partners)))
return classifications
```
## SyRI for Structural Variants
**Goal:** Identify structural rearrangements (inversions, translocations, duplications) between two genome assemblies.
**Approach:** Align genomes with minimap2, run SyRI on the alignment to detect syntenic regions and structural variants, and parse the output into a variant table.
```python
def run_syri(ref_genome, query_genome, alignment_file, output_prefix):
'''Run SyRI for structural rearrangement identification
SyRI detects:
- Syntenic regions (SYN)
- Inversions (INV)
- Translocations (TRANS)
- Duplications (DUP)
- Insertions/Deletions (INS/DEL)
Requires whole-genome alignment (minimap2 or MUMmer)
'''
# Align genomes with minimap2
align_cmd = f'minimap2 -ax asm5 {ref_genome} {query_genome} > {output_prefix}.sam'
subprocess.run(align_cmd, shell=True)
# Run SyRI
syri_cmd = f'syri -c {output_prefix}.sam -r {ref_genome} -q {query_genome} -F S --prefix {output_prefix}'
subprocess.run(syri_cmd, shell=True)
return f'{output_prefix}syri.out'
def parse_syri_output(syri_file):
'''Parse SyRI structural variant output'''
variants = []
with open(syri_file) as f:
for line in f:
parts = line.strip().split('\t')
if len(parts) >= 10:
var_type = parts[10]
ref_chr, ref_start, ref_end = parts[0], int(parts[1]), int(parts[2])
qry_chr, qry_start, qry_end = parts[5], int(parts[6]), int(parts[7])
variants.append({
'type': var_type,
'ref_chr': ref_chr,
'ref_start': ref_start,
'ref_end': ref_end,
'qry_chr': qry_chr,
'qry_start': qry_start,
'qry_end': qry_end,
'size': ref_end - ref_start
})
return pd.DataFrame(variants)
```
## JCVI Visualization
```python
def create_synteny_plot(blocks_file, layout_file, output_file):
'''Create synteny dot plot with JCVI
JCVI (python-jcvi) provides publication-ready figures.
Key JCVI parameters for pairwise synteny:
--cscore 0.70: Default hit stringency (ratio of best to 2nd-best hit)
--cscore 0.99: Reciprocal best hits only (strictest, fewest false positives)
--minspan 20: Minimum syntenic block size in genes
--tandem_Nmax 10: Filter tandem duplicates within this gene distance
'''
from jcvi.graphics.dotplot import dotplot
from jcvi.graphics.karyotype import karyotype
# Dotplot shows collinear blocks as diagonal lines
# Good for detecting WGD (parallel diagonals)
dotplot_cmd = f'python -m jcvi.graphics.dotplot {blocks_file}'
subprocess.run(dotplot_cmd, shell=True)
# Karyotype view for chromosome-level comparison
karyotype_cmd = f'python -m jcvi.graphics.karyotype {layout_file}'
subprocess.run(karyotype_cmd, shell=True)
def detect_wgd(blocks, min_parallel_blocks=3):
'''Detect whole-genome duplication signatures
WGD indicators:
- Multiple parallel syntenic blocks in self-comparison
- 2:1 or 4:1 chromosome ratios between genomes
- Ks distribution peak at consistent value across syntenic pairs
- MCScanX "duplication depth" of 1 = one WGD round
Pitfalls:
- Very recent WGDs (low Ks) can be masked by continuous small-scale duplications
- Ks peaks from different WGDs can merge if timing is similar
- Ks > 2 is saturated and unreliable for dating ancient WGDs
- Use wgd v2 tool for mixture model fitting and phylogenetic dating
min_parallel_blocks: Minimum parallel blocks to call WGD (3 minimum)
'''
chrom_ratios = defaultdict(list)
for block in blocks:
if block['n_genes'] >= 10: # Focus on substantial blocks
pairs = block['gene_pairs']
# Extract chromosome info from gene names
# Implementation depends on naming convention
pass
return chrom_ratios
```
## Ks Analysis for Dating
**Goal:** Date gene duplication and speciation events using synonymous substitution rates (Ks).
**Approach:** Calculate Ks for syntenic gene pairs using PAML yn00/codeml, plot the Ks distribution, and identify peaks corresponding to whole-genome duplication events.
```python
def calculate_ks_for_pairs(cds_file1, cds_file2, gene_pairs):
'''Calculate synonymous substitution rate (Ks) for gene pairs
Ks interpretation:
- Ks < 0.1: Very recent divergence or gene conversion
- Ks 0.1-0.5: Within-species duplication (recent WGD)
- Ks 0.5-1.5: Between closely related species (older WGD)
- Ks 1.5-2.0: Approaching saturation, interpret with caution
- Ks > 2.0: Saturated -- synonymous sites have undergone multiple substitutions
Ks peaks indicate WGD events. Use mixture models (e.g., wgd v2) to
formally fit components rather than visually identifying peaks.
Polyploids show multiple peaks (one per WGD event).
'''
from Bio import SeqIO
from Bio.Seq import Seq
# Load CDS sequences
cds1 = SeqIO.to_dict(SeqIO.parse(cds_file1, 'fasta'))
cds2 = SeqIO.to_dict(SeqIO.parse(cds_file2, 'fasta'))
ks_values = []
for gene1, gene2 in gene_pairs:
if gene1 in cds1 and gene2 in cds2:
# Run yn00 or codeml for Ks calculation
# Simplified - actual implementation uses PAML
pass
return ks_values
def plot_ks_distribution(ks_values, output_file):
'''Plot Ks distribution to identify WGD peaks'''
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
# Filter saturated values
# Ks > 2 is typically saturated and unreliable
ks_filtered = [k for k in ks_values if 0 < k < 2]
fig, ax = plt.subplots(figsize=(10, 6))
# Histogram
ax.hist(ks_filtered, bins=50, density=True, alpha=0.7, label='Ks distribution')
# KDE for peak detection
if len(ks_filtered) > 10:
kde = gaussian_kde(ks_filtered)
x = np.linspace(0, 2, 200)
ax.plot(x, kde(x), 'r-', linewidth=2, label='KDE')
ax.set_xlabel('Ks (synonymous substitution rate)')
ax.set_ylabel('Density')
ax.legend()
plt.savefig(output_file)
return fig
```
## Polyploidy Considerations
For polyploid genomes (common in plants, ~35% are recent polyploids):
- **Autopolyploids**: Subgenome assignment is extremely challenging due to high homeolog similarity; risk of chimeric assembly
- **Allopolyploids**: Easier to resolve subgenomes (greater parental divergence); assign subgenomes before comparative analysis
- AnchorWave handles WGD-aware alignment when ploidy level is specified (`proali` mode with `-R` for ploidy)
- OrthoFinder and similar tools cannot natively handle polyploid genomes -- each subgenome appears as "extra" genes; assign subgenomes first or use multi-labeled tree methods
- Ks plots in polyploids show multiple peaks (one per WGD event); peaks from different events can merge if timing overlaps
## Related Skills
- comparative-genomics/positive-selection - dN/dS analysis on syntenic gene pairs
- comparative-genomics/ortholog-inference - Identify orthologs for synteny context
- phylogenetics/modern-tree-inference - Phylogenetic context for synteny dating
- alignment/pairwise-alignment - Sequence alignment for Ks calculation
- genome-annotation/annotation-transfer - Transfer annotations using syntenic context
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.