bio-immunoinformatics-neoantigen-prediction
$
npx mdskill add GPTomics/bioSkills/bio-immunoinformatics-neoantigen-predictionPredict tumor neoantigens from somatic mutations using pVACtools for personalized immunotherapy
- Identify vaccine targets and biomarkers from tumor sequencing data
- Relies on pVACtools, Ensembl VEP, MHCflurry, and IEDB for binding prediction
- Analyzes HLA binding affinity of mutant peptides to predict T-cell responses
- Outputs ranked neoantigen candidates for downstream immunotherapy applications
SKILL.md
.github/skills/bio-immunoinformatics-neoantigen-predictionView on GitHub ↗
---
name: bio-immunoinformatics-neoantigen-prediction
description: Identify tumor neoantigens from somatic mutations using pVACtools for personalized cancer immunotherapy. Predict mutant peptides that bind patient HLA and may elicit T-cell responses. Use when identifying vaccine targets or checkpoint inhibitor response biomarkers from tumor sequencing data.
tool_type: python
primary_tool: pVACtools
---
## Version Compatibility
Reference examples tested with: Ensembl VEP 111+, MHCflurry 2.1+, pVACtools 4.1+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Neoantigen Prediction
**"Identify neoantigens from my tumor mutations"** → Predict mutant peptides from somatic variants that bind patient HLA alleles and may elicit T-cell responses for personalized cancer immunotherapy.
- CLI: `pvacseq run` with VEP-annotated VCF and patient HLA types (pVACtools)
## pVACtools Pipeline (Ensembl VEP 111+)
**Goal:** Install pVACtools and its IEDB prediction engine dependencies.
**Approach:** Install via pip (optionally in a dedicated conda environment) and download IEDB tools for binding prediction.
```bash
# Install pVACtools
pip install pvactools
# Or use conda for dependencies
conda create -n pvactools python=3.8
conda activate pvactools
pip install pvactools
# Download IEDB tools
pvactools download_iedb_tools
```
## pVACseq Workflow (Ensembl VEP 111+)
**Goal:** Run the full pVACseq neoantigen prediction pipeline on a VEP-annotated VCF.
**Approach:** Provide annotated VCF with patient HLA alleles and select binding prediction algorithms; pVACseq generates mutant peptides and predicts MHC binding.
```bash
# Run pVACseq on annotated VCF
pvacseq run \
annotated.vcf \
sample_name \
"HLA-A*02:01,HLA-A*24:02,HLA-B*07:02,HLA-B*44:02" \
MHCflurry MHCnuggetsI \
output_dir \
-e1 8,9,10,11 \
--iedb-install-directory /path/to/iedb
# Key parameters:
# -e1: Epitope lengths for MHC-I (8-11)
# -e2: Epitope lengths for MHC-II (15)
# --binding-threshold: IC50 cutoff (default 500)
# --percentile-threshold: Alternative cutoff
```
## VCF Annotation Requirements (Ensembl VEP 111+)
**Goal:** Annotate somatic VCF with transcript consequences and amino acid changes required by pVACseq.
**Approach:** Run Ensembl VEP with Downstream and Wildtype plugins to produce a VCF containing protein-level mutation annotations.
```bash
# pVACseq requires VEP-annotated VCF
# Must include transcript and amino acid changes
# Run VEP first
vep -i somatic.vcf -o annotated.vcf \
--cache --offline \
--format vcf --vcf \
--plugin Downstream \
--plugin Wildtype \
--terms SO \
--symbol
```
## Parse pVACseq Results
**Goal:** Parse pVACseq output and calculate the differential agretopicity index (DAI) for candidate neoantigens.
**Approach:** Load TSV results, filter by binding threshold, and compute WT/MT binding ratio to identify mutations that create new epitopes.
```python
import pandas as pd
def parse_pvacseq_results(results_file):
'''Parse pVACseq output
Key columns:
- Mutation: Gene and amino acid change
- HLA Allele: Patient HLA presenting this peptide
- MT Epitope Seq: Mutant peptide sequence
- WT Epitope Seq: Wild-type peptide sequence
- Median MT Score: Binding affinity (nM)
- Median WT Score: WT binding (for agretopicity)
- Tumor DNA VAF: Variant allele frequency
- Gene Expression: If RNA-seq available
'''
df = pd.read_csv(results_file, sep='\t')
# Filter by binding threshold
strong_binders = df[df['Median MT Score'] < 500]
return strong_binders
def calculate_agretopicity(df):
'''Calculate agretopicity (DAI) score
Agretopicity = ratio of WT to MT binding
Higher agretopicity means MT binds better than WT
indicating mutation creates new epitope
DAI (Differential Agretopicity Index):
- >1: Mutant binds better (favorable)
- ~1: Similar binding (less likely immunogenic)
- <1: WT binds better (unfavorable)
'''
df = df.copy()
df['agretopicity'] = df['Median WT Score'] / df['Median MT Score']
# High agretopicity = mutation improves binding
df['dai_favorable'] = df['agretopicity'] > 1
return df
```
## Prioritize Neoantigens (Ensembl VEP 111+)
**Goal:** Rank neoantigen candidates for vaccine design by combining binding, clonality, and expression evidence.
**Approach:** Apply sequential filters (binding affinity, VAF, expression) and compute a composite priority score weighting inverse IC50, VAF, and agretopicity.
```python
def prioritize_neoantigens(df, vaf_threshold=0.1, expression_threshold=1.0):
'''Prioritize neoantigens for vaccine design
Criteria for good neoantigen candidates:
1. Strong MHC binding (IC50 < 500nM, ideally < 50nM)
2. High agretopicity (MT binds better than WT)
3. High tumor VAF (clonal, present in most tumor cells)
4. Expressed in tumor (if RNA-seq available)
5. Not in tolerogenic region (self-similarity check)
Typical pipeline returns 10-50 candidates per patient
'''
candidates = df.copy()
# Filter by binding
candidates = candidates[candidates['Median MT Score'] < 500]
# Filter by VAF (clonal mutations preferred)
if 'Tumor DNA VAF' in candidates.columns:
candidates = candidates[candidates['Tumor DNA VAF'] >= vaf_threshold]
# Filter by expression
if 'Gene Expression' in candidates.columns:
candidates = candidates[candidates['Gene Expression'] >= expression_threshold]
# Calculate priority score
# Lower binding affinity = better
# Higher VAF = better
# Higher agretopicity = better
candidates['priority_score'] = (
(1 / candidates['Median MT Score']) *
candidates.get('Tumor DNA VAF', 1) *
candidates.get('agretopicity', 1)
)
return candidates.sort_values('priority_score', ascending=False)
```
## Alternative: Manual Neoantigen Pipeline (Ensembl VEP 111+)
**Goal:** Predict neoantigens without pVACtools by directly extracting mutant peptides from an annotated VCF and predicting MHC binding.
**Approach:** Parse VEP annotations from VCF via cyvcf2, generate mutant peptides around each coding mutation, and predict binding with MHCflurry.
```python
def manual_neoantigen_pipeline(vcf_file, hla_alleles, reference_fasta):
'''Simplified neoantigen prediction without pVACtools
Steps:
1. Extract coding mutations from VCF
2. Generate mutant protein sequences
3. Extract peptides around mutation
4. Predict MHC binding
'''
from cyvcf2 import VCF
from mhcflurry import Class1PresentationPredictor
vcf = VCF(vcf_file)
predictor = Class1PresentationPredictor.load()
neoantigens = []
for variant in vcf:
# Get amino acid change from VEP annotation
if 'CSQ' not in variant.INFO:
continue
# Parse consequence and extract mutant peptides
# ... (implementation depends on annotation format)
# For each mutant peptide, predict binding
for peptide in mutant_peptides:
for allele in hla_alleles:
pred = predictor.predict(peptides=[peptide], alleles=[allele])
if pred['mhcflurry_affinity'].values[0] < 500:
neoantigens.append({
'variant': f'{variant.CHROM}:{variant.POS}',
'peptide': peptide,
'allele': allele,
'affinity': pred['mhcflurry_affinity'].values[0]
})
return neoantigens
```
## Neoantigen Quality Metrics (Ensembl VEP 111+)
**Goal:** Assess neoantigen quality across multiple dimensions and produce a composite confidence score.
**Approach:** Normalize binding affinity, agretopicity, clonality, and expression to 0-1 scales and combine with domain-informed weights.
```python
def assess_neoantigen_quality(neoantigen):
'''Assess multiple quality metrics for neoantigen
Returns composite quality score considering:
- Binding affinity
- Agretopicity
- Clonality (VAF)
- Expression
- Self-similarity
'''
scores = {}
# Binding (0-1, lower IC50 = higher score)
ic50 = neoantigen.get('Median MT Score', 500)
scores['binding'] = 1 - min(ic50 / 5000, 1)
# Agretopicity (0-1)
dai = neoantigen.get('agretopicity', 1)
scores['agretopicity'] = min(dai / 10, 1)
# Clonality (0-1)
vaf = neoantigen.get('Tumor DNA VAF', 0.5)
scores['clonality'] = vaf
# Expression (0-1, log scale)
import math
expr = neoantigen.get('Gene Expression', 1)
scores['expression'] = min(math.log10(expr + 1) / 3, 1)
# Composite score
weights = {'binding': 0.3, 'agretopicity': 0.3, 'clonality': 0.2, 'expression': 0.2}
composite = sum(scores[k] * weights[k] for k in weights)
return composite, scores
```
## Related Skills
- immunoinformatics/mhc-binding-prediction - MHC binding details
- immunoinformatics/immunogenicity-scoring - Prioritization
- variant-calling/variant-calling - Input somatic mutations
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.