bio-copy-number-cnv-annotation
$
npx mdskill add GPTomics/bioSkills/bio-copy-number-cnv-annotationAnnotate copy number variants with genes, pathways, and clinical data.
- Identifies affected genes and assesses clinical significance from CNV calls.
- Depends on bedtools, pandas, pybedtools, and pysam libraries.
- Executes overlap queries between CNV segments and gene annotation databases.
- Outputs intersected gene lists and clinical pathway associations in text.
SKILL.md
.github/skills/bio-copy-number-cnv-annotationView on GitHub ↗
---
name: bio-copy-number-cnv-annotation
description: Annotate CNVs with genes, pathways, and clinical significance. Use when interpreting CNV calls or identifying affected genes from copy number analysis.
tool_type: mixed
primary_tool: bedtools
---
## Version Compatibility
Reference examples tested with: bedtools 2.31+, pandas 2.2+, pybedtools 0.9+, pysam 0.22+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# CNV Annotation
**"Annotate my CNV calls with gene names"** → Overlap CNV segments with gene annotations, clinical databases, and pathway information to identify affected genes and assess clinical significance.
- CLI: `bedtools intersect -a cnvs.bed -b genes.bed`
- Python: `pybedtools.BedTool().intersect()`
## Annotate with Gene Names (bedtools)
```bash
# Convert CNV segments to BED
awk 'NR>1 {print $1"\t"$2"\t"$3"\t"$5"\t"$6}' sample.cns > sample.cnv.bed
# Intersect with gene annotations
bedtools intersect -a sample.cnv.bed -b genes.bed -wa -wb > cnv_genes.txt
# Get genes overlapping CNVs
bedtools intersect -a genes.bed -b sample.cnv.bed -u > affected_genes.bed
```
## CNVkit Gene Annotation
```bash
# Annotate during analysis
cnvkit.py batch tumor.bam --normal normal.bam \
--targets targets.bed \
--annotate refFlat.txt \
--fasta reference.fa \
-o results/
# Genes are included in output CNS file
```
## Python: Comprehensive Annotation
**Goal:** Annotate CNV segments with all overlapping genes using interval intersection.
**Approach:** Convert CNV segments and gene annotations to BedTool objects, intersect to find overlapping genes, and aggregate gene names per CNV segment.
```python
import pandas as pd
import pybedtools as pbt
def annotate_cnvs(cns_file, gene_bed, output=None):
'''Annotate CNV segments with overlapping genes.'''
cns = pd.read_csv(cns_file, sep='\t')
# Create BED from segments
cns_bed = pbt.BedTool.from_dataframe(
cns[['chromosome', 'start', 'end', 'log2']].rename(
columns={'chromosome': 'chrom'}))
genes = pbt.BedTool(gene_bed)
# Intersect
intersect = cns_bed.intersect(genes, wa=True, wb=True)
# Parse results
results = []
for interval in intersect:
results.append({
'chrom': interval[0],
'start': int(interval[1]),
'end': int(interval[2]),
'log2': float(interval[3]),
'gene_chrom': interval[4],
'gene_start': int(interval[5]),
'gene_end': int(interval[6]),
'gene_name': interval[7] if len(interval) > 7 else 'NA'
})
df = pd.DataFrame(results)
# Aggregate genes per CNV
cnv_genes = df.groupby(['chrom', 'start', 'end', 'log2'])['gene_name'].apply(
lambda x: ','.join(sorted(set(x)))).reset_index()
if output:
cnv_genes.to_csv(output, sep='\t', index=False)
return cnv_genes
```
## Annotate with Cancer Gene Census
**Goal:** Flag known cancer-associated genes within CNV regions.
**Approach:** Load the COSMIC Cancer Gene Census, cross-reference with genes overlapping CNVs, and tag matching genes.
```python
import pandas as pd
def annotate_cancer_genes(cnv_genes, cgc_file):
'''Flag cancer-associated genes in CNVs.'''
cgc = pd.read_csv(cgc_file, sep='\t')
cancer_genes = set(cgc['Gene Symbol'].tolist())
def check_cancer_genes(genes):
if pd.isna(genes):
return ''
gene_list = genes.split(',')
cancer = [g for g in gene_list if g in cancer_genes]
return ','.join(cancer)
cnv_genes['cancer_genes'] = cnv_genes['gene_name'].apply(check_cancer_genes)
cnv_genes['n_cancer_genes'] = cnv_genes['cancer_genes'].apply(
lambda x: len(x.split(',')) if x else 0)
return cnv_genes
```
## Annotate with ACMG/ClinVar
**Goal:** Identify pathogenic ClinVar variants within CNV regions for clinical interpretation.
**Approach:** Query the ClinVar VCF for each CNV region using pysam, collect pathogenic variants and their associated genes.
```python
def annotate_clinvar_cnvs(cnv_bed, clinvar_vcf):
'''Annotate CNVs with ClinVar variants.'''
import pysam
cnv = pd.read_csv(cnv_bed, sep='\t', header=None,
names=['chrom', 'start', 'end', 'log2'])
vcf = pysam.VariantFile(clinvar_vcf)
results = []
for _, row in cnv.iterrows():
chrom = row['chrom'].replace('chr', '')
pathogenic = []
for rec in vcf.fetch(chrom, row['start'], row['end']):
clnsig = rec.info.get('CLNSIG', [''])[0]
if 'pathogenic' in clnsig.lower():
gene = rec.info.get('GENEINFO', 'NA').split(':')[0]
pathogenic.append(gene)
results.append({
'chrom': row['chrom'],
'start': row['start'],
'end': row['end'],
'log2': row['log2'],
'clinvar_pathogenic': ','.join(set(pathogenic))
})
return pd.DataFrame(results)
```
## GISTIC2 for Recurrent CNVs
```bash
# Export segments for GISTIC
cnvkit.py export seg *.cns -o cohort.seg
# Run GISTIC2
gistic2 \
-b results/ \
-seg cohort.seg \
-refgene hg38.refgene.mat \
-genegistic 1 \
-smallmem 1 \
-broad 1 \
-brlen 0.5 \
-conf 0.90 \
-armpeel 1 \
-savegene 1
# Output: significant regions with genes
```
## AnnotSV for Comprehensive Annotation
```bash
# Export CNVs to VCF
cnvkit.py export vcf sample.cns -o sample.cnv.vcf
# Run AnnotSV
AnnotSV \
-SVinputFile sample.cnv.vcf \
-genomeBuild GRCh38 \
-outputFile sample_annotated \
-SVminSize 1000
# Output includes: genes, DGV, gnomAD-SV, ClinVar, OMIM
```
## R: Gene Enrichment of CNV Regions
**Goal:** Determine whether amplified or deleted genes are enriched for specific biological pathways.
**Approach:** Extract genes from amplified CNV regions, convert to Entrez IDs, and run GO and KEGG enrichment with clusterProfiler.
```r
library(clusterProfiler)
library(org.Hs.eg.db)
# Get genes in amplified regions
amp_genes <- unique(cnv_annotation$gene_name[cnv_annotation$log2 > 0.5])
# Convert to Entrez IDs
entrez_ids <- mapIds(org.Hs.eg.db, keys=amp_genes, keytype='SYMBOL', column='ENTREZID')
entrez_ids <- na.omit(entrez_ids)
# GO enrichment
go_results <- enrichGO(
gene=entrez_ids,
OrgDb=org.Hs.eg.db,
ont='BP',
pAdjustMethod='BH',
qvalueCutoff=0.05
)
# KEGG enrichment
kegg_results <- enrichKEGG(
gene=entrez_ids,
organism='hsa',
pAdjustMethod='BH'
)
```
## Interpret CNV States
```python
def interpret_cnv(log2, purity=1.0, ploidy=2):
'''Interpret CNV log2 ratio as copy number state.'''
# Adjusted for purity
cn = 2 * (2 ** log2)
if cn < 0.5:
return 'homozygous_deletion'
elif cn < 1.5:
return 'heterozygous_deletion'
elif cn < 2.5:
return 'diploid'
elif cn < 3.5:
return 'single_copy_gain'
else:
return 'amplification'
def summarize_cnvs(cns_annotated):
'''Summarize CNV calls.'''
cns_annotated['cnv_type'] = cns_annotated['log2'].apply(interpret_cnv)
summary = {
'total_cnvs': len(cns_annotated),
'amplifications': (cns_annotated['cnv_type'] == 'amplification').sum(),
'deletions': (cns_annotated['cnv_type'].str.contains('deletion')).sum(),
'total_genes': cns_annotated['gene_name'].str.split(',').explode().nunique(),
'cancer_genes': cns_annotated['cancer_genes'].str.split(',').explode().nunique()
}
return summary
```
## Output Report
```python
def generate_cnv_report(cns_annotated, output_prefix):
'''Generate CNV annotation report.'''
# Full annotation table
cns_annotated.to_csv(f'{output_prefix}_full.tsv', sep='\t', index=False)
# High-impact CNVs
high_impact = cns_annotated[
(cns_annotated['n_cancer_genes'] > 0) |
(abs(cns_annotated['log2']) > 1)
]
high_impact.to_csv(f'{output_prefix}_high_impact.tsv', sep='\t', index=False)
# Gene-level summary
genes = cns_annotated.explode('gene_name')
gene_summary = genes.groupby('gene_name').agg({
'log2': 'mean',
'chrom': 'first',
'start': 'min',
'end': 'max'
}).reset_index()
gene_summary.to_csv(f'{output_prefix}_genes.tsv', sep='\t', index=False)
return high_impact
```
## Related Skills
- copy-number/cnvkit-analysis - Generate CNV calls
- copy-number/cnv-visualization - Visualize annotated CNVs
- pathway-analysis/go-enrichment - Enrichment of CNV genes
- genome-intervals/bed-file-basics - BED file operations
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.