bio-expression-matrix-gene-id-mapping
$
npx mdskill add GPTomics/bioSkills/bio-expression-matrix-gene-id-mappingConvert gene identifiers between Ensembl, Entrez, HGNC, and UniProt systems
- Solve mismatched gene IDs across datasets for pathway analysis or integration
- Uses MyGene.info API and biomaRt for identifier mapping
- Queries gene annotation databases based on input IDs and target identifier type
- Returns a dictionary mapping input IDs to corresponding gene symbols or identifiers
SKILL.md
.github/skills/bio-expression-matrix-gene-id-mappingView on GitHub ↗
---
name: bio-expression-matrix-gene-id-mapping
description: Convert between gene identifier systems including Ensembl, Entrez, HGNC symbols, and UniProt. Use when mapping IDs for pathway analysis or matching different data sources.
tool_type: mixed
primary_tool: biomaRt
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Gene ID Mapping
## Python: mygene
**Goal:** Convert between gene identifier systems (Ensembl, Entrez, Symbol, UniProt) using the MyGene.info API.
**Approach:** Query mygene with source IDs, specifying scopes and target fields, to build an ID mapping dictionary.
**"Convert my Ensembl gene IDs to gene symbols"** → Query a gene annotation service to map between identifier systems, handling one-to-many mappings.
```python
import mygene
import pandas as pd
mg = mygene.MyGeneInfo()
# Ensembl to Symbol
ensembl_ids = ['ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736']
results = mg.querymany(ensembl_ids, scopes='ensembl.gene', fields='symbol', species='human')
mapping = {r['query']: r.get('symbol', None) for r in results}
# {'ENSG00000141510': 'TP53', 'ENSG00000012048': 'BRCA1', 'ENSG00000141736': 'ERBB2'}
# Symbol to Entrez
symbols = ['TP53', 'BRCA1', 'ERBB2']
results = mg.querymany(symbols, scopes='symbol', fields='entrezgene', species='human')
mapping = {r['query']: r.get('entrezgene', None) for r in results}
# Ensembl to multiple fields
results = mg.querymany(ensembl_ids, scopes='ensembl.gene',
fields=['symbol', 'entrezgene', 'uniprot'], species='human')
```
## Python: pyensembl
**Goal:** Map gene identifiers using a local Ensembl database for offline, fast lookups.
**Approach:** Load a specific Ensembl release and query gene objects by ID or name.
```python
from pyensembl import EnsemblRelease
# Load Ensembl release (downloads automatically first time)
ensembl = EnsemblRelease(110, species='human') # or 'mouse'
# Gene ID to symbol
gene = ensembl.gene_by_id('ENSG00000141510')
print(gene.gene_name) # TP53
# Symbol to gene ID
gene = ensembl.genes_by_name('TP53')[0]
print(gene.gene_id) # ENSG00000141510
# Batch conversion
def ensembl_to_symbol(ensembl_ids, release=110):
ens = EnsemblRelease(release, species='human')
mapping = {}
for eid in ensembl_ids:
try:
gene = ens.gene_by_id(eid.split('.')[0]) # Remove version
mapping[eid] = gene.gene_name
except ValueError:
mapping[eid] = None
return mapping
```
## Python: gseapy
```python
import gseapy as gp
# Ensembl to Symbol using Enrichr
gene_list = ['ENSG00000141510', 'ENSG00000012048']
converted = gp.biomart.ensembl2name(gene_list, organism='hsapiens')
```
## R: biomaRt
**Goal:** Map gene identifiers using the Ensembl BioMart web service in R.
**Approach:** Connect to the Ensembl BioMart and retrieve attribute mappings for a list of gene IDs.
```r
library(biomaRt)
# Connect to Ensembl
ensembl <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl')
# Ensembl to Symbol
ensembl_ids <- c('ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736')
results <- getBM(
attributes=c('ensembl_gene_id', 'hgnc_symbol', 'entrezgene_id'),
filters='ensembl_gene_id',
values=ensembl_ids,
mart=ensembl
)
# Symbol to Ensembl
symbols <- c('TP53', 'BRCA1', 'ERBB2')
results <- getBM(
attributes=c('hgnc_symbol', 'ensembl_gene_id'),
filters='hgnc_symbol',
values=symbols,
mart=ensembl
)
# All available attributes
listAttributes(ensembl)
```
## R: org.db Packages
**Goal:** Map gene identifiers using Bioconductor organism annotation packages for fast local lookups.
**Approach:** Use mapIds from AnnotationDbi with organism-specific org.db packages.
```r
library(org.Hs.eg.db) # Human
library(AnnotationDbi)
# Ensembl to Symbol
ensembl_ids <- c('ENSG00000141510', 'ENSG00000012048')
symbols <- mapIds(org.Hs.eg.db, keys=ensembl_ids, keytype='ENSEMBL', column='SYMBOL')
# Symbol to Entrez
symbols <- c('TP53', 'BRCA1')
entrez <- mapIds(org.Hs.eg.db, keys=symbols, keytype='SYMBOL', column='ENTREZID')
# Available keytypes
keytypes(org.Hs.eg.db)
# ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, SYMBOL, UNIPROT, etc.
```
## Apply Mapping to Count Matrix
**Goal:** Replace gene IDs in a count matrix index with a different identifier type.
**Approach:** Map IDs via mygene, update the DataFrame index, and aggregate duplicates by summing.
**"Convert the gene IDs in my count matrix from Ensembl to symbols"** → Map the row index to a new ID type, handling version suffixes and duplicate mappings by summation.
```python
import pandas as pd
import mygene
def map_count_matrix_ids(counts, from_type='ensembl.gene', to_type='symbol', species='human'):
'''Map gene IDs in count matrix index.'''
mg = mygene.MyGeneInfo()
# Remove version numbers from Ensembl IDs
clean_ids = [g.split('.')[0] for g in counts.index]
# Query mygene
results = mg.querymany(clean_ids, scopes=from_type, fields=to_type, species=species)
# Build mapping
mapping = {}
for r in results:
if to_type in r:
mapping[r['query']] = r[to_type]
# Apply mapping
new_index = [mapping.get(g.split('.')[0], g) for g in counts.index]
counts_mapped = counts.copy()
counts_mapped.index = new_index
# Handle duplicates (sum)
counts_mapped = counts_mapped.groupby(counts_mapped.index).sum()
return counts_mapped
# Usage
counts_symbols = map_count_matrix_ids(counts, 'ensembl.gene', 'symbol')
```
## R Equivalent
**Goal:** Replace gene IDs in an R count matrix using biomaRt with duplicate aggregation.
**Approach:** Query BioMart for the mapping, merge with the count matrix, and sum duplicate rows.
```r
library(biomaRt)
map_count_matrix_ids <- function(counts, from_type='ensembl_gene_id', to_type='hgnc_symbol') {
ensembl <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl')
# Remove version numbers
clean_ids <- gsub('\\..*', '', rownames(counts))
# Get mapping
mapping <- getBM(
attributes=c(from_type, to_type),
filters=from_type,
values=clean_ids,
mart=ensembl
)
# Merge and aggregate duplicates
counts$gene_id <- clean_ids
merged <- merge(counts, mapping, by.x='gene_id', by.y=from_type, all.x=TRUE)
merged$gene_id <- NULL
# Use symbol as rowname, sum duplicates
rownames(merged) <- merged[[to_type]]
merged[[to_type]] <- NULL
counts_mapped <- aggregate(. ~ rownames(merged), data=merged, FUN=sum)
rownames(counts_mapped) <- counts_mapped[,1]
counts_mapped <- counts_mapped[,-1]
return(counts_mapped)
}
```
## Handle Unmapped IDs
**Goal:** Track and gracefully handle gene IDs that fail to map to the target identifier system.
**Approach:** Keep original IDs for unmapped genes and report mapping success rate.
```python
def robust_id_mapping(gene_ids, from_type, to_type, species='human'):
'''Map IDs with fallback for unmapped genes.'''
import mygene
mg = mygene.MyGeneInfo()
clean_ids = [g.split('.')[0] for g in gene_ids]
results = mg.querymany(clean_ids, scopes=from_type, fields=to_type, species=species)
mapping = {}
unmapped = []
for r in results:
original = gene_ids[clean_ids.index(r['query'])]
if to_type in r:
mapping[original] = r[to_type]
else:
mapping[original] = original # Keep original if unmapped
unmapped.append(original)
print(f'Mapped: {len(gene_ids) - len(unmapped)}/{len(gene_ids)}')
print(f'Unmapped: {len(unmapped)}')
return mapping, unmapped
```
## Common ID Types and Database Selection
| Type | Example | Use Case | Stability |
|------|---------|----------|-----------|
| Ensembl Gene | ENSG00000141510 | RNA-seq, GTF files | Stable across releases (versioned) |
| Ensembl Transcript | ENST00000269305 | Transcript-level analysis | Stable (versioned) |
| Entrez Gene | 7157 | NCBI databases, KEGG | Stable (never reused) |
| HGNC Symbol | TP53 | Human readable display | Changes frequently |
| UniProt | P04637 | Protein databases | Stable (versioned releases) |
| RefSeq | NM_000546 | NCBI RefSeq | Stable (versioned) |
### Database Selection Guide
| Scenario | Recommended ID | Why |
|----------|---------------|-----|
| Computational key / primary index | Ensembl Gene ID | Stable, versioned, consistent with GTF |
| Pathway analysis (KEGG, Reactome) | Entrez Gene ID | Required by most pathway databases |
| GO enrichment | Entrez or Ensembl | Both supported by clusterProfiler |
| Display labels (plots, tables) | HGNC Symbol | Human-readable |
| Cross-database integration | Ensembl | Best-connected hub across databases |
| Protein-level analysis | UniProt | Primary protein database |
Best practice: use stable IDs (Ensembl or Entrez) as computational keys. Use symbols only as display labels. Always pin mappings to a specific database release and archive the cross-reference table for reproducibility.
### Gene Symbol Instability
Gene symbols change regularly as nomenclature committees update names. NCBI updates daily; Bioconductor org.db packages update every 6 months. For the most current mappings, query mygene.info or download gene_info from NCBI FTP directly. Never use symbols as the primary key in a pipeline -- always join on stable IDs and add symbols as a display column.
## PAR Gene Complications
Pseudo-autosomal region (PAR) genes exist on both X and Y chromosomes with identical sequences. In Ensembl GTF files, PAR genes have coordinates on both chromosomes, potentially creating duplicate entries in count matrices. Reads from PAR regions cannot be unambiguously assigned to X or Y.
```python
# Check for PAR gene duplicates in a count matrix
par_genes_human = ['SHOX', 'IL3RA', 'SLC25A6', 'P2RY8', 'AKAP17A', 'ASMT', 'DHRSX']
duplicated_ids = counts.index[counts.index.duplicated()].unique()
if len(duplicated_ids) > 0:
print(f'Duplicate gene entries found: {len(duplicated_ids)}')
# Sum duplicates (standard approach for PAR genes)
counts = counts.groupby(counts.index).sum()
```
Some reference genomes mask the Y-chromosome PAR to avoid double-counting. Check whether the GTF includes PAR genes on both chromosomes before building count matrices.
## Cross-Species Ortholog Mapping
**Goal:** Map gene IDs between species for cross-species comparisons or integration.
**Approach:** Use Ensembl Compara (via biomaRt) to find orthologs, selecting the appropriate stringency level.
```r
library(biomaRt)
human <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl')
mouse <- useEnsembl(biomart='genes', dataset='mmusculus_gene_ensembl')
# Human to mouse one-to-one orthologs
orthologs <- getLDS(
attributes=c('hgnc_symbol', 'ensembl_gene_id'),
filters='ensembl_gene_id',
values=human_gene_ids,
mart=human,
attributesL=c('mgi_symbol', 'ensembl_gene_id'),
martL=mouse
)
```
| Strategy | When to use | Trade-off |
|----------|------------|-----------|
| One-to-one orthologs only | Cross-species scRNA-seq integration | Most conservative; loses genes without clear orthologs |
| Include one-to-many | Broader gene coverage needed | Must select: highest homology confidence or highest expression |
| Include many-to-many | Maximum inclusivity | Introduces ambiguity; use with caution |
For cross-species scRNA-seq integration, use only one-to-one orthologs (standard practice).
## Build tx2gene for tximport
**Goal:** Create the transcript-to-gene mapping table required by tximport for gene-level summarization from Salmon/kallisto output.
**Approach:** Extract transcript-gene relationships from a GTF file or Ensembl BioMart.
```r
# From GTF (recommended for consistency with quantification index)
library(GenomicFeatures)
txdb <- makeTxDbFromGFF('annotation.gtf')
k <- keys(txdb, keytype='TXNAME')
tx2gene <- AnnotationDbi::select(txdb, k, 'GENEID', 'TXNAME')
# From BioMart
library(biomaRt)
mart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
tx2gene <- getBM(
attributes=c('ensembl_transcript_id_version', 'ensembl_gene_id_version'),
mart=mart
)
colnames(tx2gene) <- c('TXNAME', 'GENEID')
```
```python
# Python: extract tx2gene from GTF
import pandas as pd
def tx2gene_from_gtf(gtf_path):
'''Extract transcript-to-gene mapping from GTF.'''
records = []
with open(gtf_path) as f:
for line in f:
if line.startswith('#') or '\ttranscript\t' not in line:
continue
attrs = line.strip().split('\t')[8]
gene_id = [a.split('"')[1] for a in attrs.split(';') if 'gene_id' in a][0]
tx_id = [a.split('"')[1] for a in attrs.split(';') if 'transcript_id' in a][0]
records.append({'TXNAME': tx_id, 'GENEID': gene_id})
return pd.DataFrame(records).drop_duplicates()
```
## Related Skills
- expression-matrix/counts-ingest - Load count data
- expression-matrix/metadata-joins - Add annotations
- rna-quantification/tximport-workflow - Uses tx2gene mapping
- pathway-analysis/go-enrichment - Requires Entrez IDs
- pathway-analysis/kegg-pathways - Requires Entrez IDs
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.