bio-rna-quantification-tximport-workflow
$
npx mdskill add GPTomics/bioSkills/bio-rna-quantification-tximport-workflowImports transcript-level RNA quantifications into R for gene-level differential expression analysis.
- Solves the task of converting Salmon/kallisto transcript counts to gene-level counts for DESeq2/edgeR.
- Relies on tximport, tximeta, and transcript-to-gene mappings from GTF or biomaRt.
- Uses quantification files and tx2gene mappings to summarize transcript data into gene counts.
- Delivers a gene-level count matrix ready for downstream differential expression analysis.
SKILL.md
.github/skills/bio-rna-quantification-tximport-workflowView on GitHub ↗
---
name: bio-rna-quantification-tximport-workflow
description: Import transcript-level quantifications from Salmon/kallisto into R for gene-level analysis with DESeq2/edgeR using tximport or tximeta. Use when importing transcript counts into R for DESeq2/edgeR.
tool_type: r
primary_tool: tximport
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, Salmon 1.10+, edgeR 4.0+, kallisto 0.50+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# tximport Workflow
**"Import Salmon/kallisto results into DESeq2"** → Summarize transcript-level abundance estimates to gene-level counts with proper length-offset correction for use in DESeq2 or edgeR.
- R: `tximport::tximport(files, type='salmon', tx2gene=tx2gene)`
Import transcript-level estimates from Salmon, kallisto, or other quantifiers into R for gene-level differential expression analysis.
## Basic tximport
**Goal:** Import transcript-level quantifications from Salmon or kallisto into R as gene-level counts with proper length-offset correction for DESeq2 or edgeR.
**Approach:** Create a transcript-to-gene mapping from a GTF or biomaRt, then run tximport on the quantification files to produce a gene-level count matrix with length-scaled TPM offsets.
```r
library(tximport)
# Define sample files
files <- c(
sample1 = 'sample1_quant/quant.sf',
sample2 = 'sample2_quant/quant.sf',
sample3 = 'sample3_quant/quant.sf'
)
# Load transcript-to-gene mapping
tx2gene <- read.csv('tx2gene.csv') # columns: TXNAME, GENEID
# Import at gene level
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
```
## Creating tx2gene Mapping
### From GTF (using GenomicFeatures)
```r
library(GenomicFeatures)
txdb <- makeTxDbFromGFF('annotation.gtf')
k <- keys(txdb, keytype = 'TXNAME')
tx2gene <- select(txdb, k, 'GENEID', 'TXNAME')
```
### From Ensembl (using biomaRt)
```r
library(biomaRt)
mart <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')
tx2gene <- getBM(
attributes = c('ensembl_transcript_id_version', 'ensembl_gene_id_version'),
mart = mart
)
colnames(tx2gene) <- c('TXNAME', 'GENEID')
```
### From Salmon quant.sf
```r
quant <- read.table('sample1_quant/quant.sf', header = TRUE)
tx2gene <- data.frame(
TXNAME = quant$Name,
GENEID = gsub('\\..*', '', quant$Name) # Remove version
)
```
## Import Types
### Gene-Level Summarization (Default)
```r
# Summarize transcripts to gene level
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
# Returns: counts, abundance (TPM), length at gene level
```
### Transcript-Level (No Summarization)
```r
# Keep transcript-level estimates
txi <- tximport(files, type = 'salmon', txOut = TRUE)
# Returns: counts, abundance, length at transcript level
```
### Scaled TPM (for visualization)
```r
# Gene-level TPM
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene,
countsFromAbundance = 'scaledTPM')
```
## Source-Specific Import
### Salmon
```r
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
```
### kallisto
```r
txi <- tximport(files, type = 'kallisto', tx2gene = tx2gene)
```
### RSEM
```r
txi <- tximport(files, type = 'rsem', tx2gene = tx2gene)
```
### StringTie
```r
txi <- tximport(files, type = 'stringtie', tx2gene = tx2gene)
```
## Using with DESeq2
```r
library(DESeq2)
# Create sample metadata
coldata <- data.frame(
condition = factor(c('control', 'control', 'treated', 'treated')),
row.names = names(files)
)
# Create DESeqDataSet from tximport
dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
# Filter low counts
dds <- dds[rowSums(counts(dds)) >= 10, ]
# Run DESeq2
dds <- DESeq(dds)
res <- results(dds)
```
## Using with edgeR
```r
library(edgeR)
# Create DGEList with offset
cts <- txi$counts
normMat <- txi$length
normMat <- normMat / exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts / normMat)) + log(colSums(cts / normMat))
y <- DGEList(cts)
y$offset <- t(t(log(normMat)) + o)
# Continue with edgeR analysis
y <- estimateDisp(y, design)
```
## tximeta: Metadata-Aware Import
tximeta automatically attaches transcript and gene information from the original annotation.
```r
library(tximeta)
# First time: link transcriptome to annotation
makeLinkedTxome(
indexDir = 'salmon_index',
source = 'Ensembl',
organism = 'Homo sapiens',
release = '110',
genome = 'GRCh38',
fasta = 'transcripts.fa',
gtf = 'annotation.gtf'
)
# Import with full metadata
coldata <- data.frame(
files = files,
names = names(files),
condition = c('control', 'control', 'treated', 'treated')
)
se <- tximeta(coldata)
# Summarize to gene level
gse <- summarizeToGene(se)
# Convert to DESeqDataSet
dds <- DESeqDataSet(gse, design = ~ condition)
```
## tximport Output Structure
```r
names(txi)
# [1] "abundance" "counts" "length"
# [4] "countsFromAbundance"
# abundance: TPM values (genes x samples)
# counts: estimated counts (genes x samples)
# length: effective gene lengths (genes x samples)
```
## Handling Version Numbers
```r
# Remove version from transcript IDs
tx2gene$TXNAME <- gsub('\\.\\d+$', '', tx2gene$TXNAME)
# Or ignore version during import
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene,
ignoreTxVersion = TRUE, ignoreAfterBar = TRUE)
```
## Related Skills
- rna-quantification/alignment-free-quant - Upstream Salmon/kallisto
- differential-expression/deseq2-basics - DESeq2 analysis
- differential-expression/edger-basics - edgeR analysis
- genome-intervals/gtf-gff-handling - GTF annotation parsing
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.