bio-ecological-genomics-landscape-genomics
$
npx mdskill add GPTomics/bioSkills/bio-ecological-genomics-landscape-genomicsDetect adaptive loci linking genetics to environmental gradients.
- Identifies genes under local adaptation across climate zones.
- Integrates LFMM2, pcadapt, OutFLANK, and gradientForest tools.
- Controls population structure to isolate true environmental signals.
- Outputs ranked loci with effect sizes and p-values.
SKILL.md
.github/skills/bio-ecological-genomics-landscape-genomicsView on GitHub ↗
---
name: bio-ecological-genomics-landscape-genomics
description: Tests genotype-environment associations and identifies loci under local adaptation using LFMM2 (LEA), pcadapt outlier detection, OutFLANK Fst-based selection scans, and redundancy analysis. Detects adaptive genetic variation correlated with environmental variables while controlling for population structure. Use when identifying adaptive loci across environmental gradients, testing for signatures of local adaptation, or predicting genetic vulnerability to climate change with gradientForest.
tool_type: r
primary_tool: LEA
---
## Version Compatibility
Reference examples tested with: vegan 2.6+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Landscape Genomics
**"Find loci associated with environmental adaptation in my populations"** → Test genotype-environment associations using LFMM2 latent factor mixed models or pcadapt outlier detection while controlling for population structure, and predict climate vulnerability with gradientForest.
- R: `LEA::lfmm2()` for genotype-environment association testing
- R: `pcadapt::pcadapt()` for selection scan without environmental data
Identifies loci under local adaptation by testing genotype-environment associations while controlling for population structure.
## Population Structure Estimation with LEA
**Goal:** Determine the number of ancestral populations (K) as a prerequisite for genotype-environment association testing.
**Approach:** Run sNMF on genotype data across K=1-10 and select the K with minimum cross-entropy.
Determine K (number of ancestral populations) before running GEA:
```r
library(LEA)
# Convert VCF to lfmm/geno format
vcf2lfmm('variants.vcf', 'genotypes.lfmm')
vcf2geno('variants.vcf', 'genotypes.geno')
# sNMF for K estimation (faster than STRUCTURE)
snmf_result <- snmf('genotypes.geno', K = 1:10, repetitions = 5,
entropy = TRUE, project = 'new')
# Select K with minimum cross-entropy across K values
# cross.entropy(obj, K) returns per-run values for that K; take min per K
ce_values <- sapply(1:10, function(k) min(cross.entropy(snmf_result, K = k)))
plot(1:10, ce_values, xlab = 'K', ylab = 'Cross-entropy', pch = 19, col = 'blue')
best_K <- which.min(ce_values)
```
## LFMM2 Genotype-Environment Association
**Goal:** Identify loci significantly associated with environmental variables while controlling for population structure.
**Approach:** Fit LFMM2 with K latent factors, extract p-values with genomic inflation calibration, and apply Storey q-value FDR correction.
Latent factor mixed model tests association between each locus and environmental variables while correcting for population structure via latent factors:
```r
library(LEA)
genotypes <- read.lfmm('genotypes.lfmm')
env_vars <- read.env('environment.env')
# K from sNMM cross-entropy (latent factors = population structure proxy)
# K=3: determined from sNMF analysis above
lfmm_result <- lfmm2(input = genotypes, env = env_vars, K = 3)
# Extract p-values for each environmental variable
# Genomic inflation factor (GIF) calibration corrects for residual confounding
# Target lambda ~ 1.0; >1.5 suggests insufficient structure correction
pvalues <- lfmm2.test(lfmm_result, input = genotypes, env = env_vars,
full = TRUE, genomic.control = TRUE)
gif <- median(qchisq(1 - pvalues$pvalues[, 1], df = 1)) / qchisq(0.5, df = 1)
cat('Genomic inflation factor (lambda):', gif, '\n')
# lambda should be ~1.0; if >1.5, increase K; if <0.5, decrease K
# Adjust p-values for multiple testing
# q-value < 0.05: Storey FDR control, more powerful than BH for genomic data
library(qvalue)
qvals <- qvalue(pvalues$pvalues[, 1])$qvalues
# Candidate loci under selection
# q < 0.05: 5% false discovery rate
candidates <- which(qvals < 0.05)
cat('Candidate adaptive loci:', length(candidates), '\n')
```
### Multiple Environmental Variables
```r
# env_vars matrix: samples x variables (e.g., temperature, precipitation, altitude)
# LFMM2 tests each variable separately but fits latent factors jointly
for (i in 1:ncol(env_vars)) {
pvals <- lfmm2.test(lfmm_result, input = genotypes, env = env_vars,
full = TRUE, genomic.control = TRUE)
qvals_i <- qvalue(pvals$pvalues[, i])$qvalues
cat('Variable', i, '- candidates (q<0.05):', sum(qvals_i < 0.05), '\n')
}
```
## pcadapt Outlier Detection
**Goal:** Detect loci under selection from genotype data alone, without requiring environmental variables.
**Approach:** Compute Mahalanobis distances from K principal components with pcadapt, then identify outlier loci via q-value FDR control.
Detects loci under selection based on Mahalanobis distances from principal components, without requiring environmental data:
```r
library(pcadapt)
# Read genotypes (bed/bim/fam or lfmm format)
geno <- read.pcadapt('genotypes.bed', type = 'bed')
# Select K from screeplot (number of significant PCs)
# Look for elbow where eigenvalues flatten
x_scree <- pcadapt(geno, K = 20)
plot(x_scree, option = 'screeplot')
# Run pcadapt with selected K
# K=3: number of principal components capturing population structure
x <- pcadapt(geno, K = 3)
# QQ-plot to check calibration (should follow diagonal except tail)
plot(x, option = 'qqplot')
# Manhattan-style plot of -log10(p-values)
plot(x, option = 'manhattan')
# Identify outliers using q-values
# q < 0.05: 5% FDR threshold
library(qvalue)
qvals <- qvalue(x$pvalues)$qvalues
outliers <- which(qvals < 0.05)
cat('pcadapt outlier loci:', length(outliers), '\n')
```
## OutFLANK Fst-Based Selection Scan
**Goal:** Identify loci with elevated Fst consistent with diversifying selection across populations.
**Approach:** Fit a chi-squared distribution to the trimmed neutral Fst distribution and flag outlier loci exceeding the expected neutral envelope.
Identifies Fst outliers by fitting a chi-squared distribution to the trimmed Fst distribution:
```r
library(OutFLANK)
# fst_data: matrix with columns FSTNoCorr, T1, T2, indexOrder, LocusName, He
# From hierfstat or custom calculation
fst_mat <- MakeDiploidFSTMat(SNPmat = genotype_matrix,
locusNames = snp_names,
popNames = pop_assignments)
# Trim extremes to estimate neutral Fst distribution
# LeftTrimFraction=0.05, RightTrimFraction=0.05: standard 5% trim on each tail
# Hmin=0.1: exclude low-heterozygosity loci (unreliable Fst estimates)
outflank_result <- OutFLANK(fst_mat, LeftTrimFraction = 0.05,
RightTrimFraction = 0.05, Hmin = 0.1,
NumberOfSamples = length(unique(pop_assignments)),
qthreshold = 0.05)
# Visualize Fst distribution and outliers
OutFLANKResultsPlotter(outflank_result, withOutliers = TRUE,
NoCorr = TRUE, Hmin = 0.1)
# Extract outlier loci
outlier_idx <- which(outflank_result$results$OutlierFlag == TRUE)
cat('OutFLANK Fst outliers:', length(outlier_idx), '\n')
```
## RDA-Based GEA
**Goal:** Detect multilocus signatures of adaptation correlated with environmental gradients using constrained ordination.
**Approach:** Run partial RDA on allele frequencies conditioned on population structure, then identify outlier loci from RDA loading z-scores.
Redundancy analysis detects multilocus adaptive signatures correlated with environment:
```r
library(vegan)
# allele_freq: matrix of allele frequencies (individuals x loci)
# Impute missing values with population means
allele_freq[is.na(allele_freq)] <- apply(allele_freq, 2,
function(x) mean(x, na.rm = TRUE))[col(allele_freq)[is.na(allele_freq)]]
# Condition on population structure (Q-matrix from sNMF)
# ~ env_vars + Condition(structure): partial RDA controlling for structure
rda_result <- rda(allele_freq ~ temperature + precipitation + altitude +
Condition(as.matrix(q_matrix)), data = env_data)
# Permutation test
anova(rda_result, permutations = 999)
# Identify outlier loci from RDA loadings
# z-score > 3: conservative threshold for candidate adaptive loci
loadings <- scores(rda_result, choices = 1:3, display = 'species')
zscores <- apply(loadings, 2, function(x) (x - mean(x)) / sd(x))
rda_candidates <- which(apply(abs(zscores), 1, max) > 3)
cat('RDA candidate loci:', length(rda_candidates), '\n')
```
## gradientForest: Allele Turnover Prediction
**Goal:** Predict how allele frequencies change along environmental gradients and estimate genetic offset under climate change scenarios.
**Approach:** Fit gradient forests on candidate loci, rank environmental predictors by importance, and compute Euclidean distance in transformed space between current and future climate projections.
Predicts how allele frequencies shift along environmental gradients using random forests:
```r
library(gradientForest)
library(terra)
# allele_data: allele frequencies at candidate loci
# env_predictors: environmental variables at sampling sites
gf <- gradientForest(cbind(env_predictors, allele_data),
predictor.vars = colnames(env_predictors),
response.vars = colnames(allele_data),
ntree = 500, trace = FALSE)
# Variable importance: which environment drives most allele turnover
plot(gf, plot.type = 'Overall.Importance')
# Cumulative importance curves for each predictor
plot(gf, plot.type = 'Cumulative.Importance')
# Predict genetic offset under climate change
# current_rasters: current bioclimatic layers
# future_rasters: projected climate layers
current <- rast('bioclim_current.tif')
future <- rast('bioclim_2070_ssp585.tif')
current_vals <- extract(current, sampling_coords)
future_vals <- extract(future, sampling_coords)
# Genetic offset = Euclidean distance in transformed environmental space
current_transformed <- predict(gf, current_vals)
future_transformed <- predict(gf, future_vals)
genetic_offset <- sqrt(rowSums((current_transformed - future_transformed)^2))
```
## Environmental Data Extraction with terra
**Goal:** Extract bioclimatic variable values at sampling coordinates for genotype-environment association analysis.
**Approach:** Load WorldClim raster layers with terra and extract values at georeferenced sampling points.
```r
library(terra)
# WorldClim bioclimatic variables (1 km resolution)
bioclim <- rast('wc2.1_30s_bio.tif')
# Extract values at sampling coordinates
# coords: data.frame with longitude and latitude columns
coords_vect <- vect(coords, geom = c('longitude', 'latitude'), crs = 'EPSG:4326')
env_values <- extract(bioclim, coords_vect)
# Common bioclimatic variables for GEA:
# BIO1: mean annual temperature
# BIO4: temperature seasonality
# BIO12: annual precipitation
# BIO15: precipitation seasonality
```
## Related Skills
- conservation-genetics - Population genetic health assessment
- community-ecology - Environmental gradient analysis for species
- population-genetics/selection-statistics - Selection scans in human data
- population-genetics/population-structure - Population structure inference
- variant-calling/vcf-basics - VCF input preparation
More from GPTomics/bioSkills
- bio-admet-predictionPredicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
- bio-alignment-amplicon-clippingTrim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
- bio-alignment-filteringFilter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
- bio-alignment-indexingCreate and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
- bio-alignment-ioRead, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
- bio-alignment-msa-parsingParse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
- bio-alignment-msa-statisticsCalculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
- bio-alignment-multiplePerform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
- bio-alignment-pairwisePerform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
- bio-alignment-sortingSort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.