Population Genetics¶
Part of Statistical Methods — MORIE’s statistical-methods reference.
MORIE provides population genetics functions for analyzing genetic variation, population structure, and genotype-phenotype associations. All functions are dataset-agnostic with column names as keyword parameters.
Sequence-Level Metrics¶
gc— GC content. Proportion of guanine + cytosine in a DNA sequence.maf— minor allele frequency. Frequency of the less common allele at a locus.hw— Hardy-Weinberg equilibrium. Chi-squared test for HWE departure (observed vs expected genotype counts).
from morie.fn import gc, maf, hw
gc_ratio = gc("ATGCGCTATGCGC")
print(f"GC content: {gc_ratio:.3f}") # 0.615
freq = maf(genotypes=[0, 0, 1, 1, 2, 0, 1])
print(f"MAF: {freq:.3f}")
hwe = hw(observed=[45, 40, 15]) # AA, Aa, aa counts
print(f"HWE chi2={hwe.statistic:.2f}, p={hwe.p_value:.4f}")
Population Differentiation¶
fst— fixation index. Weir-Cockerham \(F_{ST}\) estimator for population differentiation.tajd— Tajima’s D. Neutrality test comparing pairwise diversity to segregating sites.
from morie.fn import fst, tajd
result = fst(allele_freqs_pop1, allele_freqs_pop2,
n_pop1=100, n_pop2=120)
print(f"Fst = {result.statistic:.4f}")
taj = tajd(segregating_sites=42, n_sequences=50,
pairwise_diffs=18.5)
print(f"Tajima's D = {taj.statistic:.3f}, p = {taj.p_value:.4f}")
Linkage Disequilibrium¶
ld— linkage disequilibrium. D, D’, and r-squared between two loci.ldmat— LD matrix. Pairwise r-squared matrix for a set of SNPs.
from morie.fn import ld
result = ld(genotypes_snp1, genotypes_snp2)
print(f"D' = {result.d_prime:.3f}, r2 = {result.r_squared:.3f}")
Genome-Wide Association Studies¶
gwas— GWAS scan. Per-SNP association test (linear or logistic) with multiple-testing correction.prs— polygenic risk score. Weighted sum of risk alleles using GWAS summary statistics.
from morie.fn import gwas, prs
results = gwas(genotype_matrix, phenotype, covariates=None,
model="linear", correction="bonferroni")
sig = results[results["p_adj"] < 0.05]
print(f"Significant SNPs: {len(sig)}")
scores = prs(genotypes, effect_sizes, risk_alleles)
print(f"Mean PRS: {scores.mean():.3f}")
GWAS pipeline:
Quality control: MAF filter, HWE filter, missingness filter
Association testing: per-SNP regression (
gwas)Multiple testing correction: Bonferroni or Benjamini-Hochberg
Visualization: Manhattan plot, QQ plot (via
holo_*functions)Risk prediction: polygenic risk scores (
prs)
Epidemiological Applications¶
Population genetics functions integrate with MORIE’s epidemiological toolkit for public health genomics:
Pharmacogenomics: MAF and HWE checks for drug-metabolizing enzyme variants across population subgroups
Disease surveillance: Fst for tracking pathogen population structure across geographic regions
Health equity: PRS calibration across ancestry groups to avoid differential prediction accuracy
All functions return standardized result objects (GenomicsResult
dataclass from _containers.py) with statistic, p_value,
and method-specific fields.
References¶
Weir, B.S. & Cockerham, C.C. (1984). Estimating F-Statistics for the Analysis of Population Structure. Evolution, 38(6), 1358-1370.
Tajima, F. (1989). Statistical Method for Testing the Neutral Mutation Hypothesis by DNA Polymorphism. Genetics, 123(3), 585-595.
Purcell, S. et al. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. American Journal of Human Genetics, 81(3), 559-575.