Survey Sampling ================ Part of :doc:`index` — MORIE's statistical-methods reference. MORIE provides a complete probabilistic sampling toolkit for epidemiological surveys. All methods are implemented in :mod:`morie.sampling`. Simple Random Sampling ----------------------- Without replacement (SRS WOR), every unit has the same inclusion probability :math:`\pi_i = n / N`. The Horvitz-Thompson estimator of the mean is the unweighted sample mean :math:`\bar{y}`. With replacement (SRS WR), finite-population correction :math:`\text{fpc} = 1 - n/N` applies to the variance estimate. **Python**: :func:`morie.sampling.simple_random_sample` Stratified Random Sampling ---------------------------- Partition the population into :math:`H` strata. Within stratum :math:`h`, draw :math:`n_h` units by SRS: .. math:: \bar{y}_{\text{str}} = \sum_{h=1}^{H} W_h \bar{y}_h, \quad W_h = N_h / N **Proportional allocation**: :math:`n_h \propto N_h` — minimises total variance for equal within-stratum variances. **Optimal (Neyman) allocation**: :math:`n_h \propto N_h S_h` — minimises variance given a fixed :math:`n`, where :math:`S_h` is the stratum standard deviation. **Python**: :func:`morie.sampling.stratified_sample` Cluster Sampling ----------------- When a frame of *individuals* is unavailable, select :math:`m` clusters (e.g. households, classrooms, census tracts) by SRS, then enumerate all or a random sub-sample of elements within selected clusters: .. math:: \hat{\tau}_{\text{cluster}} = \frac{N}{m} \sum_{i \in \text{selected}} y_i Cluster sampling introduces intra-cluster correlation (ICC), which inflates variance relative to SRS. The design effect (DEFF) measures this inflation: .. math:: \text{DEFF} = 1 + (\bar{m} - 1) \cdot \rho_{\text{ICC}} where :math:`\bar{m}` is the mean cluster size and :math:`\rho_{\text{ICC}}` is the intra-class correlation. **Python**: :func:`morie.sampling.cluster_sample` Probability Proportional to Size (PPS) ---------------------------------------- PPS sampling selects units with probability proportional to a size measure :math:`x_i` (e.g. enrolment count): .. math:: \pi_i = n \cdot \frac{x_i}{\sum_j x_j} PPS is more efficient than SRS when the outcome is correlated with size. **Python**: :func:`morie.sampling.pps_sample` Horvitz-Thompson and Hájek Estimators --------------------------------------- For any probability sample with known inclusion probabilities :math:`\pi_i`: **Horvitz-Thompson** (unbiased for population total): .. math:: \hat{\tau}_{HT} = \sum_{i \in s} \frac{y_i}{\pi_i} **Hájek** (ratio estimator for mean, more stable than HT): .. math:: \bar{y}_H = \frac{\sum_{i \in s} y_i / \pi_i}{\sum_{i \in s} 1 / \pi_i} **Python**: :func:`morie.sampling.horvitz_thompson_total`, :func:`morie.survey.hajek_mean` Bootstrap and Jackknife Variance Estimation --------------------------------------------- For complex statistics (medians, quantiles, non-linear estimators) where analytic variances are unavailable: **Bootstrap** (Efron 1979): .. math:: \widehat{\text{Var}}(\hat{\theta}) = \frac{1}{B-1} \sum_{b=1}^{B} \bigl(\hat{\theta}^{*(b)} - \bar{\theta}^*\bigr)^2 **Delete-1 Jackknife**: .. math:: \widehat{\text{Var}}(\hat{\theta}) = \frac{n-1}{n} \sum_{i=1}^{n} \bigl(\hat{\theta}_{(-i)} - \bar{\theta}_{(.)} \bigr)^2 **Python**: :func:`morie.sampling.bootstrap_sample`, :func:`morie.sampling.jackknife_estimate` Effective Sample Size ---------------------- Survey weights create unequal effective sample contributions. The Kish effective sample size (ESS) quantifies the equivalent SRS size: .. math:: \text{ESS} = \frac{\bigl(\sum_i w_i\bigr)^2}{\sum_i w_i^2} The design effect :math:`\text{DEFF} = n / \text{ESS}` measures variance inflation relative to a simple random sample of the same size. **Python**: :func:`morie.sampling.effective_sample_size`, :func:`morie.sampling.design_effect` Calibration / Raking --------------------- Post-stratification and raking calibrate sample weights so that **weighted marginal distributions match known population totals**: .. math:: \min_w \sum_i d\bigl(w_i, w_i^{(0)}\bigr) \quad \text{subject to} \quad \sum_i w_i x_{ij} = T_j \; \forall j where :math:`d(\cdot)` is a distance function (chi-squared → linear calibration; multiplicative → raking). MORIE uses iterative proportional fitting (IPF). **Python**: :func:`morie.survey.calibration_weights` References ---------- - Kish L (1965). *Survey Sampling*. Wiley. - Cochran WG (1977). *Sampling Techniques* (3rd ed.). Wiley. - Lumley T (2010). *Complex Surveys: A Guide to Analysis Using R*. Wiley. - Valliant R, Dever JA, Kreuter F (2013). *Practical Tools for Designing and Weighting Survey Samples*. Springer.