Survey Sampling

Part of Statistical Methods — MORIE’s statistical-methods reference.

MORIE provides a complete probabilistic sampling toolkit for epidemiological surveys. All methods are implemented in morie.sampling.

Simple Random Sampling

Without replacement (SRS WOR), every unit has the same inclusion probability \(\pi_i = n / N\). The Horvitz-Thompson estimator of the mean is the unweighted sample mean \(\bar{y}\).

With replacement (SRS WR), finite-population correction \(\text{fpc} = 1 - n/N\) applies to the variance estimate.

Python: morie.sampling.simple_random_sample()

Stratified Random Sampling

Partition the population into \(H\) strata. Within stratum \(h\), draw \(n_h\) units by SRS:

\[\bar{y}_{\text{str}} = \sum_{h=1}^{H} W_h \bar{y}_h, \quad W_h = N_h / N\]

Proportional allocation: \(n_h \propto N_h\) — minimises total variance for equal within-stratum variances.

Optimal (Neyman) allocation: \(n_h \propto N_h S_h\) — minimises variance given a fixed \(n\), where \(S_h\) is the stratum standard deviation.

Python: morie.sampling.stratified_sample()

Cluster Sampling

When a frame of individuals is unavailable, select \(m\) clusters (e.g. households, classrooms, census tracts) by SRS, then enumerate all or a random sub-sample of elements within selected clusters:

\[\hat{\tau}_{\text{cluster}} = \frac{N}{m} \sum_{i \in \text{selected}} y_i\]

Cluster sampling introduces intra-cluster correlation (ICC), which inflates variance relative to SRS. The design effect (DEFF) measures this inflation:

\[\text{DEFF} = 1 + (\bar{m} - 1) \cdot \rho_{\text{ICC}}\]

where \(\bar{m}\) is the mean cluster size and \(\rho_{\text{ICC}}\) is the intra-class correlation.

Python: morie.sampling.cluster_sample()

Probability Proportional to Size (PPS)

PPS sampling selects units with probability proportional to a size measure \(x_i\) (e.g. enrolment count):

\[\pi_i = n \cdot \frac{x_i}{\sum_j x_j}\]

PPS is more efficient than SRS when the outcome is correlated with size.

Python: morie.sampling.pps_sample()

Horvitz-Thompson and Hájek Estimators

For any probability sample with known inclusion probabilities \(\pi_i\):

Horvitz-Thompson (unbiased for population total):

\[\hat{\tau}_{HT} = \sum_{i \in s} \frac{y_i}{\pi_i}\]

Hájek (ratio estimator for mean, more stable than HT):

\[\bar{y}_H = \frac{\sum_{i \in s} y_i / \pi_i}{\sum_{i \in s} 1 / \pi_i}\]

Python: morie.sampling.horvitz_thompson_total(), morie.survey.hajek_mean()

Bootstrap and Jackknife Variance Estimation

For complex statistics (medians, quantiles, non-linear estimators) where analytic variances are unavailable:

Bootstrap (Efron 1979):

\[\widehat{\text{Var}}(\hat{\theta}) = \frac{1}{B-1} \sum_{b=1}^{B} \bigl(\hat{\theta}^{*(b)} - \bar{\theta}^*\bigr)^2\]

Delete-1 Jackknife:

\[\widehat{\text{Var}}(\hat{\theta}) = \frac{n-1}{n} \sum_{i=1}^{n} \bigl(\hat{\theta}_{(-i)} - \bar{\theta}_{(.)} \bigr)^2\]

Python: morie.sampling.bootstrap_sample(), morie.sampling.jackknife_estimate()

Effective Sample Size

Survey weights create unequal effective sample contributions. The Kish effective sample size (ESS) quantifies the equivalent SRS size:

\[\text{ESS} = \frac{\bigl(\sum_i w_i\bigr)^2}{\sum_i w_i^2}\]

The design effect \(\text{DEFF} = n / \text{ESS}\) measures variance inflation relative to a simple random sample of the same size.

Python: morie.sampling.effective_sample_size(), morie.sampling.design_effect()

Calibration / Raking

Post-stratification and raking calibrate sample weights so that weighted marginal distributions match known population totals:

\[\min_w \sum_i d\bigl(w_i, w_i^{(0)}\bigr) \quad \text{subject to} \quad \sum_i w_i x_{ij} = T_j \; \forall j\]

where \(d(\cdot)\) is a distance function (chi-squared → linear calibration; multiplicative → raking). MORIE uses iterative proportional fitting (IPF).

Python: morie.survey.calibration_weights()

References

  • Kish L (1965). Survey Sampling. Wiley.

  • Cochran WG (1977). Sampling Techniques (3rd ed.). Wiley.

  • Lumley T (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.

  • Valliant R, Dever JA, Kreuter F (2013). Practical Tools for Designing and Weighting Survey Samples. Springer.