Survey-Weighted Statistics

Part of Statistical Methods — MORIE’s statistical-methods reference.

CPADS is a complex survey with design weights (wtpumf). All prevalence, mean, and proportion estimates must account for survey design to produce nationally-representative results.

Design weights

Survey weights \(w_i\) correct for unequal probability of selection. The Horvitz–Thompson estimator for a population total is

\[\hat{T}_y = \sum_{i \in s} \frac{y_i}{\pi_i} = \sum_{i \in s} w_i y_i\]

where \(\pi_i\) is the inclusion probability for unit \(i\).

Weighted mean and proportion

\[\bar{y}_w = \frac{\sum_i w_i y_i}{\sum_i w_i}\]

For a binary outcome (prevalence):

\[\hat{p}_w = \frac{\sum_i w_i \cdot \mathbb{1}[y_i = 1]}{\sum_i w_i}\]

Linearization variance

MORIE uses the Taylor linearization (delta method) approach for variance estimation via the R survey package.

R usage

library(survey)
svy <- svydesign(ids = ~1, weights = ~wtpumf, data = df)

# Weighted prevalence
svymean(~heavy_drinking_30d, svy, na.rm = TRUE)

# Weighted logistic regression
svyglm(heavy_drinking_30d ~ cannabis_any_use + age_group + gender,
       design = svy, family = quasibinomial())

Python usage

Survey-weighted summaries are computed directly using the weight column:

weighted_prev = (df["heavy_drinking_30d"] * df["weight"]).sum() / df["weight"].sum()

References

  • Lumley T (2004). Analysis of complex survey samples. Journal of Statistical Software, 9(1):1–19. https://doi.org/10.18637/jss.v009.i08

  • Statistics Canada (2023). CPADS 2021-2022 PUMF User Guide.