Survey-Weighted Statistics
===========================

Part of :doc:`index` — MORIE's statistical-methods reference.

CPADS is a complex survey with design weights (``wtpumf``). All prevalence,
mean, and proportion estimates must account for survey design to produce
nationally-representative results.

Design weights
--------------

Survey weights :math:`w_i` correct for unequal probability of selection.
The Horvitz–Thompson estimator for a population total is

.. math::

   \hat{T}_y = \sum_{i \in s} \frac{y_i}{\pi_i} = \sum_{i \in s} w_i y_i

where :math:`\pi_i` is the inclusion probability for unit :math:`i`.

Weighted mean and proportion
-----------------------------

.. math::

   \bar{y}_w = \frac{\sum_i w_i y_i}{\sum_i w_i}

For a binary outcome (prevalence):

.. math::

   \hat{p}_w = \frac{\sum_i w_i \cdot \mathbb{1}[y_i = 1]}{\sum_i w_i}

Linearization variance
----------------------

MORIE uses the Taylor linearization (delta method) approach for variance
estimation via the R ``survey`` package.

R usage
-------

.. code-block:: r

   library(survey)
   svy <- svydesign(ids = ~1, weights = ~wtpumf, data = df)

   # Weighted prevalence
   svymean(~heavy_drinking_30d, svy, na.rm = TRUE)

   # Weighted logistic regression
   svyglm(heavy_drinking_30d ~ cannabis_any_use + age_group + gender,
          design = svy, family = quasibinomial())

Python usage
------------

Survey-weighted summaries are computed directly using the ``weight`` column:

.. code-block:: python

   weighted_prev = (df["heavy_drinking_30d"] * df["weight"]).sum() / df["weight"].sum()

References
----------

- Lumley T (2004). Analysis of complex survey samples.
  *Journal of Statistical Software*, 9(1):1–19.
  https://doi.org/10.18637/jss.v009.i08
- Statistics Canada (2023). *CPADS 2021-2022 PUMF User Guide*.