Python API

Part of API Reference — MORIE API reference.

Reference for every public morie.* module. Signatures and docstrings come from sphinx.ext.autodoc; see Statistical Methods for the methodology behind each function.

Causal inference

Causal estimators: IPW, AIPW, G-computation, Matching and Double Machine Learning.

morie.causal.compute_propensity_scores(data, treatment, covariates)[source]

Compute propensity scores using logistic regression with standard preprocessing.

Non-numeric (object/categorical) columns are integer-encoded with LabelEncoder and all numeric columns are standardised with StandardScaler before fitting. Standardisation stabilises the logistic-regression solver and makes coefficients comparable across covariates on different scales. Encoding is applied in-place on a copy; the original data frame is not modified.

Parameters:
  • data (pandas.DataFrame) – The input pandas DataFrame containing covariates and treatment.

  • treatment (str) – The name of the column containing the binary treatment variable.

  • covariates (list[str]) – A list of column names for the covariates to adjust for.

Returns:

A pandas Series containing the predicted propensity scores (values in the open interval (0, 1)).

Return type:

pandas.Series

References

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.1093/biomet/70.1.41

morie.causal.calculate_ipw_weights(data, treatment, ps_col, *, stabilized=False, trim_quantiles=None)[source]

Calculate inverse probability of treatment weights (IPTW).

Parameters:
  • data (pandas.DataFrame) – The dataframe containing treatment assignment and propensity scores.

  • treatment (str) – The column name for the actual treatment assignment.

  • ps_col (str) – The column name containing the propensity scores.

  • stabilized (bool)

  • trim_quantiles (tuple[float, float] | None)

Returns:

A pandas Series containing the IPTW for each observation.

Return type:

pandas.Series

morie.causal.effective_sample_size(weights)[source]

Compute effective sample size for analytical weights.

Parameters:

weights (Series | ndarray)

Return type:

float

morie.causal.run_propensity_ipw_analysis(data, *, treatment='cannabis_any_use', outcome='heavy_drinking_30d', covariates=None, survey_weight_col='weight')[source]

Reproduce the core outputs of the old 07_propensity.R workflow.

ATE Estimator

The Average Treatment Effect is estimated via the Hájek (normalised Horvitz-Thompson) IPW estimator:

\[\widehat{\text{ATE}}_{\text{Hájek}} = \frac{\sum_i T_i Y_i / e_i}{\sum_i T_i / e_i} - \frac{\sum_i (1-T_i) Y_i / (1-e_i)}{\sum_i (1-T_i) / (1-e_i)}\]

where \(T_i \in \{0,1\}\) is the treatment indicator, \(Y_i\) is the outcome, and \(e_i = P(T_i=1 \mid X_i)\) is the estimated propensity score.

The Hájek estimator is preferred over the raw Horvitz-Thompson estimator because it is self-normalised (the weights sum to 1 within each arm), which gives better finite-sample performance and is less sensitive to extreme propensity scores.

Trimmed IPW weights (1st–99th percentile) are used to reduce the influence of extreme propensity scores.

References

Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19), 2937–2960. https://doi.org/10.1002/sim.1903

Hájek, J. (1971). Comment on “An essay on the logical foundations of survey sampling” by D. Basu. In V. P. Godambe & D. A. Sprott (Eds.), Foundations of Statistical Inference (pp. 236–236). Holt, Rinehart & Winston.

Parameters:
Return type:

dict[str, DataFrame | float]

morie.causal.estimate_aipw(data, *, treatment='cannabis_any_use', outcome='heavy_drinking_30d', covariates=None, outcome_model='logistic')[source]

Estimate the ATE via the Augmented Inverse Probability Weighting (AIPW) doubly-robust estimator.

The influence-function score for unit i is:

\[\psi_i = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i \bigl(Y_i - \hat{\mu}_1(X_i)\bigr)}{\hat{e}_i} - \frac{(1-T_i)\bigl(Y_i - \hat{\mu}_0(X_i)\bigr)}{1 - \hat{e}_i}\]

The ATE estimator is \(\widehat{\text{ATE}}_\text{AIPW} = n^{-1} \sum_i \psi_i\) and the standard error is \(\hat{\sigma}_{\psi} / \sqrt{n}\).

Double robustness: the estimator is consistent if either the propensity score model \(\hat{e}(X)\) or the outcome model \(\hat{\mu}(T, X)\) is correctly specified – not necessarily both.

Parameters:
  • data (pandas.DataFrame) – The input DataFrame.

  • treatment (str) – Binary treatment column (0/1).

  • outcome (str) – Outcome column.

  • covariates (list[str] | None) – Covariate column names. Defaults to the standard CPADS confounders.

  • outcome_model (str) – "logistic" for binary outcomes, "linear" for continuous. Defaults to "logistic".

Returns:

Dictionary with keys ate, se, ci_lower, ci_upper, n, method.

Return type:

dict[str, Any]

References

Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846–866.

Scharfstein, D. O., Rotnitzky, A., & Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. JASA, 94(448), 1096–1120.

morie.causal.run_ebac_selection_ipw_analysis(data, *, observation_col='ebac_tot', eligible_col='alcohol_past12m', treatment='cannabis_any_use', binary_outcome='ebac_legal', continuous_outcome='ebac_tot', survey_weight_col='weight', covariates=None)[source]

Reproduce the core outputs of the old 07_ebac_ipw.R workflow.

Parameters:
  • data (DataFrame)

  • observation_col (str)

  • eligible_col (str)

  • treatment (str)

  • binary_outcome (str)

  • continuous_outcome (str)

  • survey_weight_col (str)

  • covariates (list[str] | None)

Return type:

dict[str, DataFrame]

morie.causal.estimate_att(data, *, treatment, outcome, covariates, propensity_col=None)[source]

Estimate the Average Treatment Effect on the Treated (ATT) via Hajek-weighted IPW.

The ATT is defined as:

\[\text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid T=1]\]

Under unconfoundedness, the ATT is identified by the Hajek IPW estimator where treated units receive weight 1 and control units receive weight \(\hat{e}(X) / (1 - \hat{e}(X))\):

\[\widehat{\text{ATT}} = \frac{1}{n_1} \sum_{i: T_i=1} Y_i - \frac{\sum_{i: T_i=0} Y_i \cdot \hat{e}_i / (1 - \hat{e}_i)} {\sum_{i: T_i=0} \hat{e}_i / (1 - \hat{e}_i)}\]
Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • treatment (str) – Binary treatment column (0/1).

  • outcome (str) – Outcome column.

  • covariates (list[str]) – Covariate column names for propensity model.

  • propensity_col (str or None) – Pre-computed propensity score column. If None, propensity scores are estimated from covariates.

Returns:

Dictionary with att, se, ci_lower, ci_upper, n.

Return type:

dict[str, Any]

References

Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: a review. Review of Economics and Statistics, 86(1), 4–29. https://doi.org/10.1162/003465304323023651

Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), 1161–1189.

morie.causal.estimate_atc(data, *, treatment, outcome, covariates, propensity_col=None)[source]

Estimate the Average Treatment Effect on the Controls (ATC).

The ATC is defined as:

\[\text{ATC} = \mathbb{E}[Y(1) - Y(0) \mid T=0]\]

Treated units receive weight \((1 - \hat{e}(X)) / \hat{e}(X)\) and control units receive weight 1.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • treatment (str) – Binary treatment column (0/1).

  • outcome (str) – Outcome column.

  • covariates (list[str]) – Covariate column names for propensity model.

  • propensity_col (str or None) – Pre-computed propensity score column (optional).

Returns:

Dictionary with atc, se, ci_lower, ci_upper, n.

Return type:

dict[str, Any]

References

Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: a review. Review of Economics and Statistics, 86(1), 4–29.

morie.causal.estimate_gate(data, *, treatment, outcome, covariates, group_col, propensity_col=None)[source]

Estimate Group Average Treatment Effects (GATE) via AIPW within strata.

Partitions the data by group_col and estimates the ATE within each group using the AIPW doubly-robust estimator.

\[\text{GATE}_g = \mathbb{E}[Y(1) - Y(0) \mid G = g]\]
Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • treatment (str) – Binary treatment column (0/1).

  • outcome (str) – Outcome column.

  • covariates (list[str]) – Covariate column names.

  • group_col (str) – Column defining groups/strata.

  • propensity_col (str or None) – Pre-computed propensity score column (optional).

Returns:

DataFrame with columns: group, ate, se, ci_lower, ci_upper, n.

Return type:

pandas.DataFrame

References

Imai, K., & Ratkovic, M. (2013). Estimating treatment effect heterogeneity in randomized program evaluation. Annals of Applied Statistics, 7(1), 443–470.

Chernozhukov, V., Demirer, M., Duflo, E., & Fernandez-Val, I. (2020). Generic machine learning inference on heterogeneous treatment effects in randomized experiments. NBER Working Paper 24678.

morie.causal.estimate_cate(data, *, treatment, outcome, covariates, meta_learner='t_learner')[source]

Estimate per-unit Conditional Average Treatment Effects (CATE).

The CATE for unit i is \(\tau(X_i) = \mathbb{E}[Y(1) - Y(0) \mid X = X_i]\).

T-learner (default): fits separate outcome models on treated and control units, then computes \(\hat{\tau}(X_i) = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)\).

S-learner: fits a single outcome model on all units with treatment as a feature, then computes \(\hat{\tau}(X_i) = \hat{\mu}(X_i, T=1) - \hat{\mu}(X_i, T=0)\).

Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • treatment (str) – Binary treatment column (0/1).

  • outcome (str) – Outcome column.

  • covariates (list[str]) – Covariate column names.

  • meta_learner (str) – "t_learner" or "s_learner" (default "t_learner").

Returns:

Series of per-unit CATE estimates, indexed like the input.

Return type:

pandas.Series

References

Kunzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. PNAS, 116(10), 4156–4165. https://doi.org/10.1073/pnas.1804597116

morie.causal.estimate_late(data, *, treatment, outcome, instrument, covariates=None)[source]

Estimate the Local Average Treatment Effect (LATE) via instrumental variables.

For a binary instrument \(Z\), the Wald estimator (simple IV) is:

\[\widehat{\text{LATE}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(T, Z)} = \frac{\bar{Y}_{Z=1} - \bar{Y}_{Z=0}}{\bar{T}_{Z=1} - \bar{T}_{Z=0}}\]

With covariates, the function attempts 2SLS via linearmodels.iv.IV2SLS or statsmodels IV regression, falling back to the Wald estimator if neither IV library is installed.

Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • treatment (str) – Endogenous treatment column.

  • outcome (str) – Outcome column.

  • instrument (str) – Instrument column.

  • covariates (list[str] or None) – Exogenous covariate column names (optional).

Returns:

Dictionary with late, se, ci (tuple), f_stat, method.

Return type:

dict[str, Any]

References

Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467–475. https://doi.org/10.2307/2951620

Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. JASA, 91(434), 444–455.

morie.causal.estimate_irm(data, *, treatment, outcome, covariates, n_folds=5, random_state=42)[source]

Estimate the ATE via the Interactive Regression Model (IRM) using DoubleML.

The IRM extends the partially linear model by allowing treatment effect heterogeneity. It models:

\[ \begin{align}\begin{aligned}Y = g_0(T, X) + U, \quad \mathbb{E}[U \mid X, T] = 0\\T = m_0(X) + V, \quad \mathbb{E}[V \mid X] = 0\end{aligned}\end{align} \]

where \(g_0\) is the outcome regression and \(m_0\) is the propensity score. The Neyman-orthogonal score for the ATE is:

\[\psi = g_0(1, X) - g_0(0, X) + \frac{T(Y - g_0(1,X))}{m_0(X)} - \frac{(1-T)(Y - g_0(0,X))}{1 - m_0(X)} - \theta\]
Parameters:
  • data (pandas.DataFrame) – Input DataFrame.

  • treatment (str) – Binary treatment column.

  • outcome (str) – Outcome column.

  • covariates (list[str]) – Covariate column names.

  • n_folds (int) – Number of cross-fitting folds (default 5).

  • random_state (int) – Random seed (default 42).

Returns:

Dictionary with ate, se, ci_lower, ci_upper, n, method.

Return type:

dict[str, Any]

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097

morie.causal.estimate_double_ml(data, outcome, treatment, covariates, *, random_state=42, n_folds=5, n_rep=1)[source]

Estimate the Average Treatment Effect using Double Machine Learning (DML).

Uses doubleml.DoubleMLPLR (Partially Linear Regression Model) with Random Forest nuisance estimators for both the outcome regression (ml_l) and the treatment model (ml_m).

Reproducibility

All stochastic operations are seeded deterministically:

  1. numpy global seed is set to random_state immediately before constructing the learners.

  2. Both RandomForestRegressor instances receive random_state.

  3. The DoubleMLPLR object is constructed with n_folds and n_rep passed explicitly so that the cross-fitting schedule is fixed for a given seed.

To change the seed for a sensitivity run, pass random_state=<int>.

param data:

The dataset containing outcome, treatment, and covariates.

type data:

pandas.DataFrame

param outcome:

Name of the continuous outcome variable.

type outcome:

str

param treatment:

Name of the treatment variable.

type treatment:

str

param covariates:

List of covariate column names to control for.

type covariates:

list[str]

param random_state:

Integer seed for all RNGs. Default 42.

type random_state:

int, optional

param n_folds:

Number of cross-fitting folds. Default 5.

type n_folds:

int, optional

param n_rep:

Number of repeated cross-fitting repetitions. Default 1.

type n_rep:

int, optional

return:

A fitted doubleml.DoubleMLPLR object containing the causal estimate and inference results.

rtype:

doubleml.DoubleMLPLR

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097

Parameters:

Treatment effect estimations (ATE, ATT, ATU, LATE, G-computation).

This module provides:

  1. estimate_ate() – IPW-weighted OLS ATE (existing, preserved).

  2. estimate_plr() – Partially Linear Regression via DoubleML.

  3. estimate_pliv() – Partially Linear IV (LATE) via DoubleML or 2SLS.

  4. estimate_ate_gcomputation() – G-computation (outcome regression) ATE.

  5. sensitivity_rosenbaum() – Rosenbaum bounds for hidden confounding.

  6. e_value() – E-value for unmeasured confounding (VanderWeele & Ding, 2017).

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey,

W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.

Robins, J. M. (1986). A new approach to causal inference in mortality studies

with a sustained exposure period. Mathematical Modelling, 7, 1393-1512.

VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational

research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.

Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer.

morie.effects.estimate_ate(data, outcome, treatment, weights_col)[source]

Estimate Average Treatment Effect (ATE) using a weighted linear model.

Parameters:
  • data (pandas.DataFrame) – The pandas DataFrame containing the analytical sample.

  • outcome (str) – The name of the outcome variable column.

  • treatment (str) – The name of the binary treatment indicator column.

  • weights_col (str) – The name of the column containing the analytical weights (e.g. IPTW).

Returns:

A tuple containing the estimated ATE coefficient and its standard error.

Return type:

tuple[float, float]

morie.effects.estimate_plr(data, *, treatment, outcome, covariates, n_folds=5, random_state=42)[source]

Partially Linear Regression (PLR) ATE via DoubleML.

The PLR model is:

\[Y = \theta_0 D + g_0(X) + \varepsilon, \quad D = m_0(X) + v\]

where \(\theta_0\) is the ATE, \(g_0\) and \(m_0\) are nuisance functions estimated via cross-fitting, and the final \(\theta_0\) estimate is debiased.

Nuisance learners default to ridge-regularised linear regression for both \(g_0\) (outcome) and \(m_0\) (treatment propensity / residual).

Parameters:
  • data (DataFrame) – DataFrame containing all required columns.

  • treatment (str) – Column name of the binary or continuous treatment variable.

  • outcome (str) – Column name of the outcome variable.

  • covariates (list[str]) – List of covariate column names.

  • n_folds (int) – Number of cross-fitting folds. Default 5.

  • random_state (int) – Random seed for reproducibility. Default 42.

Returns:

dict with keys ate, se, ci_lower, ci_upper, pval, n_obs.

Raises:
Return type:

dict

References

Chernozhukov et al. (2018). Double/debiased machine learning for treatment

and structural parameters. Econometrics Journal, 21(1), C1-C68.

Bach, P., Chernozhukov, V., Kurz, M. S., & Spindler, M. (2022). DoubleML:

An object-oriented implementation of double machine learning in Python. Journal of Machine Learning Research, 23(53), 1-6.

morie.effects.estimate_pliv(data, *, treatment, outcome, instrument, covariates, n_folds=5, random_state=42)[source]

Partially Linear IV (PLIV) for the Local Average Treatment Effect (LATE) using an instrumental variable via DoubleML.

The PLIV model is:

\[Y = \theta_0 D + g_0(X) + \varepsilon, \quad D = m_0(X) + f_0(Z, X) + v\]

where Z is the instrument, satisfying relevance and exclusion restriction.

Falls back to two-stage least squares (2SLS) via statsmodels if DoubleML is unavailable.

Parameters:
  • data (DataFrame) – DataFrame containing all required columns.

  • treatment (str) – Column name of the endogenous treatment variable.

  • outcome (str) – Column name of the outcome variable.

  • instrument (str) – Column name of the instrument.

  • covariates (list[str]) – List of exogenous covariate column names.

  • n_folds (int) – Cross-fitting folds (DoubleML path). Default 5.

  • random_state (int) – Random seed. Default 42.

Returns:

dict with keys late, se, ci_lower, ci_upper, pval, n_obs, method.

Raises:

ValueError – If required columns are missing.

Return type:

dict

References

Chernozhukov et al. (2018). Double/debiased machine learning.

Econometrics Journal, 21(1), C1-C68.

Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of

causal effects using instrumental variables. JASA, 91(434), 444-455.

morie.effects.estimate_ate_gcomputation(data, *, treatment, outcome, covariates, outcome_model='linear')[source]

G-computation (outcome regression / standardisation) ATE estimator.

The G-computation estimator proceeds in three steps:

  1. Fit an outcome model \(E[Y | T, X]\) on the observed data.

  2. Predict potential outcomes \(\hat{Y}(1)\) and \(\hat{Y}(0)\) for every unit by setting T = 1 and T = 0 respectively.

  3. Compute ATE as the average difference:

\[\widehat{\text{ATE}} = \frac{1}{n} \sum_i \left(\hat{Y}_i(1) - \hat{Y}_i(0)\right)\]

Standard error is estimated via non-parametric bootstrap (500 iterations with seed = 42) on the full three-step procedure.

Parameters:
  • data (DataFrame) – DataFrame containing all required columns.

  • treatment (str) – Column name of the binary treatment indicator (0/1).

  • outcome (str) – Column name of the outcome variable.

  • covariates (list[str]) – List of covariate column names.

  • outcome_model (str) – "linear" (OLS) for continuous outcomes or "logistic" for binary outcomes. Default "linear".

Returns:

dict with keys ate, se, ci_lower, ci_upper, n_obs, outcome_model.

Raises:

ValueError – If required columns are missing or outcome_model is invalid.

Return type:

dict

References

Robins, J. M. (1986). A new approach to causal inference in mortality

studies. Mathematical Modelling, 7, 1393-1512.

Hernan, M. A., & Robins, J. M. (2020). Causal Inference: What If.

Chapman & Hall/CRC. (Chapter 13.)

morie.effects.sensitivity_rosenbaum(data, *, treatment, outcome, covariates, gamma_range=(1.0, 3.0), n_gamma=20)[source]

Rosenbaum bounds sensitivity analysis for hidden confounding.

Tests whether the sign-rank test conclusion is robust to an unmeasured confounder that could increase the odds of treatment assignment by a factor of Gamma (the sensitivity parameter).

For a range of Gamma values, the method computes: - p_lower: p-value under the most favourable assignment (best case) - p_upper: p-value under the most adverse assignment (worst case)

The analysis is based on the Wilcoxon signed-rank statistic applied to matched pairs. Here we approximate the matched analysis by using all discordant (T=1, T=0) pairs sorted by outcome.

Parameters:
  • data (DataFrame) – DataFrame containing all required columns.

  • treatment (str) – Column name of the binary treatment indicator (0/1).

  • outcome (str) – Column name of the outcome variable.

  • covariates (list[str]) – Covariate column names (used for matching approximation).

  • gamma_range (tuple[float, float]) – Tuple (min_gamma, max_gamma). Default (1.0, 3.0).

  • n_gamma (int) – Number of Gamma values to evaluate. Default 20.

Returns:

DataFrame with columns Gamma, p_lower, p_upper.

Raises:

ValueError – If required columns are missing or gamma_range is invalid.

Return type:

DataFrame

Notes

This implementation uses the normal approximation to the signed-rank distribution under sensitivity bounds (Rosenbaum, 2002, Chapter 4). For small samples or exact analysis, use the R sensitivitymw or rbounds package.

References

Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer. (Chapter 4.) Rosenbaum, P. R. (2007). Sensitivity analysis for m-estimates, tests, and

confidence intervals in matched observational studies. Biometrics, 63(2), 456-464.

morie.effects.e_value(ate, se, *, null=0.0)[source]

E-value for unmeasured confounding (VanderWeele & Ding, 2017).

The E-value is the minimum strength of association (on the risk-ratio scale) that an unmeasured confounder would need to have with both the treatment and the outcome to fully explain away the observed effect, conditional on the measured covariates.

For a risk ratio (RR) effect estimate the E-value is:

\[E = \text{RR} + \sqrt{\text{RR} \cdot (\text{RR} - 1)}\]

where RR > 1. For RR < 1, compute E on 1/RR.

Since the ATE here is a difference (not a ratio), we first convert using a delta-method approximation to get a risk-ratio-like effect, using the relationship RR ≈ exp(|ATE - null| / se) (treating the z-score as a log-RR approximation). This is an approximation appropriate for continuous-scale effects reported with a standard error.

Parameters:
  • ate (float) – Point estimate of the treatment effect.

  • se (float) – Standard error of the ATE estimate (must be > 0).

  • null (float) – Null value to test against. Default 0.0.

Returns:

E-value (float >= 1.0). Returns 1.0 if the estimate is at the null.

Raises:

ValueError – If se <= 0.

Return type:

float

Notes

For binary outcomes with a risk ratio or odds ratio estimate, convert the OR to the risk-ratio scale first, then apply the E-value formula directly. This function is designed for the continuous-ATE setting.

References

VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational

research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.

VanderWeele, T. J., Mathur, M. B., & Ding, P. (2019). Correcting

misinterpretations of the E-value. Annals of Internal Medicine, 170(2), 131-132.

Matching methods for causal inference in observational studies.

Implements propensity score matching (nearest-neighbour, caliper, with/without replacement), exact matching, coarsened exact matching (CEM), Mahalanobis distance matching, optimal matching, genetic matching, entropy balancing, cardinality matching, subclassification, full matching, balance diagnostics, treatment effect estimation, Abadie-Imbens standard errors, Rosenbaum bounds, doubly-robust estimation, and multi-treatment/longitudinal matching.

References

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.1093/biomet/70.1.41

Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1–21. https://doi.org/10.1214/09-STS313

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political Analysis, 20(1), 1–24. https://doi.org/10.1093/pan/mpr013

Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1), 235–267. https://doi.org/10.1111/j.1468-0262.2006.00655.x

Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1), 25–46. https://doi.org/10.1093/pan/mpr025

class morie.matching.MatchResult(matched_data, n_treated, n_matched_control, match_pairs, method='nearest_neighbor', details=<factory>)[source]

Bases: object

Container for matching output.

Variables:
  • matched_data (pd.DataFrame) – Matched dataset with match indicators.

  • n_treated (int) – Number of treated units.

  • n_matched_control (int) – Number of matched control units.

  • match_pairs (pd.DataFrame) – Mapping from treated to matched control indices.

  • method (str) – Matching method used.

  • details (dict) – Additional diagnostics (e.g., caliper, replacement).

Parameters:
matched_data: DataFrame
n_treated: int
n_matched_control: int
match_pairs: DataFrame
method: str = 'nearest_neighbor'
details: dict[str, Any]
class morie.matching.BalanceResult(balance_table, overall_balance, max_smd, balanced)[source]

Bases: object

Container for balance diagnostics.

Variables:
  • balance_table (pd.DataFrame) – Table with standardised mean differences, variance ratios, etc.

  • overall_balance (float) – Mean absolute SMD across covariates.

  • max_smd (float) – Maximum absolute SMD.

  • balanced (bool) – Whether all SMDs are below the threshold (default 0.1).

Parameters:
balance_table: DataFrame
overall_balance: float
max_smd: float
balanced: bool
class morie.matching.TreatmentEffectResult(estimand, estimate, std_error, ci_lower, ci_upper, p_value, n_obs, details=<factory>)[source]

Bases: object

Container for treatment effect estimates from matched samples.

Variables:
  • estimand (str) – ATT, ATE, or ATC.

  • estimate (float) – Point estimate.

  • std_error (float) – Standard error.

  • ci_lower (float) – Lower CI bound.

  • ci_upper (float) – Upper CI bound.

  • p_value (float) – Two-sided p-value.

  • n_obs (int) – Effective sample size.

  • details (dict) – Extra information.

Parameters:
estimand: str
estimate: float
std_error: float
ci_lower: float
ci_upper: float
p_value: float
n_obs: int
details: dict[str, Any]
summary()[source]

One-row summary.

Return type:

DataFrame

morie.matching.estimate_propensity_score(data, treatment, covariates, *, model='logistic', max_iter=1000)[source]

Estimate propensity scores via logistic regression.

Parameters:
  • data (pd.DataFrame)

  • treatment (str) – Binary treatment column (0/1).

  • covariates (list of str) – Covariates for the propensity model.

  • model (str) – "logistic" (default) or "gbm" for gradient boosting.

  • max_iter (int) – Maximum iterations for logistic regression.

Returns:

Propensity scores indexed like data.

Return type:

pd.Series

References

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.

morie.matching.trim_propensity_scores(ps, lower=0.01, upper=0.99)[source]

Clip propensity scores to [lower, upper].

Parameters:
  • ps (pd.Series) – Raw propensity scores.

  • lower (float)

  • upper (float)

Return type:

pd.Series

morie.matching.common_support(data, treatment, ps_col='propensity_score', *, method='minmax')[source]

Restrict sample to the region of common support.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • ps_col (str) – Propensity score column name.

  • method (str) – "minmax" (overlap of ranges) or "trim" (drop extreme 5 % from each tail).

Returns:

Subset with common support.

Return type:

pd.DataFrame

morie.matching.match_nearest_neighbor(data, treatment, covariates, *, n_neighbors=1, caliper=None, replace=False, ps=None, alpha=0.05)[source]

Nearest-neighbour matching on propensity score.

For each treated unit, finds the n_neighbors closest control units by propensity score distance.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • n_neighbors (int) – Number of matches per treated unit.

  • caliper (float, optional) – Maximum propensity score distance for a valid match (in SD of the logit of the propensity score).

  • replace (bool) – Whether controls can be matched to multiple treated units.

  • ps (pd.Series, optional) – Pre-computed propensity scores. If None, estimated internally.

  • alpha (float) – Significance level for balance checks.

Return type:

MatchResult

morie.matching.match_exact(data, treatment, exact_vars)[source]

Exact matching on discrete covariates.

Matches treated and control units that share identical values on all exact_vars. Unmatched units are dropped.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • exact_vars (list of str)

Return type:

MatchResult

morie.matching.match_cem(data, treatment, covariates, *, n_bins=5)[source]

Coarsened Exact Matching (Iacus, King & Porro, 2012).

Coarsens continuous covariates into bins and performs exact matching on the coarsened values. Returns the original (uncoarsened) data with CEM strata weights.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • n_bins (int or dict) – Number of bins for each covariate. A single int applies to all.

Returns:

matched_data includes a _cem_weight column.

Return type:

MatchResult

References

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political Analysis, 20(1), 1–24.

morie.matching.match_mahalanobis(data, treatment, covariates, *, n_neighbors=1, caliper=None, replace=False, exact=None)[source]

Mahalanobis distance matching.

Matches on the Mahalanobis distance of covariates rather than propensity score. Optionally combined with exact matching on discrete variables.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str) – Continuous covariates for distance calculation.

  • n_neighbors (int)

  • caliper (float, optional)

  • replace (bool)

  • exact (list of str, optional) – Variables to match exactly before distance matching.

Return type:

MatchResult

morie.matching.match_optimal_pair(data, treatment, covariates, *, distance='propensity', ps=None)[source]

Optimal pair matching minimising total distance.

Uses a greedy approximation to the assignment problem (Hungarian algorithm is O(n^3); we use sorted nearest-neighbor for scalability).

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • distance (str) – "propensity" or "mahalanobis".

  • ps (pd.Series, optional)

Return type:

MatchResult

morie.matching.match_full(data, treatment, covariates, *, ps=None, n_subclasses=10)[source]

Full matching via propensity score subclassification.

Every unit is placed into a subclass containing at least one treated and one control unit. Approximated here via quantile-based stratification of the propensity score.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • ps (pd.Series, optional)

  • n_subclasses (int) – Number of propensity score strata.

Return type:

MatchResult

References

Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association, 99(467), 609–618.

morie.matching.subclassify(data, treatment, covariates, *, ps=None, n_strata=5)[source]

Subclassification (stratification) on the propensity score.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • ps (pd.Series, optional)

  • n_strata (int)

Returns:

  • data_with_strata (pd.DataFrame) – Input data with a _stratum column appended.

  • stratum_effects (pd.DataFrame) – Within-stratum treatment effect estimates.

Return type:

tuple[DataFrame, DataFrame]

morie.matching.entropy_balance(data, treatment, covariates, *, max_moment=1, max_iter=500, tol=1e-06)[source]

Entropy balancing weights for the control group.

Finds weights \(w_i\) for control units that minimise the Kullback-Leibler divergence from uniform weights subject to moment balance constraints:

\[\sum_{i: D_i=0} w_i\, x_i^k = \bar x^k_{D=1} \quad \text{for } k = 1, \ldots, \text{max\_moment}\]
Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • max_moment (int) – Balance on mean (1), mean + variance (2), or up to 3rd moment.

  • max_iter (int)

  • tol (float)

Returns:

Entropy balancing weights indexed like data. Treated units receive weight 1; control units receive the balancing weights.

Return type:

pd.Series

References

Hainmueller, J. (2012). Entropy balancing for causal effects. Political Analysis, 20(1), 25–46.

morie.matching.match_genetic(data, treatment, covariates, *, n_neighbors=1, pop_size=50, n_generations=20, seed=42)[source]

Genetic matching (Diamond & Sekhon, 2013).

Uses a genetic algorithm to find the optimal weight matrix for Mahalanobis distance matching that maximises covariate balance.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • n_neighbors (int)

  • pop_size (int) – Genetic algorithm population size.

  • n_generations (int)

  • seed (int)

Return type:

MatchResult

References

Diamond, A., & Sekhon, J. S. (2013). Genetic matching for estimating causal effects. Review of Economics and Statistics, 95(3), 932–945.

morie.matching.match_variable_ratio(data, treatment, covariates, *, min_ratio=1, max_ratio=5, caliper=0.2, ps=None)[source]

Variable-ratio matching on propensity score.

Each treated unit is matched to between min_ratio and max_ratio controls within the caliper.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • min_ratio (int)

  • max_ratio (int)

  • caliper (float)

  • ps (pd.Series, optional)

Return type:

MatchResult

morie.matching.match_cardinality(data, treatment, covariates, *, balance_threshold=0.1, ps=None)[source]

Cardinality matching: maximise sample size subject to balance.

Finds the largest subset of matched pairs such that the standardised mean difference on each covariate is below balance_threshold.

Uses an iterative caliper-tightening heuristic.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • balance_threshold (float)

  • ps (pd.Series, optional)

Return type:

MatchResult

References

Zubizarreta, J. R. (2012). Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500), 1360–1371.

morie.matching.balance_diagnostics(data, treatment, covariates, *, weights=None, threshold=0.1)[source]

Compute balance diagnostics for matched/weighted samples.

Reports standardised mean differences (SMD), variance ratios, and Kolmogorov-Smirnov statistics for each covariate.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • weights (str, optional) – Column of matching/weighting weights.

  • threshold (float) – SMD threshold for balanced flag (default 0.1).

Return type:

BalanceResult

morie.matching.love_plot_data(unmatched_data, matched_data, treatment, covariates, *, weights_col=None)[source]

Generate data for a Love plot comparing pre- and post-matching balance.

Parameters:
  • unmatched_data (pd.DataFrame)

  • matched_data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • weights_col (str, optional)

Returns:

Columns: covariate, smd_before, smd_after.

Return type:

pd.DataFrame

morie.matching.balance_table(data, treatment, covariates, *, weights=None)[source]

Construct a publication-ready balance table.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • weights (str, optional)

Return type:

pd.DataFrame

morie.matching.estimate_att_matched(data, outcome, treatment, match_pairs, *, weights=None, alpha=0.05)[source]

Estimate ATT from matched data.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • match_pairs (pd.DataFrame) – With columns treated_idx and control_idx.

  • weights (str, optional)

  • alpha (float)

Return type:

TreatmentEffectResult

morie.matching.estimate_ate_matched(data, outcome, treatment, covariates, *, weights=None, alpha=0.05)[source]

Estimate ATE from matched/weighted data via weighted mean difference.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • covariates (list of str)

  • weights (str, optional)

  • alpha (float)

Return type:

TreatmentEffectResult

morie.matching.estimate_atc_matched(data, outcome, treatment, match_pairs, *, alpha=0.05)[source]

Estimate ATC from matched data (matching controls to treated).

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • match_pairs (pd.DataFrame) – With columns control_idx and treated_idx (reverse matching direction).

  • alpha (float)

Return type:

TreatmentEffectResult

morie.matching.abadie_imbens_se(data, outcome, treatment, match_pairs, *, n_matches=1)[source]

Abadie-Imbens (2006) standard error for matching estimators.

Accounts for the fact that matching introduces correlation across matched observations. Uses the conditional variance estimator:

\[\hat V_{AI} = \frac{1}{n^2}\sum_i \bigl[\hat\sigma^2(X_i) + (K_M(i))^2\,\hat\sigma^2(X_i)\bigr]\]

where \(K_M(i)\) is the number of times unit i is used as a match.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • match_pairs (pd.DataFrame)

  • n_matches (int)

Returns:

Abadie-Imbens SE.

Return type:

float

References

Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1), 235–267.

morie.matching.rosenbaum_bounds(data, outcome, treatment, match_pairs, *, gamma_range=None)[source]

Rosenbaum bounds for sensitivity to hidden bias.

For each value of \(\Gamma\) (the maximum odds ratio of differential treatment assignment due to an unobserved confounder), computes the upper and lower bounds on the p-value for the treatment effect.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • match_pairs (pd.DataFrame)

  • gamma_range (sequence of float, optional) – Values of \(\Gamma\) to evaluate (default 1.0 to 3.0).

Returns:

Columns: gamma, p_lower, p_upper, significant_lower, significant_upper.

Return type:

pd.DataFrame

References

Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer.

morie.matching.doubly_robust_matching(data, outcome, treatment, covariates, *, ps=None, n_bootstrap=200, seed=42, alpha=0.05)[source]

Doubly-robust ATT estimation combining matching and regression.

Matches on propensity score, then uses bias-corrected regression adjustment within matched pairs. Consistent if either the propensity score model or the outcome model is correct.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • covariates (list of str)

  • ps (pd.Series, optional)

  • n_bootstrap (int)

  • seed (int)

  • alpha (float)

Return type:

TreatmentEffectResult

morie.matching.match_multi_treatment(data, treatment, covariates, *, reference_group=None, method='nearest_neighbor')[source]

Matching with multiple (> 2) treatment groups.

For each non-reference treatment level, matches treated units to the reference group using propensity scores from multinomial logistic regression.

Parameters:
  • data (pd.DataFrame)

  • treatment (str) – Column with multiple treatment levels.

  • covariates (list of str)

  • reference_group (any, optional) – The reference (control) group level. Defaults to the most frequent level.

  • method (str) – "nearest_neighbor" or "mahalanobis".

Returns:

Keys are treatment levels; values are MatchResult.

Return type:

dict

morie.matching.match_longitudinal(data, treatment, covariates, unit, time, treatment_time, *, n_pre_periods=1, method='nearest_neighbor')[source]

Longitudinal matching for panel data.

Matches treated and control units based on pre-treatment covariate values, allowing for time-varying covariates.

Parameters:
  • data (pd.DataFrame) – Panel data.

  • treatment (str)

  • covariates (list of str)

  • unit (str)

  • time (str)

  • treatment_time (str) – Column indicating when treatment begins for each unit.

  • n_pre_periods (int) – Number of pre-treatment periods to use for matching.

  • method (str)

Return type:

MatchResult

morie.matching.matching_quality(unmatched_data, matched_data, treatment, covariates, *, weights=None)[source]

Comprehensive matching quality assessment.

Compares balance before and after matching and computes summary metrics including percent bias reduction, number of balanced covariates, and overlap statistics.

Parameters:
  • unmatched_data (pd.DataFrame)

  • matched_data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • weights (str, optional)

Returns:

Keys: balance_before, balance_after, bias_reduction, pct_balanced_before, pct_balanced_after.

Return type:

dict

morie.matching.overlap_diagnostics(data, treatment, covariates, *, ps=None)[source]

Propensity score overlap diagnostics.

Parameters:
  • data (pd.DataFrame)

  • treatment (str)

  • covariates (list of str)

  • ps (pd.Series, optional)

Returns:

ps_summary (by group), overlap_region, n_off_support, effective_sample_size.

Return type:

dict

Instrumental Variables and Two-Stage Least Squares estimators.

Implements 2SLS, LIML, GMM, CUE-GMM, weak-instrument diagnostics (Cragg-Donald, Kleibergen-Paap, Stock-Yogo), Anderson-Rubin inference, Sargan/Hansen J tests, Hausman tests, JIVE, split-sample IV, IV probit, panel IV, control function, and comprehensive residual diagnostics.

References

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.

Stock, J. H., & Yogo, M. (2005). Testing for weak instruments in linear IV regression. In Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg (pp. 80–108). Cambridge University Press.

Kleibergen, F. (2002). Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica, 70(5), 1781–1803. https://doi.org/10.1111/1468-0262.00353

Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica, 26(3), 393–415.

class morie.iv.IVResult(coefficients, std_errors, t_stats, p_values, ci_lower, ci_upper, variable_names, n_obs, method='2sls', details=<factory>)[source]

Bases: object

Container for an IV/2SLS regression result.

Variables:
  • coefficients (np.ndarray) – Estimated coefficient vector.

  • std_errors (np.ndarray) – Standard errors.

  • t_stats (np.ndarray) – t-statistics.

  • p_values (np.ndarray) – Two-sided p-values.

  • ci_lower (np.ndarray) – Lower 95 % CI bounds.

  • ci_upper (np.ndarray) – Upper 95 % CI bounds.

  • variable_names (list of str) – Names of variables corresponding to each coefficient.

  • n_obs (int) – Number of observations.

  • method (str) – Estimation method name.

  • details (dict) – Additional diagnostics.

Parameters:
coefficients: ndarray
std_errors: ndarray
t_stats: ndarray
p_values: ndarray
ci_lower: ndarray
ci_upper: ndarray
variable_names: list[str]
n_obs: int
method: str = '2sls'
details: dict[str, Any]
summary()[source]

Return a summary table.

Return type:

DataFrame

property endogenous_coef: float

Return the coefficient on the first endogenous variable.

property endogenous_se: float

Return the SE on the first endogenous variable.

class morie.iv.DiagnosticResult(statistic, p_value, test_name, critical_values=None, details=<factory>)[source]

Bases: object

Container for a single diagnostic test.

Parameters:
statistic: float
p_value: float
test_name: str
critical_values: dict[str, float] | None = None
details: dict[str, Any]
morie.iv.tsls(data, outcome, endogenous, instruments, exogenous=None, *, cluster=None, robust=True, alpha=0.05)[source]

Two-Stage Least Squares (2SLS) estimator.

First stage: regress each endogenous variable on the instruments and exogenous variables. Second stage: regress the outcome on the fitted endogenous variables and exogenous variables.

\[\hat\beta_{2SLS} = (X'P_Z X)^{-1} X'P_Z y\]

where \(P_Z = Z(Z'Z)^{-1}Z'\) and \(Z\) includes both excluded instruments and included exogenous variables.

Parameters:
  • data (pd.DataFrame)

  • outcome (str) – Dependent variable.

  • endogenous (list of str) – Endogenous regressors.

  • instruments (list of str) – Excluded instruments.

  • exogenous (list of str, optional) – Included exogenous regressors (added automatically to Z).

  • cluster (str, optional) – Cluster variable for cluster-robust SE.

  • robust (bool) – Use heteroskedasticity-robust SE (default True).

  • alpha (float) – Significance level.

Return type:

IVResult

References

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.

morie.iv.liml(data, outcome, endogenous, instruments, exogenous=None, *, robust=True, alpha=0.05)[source]

Limited Information Maximum Likelihood (LIML) estimator.

More robust to weak instruments than 2SLS. The LIML estimator is the k-class estimator with \(\kappa\) equal to the smallest eigenvalue of a certain matrix ratio.

\[\hat\beta_{\text{LIML}} = \bigl(X'(I - \kappa M_Z) X\bigr)^{-1} X'(I - \kappa M_Z) y\]
Parameters:
Return type:

IVResult

References

Anderson, T. W., & Rubin, H. (1949). Estimation of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics, 20(1), 46–63.

morie.iv.gmm_iv(data, outcome, endogenous, instruments, exogenous=None, *, weight_matrix='optimal', robust=True, alpha=0.05)[source]

Generalised Method of Moments (GMM) IV estimator.

Two-step efficient GMM: first estimates with the identity weight matrix, then re-estimates with the optimal weight matrix constructed from first-step residuals.

\[\hat\beta_{\text{GMM}} = \arg\min_\beta \bigl[Z'(y - X\beta)\bigr]' \hat W \bigl[Z'(y - X\beta)\bigr]\]
Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

  • weight_matrix (str) – "optimal" (efficient GMM) or "identity" (one-step).

  • robust (bool)

  • alpha (float)

Return type:

IVResult

morie.iv.cue_gmm(data, outcome, endogenous, instruments, exogenous=None, *, max_iter=100, tol=1e-08, alpha=0.05)[source]

Continuously Updated GMM estimator.

Unlike two-step GMM, CUE simultaneously optimises over \(\beta\) and the weight matrix, yielding an estimator that is invariant to the normalisation of moment conditions.

Parameters:
Return type:

IVResult

morie.iv.wald_estimator(data, outcome, treatment, instrument, *, alpha=0.05)[source]

Wald (simple IV) estimator for a single binary instrument.

\[\hat\beta_{\text{Wald}} = \frac{E[Y | Z=1] - E[Y | Z=0]}{E[D | Z=1] - E[D | Z=0]}\]
Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str) – Endogenous treatment variable.

  • instrument (str) – Binary instrument.

  • alpha (float)

Return type:

IVResult

morie.iv.first_stage_diagnostics(data, endogenous, instruments, exogenous=None)[source]

First-stage regression diagnostics for each endogenous variable.

Reports the F-statistic, partial R-squared, and Shea’s partial R-squared for each endogenous variable.

Parameters:
  • data (pd.DataFrame)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Returns:

Columns: endogenous_var, f_stat, partial_r2, shea_partial_r2, n_obs.

Return type:

pd.DataFrame

References

Shea, J. (1997). Instrument relevance in multivariate linear models: A simple measure. Review of Economics and Statistics, 79(2), 348–352.

morie.iv.cragg_donald_test(data, endogenous, instruments, exogenous=None)[source]

Cragg-Donald (1993) minimum eigenvalue statistic for weak instruments.

Parameters:
  • data (pd.DataFrame)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Returns:

The statistic is the Cragg-Donald F-statistic (minimum eigenvalue of the concentration parameter matrix).

Return type:

DiagnosticResult

morie.iv.stock_yogo_critical_values(n_endogenous=1, n_instruments=1)[source]

Stock-Yogo (2005) critical values for weak instrument testing.

Returns approximate critical values for the Cragg-Donald statistic at various maximal bias/size thresholds.

Parameters:
  • n_endogenous (int)

  • n_instruments (int)

Returns:

Critical values keyed by bias threshold (e.g., "10%_bias").

Return type:

dict

References

Stock, J. H., & Yogo, M. (2005). Testing for weak instruments in linear IV regression.

morie.iv.kleibergen_paap_test(data, endogenous, instruments, exogenous=None)[source]

Kleibergen-Paap (2006) rk statistic for weak instruments.

Robust to heteroskedasticity and clustering, unlike Cragg-Donald.

Parameters:
  • data (pd.DataFrame)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Return type:

DiagnosticResult

References

Kleibergen, F., & Paap, R. (2006). Generalized reduced rank tests using the singular value decomposition. Journal of Econometrics, 133(1), 97–126.

morie.iv.anderson_rubin_test(data, outcome, endogenous, instruments, exogenous=None, *, beta0=None, alpha=0.05)[source]

Anderson-Rubin (1949) test: robust to weak instruments.

Tests \(H_0: \beta = \beta_0\) using the statistic:

\[AR = \frac{(y - X\beta_0)'P_Z(y - X\beta_0) / q} {(y - X\beta_0)'M_Z(y - X\beta_0) / (n - k)}\]

which has an \(F(q, n-k)\) distribution under the null regardless of instrument strength.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

  • beta0 (float, optional) – Null hypothesis value (default 0).

  • alpha (float)

Returns:

details contains the confidence set if computed by inversion.

Return type:

DiagnosticResult

morie.iv.anderson_rubin_ci(data, outcome, endogenous, instruments, exogenous=None, *, grid_min=-10.0, grid_max=10.0, grid_n=200, alpha=0.05)[source]

Anderson-Rubin confidence interval by test inversion.

Parameters:
Returns:

(lower, upper) bounds of the AR confidence set.

Return type:

tuple of float

morie.iv.conditional_lr_test(data, outcome, endogenous, instruments, exogenous=None, *, beta0=0.0)[source]

Conditional likelihood ratio test (Moreira, 2003).

More powerful than Anderson-Rubin when instruments are weak but not completely irrelevant.

Parameters:
Return type:

DiagnosticResult

References

Moreira, M. J. (2003). A conditional likelihood ratio test for structural models. Econometrica, 71(4), 1027–1048.

morie.iv.sargan_test(data, outcome, endogenous, instruments, exogenous=None)[source]

Sargan (1958) / Hansen (1982) J-test for overidentifying restrictions.

Tests whether the excluded instruments are uncorrelated with the structural error. Requires more instruments than endogenous variables (overidentification).

\[J = n \cdot \bar g' \hat S^{-1} \bar g \sim \chi^2(L - K)\]
Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Return type:

DiagnosticResult

morie.iv.hansen_j_test(data, outcome, endogenous, instruments, exogenous=None)[source]

Hansen J-test (heteroskedasticity-robust overidentification test).

Uses the efficient GMM objective function as the test statistic.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Return type:

DiagnosticResult

morie.iv.hausman_test(data, outcome, endogenous, instruments, exogenous=None)[source]

Hausman (1978) specification test: OLS vs 2SLS.

Tests whether the endogenous variables are in fact exogenous. Under the null, OLS is consistent and efficient; under the alternative, only 2SLS is consistent.

\[H = (\hat\beta_{2SLS} - \hat\beta_{OLS})' (V_{2SLS} - V_{OLS})^{-1} (\hat\beta_{2SLS} - \hat\beta_{OLS})\]
Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Return type:

DiagnosticResult

References

Hausman, J. A. (1978). Specification tests in econometrics. Econometrica, 46(6), 1251–1271.

morie.iv.durbin_wu_hausman(data, outcome, endogenous, instruments, exogenous=None)[source]

Durbin-Wu-Hausman regression-based endogeneity test.

Augments the structural equation with first-stage residuals and tests their joint significance.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Return type:

DiagnosticResult

morie.iv.jive(data, outcome, endogenous, instruments, exogenous=None, *, alpha=0.05)[source]

Jackknife IV estimator (JIVE).

Uses leave-one-out predicted values from the first stage to reduce finite-sample bias from many or weak instruments.

\[\hat D_{i,(-i)} = Z_i' \hat\Pi_{(-i)}\]
Parameters:
Return type:

IVResult

References

Angrist, J. D., Imbens, G. W., & Krueger, A. B. (1999). Jackknife instrumental variables estimation. Journal of Applied Econometrics, 14(1), 57–67.

morie.iv.split_sample_iv(data, outcome, endogenous, instruments, exogenous=None, *, split_fraction=0.5, seed=42, alpha=0.05)[source]

Split-sample IV estimator (Angrist & Krueger, 1995).

Randomly splits the sample in two halves. Estimates the first stage on one half and the second stage on the other, reducing overfitting bias from many instruments.

Parameters:
Return type:

IVResult

morie.iv.control_function(data, outcome, endogenous, instruments, exogenous=None, *, robust=True, alpha=0.05)[source]

Control function (CF) approach to endogeneity.

Two-step: (1) regress endogenous on instruments and exogenous to get residuals; (2) include residuals in the structural equation as additional regressors. Numerically equivalent to 2SLS for linear models but extends naturally to nonlinear models.

Parameters:
Returns:

details includes the test of endogeneity (significance of first-stage residuals).

Return type:

IVResult

morie.iv.iv_probit(data, outcome, endogenous, instruments, exogenous=None, *, alpha=0.05)[source]

IV Probit via control function (Rivers & Vuong, 1988).

For binary outcomes. Uses the control-function approach: first-stage OLS residuals are included in a probit second stage.

Parameters:
  • data (pd.DataFrame)

  • outcome (str) – Binary outcome (0/1).

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

  • alpha (float)

Return type:

IVResult

References

Rivers, D., & Vuong, Q. H. (1988). Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics, 39(3), 347–366.

morie.iv.panel_iv(data, outcome, endogenous, instruments, unit, *, exogenous=None, time_fe=None, alpha=0.05)[source]

Panel IV with unit (and optionally time) fixed effects.

Demeans all variables by unit (and time if specified), then applies 2SLS to the demeaned data.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • unit (str) – Unit identifier for fixed effects.

  • exogenous (list of str, optional)

  • time_fe (str, optional) – Time column for two-way fixed effects.

  • alpha (float)

Return type:

IVResult

morie.iv.iv_diagnostics(data, outcome, endogenous, instruments, exogenous=None)[source]

Comprehensive IV regression diagnostic suite.

Runs first-stage diagnostics, weak instrument tests, overidentification tests, and endogeneity tests in one call.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Returns:

Keys: first_stage, cragg_donald, hausman, sargan (if overidentified), durbin_wu_hausman.

Return type:

dict

morie.iv.iv_residual_analysis(data, outcome, endogenous, instruments, exogenous=None)[source]

Residual analysis for IV regression.

Returns a DataFrame with fitted values, residuals, leverage, and standardised residuals.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • endogenous (list of str)

  • instruments (list of str)

  • exogenous (list of str, optional)

Return type:

pd.DataFrame

Regression Discontinuity Design estimators for causal inference.

Implements sharp and fuzzy RDD, optimal bandwidth selection (IK, CCT, rule-of-thumb), local polynomial regression, kernel functions, manipulation tests, covariate balance, placebo tests, donut-hole RDD, geographic/multi-dimensional RDD, RD plots, kink RD, sensitivity analysis, and power calculations.

References

Imbens, G. W., & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79(3), 933–959. https://doi.org/10.1093/restud/rdr043

Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295–2326. https://doi.org/10.3982/ECTA11757

McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics, 142(2), 698–714. https://doi.org/10.1016/j.jeconom.2007.05.005

Cattaneo, M. D., Jansson, M., & Ma, X. (2020). Simple local polynomial density estimators. Journal of the American Statistical Association, 115(531), 1449–1455. https://doi.org/10.1080/01621459.2019.1635480

morie.rdd.kernel_triangular(u)[source]

Triangular kernel: \(K(u) = (1 - |u|)\) for \(|u| \le 1\).

Parameters:

u (ndarray)

Return type:

ndarray

morie.rdd.kernel_epanechnikov(u)[source]

Epanechnikov kernel: \(K(u) = \frac{3}{4}(1 - u^2)\) for \(|u| \le 1\).

Parameters:

u (ndarray)

Return type:

ndarray

morie.rdd.kernel_uniform(u)[source]

Uniform (rectangular) kernel: \(K(u) = 0.5\) for \(|u| \le 1\).

Parameters:

u (ndarray)

Return type:

ndarray

morie.rdd.kernel_gaussian(u)[source]

Gaussian kernel: \(K(u) = \phi(u)\) (standard normal density).

Parameters:

u (ndarray)

Return type:

ndarray

class morie.rdd.RDDResult(estimate, std_error, t_stat, p_value, ci_lower, ci_upper, bandwidth, n_left, n_right, method='sharp_rdd', details=<factory>)[source]

Bases: object

Container for an RDD treatment effect estimate.

Variables:
  • estimate (float) – Point estimate of the treatment effect at the cutoff.

  • std_error (float) – Standard error.

  • t_stat (float) – t-statistic.

  • p_value (float) – Two-sided p-value.

  • ci_lower (float) – Lower bound of confidence interval.

  • ci_upper (float) – Upper bound of confidence interval.

  • bandwidth (float) – Bandwidth used for estimation.

  • n_left (int) – Observations to the left of the cutoff within the bandwidth.

  • n_right (int) – Observations to the right of the cutoff within the bandwidth.

  • method (str) – Estimation method name.

  • details (dict) – Additional diagnostic information.

Parameters:
estimate: float
std_error: float
t_stat: float
p_value: float
ci_lower: float
ci_upper: float
bandwidth: float
n_left: int
n_right: int
method: str = 'sharp_rdd'
details: dict[str, Any]
summary()[source]

Return a one-row summary DataFrame.

Return type:

DataFrame

class morie.rdd.BandwidthResult(h_opt, method, details=<factory>)[source]

Bases: object

Container for bandwidth selection output.

Variables:
  • h_opt (float) – Optimal bandwidth.

  • method (str) – Selection method used.

  • details (dict) – Intermediate quantities.

Parameters:
h_opt: float
method: str
details: dict[str, Any]
class morie.rdd.DensityTestResult(statistic, p_value, method, details=<factory>)[source]

Bases: object

Container for manipulation/density test output.

Parameters:
statistic: float
p_value: float
method: str
details: dict[str, Any]
morie.rdd.local_polynomial_regression(x, y, eval_points, h, p=1, kernel='triangular')[source]

Evaluate a local polynomial regression at multiple points.

Parameters:
  • x (np.ndarray) – Running variable.

  • y (np.ndarray) – Outcome.

  • eval_points (np.ndarray) – Points at which to evaluate the regression.

  • h (float) – Bandwidth.

  • p (int) – Polynomial order (0=Nadaraya-Watson, 1=local linear, etc.).

  • kernel (str) – Kernel name.

Returns:

Columns: x, fitted, se.

Return type:

pd.DataFrame

morie.rdd.bandwidth_ik(x, y, cutoff=0.0, kernel='triangular')[source]

Imbens-Kalyanaraman (2012) optimal bandwidth selector.

Parameters:
  • x (np.ndarray) – Running variable.

  • y (np.ndarray) – Outcome.

  • cutoff (float) – RD cutoff.

  • kernel (str) – Kernel function name.

Return type:

BandwidthResult

References

Imbens, G. W., & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79(3), 933–959.

morie.rdd.bandwidth_rot(x, y, cutoff=0.0)[source]

Rule-of-thumb bandwidth selector.

Uses Silverman’s rule adapted for the RDD context.

Parameters:
  • x (np.ndarray)

  • y (np.ndarray)

  • cutoff (float)

Return type:

BandwidthResult

morie.rdd.bandwidth_cct(x, y, cutoff=0.0, kernel='triangular', p=1)[source]

Calonico-Cattaneo-Titiunik (2014) MSE-optimal bandwidth.

Also computes the CER-optimal bandwidth.

Parameters:
  • x (np.ndarray)

  • y (np.ndarray)

  • cutoff (float)

  • kernel (str)

  • p (int) – Polynomial order.

Returns:

details includes h_mse and h_cer.

Return type:

BandwidthResult

References

Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295–2326.

morie.rdd.sharp_rdd(data, outcome, running, cutoff=0.0, *, bandwidth=None, p=1, kernel='triangular', cluster=None, covariates=None, alpha=0.05)[source]

Sharp regression discontinuity design estimator.

Estimates the local average treatment effect at the cutoff using local polynomial regression of order p separately on each side of the cutoff:

\[\hat\tau = \hat\mu_+(c) - \hat\mu_-(c)\]

If no bandwidth is provided, the CCT MSE-optimal bandwidth is used.

Parameters:
  • data (pd.DataFrame)

  • outcome (str) – Outcome column.

  • running (str) – Running variable column.

  • cutoff (float) – RD cutoff value.

  • bandwidth (float, optional) – Estimation bandwidth. If None, computed via bandwidth_cct().

  • p (int) – Local polynomial order (default 1 = local linear).

  • kernel (str) – Kernel function name.

  • cluster (str, optional) – Column for cluster-robust variance.

  • covariates (list of str, optional) – Additional covariates (included additively).

  • alpha (float) – Significance level.

Return type:

RDDResult

References

Hahn, J., Todd, P., & Van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), 201–209.

morie.rdd.fuzzy_rdd(data, outcome, running, treatment, cutoff=0.0, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]

Fuzzy regression discontinuity design estimator.

Uses the cutoff as an instrument for treatment takeup. The fuzzy RD estimand is:

\[\hat\tau_{\text{FRD}} = \frac{\hat\mu_{Y+}(c) - \hat\mu_{Y-}(c)} {\hat\mu_{D+}(c) - \hat\mu_{D-}(c)}\]
Parameters:
  • data (pd.DataFrame)

  • outcome (str) – Outcome column.

  • running (str) – Running variable.

  • treatment (str) – Treatment takeup column (binary).

  • cutoff (float) – RD cutoff.

  • bandwidth (float, optional)

  • p (int) – Polynomial order.

  • kernel (str)

  • alpha (float)

Returns:

details includes the first-stage jump.

Return type:

RDDResult

morie.rdd.rdd_bias_corrected(data, outcome, running, cutoff=0.0, *, bandwidth=None, rho=1.0, p=1, kernel='triangular', alpha=0.05)[source]

Bias-corrected RDD with robust confidence intervals (CCT, 2014).

Uses a higher-order local polynomial to estimate and remove the leading bias term, then constructs robust CIs.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • running (str)

  • cutoff (float)

  • bandwidth (float, optional) – Main bandwidth. If None, uses CCT MSE-optimal.

  • rho (float) – Ratio of pilot bandwidth to main bandwidth (default 1.0).

  • p (int) – Main polynomial order.

  • kernel (str)

  • alpha (float)

Returns:

The estimate is bias-corrected; ci_lower/ci_upper are the robust confidence intervals.

Return type:

RDDResult

morie.rdd.mccrary_test(x, cutoff=0.0, *, n_bins=50, bandwidth=None)[source]

McCrary (2008) density test for manipulation of the running variable.

Constructs a histogram of the running variable and tests for a discontinuity in the density at the cutoff using local linear regression on the bin counts.

Parameters:
  • x (np.ndarray) – Running variable.

  • cutoff (float) – RD cutoff.

  • n_bins (int) – Number of histogram bins on each side.

  • bandwidth (float, optional)

Returns:

statistic is the log-difference in estimated densities.

Return type:

DensityTestResult

References

McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics, 142(2), 698–714.

morie.rdd.cattaneo_density_test(x, cutoff=0.0, *, p=2, kernel='triangular', bandwidth=None)[source]

Cattaneo, Jansson, & Ma (2020) density test for manipulation.

Uses local polynomial density estimation on each side of the cutoff and tests for a discontinuity.

Parameters:
  • x (np.ndarray)

  • cutoff (float)

  • p (int) – Polynomial order for density estimation.

  • kernel (str)

  • bandwidth (float, optional)

Return type:

DensityTestResult

References

Cattaneo, M. D., Jansson, M., & Ma, X. (2020). Simple local polynomial density estimators. JASA, 115(531), 1449–1455.

morie.rdd.covariate_balance_rdd(data, running, covariates, cutoff=0.0, *, bandwidth=None, kernel='triangular', alpha=0.05)[source]

Test covariate balance at the RD cutoff.

For each covariate, estimates a sharp RDD and reports whether there is a statistically significant jump at the cutoff.

Parameters:
Returns:

Columns: covariate, estimate, std_error, p_value, balanced.

Return type:

pd.DataFrame

morie.rdd.placebo_cutoff_test(data, outcome, running, true_cutoff, placebo_cutoffs, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]

Test for RDD effects at placebo (false) cutoff values.

Estimates a sharp RDD at each placebo cutoff using only data on one side of the true cutoff. A well-designed RDD should show no significant effects at placebo cutoffs.

Parameters:
Return type:

pd.DataFrame

morie.rdd.donut_rdd(data, outcome, running, cutoff=0.0, donut=0.0, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]

Donut-hole RDD: exclude observations within donut of the cutoff.

Addresses concerns about manipulation or heaping at the cutoff by removing observations in \([c - \text{donut}, c + \text{donut}]\).

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • running (str)

  • cutoff (float)

  • donut (float) – Exclusion radius around the cutoff.

  • bandwidth (float, optional)

  • p (int)

  • kernel (str)

  • alpha (float)

Return type:

RDDResult

morie.rdd.rdd_discrete(data, outcome, running, cutoff=0.0, *, bandwidth=None, p=0, alpha=0.05)[source]

RDD estimator for discrete running variables.

Uses local constant (Nadaraya-Watson) or local linear regression with a uniform kernel. Standard errors are clustered at the level of the running variable mass points (Lee & Card, 2008).

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • running (str)

  • cutoff (float)

  • bandwidth (float, optional)

  • p (int) – Polynomial order (default 0 = local constant).

  • alpha (float)

Return type:

RDDResult

References

Lee, D. S., & Card, D. (2008). Regression discontinuity inference with specification error. Journal of Econometrics, 142(2), 655–674.

morie.rdd.rd_plot_data(data, outcome, running, cutoff=0.0, *, n_bins=20, p_global=4, p_local=1, bandwidth=None, kernel='triangular')[source]

Generate data for standard RD plots.

Returns three datasets: - binned: binned scatter plot (evenly-spaced or quantile bins) - global_poly: global polynomial fit - local_poly: local polynomial fit within bandwidth

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • running (str)

  • cutoff (float)

  • n_bins (int) – Number of bins on each side.

  • p_global (int) – Order of the global polynomial.

  • p_local (int) – Order of the local polynomial.

  • bandwidth (float, optional)

  • kernel (str)

Returns:

Keys: "binned", "global_poly", "local_poly".

Return type:

dict

morie.rdd.bandwidth_sensitivity(data, outcome, running, cutoff=0.0, *, bandwidth_range=None, p=1, kernel='triangular', alpha=0.05)[source]

Sensitivity of the RDD estimate to bandwidth choice.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • running (str)

  • cutoff (float)

  • bandwidth_range (list of float, optional) – Bandwidths to evaluate. If None, uses multiples of the optimal bandwidth.

  • p (int)

  • kernel (str)

  • alpha (float)

Returns:

Columns: bandwidth, estimate, std_error, ci_lower, ci_upper, p_value, n_eff.

Return type:

pd.DataFrame

morie.rdd.kink_rdd(data, outcome, running, cutoff=0.0, *, bandwidth=None, kernel='triangular', alpha=0.05)[source]

Regression Kink Design (RKD) estimator.

Tests for a change in the slope (rather than the level) of the conditional expectation function at the cutoff.

\[\hat\tau_{\text{RKD}} = \frac{\hat\mu'_+(c) - \hat\mu'_-(c)} {\text{kink in policy variable}}\]
Parameters:
Returns:

estimate is the change in slope at the cutoff.

Return type:

RDDResult

References

Card, D., Lee, D. S., Pei, Z., & Weber, A. (2015). Inference on causal effects in a generalized regression kink design. Econometrica, 83(6), 2453–2483.

morie.rdd.rdd_local_randomisation(data, outcome, running, cutoff=0.0, window=1.0, *, n_permutations=1000, seed=42, alpha=0.05)[source]

RDD via local randomisation (Cattaneo, Frandsen & Titiunik, 2015).

Within a narrow window around the cutoff, assumes treatment is as-if randomly assigned and uses a permutation test.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • running (str)

  • cutoff (float)

  • window (float) – Symmetric window around cutoff for local randomisation.

  • n_permutations (int)

  • seed (int)

  • alpha (float)

Returns:

p_value is the permutation-based p-value.

Return type:

RDDResult

References

Cattaneo, M. D., Frandsen, B. R., & Titiunik, R. (2015). Randomization inference in the regression discontinuity design. Journal of Causal Inference, 3(1), 1–24.

morie.rdd.geographic_rdd(data, outcome, distance_to_boundary, side, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]

Geographic (multi-dimensional) RDD using distance to a boundary.

Reduces a spatial RDD to a univariate RDD by using the signed distance from each unit to the treatment boundary as the running variable.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • distance_to_boundary (str) – Column with signed distance (negative on one side, positive on the other).

  • side (str) – Binary column indicating which side of the boundary each unit falls on.

  • bandwidth (float, optional)

  • p (int)

  • kernel (str)

  • alpha (float)

Return type:

RDDResult

References

Keele, L. J., & Titiunik, R. (2015). Geographic boundaries as regression discontinuities. Political Analysis, 23(1), 127–155.

morie.rdd.rdd_power(n, tau, sigma, cutoff_density=1.0, bandwidth=None, kernel='triangular', alpha=0.05)[source]

Power calculation for a sharp RDD.

Computes approximate power based on the asymptotic distribution of the local linear estimator:

\[\text{Power} = \Phi\left(\frac{|\tau|}{\hat\sigma_\tau} - z_{1-\alpha/2}\right)\]
Parameters:
  • n (int) – Total sample size.

  • tau (float) – Hypothesised treatment effect.

  • sigma (float) – Outcome standard deviation.

  • cutoff_density (float) – Density of the running variable at the cutoff.

  • bandwidth (float, optional) – Bandwidth (if None, uses a default fraction of the range).

  • kernel (str) – Kernel name (affects efficiency constants).

  • alpha (float) – Significance level.

Returns:

power, n_effective, mde (minimum detectable effect).

Return type:

dict

morie.rdd.rdd_sample_size(tau, sigma, cutoff_density=1.0, bandwidth=1.0, power=0.8, kernel='triangular', alpha=0.05)[source]

Compute the required sample size for a sharp RDD.

Parameters:
  • tau (float) – Target treatment effect.

  • sigma (float) – Outcome SD.

  • cutoff_density (float)

  • bandwidth (float)

  • power (float) – Target power (default 0.80).

  • kernel (str)

  • alpha (float)

Returns:

Required total sample size.

Return type:

int

Difference-in-Differences estimators for causal inference.

Implements classic 2x2 DiD, generalized DiD with multiple time periods (Callaway–Sant’Anna), staggered adoption, event studies, Bacon decomposition, doubly-robust DiD, triple differences, synthetic DiD, and comprehensive pre-trend and placebo testing infrastructure.

References

Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200–230. https://doi.org/10.1016/j.jeconom.2020.12.001

Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277. https://doi.org/10.1016/j.jeconom.2021.03.014

Sant’Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122. https://doi.org/10.1016/j.jeconom.2020.06.003

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088–4118. https://doi.org/10.1257/aer.20190159

class morie.did.DiDResult(estimate, std_error, t_stat, p_value, ci_lower, ci_upper, n_treated, n_control, method='did_2x2', details=<factory>)[source]

Bases: object

Container for a Difference-in-Differences estimate.

Variables:
  • estimate (float) – Point estimate of the treatment effect (ATT).

  • std_error (float) – Standard error of the estimate.

  • t_stat (float) – t-statistic (estimate / std_error).

  • p_value (float) – Two-sided p-value.

  • ci_lower (float) – Lower bound of the 95 % confidence interval.

  • ci_upper (float) – Upper bound of the 95 % confidence interval.

  • n_treated (int) – Number of treated observations.

  • n_control (int) – Number of control observations.

  • method (str) – Name of the estimation method used.

  • details (dict) – Additional diagnostic information.

Parameters:
estimate: float
std_error: float
t_stat: float
p_value: float
ci_lower: float
ci_upper: float
n_treated: int
n_control: int
method: str = 'did_2x2'
details: dict[str, Any]
summary()[source]

Return a one-row summary DataFrame.

Return type:

DataFrame

class morie.did.EventStudyResult(coefficients, reference_period, pre_trend_f_stat, pre_trend_p_value, details=<factory>)[source]

Bases: object

Container for event-study coefficient estimates.

Variables:
  • coefficients (pd.DataFrame) – DataFrame with columns relative_time, estimate, std_error, ci_lower, ci_upper, p_value.

  • reference_period (int) – The omitted (normalised-to-zero) relative time period.

  • pre_trend_f_stat (float) – Joint F-statistic for pre-treatment coefficients.

  • pre_trend_p_value (float) – p-value for the joint pre-trend test.

Parameters:
coefficients: DataFrame
reference_period: int
pre_trend_f_stat: float
pre_trend_p_value: float
details: dict[str, Any]
class morie.did.BaconDecomposition(components, overall_estimate)[source]

Bases: object

Container for Goodman-Bacon decomposition.

Variables:
  • components (pd.DataFrame) – DataFrame with columns group1, group2, estimate, weight, type (timing, always-treated, never-treated).

  • overall_estimate (float) – Weighted-average TWFE DiD estimate.

Parameters:
components: DataFrame
overall_estimate: float
morie.did.did_2x2(data, outcome, treatment, post, *, covariates=None, cluster=None, alpha=0.05)[source]

Estimate a classic 2x2 Difference-in-Differences treatment effect.

The canonical DiD estimator for settings with two groups (treated vs control) and two periods (pre vs post). With no covariates the estimator is:

\[\hat\tau_{\text{DiD}} = \bigl(\bar Y_{1,\text{post}} - \bar Y_{1,\text{pre}}\bigr) - \bigl(\bar Y_{0,\text{post}} - \bar Y_{0,\text{pre}}\bigr)\]

When covariates are provided the regression specification is:

\[Y_{it} = \alpha + \beta\,D_i + \gamma\,\text{Post}_t + \tau\,(D_i \times \text{Post}_t) + X_{it}'\delta + \varepsilon_{it}\]

and \(\hat\tau\) is the coefficient on the interaction term.

Parameters:
  • data (pd.DataFrame) – Panel or repeated cross-section data.

  • outcome (str) – Name of the outcome column.

  • treatment (str) – Binary (0/1) column indicating the treatment group.

  • post (str) – Binary (0/1) column indicating the post-treatment period.

  • covariates (list of str, optional) – Additional covariates for the regression specification.

  • cluster (str, optional) – Column for cluster-robust standard errors.

  • alpha (float) – Significance level for confidence intervals (default 0.05).

Returns:

Estimation results including point estimate, SE, CI, and diagnostics.

Return type:

DiDResult

References

Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.

morie.did.did_repeated_cross_section(data, outcome, treatment, post, *, covariates=None, weights=None, cluster=None, alpha=0.05)[source]

DiD estimator designed for repeated cross-section data.

Identical to did_2x2() but optionally incorporates survey weights. When weights is provided, weighted least squares is used for estimation.

Parameters:
  • data (pd.DataFrame) – Repeated cross-section data.

  • outcome (str) – Column names for outcome, treatment indicator, and post indicator.

  • treatment (str) – Column names for outcome, treatment indicator, and post indicator.

  • post (str) – Column names for outcome, treatment indicator, and post indicator.

  • covariates (list of str, optional) – Additional covariates.

  • weights (str, optional) – Column containing survey / sampling weights.

  • cluster (str, optional) – Column for cluster-robust SE.

  • alpha (float) – Significance level.

Return type:

DiDResult

morie.did.did_panel_fe(data, outcome, treatment, unit, time, *, covariates=None, cluster=None, alpha=0.05)[source]

DiD with unit and time fixed effects (two-way fixed effects, TWFE).

Demeans the outcome by unit and time means (within transformation) and regresses on the treatment indicator.

Parameters:
  • data (pd.DataFrame) – Balanced or unbalanced panel data.

  • outcome (str) – Outcome column.

  • treatment (str) – Binary column (0/1) indicating treatment status in each period.

  • unit (str) – Unit identifier column.

  • time (str) – Time period column.

  • covariates (list of str, optional) – Time-varying covariates.

  • cluster (str, optional) – Cluster variable for standard errors (defaults to unit).

  • alpha (float) – Significance level.

Return type:

DiDResult

morie.did.event_study(data, outcome, unit, time, treatment_time, *, covariates=None, reference_period=-1, leads=4, lags=4, cluster=None, alpha=0.05)[source]

Estimate an event-study specification around treatment onset.

Constructs relative-time dummies \(\{1[t - g = k]\}\) for \(k \in [-\text{leads}, \text{lags}]\) and regresses the outcome on these indicators with unit and time fixed effects. The reference_period dummy is omitted (normalised to zero).

Parameters:
  • data (pd.DataFrame) – Panel data with a treatment-onset column.

  • outcome (str) – Outcome column.

  • unit (str) – Unit identifier column.

  • time (str) – Calendar-time column (integer-valued).

  • treatment_time (str) – Column giving the period in which each unit first received treatment (np.inf or NaN for never-treated units).

  • covariates (list of str, optional) – Time-varying covariates.

  • reference_period (int) – Relative-time period omitted as baseline (default -1).

  • leads (int) – Number of pre-treatment periods to include.

  • lags (int) – Number of post-treatment periods to include.

  • cluster (str, optional) – Cluster variable (defaults to unit).

  • alpha (float) – Significance level.

Return type:

EventStudyResult

Test the parallel trends assumption using pre-treatment data.

Runs two tests:

  1. Individual coefficient test: regresses the outcome on group-by-time interactions in the pre-period and reports each coefficient.

  2. Joint F-test: tests that all interaction coefficients are jointly zero.

Parameters:
  • data (pd.DataFrame) – Panel or repeated cross-section data.

  • outcome (str) – Outcome column name.

  • treatment (str) – Treatment group indicator.

  • time (str) – Time column (integer-valued).

  • unit (str, optional) – Unit identifier for panel demeaning.

  • cluster (str, optional) – Cluster variable for robust SE.

  • pre_periods (list, optional) – Explicit list of pre-treatment time values to test.

Returns:

coefficients (pd.DataFrame), joint_f_stat, joint_p_value, parallel_trends_plausible (bool, True when p > 0.05).

Return type:

dict

Compute group-level outcome means over time for parallel-trends plots.

Parameters:
  • data (pd.DataFrame) – Panel or repeated cross-section.

  • outcome (str) – Column names.

  • treatment (str) – Column names.

  • time (str) – Column names.

  • weights (str, optional) – Survey weight column.

Returns:

Columns: time, group (0 or 1), mean_outcome, se, n.

Return type:

pd.DataFrame

morie.did.group_time_att(data, outcome, unit, time, treatment_time, *, covariates=None, method='doubly_robust', control_group='never_treated', n_bootstrap=200, seed=42, alpha=0.05)[source]

Estimate group-time average treatment effects (Callaway & Sant’Anna 2021).

For each cohort \(g\) (units first treated at time \(g\)) and each post-treatment calendar period \(t\), estimates \(\text{ATT}(g, t)\).

Parameters:
  • data (pd.DataFrame) – Panel data.

  • outcome (str) – Outcome column.

  • unit (str) – Unit identifier.

  • time (str) – Calendar-time column (integer).

  • treatment_time (str) – Column with treatment-onset period (use np.inf for never-treated).

  • covariates (list of str, optional) – Covariates for doubly-robust estimation.

  • method (str) – "doubly_robust" (default), "ipw", or "outcome_regression".

  • control_group (str) – "never_treated" or "not_yet_treated".

  • n_bootstrap (int) – Number of bootstrap replications for inference.

  • seed (int) – Random seed for bootstrap.

  • alpha (float) – Significance level.

Returns:

Columns: cohort, time, att, std_error, ci_lower, ci_upper, p_value.

Return type:

pd.DataFrame

References

Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200–230.

morie.did.aggregate_gt_att(gt_results, *, aggregation='overall', time_col='time', cohort_col='cohort', att_col='att', se_col='std_error')[source]

Aggregate group-time ATTs into summary treatment effect parameters.

Parameters:
  • gt_results (pd.DataFrame) – Output of group_time_att().

  • aggregation (str) – "overall" (single ATT), "cohort" (by cohort), "calendar_time" (by period), or "event_time" (by relative period since treatment).

  • time_col (str) – Column name overrides.

  • cohort_col (str) – Column name overrides.

  • att_col (str) – Column name overrides.

  • se_col (str) – Column name overrides.

Returns:

Aggregated estimates with estimate, std_error, ci_lower, ci_upper.

Return type:

pd.DataFrame

morie.did.staggered_did(data, outcome, unit, time, treatment_time, *, covariates=None, n_bootstrap=200, seed=42, alpha=0.05)[source]

Staggered DiD via group-time ATTs with aggregation.

A convenience wrapper around group_time_att() and aggregate_gt_att() that returns overall, cohort-level, and event-time aggregated results.

Parameters:
  • data (pd.DataFrame) – Panel data.

  • outcome (str) – Column names.

  • unit (str) – Column names.

  • time (str) – Column names.

  • treatment_time (str) – Column names.

  • covariates (list of str, optional) – Covariates for doubly-robust estimation.

  • n_bootstrap (int) – Bootstrap replications.

  • seed (int) – Random seed.

  • alpha (float) – Significance level.

Returns:

Keys: group_time (pd.DataFrame), overall (pd.DataFrame), by_cohort (pd.DataFrame), by_event_time (pd.DataFrame).

Return type:

dict

morie.did.did_doubly_robust(data, outcome, treatment, post, covariates, *, ps_model='logistic', or_model='linear', cluster=None, n_bootstrap=200, seed=42, alpha=0.05)[source]

Doubly-robust DiD estimator (Sant’Anna & Zhao, 2020).

Combines an outcome regression model with an inverse-probability weighting model. The estimator is consistent if either model is correctly specified.

\[\hat\tau_{\text{DR}} = \frac{1}{n_1}\sum_{i:D_i=1}\bigl[\Delta Y_i - \hat\mu_0(\mathbf X_i)\bigr] - \frac{1}{\sum_{j:D_j=0}\hat w_j} \sum_{j:D_j=0}\hat w_j\bigl[\Delta Y_j - \hat\mu_0(\mathbf X_j)\bigr]\]
Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • post (str)

  • covariates (list of str)

  • ps_model (str) – "logistic" or "gbm" for propensity score model.

  • or_model (str) – "linear" or "gbm" for outcome regression model.

  • cluster (str, optional)

  • n_bootstrap (int)

  • seed (int)

  • alpha (float)

Return type:

DiDResult

References

Sant’Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122.

morie.did.did_triple_difference(data, outcome, treatment, post, third_diff, *, covariates=None, cluster=None, alpha=0.05)[source]

Triple-difference (DDD) estimator.

Extends the standard DiD by adding a third differencing dimension (e.g., within-state variation across affected and unaffected sub-populations).

\[Y = \alpha + \beta_1 D + \beta_2 \text{Post} + \beta_3 S + \beta_4 (D \times \text{Post}) + \beta_5 (D \times S) + \beta_6 (\text{Post} \times S) + \tau (D \times \text{Post} \times S) + \varepsilon\]

The DDD estimate is the coefficient on the three-way interaction.

Parameters:
  • data (pd.DataFrame)

  • outcome (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).

  • treatment (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).

  • post (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).

  • third_diff (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).

  • covariates (list of str, optional)

  • cluster (str, optional)

  • alpha (float)

Return type:

DiDResult

morie.did.bacon_decomposition(data, outcome, treatment, unit, time)[source]

Goodman-Bacon (2021) decomposition of the TWFE DiD estimator.

Decomposes the two-way fixed-effects DiD estimate into a weighted average of all possible 2x2 DiD comparisons: earlier-vs-later treated, treated-vs-never-treated, and later-vs-earlier treated.

Parameters:
  • data (pd.DataFrame) – Balanced panel data.

  • outcome (str) – Outcome column.

  • treatment (str) – Binary treatment indicator that turns on at treatment onset.

  • unit (str) – Unit identifier.

  • time (str) – Time period.

Return type:

BaconDecomposition

References

Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277.

morie.did.synthetic_did(data, outcome, unit, time, treatment_time, *, treated_units=None, zeta=None, n_bootstrap=200, seed=42, alpha=0.05)[source]

Synthetic Difference-in-Differences estimator.

Combines synthetic control reweighting of control units with DiD to produce a doubly-robust estimator.

Solves for unit weights \(\hat\omega\) and time weights \(\hat\lambda\) that minimise weighted pre-treatment outcome differences, then estimates:

\[\hat\tau_{\text{SDID}} = \sum_i \hat\omega_i \sum_t \hat\lambda_t \bigl(Y_{it} - \hat\mu\bigr)\]
Parameters:
  • data (pd.DataFrame) – Balanced panel.

  • outcome (str) – Column names.

  • unit (str) – Column names.

  • time (str) – Column names.

  • treatment_time (str) – Column names.

  • treated_units (list, optional) – Explicit list of treated unit IDs.

  • zeta (float, optional) – Regularisation parameter for unit weights. If None, automatically selected.

  • n_bootstrap (int) – Bootstrap replications.

  • seed (int) – Random seed.

  • alpha (float) – Significance level.

Return type:

DiDResult

References

Arkhangelsky, D., et al. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088–4118.

morie.did.wild_cluster_bootstrap(data, outcome, treatment, post, cluster, *, covariates=None, n_bootstrap=999, weight_type='rademacher', seed=42, alpha=0.05)[source]

DiD with wild cluster bootstrap p-values (Cameron, Gelbach & Miller, 2008).

Recommended when the number of clusters is small (< 50). Multiplies cluster-level residuals by random weights drawn from the Rademacher or Webb distributions.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • post (str)

  • cluster (str)

  • covariates (list of str, optional)

  • n_bootstrap (int) – Number of bootstrap replications.

  • weight_type (str) – "rademacher" (+1/-1 with equal probability) or "webb" (six-point distribution).

  • seed (int)

  • alpha (float)

Returns:

The p_value is the bootstrap p-value.

Return type:

DiDResult

References

Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414–427.

morie.did.did_continuous_treatment(data, outcome, dose, post, *, covariates=None, cluster=None, alpha=0.05)[source]

DiD with a continuous treatment variable (dose–response DiD).

Models the outcome as:

\[Y_{it} = \alpha + \beta\,\text{Dose}_i + \gamma\,\text{Post}_t + \tau\,(\text{Dose}_i \times \text{Post}_t) + X_{it}'\delta + \varepsilon_{it}\]

\(\hat\tau\) is the marginal effect of a one-unit increase in treatment intensity in the post period.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • dose (str) – Continuous treatment intensity column.

  • post (str) – Binary post-period indicator.

  • covariates (list of str, optional)

  • cluster (str, optional)

  • alpha (float)

Return type:

DiDResult

morie.did.did_fuzzy(data, outcome, assignment, takeup, post, *, covariates=None, cluster=None, alpha=0.05)[source]

Fuzzy DiD estimator for settings with imperfect compliance.

Uses the interaction of assignment \(\times\) post as an instrument for takeup \(\times\) post in a 2SLS framework to recover a local average treatment effect (LATE).

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • assignment (str) – Intent-to-treat assignment indicator (binary).

  • takeup (str) – Actual treatment takeup indicator (binary).

  • post (str) – Post-period indicator.

  • covariates (list of str, optional)

  • cluster (str, optional)

  • alpha (float)

Returns:

The estimate is the LATE from fuzzy DiD.

Return type:

DiDResult

morie.did.placebo_test_time(data, outcome, treatment, time, true_treatment_time, placebo_times, *, covariates=None, cluster=None, alpha=0.05)[source]

Run placebo DiD tests at fake treatment times.

For each time in placebo_times, re-define the post indicator as time >= placebo_time using only pre-treatment data (before true_treatment_time) and estimate a DiD. A well-identified design should yield estimates close to zero.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • time (str)

  • true_treatment_time (any) – The actual treatment onset time (to restrict to pre-period data).

  • placebo_times (list) – Candidate fake treatment times to test.

  • covariates (list of str, optional)

  • cluster (str, optional)

  • alpha (float)

Returns:

One row per placebo time with placebo_time, estimate, std_error, p_value, significant.

Return type:

pd.DataFrame

morie.did.placebo_test_outcome(data, placebo_outcomes, treatment, post, *, covariates=None, cluster=None, alpha=0.05)[source]

Run placebo DiD on outcomes that should not be affected by treatment.

Parameters:
  • data (pd.DataFrame)

  • placebo_outcomes (list of str) – Outcome columns expected to show no treatment effect.

  • treatment (str)

  • post (str)

  • covariates (list of str, optional)

  • cluster (str, optional)

  • alpha (float)

Returns:

One row per placebo outcome.

Return type:

pd.DataFrame

morie.did.placebo_test_group(data, outcome, treatment, post, group_col, unaffected_groups, *, covariates=None, cluster=None, alpha=0.05)[source]

Run placebo DiD on groups that should not be affected.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • post (str)

  • group_col (str) – Column defining sub-groups.

  • unaffected_groups (list) – Group values where no treatment effect is expected.

  • covariates (list of str, optional)

  • cluster (str, optional)

  • alpha (float)

Return type:

pd.DataFrame

morie.did.did_heterogeneous(data, outcome, treatment, post, moderator, *, covariates=None, cluster=None, n_quantiles=4, alpha=0.05)[source]

Heterogeneity-robust DiD: estimate treatment effects by subgroups.

Splits the sample by quantiles (or categories) of moderator and estimates separate DiD effects for each stratum.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • post (str)

  • moderator (str)

  • covariates (list of str, optional)

  • cluster (str, optional)

  • n_quantiles (int) – Number of quantile bins if moderator is continuous.

  • alpha (float)

Returns:

Columns: group, estimate, std_error, ci_lower, ci_upper, p_value, n.

Return type:

pd.DataFrame

morie.did.did_chaisemartin_dhaultfoeuille(data, outcome, treatment, unit, time, *, n_bootstrap=200, seed=42, alpha=0.05)[source]

Heterogeneity-robust DiD (de Chaisemartin & D’Haultfoeuille, 2020).

Computes the instantaneous treatment effect for switchers–units whose treatment status changes–using appropriate comparisons.

\[\hat\delta = \sum_{(i,t): D_{it}>D_{i,t-1}} w_{it} \bigl[\Delta Y_{it} - \Delta \bar Y_{ct}\bigr]\]
Parameters:
  • data (pd.DataFrame) – Panel data.

  • outcome (str)

  • treatment (str)

  • unit (str)

  • time (str)

  • n_bootstrap (int)

  • seed (int)

  • alpha (float)

Return type:

DiDResult

References

de Chaisemartin, C., & D’Haultfoeuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), 2964–2996.

morie.did.did_sensitivity_analysis(data, outcome, treatment, post, *, covariates=None, delta_range=None, cluster=None, alpha=0.05)[source]

Sensitivity of DiD estimate to violations of parallel trends.

Following Rambachan & Roth (2023), computes the identified set for the ATT under bounded deviations \(\delta\) from parallel trends:

\[|\text{bias}| \le \delta \cdot \hat\sigma\]

For each \(\delta\), computes a bias-adjusted confidence set.

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • post (str)

  • covariates (list of str, optional)

  • delta_range (sequence of float, optional) – Values of \(\delta\) to evaluate (default: 0 to 2 in 0.25 steps).

  • cluster (str, optional)

  • alpha (float)

Returns:

Columns: delta, ci_lower, ci_upper, covers_zero.

Return type:

pd.DataFrame

References

Rambachan, A., & Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5), 2555–2591.

morie.did.did_diagnostics(data, outcome, treatment, post, *, covariates=None, cluster=None)[source]

Comprehensive diagnostics for a 2x2 DiD setting.

Checks: - Sample sizes by group and period - Baseline covariate balance - Outcome distributions by group and period - Pre-period outcome correlation between groups

Parameters:
  • data (pd.DataFrame)

  • outcome (str)

  • treatment (str)

  • post (str)

  • covariates (list of str, optional)

  • cluster (str, optional)

Returns:

Keys: sample_sizes, outcome_stats, covariate_balance.

Return type:

dict

Sensitivity analysis for causal inference assumptions.

Provides tools to assess the robustness of causal effect estimates to unmeasured confounding, model specification, and other threats to internal validity. Implements Rosenbaum bounds, E-value, Ding-VanderWeele bias formulas, tipping-point analysis, and specification curve analysis.

References

  • Rosenbaum (2002). Observational Studies, 2nd ed.

  • VanderWeele & Ding (2017). Sensitivity analysis in observational research.

  • Cinelli & Hazlett (2020). Making sense of sensitivity.

class morie.sensitivity.EValueResult(point_estimate, e_value_point, e_value_ci, rr, ci_lower, ci_upper, interpretation)[source]

Bases: object

Result from E-value computation.

Parameters:
point_estimate: float
e_value_point: float
e_value_ci: float
rr: float
ci_lower: float
ci_upper: float
interpretation: str
class morie.sensitivity.RosenbaumBounds(gamma_values, p_upper, p_lower, critical_gamma, method, interpretation)[source]

Bases: object

Result from Rosenbaum sensitivity analysis.

Parameters:
gamma_values: ndarray
p_upper: ndarray
p_lower: ndarray
critical_gamma: float
method: str
interpretation: str
class morie.sensitivity.TippingPointResult(delta_values, adjusted_estimates, adjusted_p_values, tipping_point, original_estimate, interpretation)[source]

Bases: object

Result from tipping-point analysis.

Parameters:
delta_values: ndarray
adjusted_estimates: ndarray
adjusted_p_values: ndarray
tipping_point: float
original_estimate: float
interpretation: str
class morie.sensitivity.OmittedVariableBias(estimate, se, rv_q, rv_qa, partial_r2_treatment, benchmark_bounds, interpretation)[source]

Bases: object

Result from omitted variable bias analysis (Cinelli & Hazlett).

Parameters:
estimate: float
se: float
rv_q: float
rv_qa: float
partial_r2_treatment: float
benchmark_bounds: dict[str, tuple[float, float]]
interpretation: str
class morie.sensitivity.SpecificationCurveResult(estimates, ses, p_values, specifications, median_estimate, iqr_lower, iqr_upper, pct_significant, pct_same_sign)[source]

Bases: object

Result from specification curve analysis.

Parameters:
estimates: ndarray
ses: ndarray
p_values: ndarray
specifications: list[dict]
median_estimate: float
iqr_lower: float
iqr_upper: float
pct_significant: float
pct_same_sign: float
morie.sensitivity.e_value_rr(rr, ci_lower=None, ci_upper=None)[source]

Compute E-value for a risk ratio.

The E-value is the minimum strength of association (on the risk ratio scale) that an unmeasured confounder would need to have with both treatment and outcome to fully explain away the observed association.

Parameters:
  • rr (float) – Observed risk ratio.

  • ci_lower (float, optional) – Lower bound of 95% CI for the RR.

  • ci_upper (float, optional) – Upper bound of 95% CI for the RR.

Return type:

EValueResult

morie.sensitivity.e_value_or(odds_ratio, ci_lower=None, ci_upper=None, prevalence=None)[source]

Compute E-value for an odds ratio.

For rare outcomes (prevalence < 15%), the OR approximates the RR. For common outcomes, a correction is applied using the prevalence.

Parameters:
  • odds_ratio (float) – Observed odds ratio.

  • ci_lower (float, optional) – Lower CI bound.

  • ci_upper (float, optional) – Upper CI bound.

  • prevalence (float, optional) – Outcome prevalence for OR-to-RR conversion.

Return type:

EValueResult

morie.sensitivity.e_value_hr(hr, ci_lower=None, ci_upper=None)[source]

Compute E-value for a hazard ratio.

Uses the HR-to-RR approximation from VanderWeele (2017).

Parameters:
  • hr (float) – Observed hazard ratio.

  • ci_lower (float, optional) – Lower CI bound.

  • ci_upper (float, optional) – Upper CI bound.

Return type:

EValueResult

morie.sensitivity.e_value_d(d, se=None, n=None)[source]

Compute E-value for a standardized mean difference (Cohen’s d).

Converts d to a risk ratio scale using the VanderWeele & Ding approximation: RR ≈ exp(0.91 * d).

Parameters:
  • d (float) – Standardized mean difference.

  • se (float, optional) – Standard error of d.

  • n (int, optional) – Sample size (used to compute SE if not provided).

Return type:

EValueResult

morie.sensitivity.rosenbaum_bounds(treated_outcomes, control_outcomes, gamma_range=None, method='wilcoxon')[source]

Rosenbaum sensitivity analysis for matched pair designs.

Tests how strong hidden bias (Gamma) would need to be to alter the study’s conclusions.

Parameters:
  • treated_outcomes (array-like) – Outcomes for treated units in matched pairs.

  • control_outcomes (array-like) – Outcomes for control units in matched pairs.

  • gamma_range (array-like, optional) – Range of Gamma values to test. Default: 1.0 to 5.0 by 0.25.

  • method (str) – Test method: ‘wilcoxon’ (default), ‘sign’, ‘mcnemar’.

Return type:

RosenbaumBounds

morie.sensitivity.tipping_point_analysis(estimate, se, n_treated, n_control, delta_range=None, outcome_type='continuous')[source]

Tipping-point analysis for missing data sensitivity.

Evaluates how much the treatment effect changes if missing outcomes are systematically different from observed outcomes.

Parameters:
  • estimate (float) – Observed treatment effect estimate.

  • se (float) – Standard error of the estimate.

  • n_treated (int) – Number of treated units.

  • n_control (int) – Number of control units.

  • delta_range (array-like, optional) – Range of bias parameters to evaluate.

  • outcome_type (str) – ‘continuous’ or ‘binary’.

Return type:

TippingPointResult

morie.sensitivity.omitted_variable_bias(estimate, se, dof, r2_yd_x, partial_r2_treatment, q=1.0, alpha=0.05, benchmark_covariates=None)[source]

Omitted variable bias analysis (sensemakr framework).

Implements the Cinelli & Hazlett (2020) approach to assess how much an unobserved confounder would need to explain to change the conclusion.

Parameters:
  • estimate (float) – Treatment coefficient estimate.

  • se (float) – Standard error of the estimate.

  • dof (int) – Residual degrees of freedom.

  • r2_yd_x (float) – Partial R-squared of treatment with outcome (controlling for X).

  • partial_r2_treatment (float) – Same as r2_yd_x (for clarity).

  • q (float) – Fraction of the estimate to be explained away. Default 1.0 (full).

  • alpha (float) – Significance level.

  • benchmark_covariates (dict, optional) – Dictionary mapping covariate names to their partial R-squared with the outcome. Used to generate benchmark bounds.

Return type:

OmittedVariableBias

morie.sensitivity.specification_curve(data, outcome, treatment, covariate_sets, sample_filters=None, model_types=None, alpha=0.05)[source]

Run a specification curve analysis.

Estimates the treatment effect across many reasonable model specifications to assess robustness.

Parameters:
  • data (pd.DataFrame) – Analysis dataset.

  • outcome (str) – Outcome variable name.

  • treatment (str) – Treatment variable name.

  • covariate_sets (list of list[str]) – Different sets of covariates to try.

  • sample_filters (list of (name, callable), optional) – Different sample restrictions to try.

  • model_types (list[str], optional) – Model types: ‘ols’, ‘logistic’, ‘robust’. Default: [‘ols’].

  • alpha (float) – Significance level.

Return type:

SpecificationCurveResult

morie.sensitivity.manski_bounds(outcome_treated, outcome_control, p_treated, outcome_range=None)[source]

Compute Manski worst-case bounds for the ATE.

Under no assumptions about selection, the ATE is only partially identified.

Parameters:
  • outcome_treated (array-like) – Outcomes for treated units.

  • outcome_control (array-like) – Outcomes for control units.

  • p_treated (float) – Proportion treated.

  • outcome_range (tuple[float, float], optional) – Logical bounds on the outcome. Default: (0, 1) for binary.

Returns:

With keys: lower_bound, upper_bound, point_estimate, width.

Return type:

dict

morie.sensitivity.bias_adjusted_estimate(estimate, se, rr_ud, rr_eu, prevalence_confounder=0.5)[source]

Compute bias-adjusted treatment effect.

Uses the Ding & VanderWeele (2016) bias formula.

Parameters:
  • estimate (float) – Observed treatment effect (on log-RR or coefficient scale).

  • se (float) – Standard error.

  • rr_ud (float) – Risk ratio relating confounder to outcome.

  • rr_eu (float) – Risk ratio relating treatment to confounder.

  • prevalence_confounder (float) – Prevalence of the confounder in the population.

Returns:

With keys: adjusted_estimate, bias, adjusted_ci_lower, adjusted_ci_upper.

Return type:

dict

morie.sensitivity.probabilistic_bias_analysis(estimate, se, n_simulations=10000, bias_parms=None, seed=42)[source]

Probabilistic (Monte Carlo) sensitivity analysis.

Draws bias parameters from specified prior distributions and computes the distribution of bias-adjusted estimates.

Parameters:
  • estimate (float) – Observed estimate.

  • se (float) – Standard error.

  • n_simulations (int) – Number of Monte Carlo draws.

  • bias_parms (dict, optional) – Parameters as (mean, sd) tuples for: ‘rr_ud’, ‘rr_eu’, ‘prevalence’.

  • seed (int) – Random seed.

Returns:

Summary of bias-adjusted estimate distribution.

Return type:

dict

morie.sensitivity.sensitivity_summary(estimate, se, rr=None, odds_ratio=None, hazard_ratio=None, prevalence=None)[source]

Generate a comprehensive sensitivity analysis summary.

Computes E-values, tipping points, and robustness metrics for a single treatment effect estimate.

Parameters:
  • estimate (float) – Treatment effect estimate.

  • se (float) – Standard error.

  • rr (float, optional) – Risk ratio (if applicable).

  • odds_ratio (float, optional) – Odds ratio (if applicable).

  • hazard_ratio (float, optional) – Hazard ratio (if applicable).

  • prevalence (float, optional) – Outcome prevalence.

Returns:

Summary table of sensitivity metrics.

Return type:

pd.DataFrame

Survey + descriptive statistics

Survey-weighted estimation wrappers. Parallels the survey and srvyr packages in R.

This module provides:

  1. SurveyDesign – encapsulates survey data with weights and strata.

  2. Horvitz-Thompson and Hájek estimators for population totals and means.

  3. Ratio estimator for auxiliary-variable-assisted estimation.

  4. Post-stratification and calibration (raking) weight adjustments.

  5. Domain/subpopulation estimation with design-based standard errors.

  6. Complex-survey GLM with weights, clustering, and stratification.

Statistical assumptions and caveats

  • All functions assume the supplied weights are probability/analytic weights (i.e., w_i = 1 / pi_i where pi_i is the first-order inclusion probability). They are NOT frequency or expansion weights.

  • Standard errors under complex sampling are approximated by Taylor linearisation or the jackknife/bootstrap where noted.

  • For production survey analysis, validate against the R survey package (Lumley, 2010) or Stata’s svy prefix.

References

Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. Horvitz, D. G., & Thompson, D. J. (1952). A generalisation of sampling without

replacement from a finite universe. JASA, 47(260), 663-685.

Hájek, J. (1971). Comment on “An essay on the logical foundations of survey

sampling.” In Godambe & Sprott (Eds.), Foundations of Statistical Inference. Holt, Rinehart and Winston.

class morie.survey.SurveyDesign(data, weights_col, strata_col=None)[source]

Bases: object

Encapsulates survey data with its corresponding sampling weights and strata.

Parameters:
__init__(data, weights_col, strata_col=None)[source]

Initialize the survey design object.

Parameters:
  • data (pandas.DataFrame) – The pandas DataFrame containing the survey data.

  • weights_col (str) – The column name corresponding to the survey weights.

  • strata_col (str, optional) – The column name indicating survey strata, defaults to None.

weighted_mean(variable)[source]

Compute the survey-weighted mean of a continuous variable.

Parameters:

variable (str) – The name of the variable to average.

Returns:

The weighted average.

Return type:

float

svyglm(formula, family=None)[source]

Fit a survey-weighted generalized linear model.

Survey probability weights (such as CPADS weight / wtpumf) are analytic / probability weights, not frequency expansion weights. Using freq_weights would incorrectly treat each weight as a replication count, inflating the effective sample size and producing anti-conservative standard errors. The correct statsmodels parameter for analytic probability weights in a GLM is var_weights, which scales the variance of each observation proportionally to its weight without artificially expanding n.

Standard errors returned by statsmodels GLM with var_weights are model-based (sandwich-free); for fully robust inference use fit(cov_type='HC3') on the returned object.

Parameters:
Returns:

A fitted statsmodels GLM result object.

Return type:

statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

References

Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. (Chapter 2 – probability weights vs. frequency weights.)

morie.survey.horvitz_thompson_total(y, inclusion_probs)[source]

Horvitz-Thompson estimator of population total.

The HT estimator is:

\[\hat{\tau}_{HT} = \sum_{i \in s} \frac{y_i}{\pi_i}\]

where \(\pi_i\) is the first-order inclusion probability for unit i.

The variance estimator used here is the Sen-Yates-Grundy (SYG) approximation assuming simple random sampling (i.e., pairwise inclusion probabilities \(\pi_{ij} \approx \pi_i \pi_j\)), which yields:

\[\hat{V}(\hat{\tau}_{HT}) \approx \sum_i \frac{y_i^2}{\pi_i^2} \cdot \frac{1 - \pi_i}{1}\]

This is the standard Horvitz-Thompson variance under with-replacement sampling (Hansen-Hurwitz estimator) and is conservative under without-replacement designs.

Parameters:
  • y (ndarray) – Observed response values for sampled units (1-D array-like).

  • inclusion_probs (ndarray) – First-order inclusion probabilities pi_i for each sampled unit (values in (0, 1]).

Returns:

dict with keys total, se, ci_lower, ci_upper.

Raises:

ValueError – If y and inclusion_probs have different lengths, or any inclusion probability is <= 0 or > 1.

Return type:

dict

References

Horvitz, D. G., & Thompson, D. J. (1952). A generalisation of sampling

without replacement from a finite universe. JASA, 47(260), 663-685.

Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. (Chapter 9A.)

morie.survey.hajek_mean(y, weights)[source]

Hájek estimator for population mean (ratio estimator).

The Hájek estimator normalises the HT estimator by the estimated population size, eliminating sensitivity to total weight magnitude:

\[\bar{y}_H = \frac{\sum_i w_i y_i}{\sum_i w_i}\]

where \(w_i = 1 / \pi_i\) are the survey weights.

Standard error is computed via Taylor (delta-method) linearisation:

\[\hat{V}(\bar{y}_H) \approx \frac{1}{N^2} \sum_i w_i^2 (y_i - \bar{y}_H)^2 \cdot \frac{1 - \pi_i}{\pi_i}\]

For the with-replacement approximation (pi_i = w_i / N_hat) this simplifies to the weighted sample variance divided by the effective sample size.

Parameters:
  • y (ndarray) – Response values (1-D array-like).

  • weights (ndarray) – Survey weights w_i = 1/pi_i (must be > 0).

Returns:

dict with keys mean, se, ci_lower, ci_upper.

Raises:

ValueError – If y and weights have different lengths or weights <= 0.

Return type:

dict

References

Hájek, J. (1971). Comment in Godambe & Sprott (Eds.), Foundations of

Statistical Inference. Holt, Rinehart and Winston.

Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. (Section 6.13.)

morie.survey.ratio_estimator(y, x, weights, X_population_total)[source]

Survey ratio estimator for a population total.

The ratio estimator exploits auxiliary information X whose population total is known:

\[\hat{Y}_R = \hat{R} \cdot X_{\text{pop}}\]

where \(\hat{R} = \hat{Y}_{HT} / \hat{X}_{HT}\) is the estimated ratio of totals.

Standard error via Taylor linearisation:

\[\hat{V}(\hat{Y}_R) \approx \frac{1}{\hat{X}_{HT}^2} \sum_i w_i^2 (y_i - \hat{R} x_i)^2\]
Parameters:
  • y (ndarray) – Response variable values for sampled units (1-D array-like).

  • x (ndarray) – Auxiliary variable values for sampled units (same length as y).

  • weights (ndarray) – Survey weights w_i (must be > 0).

  • X_population_total (float) – Known population total of the auxiliary variable.

Returns:

dict with keys ratio, total_estimate, se, ci_lower, ci_upper.

Raises:

ValueError – If inputs have different lengths, weights <= 0, or X_population_total <= 0.

Return type:

dict

References

Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. (Section 6.4.) Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.

morie.survey.poststratification_weights(df, strata_col, population_counts)[source]

Compute post-stratification weights so the sample distribution matches the known population distribution.

For each stratum h:

\[w_i^{\text{PS}} = \frac{N_h / N}{n_h / n}\]

where \(N_h\) is the known population count in stratum h, \(N\) is the total population, \(n_h\) is the sample count in stratum h, and \(n\) is the total sample size.

Parameters:
  • df (DataFrame) – Input DataFrame.

  • strata_col (str) – Column name identifying strata.

  • population_counts (dict[str, int]) – Dict mapping each stratum label (as string) to its population count.

Returns:

pd.Series of post-stratification weights, indexed like df.

Raises:

ValueError – If strata_col not in df, any stratum in df is absent from population_counts, or sample stratum count is zero.

Return type:

Series

References

Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. (Chapter 7.) Little, R. J. A. (1993). Post-stratification: A modeler’s perspective.

JASA, 88(423), 1001-1012.

morie.survey.calibration_weights(df, aux_vars, population_totals, *, max_iter=50, tol=1e-06)[source]

Raking calibration (iterative proportional fitting) to match known population marginal totals.

For each auxiliary variable x_j, the calibration adjusts weights multiplicatively so that:

\[\sum_i w_i x_{ij} = T_j\]

where \(T_j\) is the known population total for variable j.

IPF convergence is assessed by the maximum relative deviation across all margins at each iteration.

Parameters:
  • df (DataFrame) – Input DataFrame with auxiliary variable columns.

  • aux_vars (list[str]) – List of column names to calibrate on.

  • population_totals (dict[str, float]) – Dict mapping column name to known population total.

  • max_iter (int) – Maximum number of IPF iterations. Default 50.

  • tol (float) – Convergence tolerance (maximum relative deviation). Default 1e-6.

Returns:

pd.Series of calibrated weights.

Raises:

ValueError – If any aux_var is missing from df or population_totals, or initial weights are zero.

Return type:

Series

References

Deville, J.-C., & Sarndal, C.-E. (1992). Calibration estimators in survey

sampling. JASA, 87(418), 376-382.

Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of

a sampled frequency table when the expected marginal totals are known. Annals of Mathematical Statistics, 11(4), 427-444.

morie.survey.subpopulation_estimate(df, domain_col, domain_value, outcome_col, weight_col)[source]

Design-based estimate for a domain (subpopulation) mean.

Treats domain membership as a 0/1 indicator and applies the Hájek estimator within the full-sample design frame (i.e., does NOT subset then reweight, which would ignore the variance contribution from domain membership uncertainty).

Domain mean:

\[\bar{y}_d = \frac{\sum_{i: d_i = 1} w_i y_i}{\sum_{i: d_i = 1} w_i}\]

SE via linearisation (Woodruff, 1971):

\[\hat{V}(\bar{y}_d) \approx \frac{1}{N_d^2} \sum_i w_i^2 z_i^2\]

where \(z_i = d_i (y_i - \bar{y}_d)\) and \(N_d = \sum_{i: d_i=1} w_i\).

Parameters:
  • df (DataFrame) – Full sample DataFrame (do NOT pre-filter to domain).

  • domain_col (str) – Column name identifying the domain.

  • domain_value – Value of domain_col that identifies the target subpopulation.

  • outcome_col (str) – Column name of the outcome variable.

  • weight_col (str) – Column name of survey weights.

Returns:

dict with keys mean, se, ci_lower, ci_upper, n_domain.

Raises:

ValueError – If required columns are missing.

Return type:

dict

References

Woodruff, R. S. (1971). A simple method for approximating the variance of a

complicated estimate. JASA, 66(334), 411-414.

Lumley, T. (2010). Complex Surveys. Wiley. (Section 4.2.)

morie.survey.complex_survey_glm(df, formula, weight_col, *, family='gaussian', cluster_col=None, strata_col=None)[source]

Fit a GLM with complex survey design (weights, optional clustering, strata).

Model is fit using statsmodels WLS/GLM with analytic var_weights (not freq_weights – see SurveyDesign.svyglm() for the rationale).

When cluster_col is provided, cluster-robust (sandwich) standard errors are applied via fit(cov_type='cluster', cov_kwds={'groups': ...}).

Strata support is informational only at this level; stratum-specific estimates require subpopulation_estimate().

Supported families (string): "gaussian", "binomial", "poisson", "gamma", "negativebinomial".

Parameters:
  • df (DataFrame) – Input DataFrame.

  • formula (str) – Patsy-compatible formula string (e.g. "y ~ x1 + x2").

  • weight_col (str) – Column name of analytic/probability survey weights.

  • family (str) – GLM family string. Default "gaussian".

  • cluster_col (str | None) – Column identifying clusters for robust SE. Default None.

  • strata_col (str | None) – Column identifying strata (informational only). Default None.

Returns:

Fitted statsmodels GLM result object.

Raises:

ValueError – If required columns are missing or family is unrecognised.

Return type:

object

References

Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator

and a direct test for heteroskedasticity. Econometrica, 48(4), 817-838.

Survey sampling, resampling, and design-weight utilities.

This module provides functions for common sampling designs used in epidemiological and public health research: simple random sampling, stratified sampling, cluster sampling, probability-proportional-to-size (PPS) sampling, bootstrap resampling, jackknife variance estimation, and design-weight computation.

All stochastic functions use numpy.random.default_rng() for reproducibility rather than legacy global-state RNGs.

References

Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley.

Kish, L. (1965). Survey Sampling. Wiley.

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.

Wolter, K. M. (2007). Introduction to Variance Estimation (2nd ed.). Springer.

Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.

morie.sampling.simple_random_sample(df, n, *, replace=False, seed=42)[source]

Draw a simple random sample (SRS) from a DataFrame.

Parameters:
  • df (pandas.DataFrame) – Population frame.

  • n (int) – Number of units to sample.

  • replace (bool, optional) – If True, sample with replacement (default False).

  • seed (int, optional) – Random seed for reproducibility (default 42).

Returns:

Sampled rows with their original index preserved.

Return type:

pandas.DataFrame

Raises:

ValueError – If n exceeds the number of rows when sampling without replacement.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"x": range(100)})
>>> sample = simple_random_sample(df, 10, seed=0)
>>> len(sample)
10

References

Cochran, W. G. (1977). Sampling Techniques (3rd ed.), Chapter 2.

morie.sampling.stratified_sample(df, strata_col, n_per_stratum, *, proportional=False, seed=42)[source]

Draw a stratified random sample.

Supports both fixed allocation (same n per stratum) and proportional allocation. When proportional=True, n_per_stratum is interpreted as the total desired sample size and is allocated proportionally across strata.

Parameters:
  • df (pandas.DataFrame) – Population frame.

  • strata_col (str) – Column defining strata.

  • n_per_stratum (int or dict[str, int]) – If proportional=False: fixed integer per stratum (or a dict mapping stratum values to per-stratum sample sizes). If proportional=True: total desired sample size (must be int).

  • proportional (bool, optional) – If True, allocate sample sizes proportionally to stratum sizes (default False).

  • seed (int, optional) – Random seed (default 42).

Returns:

Sampled rows with original index preserved.

Return type:

pandas.DataFrame

Raises:

ValueError – If a requested stratum size exceeds the stratum population.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"stratum": ["A"]*50 + ["B"]*50, "x": range(100)})
>>> sample = stratified_sample(df, "stratum", 10, seed=0)
>>> sample.groupby("stratum").size().to_dict()
{'A': 10, 'B': 10}

References

Cochran, W. G. (1977). Sampling Techniques (3rd ed.), Chapter 5.

morie.sampling.cluster_sample(df, cluster_col, n_clusters, *, seed=42)[source]

Two-stage cluster sampling: select clusters, then take all units within.

This implements single-stage cluster sampling (select clusters, enumerate all units within selected clusters).

Parameters:
  • df (pandas.DataFrame) – Population frame with a cluster identifier column.

  • cluster_col (str) – Column identifying clusters (e.g., PSU, school, clinic).

  • n_clusters (int) – Number of clusters to select.

  • seed (int, optional) – Random seed (default 42).

Returns:

All rows belonging to the selected clusters.

Return type:

pandas.DataFrame

Raises:

ValueError – If n_clusters exceeds the number of distinct clusters.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "cluster": [1]*10 + [2]*10 + [3]*10,
...     "y": range(30),
... })
>>> sample = cluster_sample(df, "cluster", 2, seed=0)
>>> sample["cluster"].nunique()
2

References

Cochran, W. G. (1977). Sampling Techniques (3rd ed.), Chapter 9.

morie.sampling.pps_sample(df, size_col, n, *, seed=42)[source]

Probability-proportional-to-size (PPS) sampling.

Draws n units with probability proportional to the values in size_col. Negative or zero size values are excluded with a warning.

Parameters:
  • df (pandas.DataFrame) – Population frame.

  • size_col (str) – Column containing the size measure (must be positive).

  • n (int) – Number of units to draw.

  • seed (int, optional) – Random seed (default 42).

Returns:

Sampled rows.

Return type:

pandas.DataFrame

Raises:

ValueError – If no positive size values remain after filtering.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"pop": [1000, 500, 200, 100], "name": ["A","B","C","D"]})
>>> sample = pps_sample(df, "pop", 2, seed=0)
>>> len(sample)
2

References

Brewer, K. R. W., & Hanif, M. (1983). Sampling with Unequal Probabilities. Springer.

morie.sampling.bootstrap_sample(df, n_bootstrap=1000, *, statistic, seed=42)[source]

Bootstrap resampling for any scalar statistic.

Draws n_bootstrap resamples with replacement from df, applies statistic to each, and returns the bootstrap distribution with percentile confidence interval.

Parameters:
  • df (pandas.DataFrame) – Input data.

  • n_bootstrap (int, optional) – Number of bootstrap replicates (default 1000).

  • statistic (callable) – Function f(df) -> float computing the statistic of interest.

  • seed (int, optional) – Random seed (default 42).

Returns:

Keys: mean (float), se (float), ci_lower (float), ci_upper (float), distribution (numpy.ndarray of length n_bootstrap).

Return type:

dict

Examples

>>> import pandas as pd, numpy as np
>>> df = pd.DataFrame({"x": np.random.default_rng(0).normal(5, 1, 200)})
>>> result = bootstrap_sample(df, 500, statistic=lambda d: d["x"].mean(), seed=0)
>>> 4.5 < result["mean"] < 5.5
True

References

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.

morie.sampling.jackknife_estimate(df, *, statistic)[source]

Delete-1 jackknife variance estimate for a scalar statistic.

Computes the leave-one-out estimates \(\hat{\theta}_{(-i)}\) and derives the jackknife variance estimate:

\[\widehat{\text{Var}}_J = \frac{n-1}{n} \sum_{i=1}^{n} \left(\hat{\theta}_{(-i)} - \bar{\theta}_{(\cdot)}\right)^2\]
Parameters:
  • df (pandas.DataFrame) – Input data.

  • statistic (callable) – Function f(df) -> float computing the statistic of interest.

Returns:

Keys: estimate (float — full-sample statistic), se (float — jackknife standard error), bias (float — jackknife bias estimate).

Return type:

dict

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]})
>>> jk = jackknife_estimate(df, statistic=lambda d: d["x"].mean())
>>> abs(jk["estimate"] - 3.0) < 1e-10
True

References

Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3/4), 353–360.

Tukey, J. W. (1958). Bias and confidence in not-quite large samples (abstract). Annals of Mathematical Statistics, 29, 614.

morie.sampling.compute_design_weights(df, strata_col, population_sizes)[source]

Compute inverse-probability design weights for a stratified sample.

For each stratum h, the design weight is:

\[w_{hi} = \frac{N_h}{n_h}\]

where \(N_h\) is the population size and \(n_h\) is the sample size in stratum h.

Parameters:
  • df (pandas.DataFrame) – Sample data.

  • strata_col (str) – Column defining strata.

  • population_sizes (dict[str, int]) – Mapping of stratum value to population size.

Returns:

Design weights aligned with the DataFrame index.

Return type:

pandas.Series

Raises:

KeyError – If a stratum in df is not found in population_sizes.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"stratum": ["A"]*10 + ["B"]*20})
>>> w = compute_design_weights(df, "stratum", {"A": 1000, "B": 2000})
>>> float(w[df["stratum"] == "A"].iloc[0])
100.0

References

Kish, L. (1965). Survey Sampling, Chapter 2. Wiley.

morie.sampling.effective_sample_size(weights)[source]

Compute the Kish effective sample size.

\[n_{\text{eff}} = \frac{\left(\sum_i w_i\right)^2}{\sum_i w_i^2}\]
Parameters:

weights (numpy.ndarray or pandas.Series) – Sampling or analytic weights.

Returns:

Effective sample size.

Return type:

float

Examples

>>> import numpy as np
>>> effective_sample_size(np.ones(100))
100.0
>>> ess = effective_sample_size(np.array([1.0, 1.0, 10.0]))
>>> ess < 3.0
True

References

Kish, L. (1965). Survey Sampling, p. 162. Wiley.

morie.sampling.design_effect(weights)[source]

Approximate design effect (DEFF) from weights.

The Kish approximation to the design effect is:

\[\text{DEFF} \approx \frac{n}{n_{\text{eff}}} = \frac{n \sum_i w_i^2}{\left(\sum_i w_i\right)^2}\]

A DEFF of 1.0 indicates no inflation (equal weights); values > 1 indicate variance inflation from unequal weighting.

Parameters:

weights (numpy.ndarray or pandas.Series) – Sampling or analytic weights.

Returns:

Approximate design effect.

Return type:

float

Examples

>>> import numpy as np
>>> design_effect(np.ones(100))
1.0

References

Kish, L. (1965). Survey Sampling, p. 162. Wiley.

Survival analysis toolkit for epidemiological time-to-event research.

This module provides a comprehensive set of tools for censored time-to-event data commonly encountered in cohort studies, clinical trials, and public health surveillance. Implementations are built on numpy, scipy, and pandas; optional integration with lifelines is used where available but never required.

Major components

  • Non-parametric estimators: Kaplan–Meier, Nelson–Aalen

  • Log-rank family tests: standard, Peto–Peto, Gehan–Wilcoxon, Tarone–Ware

  • Cox proportional hazards: partial likelihood estimation with Breslow and Efron tie-handling, Schoenfeld / Cox–Snell / martingale / deviance residuals

  • Parametric models: Exponential, Weibull, log-normal, log-logistic, Gompertz

  • Accelerated failure time (AFT) models

  • Competing risks: Cumulative incidence functions (CIF), Fine–Gray model

  • Discrimination: Concordance index (Harrell’s C)

  • RMST: Restricted mean survival time comparison

References

Collett, D. (2015). Modelling Survival Data in Medical Research (3rd ed.).

Chapman & Hall/CRC.

Kalbfleisch, J. D., & Prentice, R. L. (2002). *The Statistical Analysis of

Failure Time Data* (2nd ed.). Wiley.

Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the

subdistribution of a competing risk. JASA, 94(446), 496–509.

Klein, J. P., & Moeschberger, M. L. (2003). *Survival Analysis: Techniques

for Censored and Truncated Data* (2nd ed.). Springer.

class morie.survival.SurvivalCurve(times, survival, ci_lower, ci_upper, at_risk, events, censored, method='Kaplan-Meier', median_survival=None)[source]

Bases: object

Stores a non-parametric survival or cumulative-hazard estimate.

Parameters:
  • times (ndarray) – Ordered unique event times.

  • survival (ndarray) – Estimated survival probabilities (or cumulative hazard).

  • ci_lower (ndarray) – Lower confidence band.

  • ci_upper (ndarray) – Upper confidence band.

  • at_risk (ndarray) – Number at risk just before each event time.

  • events (ndarray) – Number of events at each event time.

  • censored (ndarray) – Number censored at each event time.

  • method (str) – Estimator name.

  • median_survival (float or None) – Median survival time (first time S(t) <= 0.5).

times: ndarray
survival: ndarray
ci_lower: ndarray
ci_upper: ndarray
at_risk: ndarray
events: ndarray
censored: ndarray
method: str = 'Kaplan-Meier'
median_survival: float | None = None
class morie.survival.LogRankResult(method, test_statistic, p_value, df, n_groups, n_total, extra=<factory>)[source]

Bases: object

Result container for log-rank family tests.

Parameters:
method: str
test_statistic: float
p_value: float
df: int
n_groups: int
n_total: int
extra: dict
class morie.survival.CoxResult(coefficients, standard_errors, hazard_ratios, z_scores, p_values, ci_lower, ci_upper, covariate_names, concordance, log_likelihood, n_events, n_observations, method='Cox PH', baseline_hazard=None, extra=<factory>)[source]

Bases: object

Result container for Cox PH model.

Parameters:
coefficients: ndarray
standard_errors: ndarray
hazard_ratios: ndarray
z_scores: ndarray
p_values: ndarray
ci_lower: ndarray
ci_upper: ndarray
covariate_names: list[str]
concordance: float
log_likelihood: float
n_events: int
n_observations: int
method: str = 'Cox PH'
baseline_hazard: DataFrame | None = None
extra: dict
class morie.survival.ParametricSurvivalResult(distribution, coefficients, log_likelihood, aic, bic, n_observations, n_events, extra=<factory>)[source]

Bases: object

Result container for parametric survival / AFT models.

Parameters:
distribution: str
coefficients: dict
log_likelihood: float
aic: float
bic: float
n_observations: int
n_events: int
extra: dict
class morie.survival.CompetingRiskResult(times, cif, ci_lower, ci_upper, event_of_interest, n_total, method='Aalen-Johansen')[source]

Bases: object

Result container for competing-risk analyses.

Parameters:
times: ndarray
cif: ndarray
ci_lower: ndarray
ci_upper: ndarray
event_of_interest: int
n_total: int
method: str = 'Aalen-Johansen'
morie.survival.kaplan_meier(time, event, confidence=0.95, ci_method='greenwood')[source]

Kaplan–Meier product-limit survival estimator.

Parameters:
  • time (array-like) – Observed times (non-negative).

  • event (array-like) – Event indicator (1 = event, 0 = censored).

  • confidence (float, default 0.95) – Confidence level for survival bands.

  • ci_method (str, default "greenwood") – "greenwood" for Greenwood’s formula or "log-log" for the complementary log-log transformation.

Return type:

SurvivalCurve

References

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. JASA, 53(282), 457–481.

morie.survival.nelson_aalen(time, event, confidence=0.95)[source]

Nelson–Aalen cumulative hazard estimator.

Parameters:
  • time (array-like) – Observed times.

  • event (array-like) – Event indicator (1 = event, 0 = censored).

  • confidence (float, default 0.95) – Confidence level.

Returns:

The survival field here contains the cumulative hazard H(t), not S(t).

Return type:

SurvivalCurve

morie.survival.logrank_test(time, event, group)[source]

Standard log-rank test (Mantel–Haenszel).

Parameters:
  • time (array-like) – Observed times, event indicators, and group labels.

  • event (array-like) – Observed times, event indicators, and group labels.

  • group (array-like) – Observed times, event indicators, and group labels.

Return type:

LogRankResult

morie.survival.peto_peto_test(time, event, group)[source]

Peto–Peto modification of the Gehan–Wilcoxon test.

Parameters:
  • time (array-like)

  • event (array-like)

  • group (array-like)

Return type:

LogRankResult

morie.survival.gehan_wilcoxon_test(time, event, group)[source]

Gehan–Wilcoxon test (Breslow test).

Parameters:
  • time (array-like)

  • event (array-like)

  • group (array-like)

Return type:

LogRankResult

morie.survival.tarone_ware_test(time, event, group)[source]

Tarone–Ware weighted log-rank test.

Parameters:
  • time (array-like)

  • event (array-like)

  • group (array-like)

Return type:

LogRankResult

morie.survival.cox_ph(data, duration_col, event_col, covariate_cols, ties='efron', confidence=0.95, penalizer=0.0)[source]

Fit a Cox proportional hazards model.

Uses Newton–Raphson optimisation of the partial likelihood.

Parameters:
  • data (DataFrame) – Analysis dataset.

  • duration_col (str) – Column with observed times.

  • event_col (str) – Column with event indicators (1 = event).

  • covariate_cols (list[str]) – Names of covariate columns.

  • ties (str, default "efron") – Tie handling: "efron" or "breslow".

  • confidence (float, default 0.95) – Confidence level for hazard ratio CIs.

  • penalizer (float, default 0.0) – L2 regularisation strength.

Return type:

CoxResult

morie.survival.schoenfeld_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]

Compute Schoenfeld residuals for testing the proportional hazards assumption.

Parameters:
  • data (DataFrame) – Analysis dataset.

  • duration_col (str) – Time and event columns.

  • event_col (str) – Time and event columns.

  • covariate_cols (list[str]) – Covariates used in the Cox model.

  • cox_result (CoxResult) – A fitted CoxResult from cox_ph().

Returns:

One row per event with columns for each covariate’s Schoenfeld residual and the event time.

Return type:

DataFrame

morie.survival.cox_snell_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]

Compute Cox–Snell residuals.

The Cox–Snell residual for observation i is \(r_i = \hat{H}_0(t_i) \exp(x_i^T \hat{\beta})\).

If the model is correct these should follow a unit-rate exponential distribution.

Parameters:
Return type:

ndarray

morie.survival.martingale_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]

Compute martingale residuals.

\(M_i = \delta_i - r_i^{CS}\) where \(r_i^{CS}\) is the Cox–Snell residual.

Parameters:
Return type:

ndarray

morie.survival.deviance_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]

Compute deviance residuals from a Cox PH model.

\(d_i = \text{sign}(M_i) \sqrt{-2[M_i + \delta_i \log(\delta_i - M_i)]}\)

Parameters:
Return type:

ndarray

morie.survival.test_ph_assumption(data, duration_col, event_col, covariate_cols, cox_result)[source]

Test the proportional hazards assumption via Schoenfeld residual correlation with time.

For each covariate, computes the Pearson correlation between scaled Schoenfeld residuals and event time and reports a p-value.

Parameters:
Returns:

Columns: covariate, rho, chi2, p_value.

Return type:

DataFrame

morie.survival.exponential_model(time, event)[source]

Fit an exponential survival model (constant hazard).

The exponential distribution has hazard \(h(t) = \lambda\) and survival \(S(t) = e^{-\lambda t}\).

Parameters:
  • time (array-like)

  • event (array-like)

Return type:

ParametricSurvivalResult

morie.survival.weibull_model(time, event)[source]

Fit a Weibull survival model via MLE.

Parameterisation: \(S(t) = \exp(-(t/\lambda)^k)\).

Parameters:
  • time (array-like)

  • event (array-like)

Return type:

ParametricSurvivalResult

morie.survival.lognormal_model(time, event)[source]

Fit a log-normal survival model via MLE.

Parameters:
  • time (array-like)

  • event (array-like)

Return type:

ParametricSurvivalResult

morie.survival.loglogistic_model(time, event)[source]

Fit a log-logistic survival model via MLE.

\(S(t) = 1 / (1 + (t/\alpha)^\beta)\).

Parameters:
  • time (array-like)

  • event (array-like)

Return type:

ParametricSurvivalResult

morie.survival.gompertz_model(time, event)[source]

Fit a Gompertz survival model via MLE.

Hazard: \(h(t) = b \exp(ct)\). Survival: \(S(t) = \exp(-(b/c)(e^{ct} - 1))\).

Parameters:
  • time (array-like)

  • event (array-like)

Return type:

ParametricSurvivalResult

morie.survival.aft_weibull(data, duration_col, event_col, covariate_cols)[source]

Accelerated failure time model with Weibull baseline.

The AFT model specifies \(\log T = \mu + x^T \beta + \sigma W\) where W follows a standard extreme value distribution.

Parameters:
  • data (DataFrame)

  • duration_col (str)

  • event_col (str)

  • covariate_cols (list[str])

Return type:

ParametricSurvivalResult

morie.survival.aft_lognormal(data, duration_col, event_col, covariate_cols)[source]

AFT model with log-normal baseline.

Parameters:
  • data (DataFrame)

  • duration_col (str)

  • event_col (str)

  • covariate_cols (list[str])

Return type:

ParametricSurvivalResult

morie.survival.aft_loglogistic(data, duration_col, event_col, covariate_cols)[source]

AFT model with log-logistic baseline.

Parameters:
  • data (DataFrame)

  • duration_col (str)

  • event_col (str)

  • covariate_cols (list[str])

Return type:

ParametricSurvivalResult

morie.survival.restricted_mean_survival_time(time, event, tau=None, confidence=0.95)[source]

Compute the restricted mean survival time (RMST).

\(\text{RMST}(\tau) = \int_0^\tau \hat{S}(t) \, dt\).

Parameters:
  • time (array-like)

  • event (array-like)

  • tau (float or None) – Restriction time. If None, uses the maximum observed time.

  • confidence (float, default 0.95) – Confidence level.

Returns:

Keys: rmst, se, ci_lower, ci_upper, tau.

Return type:

dict

morie.survival.rmst_difference(time1, event1, time2, event2, tau=None, confidence=0.95)[source]

Compare RMST between two groups.

Parameters:
  • time1 (array-like) – Group 1 survival data.

  • event1 (array-like) – Group 1 survival data.

  • time2 (array-like) – Group 2 survival data.

  • event2 (array-like) – Group 2 survival data.

  • tau (float or None) – Common restriction time.

  • confidence (float, default 0.95)

Returns:

Keys: rmst_diff, se, z, p_value, ci_lower, ci_upper.

Return type:

dict

morie.survival.cumulative_incidence_function(time, event, event_of_interest=1, confidence=0.95)[source]

Aalen–Johansen estimator of the cumulative incidence function (CIF) for competing risks.

Parameters:
  • time (array-like) – Observed times.

  • event (array-like) – Event type indicator: 0 = censored, 1 = event of interest, 2+ = competing events.

  • event_of_interest (int, default 1) – Which event code is the event of interest.

  • confidence (float, default 0.95)

Return type:

CompetingRiskResult

morie.survival.fine_gray_weights(time, event, event_of_interest=1)[source]

Compute Fine–Gray IPCW weights for subdistribution hazard modelling.

These weights can be passed to a weighted Cox model to obtain Fine–Gray subdistribution hazard ratios.

Parameters:
  • time (array-like)

  • event (array-like)

  • event_of_interest (int, default 1)

Returns:

Observation-level weights.

Return type:

ndarray

References

Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. JASA, 94(446), 496–509.

morie.survival.concordance_index(event_times, predicted_scores, event_observed)[source]

Harrell’s concordance index (C-statistic) for survival models.

Parameters:
  • event_times (array-like) – Observed times.

  • predicted_scores (array-like) – Model-predicted risk scores (higher = higher risk).

  • event_observed (array-like) – Event indicator (1 = event).

Returns:

Concordance index in [0, 1].

Return type:

float

morie.survival.hazard_ratio(time, event, group, confidence=0.95)[source]

Estimate the hazard ratio between two groups from a simple Cox model.

Parameters:
  • time (array-like)

  • event (array-like)

  • group (array-like)

  • confidence (float, default 0.95)

Returns:

Keys: hr, ci_lower, ci_upper, p_value, log_hr, se.

Return type:

dict

morie.survival.survival_plot_data(time, event, group=None, confidence=0.95)[source]

Generate a long-format DataFrame suitable for survival curve plotting.

Parameters:
  • time (array-like)

  • event (array-like)

  • group (array-like or None) – Group labels. If None, a single group is assumed.

  • confidence (float, default 0.95)

Returns:

Columns: time, survival, ci_lower, ci_upper, at_risk, group.

Return type:

DataFrame

morie.survival.survival_calibration(predicted_survival, time, event, n_groups=10)[source]

Calibration of predicted survival probabilities.

Divides observations into n_groups deciles of predicted risk and compares predicted vs. observed KM survival within each group.

Parameters:
  • predicted_survival (array-like) – Model-predicted survival probabilities at a fixed time horizon.

  • time (array-like)

  • event (array-like)

  • n_groups (int, default 10)

Returns:

Columns: group, predicted_mean, observed_survival, n.

Return type:

DataFrame

morie.survival.landmark_dataset(data, duration_col, event_col, landmark_time)[source]

Create a landmark analysis dataset.

Excludes subjects who experienced an event or were censored before landmark_time and resets the time origin.

Parameters:
  • data (DataFrame)

  • duration_col (str)

  • event_col (str)

  • landmark_time (float)

Returns:

Filtered dataset with adjusted times.

Return type:

DataFrame

morie.survival.left_truncated_km(entry_time, exit_time, event, confidence=0.95)[source]

Kaplan–Meier estimator with left truncation (delayed entry).

Parameters:
  • entry_time (array-like) – Time at which each subject enters the risk set.

  • exit_time (array-like) – Observed failure/censoring time.

  • event (array-like) – Event indicator.

  • confidence (float, default 0.95)

Return type:

SurvivalCurve

morie.survival.turnbull_estimator(left, right, max_iter=200, tol=1e-06)[source]

Turnbull’s self-consistency algorithm for interval-censored data.

Parameters:
  • left (array-like) – Left endpoints of observation intervals.

  • right (array-like) – Right endpoints (use np.inf for right-censored).

  • max_iter (int, default 200) – Maximum EM iterations.

  • tol (float, default 1e-6) – Convergence tolerance.

Returns:

(unique_times, survival_probs).

Return type:

tuple[ndarray, ndarray]

morie.survival.compare_parametric_models(time, event)[source]

Fit all available parametric models and compare via AIC/BIC.

Parameters:
  • time (array-like)

  • event (array-like)

Returns:

Columns: distribution, log_likelihood, aic, bic, n_events.

Return type:

DataFrame

Comprehensive hypothesis testing suite for epidemiological research.

This module provides a unified interface for the full spectrum of frequentist hypothesis tests encountered in public health and biomedical research. Every test function returns a TestResult dataclass containing the test statistic, p-value, degrees of freedom, confidence interval bounds, effect size, and method name so that downstream code can process results programmatically without parsing text output.

Categories of tests

  • Location tests (t-tests, ANOVA, non-parametric rank tests)

  • Association tests (chi-squared, Fisher exact, correlation)

  • Distribution tests (normality, homogeneity of variance, goodness of fit)

  • Proportion tests (one- and two-sample z-tests, Fisher exact)

  • Agreement tests (Cohen’s kappa, Fleiss’ kappa, ICC)

All implementations delegate heavy numerics to scipy.stats and statsmodels where possible; only formulas absent from those libraries are coded from scratch.

References

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences

(2nd ed.). Lawrence Erlbaum Associates.

Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley. Conover, W. J. (1999). Practical Nonparametric Statistics (3rd ed.). Wiley. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters.

Psychological Bulletin, 76(5), 378–382.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in

assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

class morie.statistics.TestResult(method, test_statistic, p_value, df=None, ci_lower=None, ci_upper=None, effect_size=None, estimate=None, n=None, extra=<factory>)[source]

Bases: object

Standardised container for every hypothesis-test result in this module.

Parameters:
  • method (str) – Human-readable name of the statistical test.

  • test_statistic (float) – Value of the test statistic (t, F, chi-squared, U, etc.).

  • p_value (float) – Two-sided p-value (or one-sided where documented).

  • df (float | None) – Degrees of freedom (None for non-parametric tests that do not define df).

  • ci_lower (float | None) – Lower bound of the confidence interval for the estimated parameter.

  • ci_upper (float | None) – Upper bound of the confidence interval for the estimated parameter.

  • effect_size (float | None) – A standardised effect-size measure appropriate to the test.

  • estimate (float | None) – Point estimate of the quantity being tested (mean difference, odds ratio, correlation, etc.).

  • n (int | None) – Total sample size used in the test.

  • extra (dict) – Arbitrary additional outputs (e.g. per-group means, residuals).

method: str
test_statistic: float
p_value: float
df: float | None = None
ci_lower: float | None = None
ci_upper: float | None = None
effect_size: float | None = None
estimate: float | None = None
n: int | None = None
extra: dict
morie.statistics.one_sample_ttest(x, mu0=0.0, confidence=0.95)[source]

One-sample Student’s t-test.

Tests \(H_0: \mu = \mu_0\) against \(H_1: \mu \neq \mu_0\).

Parameters:
  • x (array-like) – Sample observations.

  • mu0 (float, default 0.0) – Hypothesised population mean.

  • confidence (float, default 0.95) – Confidence level for the CI around the sample mean.

Return type:

TestResult

morie.statistics.two_sample_ttest(x, y, equal_var=True, confidence=0.95)[source]

Independent two-sample t-test (equal or unequal variance).

Parameters:
  • x (array-like) – Two independent samples.

  • y (array-like) – Two independent samples.

  • equal_var (bool, default True) – If False, use Welch’s approximation for unequal variances.

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.welch_ttest(x, y, confidence=0.95)[source]

Welch’s t-test – convenience wrapper for two_sample_ttest(equal_var=False).

Parameters:
  • x (array-like) – Two independent samples.

  • y (array-like) – Two independent samples.

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.paired_ttest(x, y, confidence=0.95)[source]

Paired-sample t-test.

Tests \(H_0: \mu_d = 0\) where \(d_i = x_i - y_i\).

Parameters:
  • x (array-like) – Paired observations (must have equal length).

  • y (array-like) – Paired observations (must have equal length).

  • confidence (float, default 0.95) – Confidence level for the mean difference CI.

Return type:

TestResult

morie.statistics.one_way_anova(*groups)[source]

One-way between-subjects ANOVA (F-test).

Parameters:

*groups (array-like) – Two or more independent samples.

Returns:

Effect size is \(\eta^2 = SS_{between} / SS_{total}\).

Return type:

TestResult

morie.statistics.two_way_anova(data, outcome, factor_a, factor_b)[source]

Two-way factorial ANOVA via OLS type-II sums of squares.

Parameters:
  • data (DataFrame) – Long-format data.

  • outcome (str) – Name of the dependent-variable column.

  • factor_a (str) – Names of the two factor columns.

  • factor_b (str) – Names of the two factor columns.

Returns:

The extra dict contains per-factor and interaction F and p values.

Return type:

TestResult

morie.statistics.repeated_measures_anova(data, outcome, subject, within)[source]

One-way repeated-measures ANOVA (sphericity not assumed – uses Greenhouse–Geisser correction).

Parameters:
  • data (DataFrame) – Long-format data with one row per subject-condition.

  • outcome (str) – Name of the outcome column.

  • subject (str) – Name of the subject identifier column.

  • within (str) – Name of the within-subjects factor column.

Return type:

TestResult

morie.statistics.kruskal_wallis(*groups)[source]

Kruskal–Wallis H-test (non-parametric one-way ANOVA).

Parameters:

*groups (array-like) – Two or more independent samples.

Returns:

Effect size is \(\eta^2_H = (H - k + 1) / (N - k)\).

Return type:

TestResult

morie.statistics.friedman_test(*groups)[source]

Friedman test for repeated measures on ranks.

Parameters:

*groups (array-like) – Three or more matched samples (equal length).

Returns:

Effect size is Kendall’s W.

Return type:

TestResult

morie.statistics.chi2_goodness_of_fit(observed, expected=None)[source]

Chi-squared goodness-of-fit test.

Parameters:
  • observed (array-like) – Observed frequency counts.

  • expected (array-like or None) – Expected frequency counts. If None, uniform distribution assumed.

Returns:

Effect size is Cohen’s w.

Return type:

TestResult

morie.statistics.chi2_independence(contingency_table, correction=True)[source]

Chi-squared test of independence for a contingency table.

Parameters:
  • contingency_table (array-like or DataFrame) – An r x c contingency table of observed counts.

  • correction (bool, default True) – Apply Yates’s continuity correction for 2x2 tables.

Returns:

Effect size is Cramer’s V.

Return type:

TestResult

morie.statistics.mcnemar_test(contingency_table, exact=False)[source]

McNemar’s test for paired nominal data (2x2 table).

Parameters:
  • contingency_table (array-like) – A 2x2 contingency table [[a, b], [c, d]].

  • exact (bool, default False) – If True, use the exact binomial test instead of the chi-squared approximation.

Return type:

TestResult

morie.statistics.cochrans_q(*groups)[source]

Cochran’s Q test for related samples with binary outcomes.

Parameters:

*groups (array-like) – Three or more matched binary (0/1) samples.

Return type:

TestResult

morie.statistics.pearson_correlation(x, y, confidence=0.95)[source]

Pearson product-moment correlation with Fisher z CI.

Parameters:
  • x (array-like) – Two continuous variables.

  • y (array-like) – Two continuous variables.

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.spearman_correlation(x, y, confidence=0.95)[source]

Spearman rank correlation.

Parameters:
  • x (array-like) – Two variables.

  • y (array-like) – Two variables.

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.kendall_correlation(x, y)[source]

Kendall’s tau-b rank correlation.

Parameters:
  • x (array-like) – Two variables.

  • y (array-like) – Two variables.

Return type:

TestResult

morie.statistics.point_biserial_correlation(binary, continuous, confidence=0.95)[source]

Point-biserial correlation between a binary and continuous variable.

Parameters:
  • binary (array-like) – Binary (0/1) variable.

  • continuous (array-like) – Continuous variable.

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.partial_correlation(x, y, covariates, confidence=0.95)[source]

Partial Pearson correlation controlling for covariates.

Uses OLS residualisation: regress both x and y on the covariates, then correlate the residuals.

Parameters:
  • x (array-like) – Variables of interest.

  • y (array-like) – Variables of interest.

  • covariates (array-like or DataFrame) – Matrix of control variables (n x p).

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.semi_partial_correlation(x, y, covariates)[source]

Semi-partial (part) correlation.

Residualises only x on the covariates and correlates the residual with the raw y.

Parameters:
  • x (array-like) – Variables of interest.

  • y (array-like) – Variables of interest.

  • covariates (array-like or DataFrame) – Matrix of control variables.

Return type:

TestResult

morie.statistics.mann_whitney_u(x, y, alternative='two-sided')[source]

Mann–Whitney U test (Wilcoxon rank-sum test).

Parameters:
  • x (array-like) – Two independent samples.

  • y (array-like) – Two independent samples.

  • alternative (str, default "two-sided") – One of "two-sided", "less", "greater".

Returns:

Effect size is rank-biserial correlation \(r = 1 - 2U / (n_1 n_2)\).

Return type:

TestResult

morie.statistics.wilcoxon_signed_rank(x, y=None, alternative='two-sided')[source]

Wilcoxon signed-rank test for paired samples or one sample.

Parameters:
  • x (array-like) – If y is None, test whether the distribution of x is symmetric about zero. Otherwise, test paired differences x - y.

  • y (array-like or None) – Second sample for paired test.

  • alternative (str, default "two-sided") – One of "two-sided", "less", "greater".

Return type:

TestResult

morie.statistics.ks_test_one_sample(x, cdf='norm', args=())[source]

One-sample Kolmogorov–Smirnov test.

Parameters:
  • x (array-like) – Sample data.

  • cdf (str, default "norm") – Name of a scipy.stats distribution (e.g. "norm", "expon").

  • args (tuple) – Extra arguments to the CDF (loc, scale, etc.).

Return type:

TestResult

morie.statistics.ks_test_two_sample(x, y)[source]

Two-sample Kolmogorov–Smirnov test.

Parameters:
  • x (array-like) – Two independent samples.

  • y (array-like) – Two independent samples.

Return type:

TestResult

morie.statistics.anderson_darling(x, dist='norm')[source]

Anderson–Darling test for a specified distribution family.

Parameters:
  • x (array-like) – Sample data.

  • dist (str, default "norm") – Distribution name ("norm", "expon", "logistic", "gumbel").

Returns:

The extra dict contains critical values and significance levels.

Return type:

TestResult

morie.statistics.levene_test(*groups, center='median')[source]

Levene’s test for equality of variances.

Parameters:
  • *groups (array-like) – Two or more samples.

  • center (str, default "median") – "median" (Brown–Forsythe), "mean", or "trimmed".

Return type:

TestResult

morie.statistics.bartlett_test(*groups)[source]

Bartlett’s test for equality of variances.

Parameters:

*groups (array-like) – Two or more samples.

Return type:

TestResult

morie.statistics.runs_test(x, cutoff=None)[source]

Wald–Wolfowitz runs test for randomness.

Parameters:
  • x (array-like) – Numeric sequence.

  • cutoff (float or None) – If None, uses the median to dichotomise x.

Return type:

TestResult

morie.statistics.shapiro_wilk(x)[source]

Shapiro–Wilk test for normality.

Parameters:

x (array-like) – Sample data (n <= 5000 recommended by scipy).

Return type:

TestResult

morie.statistics.dagostino_pearson(x)[source]

D’Agostino–Pearson omnibus normality test.

Parameters:

x (array-like) – Sample data (n >= 20 recommended).

Return type:

TestResult

morie.statistics.jarque_bera(x)[source]

Jarque–Bera test for normality (based on skewness and kurtosis).

Parameters:

x (array-like) – Sample data.

Return type:

TestResult

morie.statistics.lilliefors_test(x)[source]

Lilliefors test for normality (KS test with estimated mean and SD).

Uses the Kolmogorov–Smirnov statistic with parameters estimated from the data. Critical values and p-value are approximated via the D’Agostino– Stephens table or, when statsmodels is available, via its lilliefors implementation.

Parameters:

x (array-like) – Sample data.

Return type:

TestResult

morie.statistics.one_proportion_ztest(count, nobs, value=0.5, confidence=0.95)[source]

One-sample z-test for a proportion.

Parameters:
  • count (int) – Number of successes.

  • nobs (int) – Number of observations.

  • value (float, default 0.5) – Hypothesised proportion under H_0.

  • confidence (float, default 0.95) – Confidence level for the Wilson score interval.

Return type:

TestResult

morie.statistics.two_proportion_ztest(count1, nobs1, count2, nobs2, confidence=0.95)[source]

Two-sample z-test for the difference between two proportions.

Parameters:
  • count1 (int) – Successes and observations in sample 1.

  • nobs1 (int) – Successes and observations in sample 1.

  • count2 (int) – Successes and observations in sample 2.

  • nobs2 (int) – Successes and observations in sample 2.

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.fisher_exact_test(contingency_table, alternative='two-sided')[source]

Fisher’s exact test for a 2x2 contingency table.

Parameters:
  • contingency_table (array-like) – A 2x2 table of counts.

  • alternative (str, default "two-sided") – "two-sided", "less", or "greater".

Returns:

estimate is the odds ratio.

Return type:

TestResult

morie.statistics.cohens_kappa(rater1, rater2, confidence=0.95)[source]

Cohen’s kappa for inter-rater agreement between two raters.

Parameters:
  • rater1 (array-like) – Categorical ratings from two raters (same length).

  • rater2 (array-like) – Categorical ratings from two raters (same length).

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

morie.statistics.fleiss_kappa(ratings_matrix)[source]

Fleiss’ kappa for agreement among multiple raters.

Parameters:

ratings_matrix (array-like) – An n x k matrix where rows are subjects and columns are rating categories. Cell (i, j) is the number of raters who assigned subject i to category j.

Return type:

TestResult

References

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.

morie.statistics.intraclass_correlation(data, targets, raters, ratings, icc_type='ICC3k')[source]

Intraclass correlation coefficient (ICC).

Implements the Shrout & Fleiss (1979) taxonomy:

  • ICC1: One-way random, single measures

  • ICC1k: One-way random, average measures

  • ICC2: Two-way random, single measures

  • ICC2k: Two-way random, average measures

  • ICC3: Two-way mixed, single measures

  • ICC3k: Two-way mixed, average measures

Parameters:
  • data (DataFrame) – Long-format data.

  • targets (str) – Column identifying subjects/targets.

  • raters (str) – Column identifying raters.

  • ratings (str) – Column containing ratings (numeric).

  • icc_type (str, default "ICC3k") – Which ICC form to compute.

Return type:

TestResult

References

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

morie.statistics.normality_battery(x)[source]

Run all available normality tests on a single sample.

Parameters:

x (array-like) – Sample data.

Returns:

Results from Shapiro–Wilk, D’Agostino–Pearson, Jarque–Bera, and Lilliefors tests.

Return type:

list[TestResult]

morie.statistics.variance_equality_battery(*groups)[source]

Run Levene’s (median) and Bartlett’s tests for homogeneity of variance.

Parameters:

*groups (array-like) – Two or more samples.

Return type:

list[TestResult]

morie.statistics.correlation_matrix(data, method='pearson')[source]

Compute a pairwise correlation matrix with p-values.

Parameters:
  • data (DataFrame) – Numeric columns only.

  • method (str, default "pearson") – "pearson", "spearman", or "kendall".

Returns:

MultiIndex columns with ("r", col) and ("p", col) levels.

Return type:

DataFrame

morie.statistics.auto_test(x, y=None, paired=False, confidence=0.95)[source]

Automatically select and run the most appropriate test.

Decision logic:

  1. If y is None – one-sample t-test against zero.

  2. If paired=True – paired t-test (if normal differences) or Wilcoxon.

  3. If two independent samples – check normality and variance equality; choose between Student’s t, Welch’s t, or Mann–Whitney U.

Parameters:
  • x (array-like) – First sample.

  • y (array-like or None) – Second sample (if applicable).

  • paired (bool, default False) – Whether samples are paired.

  • confidence (float, default 0.95) – Confidence level.

Return type:

TestResult

Comprehensive bootstrap and resampling inference methods.

Provides nonparametric and parametric bootstrap, jackknife, permutation tests, subsampling, and cross-validation resampling for statistical inference in epidemiological and observational studies.

All methods support stratified resampling, cluster-aware resampling, and survey-weighted variants for complex survey designs.

class morie.bootstrap_methods.BootstrapResult(estimate, se, ci_lower, ci_upper, bias, n_boot, method, ci_method, boot_distribution, original_estimate, acceleration=0.0)[source]

Bases: object

Result from a bootstrap procedure.

Parameters:
estimate: float
se: float
ci_lower: float
ci_upper: float
bias: float
n_boot: int
method: str
ci_method: str
boot_distribution: ndarray
original_estimate: float
acceleration: float = 0.0
summary()[source]
Return type:

dict

class morie.bootstrap_methods.JackknifeResult(estimate, se, ci_lower, ci_upper, bias, n, jackknife_estimates, pseudovalues, influence_values)[source]

Bases: object

Result from jackknife estimation.

Parameters:
estimate: float
se: float
ci_lower: float
ci_upper: float
bias: float
n: int
jackknife_estimates: ndarray
pseudovalues: ndarray
influence_values: ndarray
summary()[source]
Return type:

dict

class morie.bootstrap_methods.PermutationTestResult(observed_statistic, p_value, null_distribution, n_permutations, alternative, ci_lower=nan, ci_upper=nan)[source]

Bases: object

Result from a permutation test.

Parameters:
observed_statistic: float
p_value: float
null_distribution: ndarray
n_permutations: int
alternative: str
ci_lower: float = nan
ci_upper: float = nan
class morie.bootstrap_methods.CrossValidationResult(scores, mean_score, se_score, ci_lower, ci_upper, n_folds, metric, fold_sizes)[source]

Bases: object

Result from cross-validation.

Parameters:
scores: ndarray
mean_score: float
se_score: float
ci_lower: float
ci_upper: float
n_folds: int
metric: str
fold_sizes: list[int]
morie.bootstrap_methods.bootstrap(data, statistic, n_boot=2000, ci_level=0.95, ci_method='bca', seed=42, stratify=None, cluster=None)[source]

Nonparametric bootstrap inference.

Parameters:
  • data (array-like) – Input data (1D or 2D).

  • statistic (callable) – Function that computes the statistic from a data array.

  • n_boot (int) – Number of bootstrap replicates.

  • ci_level (float) – Confidence interval level.

  • ci_method (str) – CI method: ‘percentile’, ‘normal’, ‘basic’, ‘bca’, ‘studentized’.

  • seed (int) – Random seed.

  • stratify (array-like, optional) – Stratification variable for stratified bootstrap.

  • cluster (array-like, optional) – Cluster variable for cluster bootstrap.

Return type:

BootstrapResult

morie.bootstrap_methods.parametric_bootstrap(data, statistic, distribution='normal', n_boot=2000, ci_level=0.95, seed=42, **dist_params)[source]

Parametric bootstrap.

Generates bootstrap samples from a fitted parametric distribution rather than resampling from the data.

Parameters:
  • data (array-like) – Original data (used to fit the distribution).

  • statistic (callable) – Function that computes the statistic.

  • distribution (str) – Distribution: ‘normal’, ‘poisson’, ‘binomial’, ‘exponential’, ‘gamma’.

  • n_boot (int) – Number of bootstrap replicates.

  • ci_level (float) – Confidence interval level.

  • seed (int) – Random seed.

Return type:

BootstrapResult

morie.bootstrap_methods.wild_bootstrap(y, X, statistic_idx=1, n_boot=999, ci_level=0.95, weight_distribution='rademacher', seed=42)[source]

Wild bootstrap for linear regression with heteroskedasticity.

Parameters:
  • y (array-like) – Response variable.

  • X (array-like) – Design matrix (including intercept if desired).

  • statistic_idx (int) – Index of the coefficient to test.

  • n_boot (int) – Number of bootstrap replicates.

  • ci_level (float) – Confidence interval level.

  • weight_distribution (str) – ‘rademacher’ (±1 with equal probability) or ‘mammen’.

  • seed (int) – Random seed.

Return type:

BootstrapResult

morie.bootstrap_methods.block_bootstrap(data, statistic, block_size, n_boot=2000, ci_level=0.95, method='circular', seed=42)[source]

Block bootstrap for dependent data.

Parameters:
  • data (array-like) – Time series or dependent data.

  • statistic (callable) – Function that computes the statistic.

  • block_size (int) – Length of each block.

  • n_boot (int) – Number of bootstrap replicates.

  • ci_level (float) – Confidence interval level.

  • method (str) – ‘moving’ (non-overlapping), ‘circular’, or ‘stationary’.

  • seed (int) – Random seed.

Return type:

BootstrapResult

morie.bootstrap_methods.jackknife(data, statistic, ci_level=0.95)[source]

Delete-one jackknife estimation.

Parameters:
  • data (array-like) – Input data.

  • statistic (callable) – Function that computes the statistic.

  • ci_level (float) – Confidence interval level.

Return type:

JackknifeResult

morie.bootstrap_methods.delete_d_jackknife(data, statistic, d=2, ci_level=0.95, max_subsets=5000, seed=42)[source]

Delete-d jackknife (generalized jackknife).

Parameters:
  • data (array-like) – Input data.

  • statistic (callable) – Function that computes the statistic.

  • d (int) – Number of observations to delete.

  • ci_level (float) – Confidence interval level.

  • max_subsets (int) – Maximum number of subsets to evaluate.

  • seed (int) – Random seed.

Return type:

JackknifeResult

morie.bootstrap_methods.permutation_test(group1, group2, statistic='mean_diff', n_permutations=9999, alternative='two-sided', seed=42)[source]

Two-sample permutation test.

Parameters:
  • group1 (array-like) – First group.

  • group2 (array-like) – Second group.

  • statistic (str or callable) – Test statistic: ‘mean_diff’, ‘median_diff’, ‘t_stat’, or a callable(g1, g2) -> float.

  • n_permutations (int) – Number of permutations.

  • alternative (str) – ‘two-sided’, ‘greater’, ‘less’.

  • seed (int) – Random seed.

Return type:

PermutationTestResult

morie.bootstrap_methods.paired_permutation_test(x, y, statistic='mean_diff', n_permutations=9999, alternative='two-sided', seed=42)[source]

Paired permutation test (sign-flipping).

Parameters:
  • x (array-like) – First measurement.

  • y (array-like) – Second measurement (paired with x).

  • statistic (str) – ‘mean_diff’ or ‘median_diff’.

  • n_permutations (int) – Number of permutations.

  • alternative (str) – ‘two-sided’, ‘greater’, ‘less’.

  • seed (int) – Random seed.

Return type:

PermutationTestResult

morie.bootstrap_methods.subsampling(data, statistic, subsample_size=None, n_subsamples=1000, ci_level=0.95, seed=42)[source]

Subsampling inference (Politis, Romano & Wolf).

Unlike the bootstrap, subsampling draws without replacement at a smaller sample size, providing valid inference under weaker conditions.

Parameters:
  • data (array-like) – Input data.

  • statistic (callable) – Function that computes the statistic.

  • subsample_size (int, optional) – Size of each subsample. Default: int(n^0.7).

  • n_subsamples (int) – Number of subsamples.

  • ci_level (float) – Confidence interval level.

  • seed (int) – Random seed.

Return type:

BootstrapResult

morie.bootstrap_methods.bootstrap_632(X, y, model_fn, score_fn, n_boot=200, seed=42)[source]

The .632 bootstrap estimator for prediction error.

Parameters:
  • X (array-like, shape (n, p)) – Feature matrix.

  • y (array-like, shape (n,)) – Response.

  • model_fn (callable) – Function that fits a model: model_fn(X_train, y_train) -> model.

  • score_fn (callable) – Function that scores predictions: score_fn(y_true, y_pred) -> float.

  • n_boot (int) – Number of bootstrap replicates.

  • seed (int) – Random seed.

Returns:

With keys: apparent_error, bootstrap_error, error_632, error_632plus.

Return type:

dict

morie.bootstrap_methods.cross_validate(X, y, model_fn, score_fn, n_folds=10, stratify=None, groups=None, seed=42)[source]

K-fold cross-validation.

Parameters:
  • X (array-like, shape (n, p)) – Feature matrix.

  • y (array-like, shape (n,)) – Response.

  • model_fn (callable) – model_fn(X_train, y_train) -> model with .predict().

  • score_fn (callable) – score_fn(y_true, y_pred) -> float.

  • n_folds (int) – Number of folds.

  • stratify (array-like, optional) – Stratification variable for stratified CV.

  • groups (array-like, optional) – Group variable for grouped CV (no group split across folds).

  • seed (int) – Random seed.

Return type:

CrossValidationResult

morie.bootstrap_methods.repeated_cv(X, y, model_fn, score_fn, n_folds=10, n_repeats=10, seed=42)[source]

Repeated K-fold cross-validation.

Parameters:
  • X (array-like) – Feature matrix.

  • y (array-like) – Response.

  • model_fn (callable) – Model fitting function.

  • score_fn (callable) – Scoring function.

  • n_folds (int) – Number of folds per repeat.

  • n_repeats (int) – Number of repetitions.

  • seed (int) – Random seed.

Return type:

CrossValidationResult

morie.bootstrap_methods.leave_one_out_cv(X, y, model_fn, score_fn)[source]

Leave-one-out cross-validation.

Parameters:
  • X (array-like) – Feature matrix.

  • y (array-like) – Response.

  • model_fn (callable) – Model fitting function.

  • score_fn (callable) – Scoring function.

Return type:

CrossValidationResult

Multiple testing correction and multiplicity-adjusted inference.

Provides comprehensive p-value adjustment methods, family-wise error rate (FWER) control, false discovery rate (FDR) control, and simultaneous confidence interval methods for multiple comparisons in epidemiological and biostatistical analyses.

All functions accept arrays of p-values and return adjusted p-values or decision boundaries. Compatible with the MORIE pipeline and can be applied to any collection of hypothesis tests from the analysis modules.

class morie.multiple_testing.AdjustedPValues(original, adjusted, method, alpha, rejected, n_rejected, n_tests, labels=None)[source]

Bases: object

Container for multiplicity-adjusted p-values.

Parameters:
original: ndarray
adjusted: ndarray
method: str
alpha: float
rejected: ndarray
n_rejected: int
n_tests: int
labels: list[str] | None = None
summary()[source]

Return a summary DataFrame.

Return type:

DataFrame

class morie.multiple_testing.GatekeepingResult(stages, overall_rejected, method, alpha)[source]

Bases: object

Result from a gatekeeping or hierarchical testing procedure.

Parameters:
stages: list[dict]
overall_rejected: ndarray
method: str
alpha: float
class morie.multiple_testing.ClosedTestResult(original_p, adjusted_p, rejected, intersection_tests, method)[source]

Bases: object

Result from a closed testing procedure.

Parameters:
original_p: ndarray
adjusted_p: ndarray
rejected: ndarray
intersection_tests: dict
method: str
morie.multiple_testing.bonferroni(p_values, alpha=0.05, labels=None)[source]

Bonferroni correction (FWER control).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.sidak(p_values, alpha=0.05, labels=None)[source]

Sidak correction (FWER control, slightly less conservative than Bonferroni).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.holm(p_values, alpha=0.05, labels=None)[source]

Holm step-down procedure (FWER control).

Uniformly more powerful than Bonferroni.

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.hochberg(p_values, alpha=0.05, labels=None)[source]

Hochberg step-up procedure (FWER control under PRDS).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.hommel(p_values, alpha=0.05, labels=None)[source]

Hommel procedure (FWER control, most powerful single-step method under independence).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.holm_sidak(p_values, alpha=0.05, labels=None)[source]

Holm-Sidak step-down procedure.

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.benjamini_hochberg(p_values, alpha=0.05, labels=None)[source]

Benjamini-Hochberg procedure (FDR control).

Controls the false discovery rate at level alpha under independence or positive regression dependency (PRDS).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Target FDR level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.bh(p_values, alpha=0.05, labels=None)

Benjamini-Hochberg procedure (FDR control).

Controls the false discovery rate at level alpha under independence or positive regression dependency (PRDS).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Target FDR level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.benjamini_yekutieli(p_values, alpha=0.05, labels=None)[source]

Benjamini-Yekutieli procedure (FDR control under arbitrary dependence).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Target FDR level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.by(p_values, alpha=0.05, labels=None)

Benjamini-Yekutieli procedure (FDR control under arbitrary dependence).

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Target FDR level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.storey_q(p_values, alpha=0.05, lambda_param=0.5, labels=None)[source]

Storey’s q-value procedure (adaptive FDR control).

Estimates the proportion of true null hypotheses (pi0) and uses it to achieve higher power than BH when many nulls are false.

Parameters:
  • p_values (array-like) – Raw p-values.

  • alpha (float) – Target FDR level.

  • lambda_param (float) – Tuning parameter for pi0 estimation.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.local_fdr(p_values, labels=None)[source]

Estimate local false discovery rate from p-values.

Uses an empirical Bayes approach with a two-component mixture model.

Parameters:
  • p_values (array-like) – Raw p-values.

  • labels (list[str], optional) – Labels for each test.

Returns:

With columns: p_value, z_score, local_fdr, label (if provided).

Return type:

pd.DataFrame

morie.multiple_testing.permutation_fwer(test_statistics, null_distribution, alternative='two-sided', alpha=0.05, labels=None)[source]

FWER control via max-T permutation test (Westfall-Young).

Parameters:
  • test_statistics (array-like, shape (m,)) – Observed test statistics.

  • null_distribution (array-like, shape (n_perm, m)) – Null distribution from permutations.

  • alternative (str) – One of ‘two-sided’, ‘greater’, ‘less’.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.permutation_fdr(p_values, null_p_values, alpha=0.05, labels=None)[source]

FDR control via permutation null distribution.

Parameters:
  • p_values (array-like, shape (m,)) – Observed p-values.

  • null_p_values (array-like, shape (n_perm, m)) – P-values from permuted data.

  • alpha (float) – Target FDR level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.fisher_combined(p_values)[source]

Fisher’s method for combining independent p-values.

Parameters:

p_values (array-like) – Independent p-values.

Returns:

(test_statistic, combined_p_value).

Return type:

tuple[float, float]

morie.multiple_testing.stouffer_combined(p_values, weights=None)[source]

Stouffer’s method for combining independent p-values.

Parameters:
  • p_values (array-like) – Independent p-values.

  • weights (array-like, optional) – Weights for each p-value. If None, equal weights are used.

Returns:

(z_statistic, combined_p_value).

Return type:

tuple[float, float]

morie.multiple_testing.tippett_combined(p_values)[source]

Tippett’s method: combine via minimum p-value.

Parameters:

p_values (array-like) – Independent p-values.

Returns:

(min_p, combined_p).

Return type:

tuple[float, float]

morie.multiple_testing.simes_combined(p_values)[source]

Simes’ test for the global null hypothesis.

Parameters:

p_values (array-like) – P-values (may be dependent under PRDS).

Returns:

(simes_statistic, combined_p).

Return type:

tuple[float, float]

morie.multiple_testing.harmonic_mean_p(p_values)[source]

Harmonic mean p-value for combining dependent tests.

Parameters:

p_values (array-like) – P-values (can be dependent).

Returns:

Harmonic mean p-value.

Return type:

float

morie.multiple_testing.cauchy_combination(p_values, weights=None)[source]

Cauchy combination test (Liu & Xie 2020).

Robust to arbitrary correlation structures.

Parameters:
  • p_values (array-like) – P-values.

  • weights (array-like, optional) – Non-negative weights summing to 1.

Returns:

(test_statistic, combined_p).

Return type:

tuple[float, float]

morie.multiple_testing.hierarchical_bonferroni(p_values_by_family, alpha=0.05, propagate_alpha=True)[source]

Hierarchical (serial gatekeeping) Bonferroni procedure.

Tests are organized in families (stages). A family is tested only if at least one hypothesis in the previous family is rejected.

Parameters:
  • p_values_by_family (list of array-like) – P-values grouped by family/stage.

  • alpha (float) – Overall FWER level.

  • propagate_alpha (bool) – If True, unused alpha is propagated to next family.

Return type:

GatekeepingResult

morie.multiple_testing.fixed_sequence(p_values, alpha=0.05, labels=None)[source]

Fixed-sequence (predetermined order) testing procedure.

Tests are evaluated in order; the procedure stops at the first non-rejection. No multiplicity adjustment needed for tests that are reached.

Parameters:
  • p_values (array-like) – P-values in the predetermined testing order.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.fallback_procedure(p_values, weights, alpha=0.05, labels=None)[source]

Fallback (fixed-sequence with alpha spending) procedure.

Parameters:
  • p_values (array-like) – P-values in testing order.

  • weights (array-like) – Alpha allocation weights (must sum to 1).

  • alpha (float) – Overall significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.estimate_pi0(p_values, method='storey')[source]

Estimate the proportion of true null hypotheses (pi0).

Parameters:
  • p_values (array-like) – Raw p-values.

  • method (str) – Estimation method: ‘storey’ (default), ‘bootstrap’, ‘two_step’.

Returns:

Estimated pi0 in [0, 1].

Return type:

float

morie.multiple_testing.adjust_p_values(p_values, method='bh', alpha=0.05, labels=None)[source]

Convenience dispatcher for p-value adjustment methods.

Parameters:
  • p_values (array-like) – Raw p-values.

  • method (str) – Adjustment method. One of: ‘bonferroni’, ‘sidak’, ‘holm’, ‘hochberg’, ‘hommel’, ‘holm_sidak’, ‘bh’, ‘by’, ‘storey’.

  • alpha (float) – Significance level.

  • labels (list[str], optional) – Labels for each test.

Return type:

AdjustedPValues

morie.multiple_testing.n_effective_tests(p_values=None, correlation_matrix=None, method='galwey')[source]

Estimate the effective number of independent tests.

Useful for determining appropriate Bonferroni correction when tests are correlated.

Parameters:
  • p_values (array-like, optional) – P-values (used for some methods).

  • correlation_matrix (array-like, optional) – Correlation matrix between test statistics.

  • method (str) – ‘galwey’ (eigenvalue-based), ‘li_ji’ (Li & Ji 2005), ‘nyholt’ (Nyholt 2004).

Returns:

Estimated number of effective independent tests.

Return type:

float

Comprehensive effect-size calculations for epidemiological research.

This module provides every major family of effect-size estimators used in biomedical and social-science research, each with analytic or bootstrap confidence intervals. Functions are organised by the type of comparison:

  • Standardised mean differences: Cohen’s d, Hedges’ g, Glass’s delta

  • Common-language effect sizes: CLES / probability of superiority

  • Correlation-based: r, R-squared, eta-squared, partial eta-squared, omega-squared, epsilon-squared

  • Contingency-table measures: odds ratio, risk ratio, risk difference, NNT, NNH, rate ratio, incidence rate difference

  • Association measures: Cohen’s w, Cramer’s V, phi coefficient

  • Non-parametric: rank-biserial correlation, Cliff’s delta, Vargha–Delaney A

  • Regression: standardised coefficients, coefficient of variation

  • Conversion: d <-> r, OR <-> d, etc.

  • Meta-analysis: fixed- and random-effects pooling, I-squared, prediction intervals

References

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences

(2nd ed.). Lawrence Erlbaum Associates.

Hedges, L. V., & Olkin, I. (1985). Statistical Methods for Meta-Analysis.

Academic Press.

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009).

Introduction to Meta-Analysis. Wiley.

Vargha, A., & Delaney, H. D. (2000). A critique and improvement of the CL

common language effect size statistics of McGraw and Wong. JEBS, 25(2), 101–132.

class morie.effect_sizes.EffectSizeResult(measure, estimate, ci_lower=None, ci_upper=None, se=None, n=None, extra=<factory>)[source]

Bases: object

Standardised result for every effect-size calculation.

Parameters:
  • measure (str) – Name of the effect-size statistic.

  • estimate (float) – Point estimate.

  • ci_lower (float | None) – Lower confidence bound.

  • ci_upper (float | None) – Upper confidence bound.

  • se (float | None) – Standard error (analytic or bootstrap).

  • n (int | None) – Sample size used.

  • extra (dict) – Additional outputs.

measure: str
estimate: float
ci_lower: float | None = None
ci_upper: float | None = None
se: float | None = None
n: int | None = None
extra: dict
morie.effect_sizes.cohens_d(x, y, confidence=0.95)[source]

Cohen’s d for independent samples (pooled SD denominator).

\[d = \frac{\bar{x} - \bar{y}}{s_p}\]
Parameters:
  • x (array-like) – Two independent samples.

  • y (array-like) – Two independent samples.

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.hedges_g(x, y, confidence=0.95)[source]

Hedges’ g – bias-corrected Cohen’s d.

Applies the exact correction factor \(J = 1 - 3/(4(n_1+n_2-2)-1)\).

Parameters:
  • x (array-like)

  • y (array-like)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.glass_delta(x, y, control='y', confidence=0.95)[source]

Glass’s delta – uses the control group SD as denominator.

Parameters:
  • x (array-like)

  • y (array-like)

  • control (str, default "y") – Which group is the control: "x" or "y".

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.cles(x, y, confidence=0.95)[source]

Common Language Effect Size (probability of superiority).

Estimates \(P(X > Y)\) for randomly drawn observations from each group.

Parameters:
  • x (array-like)

  • y (array-like)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.r_effect_size(x, y, confidence=0.95)[source]

Pearson r as an effect size with Fisher z CI.

Parameters:
  • x (array-like)

  • y (array-like)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.r_squared(x, y)[source]

Coefficient of determination R-squared.

Parameters:
  • x (array-like)

  • y (array-like)

Return type:

EffectSizeResult

morie.effect_sizes.eta_squared(ss_effect, ss_total)[source]

Eta-squared from ANOVA sums of squares.

\(\eta^2 = SS_{effect} / SS_{total}\)

Parameters:
  • ss_effect (float) – Sum of squares for the effect.

  • ss_total (float) – Total sum of squares.

Return type:

EffectSizeResult

morie.effect_sizes.partial_eta_squared(ss_effect, ss_error)[source]

Partial eta-squared.

\(\eta^2_p = SS_{effect} / (SS_{effect} + SS_{error})\)

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.omega_squared(ss_effect, ss_total, df_effect, ms_error)[source]

Omega-squared – less biased than eta-squared.

\(\omega^2 = (SS_{effect} - df_{effect} \cdot MS_{error}) / (SS_{total} + MS_{error})\)

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.epsilon_squared(ss_effect, ss_total, df_effect, ms_error)[source]

Epsilon-squared (Kelley, 1935).

\(\varepsilon^2 = (SS_{effect} - df_{effect} \cdot MS_{error}) / SS_{total}\)

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.odds_ratio(a, b, c, d, confidence=0.95)[source]

Odds ratio for a 2x2 table [[a, b], [c, d]].

\(OR = (a \cdot d) / (b \cdot c)\)

Parameters:
  • a (int) – Cell counts of the 2x2 table.

  • b (int) – Cell counts of the 2x2 table.

  • c (int) – Cell counts of the 2x2 table.

  • d (int) – Cell counts of the 2x2 table.

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.risk_ratio(a, b, c, d, confidence=0.95)[source]

Risk ratio (relative risk) for a 2x2 table.

\(RR = [a/(a+b)] / [c/(c+d)]\)

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.risk_difference(a, b, c, d, confidence=0.95)[source]

Risk difference (attributable risk) for a 2x2 table.

\(RD = a/(a+b) - c/(c+d)\)

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.number_needed_to_treat(a, b, c, d, confidence=0.95)[source]

Number needed to treat (NNT).

\(NNT = 1 / |RD|\)

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.number_needed_to_harm(a, b, c, d, confidence=0.95)[source]

Number needed to harm (NNH) – same as NNT but with reversed sign convention.

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.rate_ratio(events1, person_time1, events2, person_time2, confidence=0.95)[source]

Incidence rate ratio.

\(IRR = (e_1/PT_1) / (e_2/PT_2)\)

Parameters:
  • events1 (int, float) – Events and person-time in group 1.

  • person_time1 (int, float) – Events and person-time in group 1.

  • events2 (int, float) – Events and person-time in group 2.

  • person_time2 (int, float) – Events and person-time in group 2.

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.incidence_rate_difference(events1, person_time1, events2, person_time2, confidence=0.95)[source]

Incidence rate difference.

\(IRD = (e_1/PT_1) - (e_2/PT_2)\)

Parameters:
Return type:

EffectSizeResult

morie.effect_sizes.cohens_w(observed, expected=None)[source]

Cohen’s w for chi-squared.

\(w = \sqrt{\chi^2 / N}\)

Parameters:
  • observed (array-like) – Observed frequencies.

  • expected (array-like or None) – Expected frequencies (uniform if None).

Return type:

EffectSizeResult

morie.effect_sizes.cohens_f(eta2)[source]

Cohen’s f from eta-squared.

\(f = \sqrt{\eta^2 / (1 - \eta^2)}\)

Parameters:

eta2 (float) – Eta-squared value.

Return type:

EffectSizeResult

morie.effect_sizes.cramers_v(contingency_table, confidence=0.95)[source]

Cramer’s V for a contingency table.

\(V = \sqrt{\chi^2 / (N \cdot (k - 1))}\) where \(k = \min(r, c)\).

Parameters:
  • contingency_table (array-like or DataFrame)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.phi_coefficient(contingency_table)[source]

Phi coefficient for a 2x2 contingency table.

\(\phi = \sqrt{\chi^2 / N}\)

Parameters:

contingency_table (array-like)

Return type:

EffectSizeResult

morie.effect_sizes.rank_biserial_correlation(x, y, confidence=0.95)[source]

Rank-biserial correlation (matched rank version).

\(r = 1 - 2U / (n_1 n_2)\) where U is the Mann–Whitney statistic.

Parameters:
  • x (array-like)

  • y (array-like)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.cliffs_delta(x, y, confidence=0.95)[source]

Cliff’s delta (non-parametric effect size).

\(\delta = (\#(x_i > y_j) - \#(x_i < y_j)) / (n_x n_y)\)

Parameters:
  • x (array-like)

  • y (array-like)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.vargha_delaney_a(x, y, confidence=0.95)[source]

Vargha–Delaney A statistic.

\(A = U / (n_1 n_2)\) where U is the Mann–Whitney statistic (probability that a randomly chosen x exceeds a randomly chosen y).

Parameters:
  • x (array-like)

  • y (array-like)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.standardized_coefficients(X, y)[source]

Compute standardised regression coefficients (beta weights).

Standardises X and y to zero mean and unit variance before OLS.

Parameters:
  • X (array-like or DataFrame) – Predictor matrix (n x p).

  • y (array-like or Series) – Outcome variable.

Returns:

Columns: variable, beta, se, t, p_value.

Return type:

DataFrame

morie.effect_sizes.coefficient_of_variation(x)[source]

Coefficient of variation (CV).

\(CV = s / \bar{x}\)

Parameters:

x (array-like)

Return type:

EffectSizeResult

morie.effect_sizes.variance_ratio(x, y, confidence=0.95)[source]

Variance ratio (F-test for equality of variances).

Parameters:
  • x (array-like)

  • y (array-like)

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.d_to_r(d, n1=None, n2=None)[source]

Convert Cohen’s d to Pearson r.

\(r = d / \sqrt{d^2 + a}\) where \(a = (n_1 + n_2)^2 / (n_1 n_2)\) or a = 4 when sample sizes are unknown.

Parameters:
Return type:

float

morie.effect_sizes.r_to_d(r)[source]

Convert Pearson r to Cohen’s d.

\(d = 2r / \sqrt{1 - r^2}\)

Parameters:

r (float)

Return type:

float

morie.effect_sizes.or_to_d(or_val)[source]

Convert odds ratio to Cohen’s d (Hasselblad & Hedges, 1995).

\(d = \log(OR) \cdot \sqrt{3} / \pi\)

Parameters:

or_val (float) – Odds ratio.

Return type:

float

morie.effect_sizes.d_to_or(d)[source]

Convert Cohen’s d to an odds ratio.

\(OR = \exp(d \pi / \sqrt{3})\)

Parameters:

d (float)

Return type:

float

morie.effect_sizes.or_to_r(or_val)[source]

Convert odds ratio to Pearson r via d.

Parameters:

or_val (float)

Return type:

float

morie.effect_sizes.r_to_or(r)[source]

Convert Pearson r to odds ratio via d.

Parameters:

r (float)

Return type:

float

morie.effect_sizes.d_to_nnt(d, base_rate=0.5)[source]

Convert Cohen’s d to NNT given a base rate.

Uses the Kraemer & Kupfer (2006) formula: \(NNT = 1 / (\Phi(d/2 + \Phi^{-1}(CER)) - CER)\)

Parameters:
  • d (float)

  • base_rate (float, default 0.5) – Control event rate.

Return type:

float

morie.effect_sizes.fixed_effects_meta(estimates, standard_errors, confidence=0.95)[source]

Fixed-effects (inverse-variance weighted) meta-analytic pooling.

Parameters:
  • estimates (array-like) – Effect-size estimates from k studies.

  • standard_errors (array-like) – Standard errors of the estimates.

  • confidence (float, default 0.95)

Return type:

EffectSizeResult

morie.effect_sizes.random_effects_meta(estimates, standard_errors, confidence=0.95, method='DL')[source]

Random-effects meta-analytic pooling.

Parameters:
  • estimates (array-like)

  • standard_errors (array-like)

  • confidence (float, default 0.95)

  • method (str, default "DL") – Tau-squared estimator: "DL" (DerSimonian–Laird).

Return type:

EffectSizeResult

References

DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3), 177–188.

morie.effect_sizes.i_squared(estimates, standard_errors)[source]

Compute Higgins’ I-squared heterogeneity statistic.

\(I^2 = \max(0, (Q - (k-1))/Q) \times 100\)

Parameters:
  • estimates (array-like)

  • standard_errors (array-like)

Returns:

Percentage (0–100).

Return type:

float

morie.effect_sizes.prediction_interval(estimates, standard_errors, confidence=0.95)[source]

Prediction interval for a new study from a random-effects meta-analysis.

Parameters:
  • estimates (array-like)

  • standard_errors (array-like)

  • confidence (float, default 0.95)

Returns:

(lower, upper) prediction interval bounds.

Return type:

tuple[float, float]

morie.effect_sizes.bootstrap_effect_size_ci(func, *arrays, n_boot=2000, confidence=0.95, seed=42)[source]

Generic bootstrap CI wrapper for any effect-size function.

Parameters:
  • func (callable) – Function that takes one or more arrays and returns a scalar.

  • *arrays (array-like) – Input arrays to bootstrap.

  • n_boot (int, default 2000)

  • confidence (float, default 0.95)

  • seed (int, default 42)

Return type:

EffectSizeResult

Datasets

Dataset-agnostic analysis engine: ingest, profile, and plan analyses for any dataset.

This module provides tools to ingest datasets in multiple formats (CSV, TSV, Excel, Parquet, JSON) without prior schema knowledge, infer levels of measurement (NOIR: Nominal, Ordinal, Interval, Ratio) for each column, auto-detect key variable roles (treatment, outcome, weight, stratum, cluster), produce comprehensive dataset profiles, and suggest analysis plans.

The measurement-level inference follows Stevens’ (1946) taxonomy as refined for applied statistical practice:

  • Nominal — unordered categories (e.g., province, gender)

  • Ordinal — ordered categories (e.g., Likert scales, education level)

  • Interval — numeric with no true zero (e.g., temperature, year, index)

  • Ratio — numeric with true zero (e.g., count, weight, income, eBAC)

References

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677

Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65–72. https://doi.org/10.1080/00031305.1993.10475938

class morie.dataset.MeasurementLevel(*values)[source]

Bases: Enum

Stevens’ levels of measurement (NOIR taxonomy).

Level

Properties

Example variables

NOMINAL

Identity only (= / !=)

Province, gender, treatment indicator

ORDINAL

Identity + order (< / >)

Likert scale, education level, income bracket

INTERVAL

Identity + order + equal intervals, no true zero

Temperature (C), year, standardized score

RATIO

Identity + order + equal intervals + true zero

Count, weight, income, eBAC

References

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.

NOMINAL = 'nominal'
ORDINAL = 'ordinal'
INTERVAL = 'interval'
RATIO = 'ratio'
class morie.dataset.ColumnProfile(name, dtype, level, n_unique, missing_pct, is_binary, is_constant, suggested_role, summary_stats=<factory>)[source]

Bases: object

Profile of a single column in the dataset.

Variables:
  • name (str) – Column name.

  • dtype (str) – Pandas dtype as a string.

  • level (MeasurementLevel) – Inferred Stevens measurement level.

  • n_unique (int) – Number of distinct non-null values.

  • missing_pct (float) – Percentage of missing values (0.0–100.0).

  • is_binary (bool) – True if column has exactly two unique non-null values.

  • is_constant (bool) – True if column has at most one unique non-null value.

  • suggested_role (str) – One of: "treatment", "outcome", "covariate", "weight", "id", "stratum", "cluster", or "unknown".

  • summary_stats (dict) – For numeric columns: {mean, std, min, q25, median, q75, max}. For categorical columns: {top_counts} as a dict of value->count.

Parameters:
name: str
dtype: str
level: MeasurementLevel
n_unique: int
missing_pct: float
is_binary: bool
is_constant: bool
suggested_role: str
summary_stats: dict
class morie.dataset.DatasetProfile(n_rows, n_cols, columns, suggested_treatment=None, suggested_outcome=None, suggested_weights=None)[source]

Bases: object

Full profile of a dataset including all column profiles and summary diagnostics.

Variables:
  • n_rows (int) – Number of rows.

  • n_cols (int) – Number of columns.

  • columns (dict[str, ColumnProfile]) – Per-column profiles keyed by column name.

  • suggested_treatment (str | None) – Best-guess treatment column, or None if not detected.

  • suggested_outcome (str | None) – Best-guess outcome column, or None if not detected.

  • suggested_weights (str | None) – Best-guess survey-weight column, or None if not detected.

Parameters:
n_rows: int
n_cols: int
columns: dict[str, ColumnProfile]
suggested_treatment: str | None = None
suggested_outcome: str | None = None
suggested_weights: str | None = None
to_dict()[source]

Serialize the profile to a plain dictionary.

Returns:

Nested dictionary representation suitable for JSON serialization.

Return type:

dict

summary_table()[source]

Return a human-readable summary table.

Uses rich Table if available; falls back to a plain-text tabular layout otherwise.

Returns:

Formatted summary string.

Return type:

str

morie.dataset.infer_measurement_level(series, *, ordinal_threshold=10)[source]

Infer the Stevens measurement level (NOIR) for a single column.

Detection rules (applied in order):

  1. dtype object / category + ordinal name heuristic + n_unique <= ordinal_threshold: ORDINAL

  2. dtype object / category (default): NOMINAL

  3. dtype bool: NOMINAL

  4. dtype int/float + is_binary (n_unique <= 2): NOMINAL (binary)

  5. dtype int/float + n_unique <= 20 + ordinal name heuristic: ORDINAL

  6. dtype float + interval name heuristic: INTERVAL

  7. dtype float: RATIO

  8. dtype int + min >= 0: RATIO

  9. dtype int + min < 0: INTERVAL

  10. datetime: INTERVAL

Parameters:
  • series (pandas.Series) – Column data to classify.

  • ordinal_threshold (int, optional) – Maximum number of unique values for a categorical column to be considered ordinal (default 10).

Returns:

Inferred measurement level.

Return type:

MeasurementLevel

Examples

>>> import pandas as pd
>>> infer_measurement_level(pd.Series(["M", "F", "M", "F"]))
<MeasurementLevel.NOMINAL: 'nominal'>
>>> infer_measurement_level(pd.Series([1, 2, 3, 4, 5], name="likert_q1"))
<MeasurementLevel.ORDINAL: 'ordinal'>
morie.dataset.profile_dataset(df, *, hint_treatment=None, hint_outcome=None, hint_weights=None, ordinal_threshold=10, binary_threshold=2)[source]

Fully profile a DataFrame without prior schema knowledge.

Walks every column, infers its measurement level and epidemiological role, computes summary statistics, and auto-detects likely treatment, outcome, and survey-weight columns. User-supplied hints override heuristic detection.

Parameters:
  • df (pandas.DataFrame) – Input data.

  • hint_treatment (str, optional) – If provided, use this column as the treatment indicator.

  • hint_outcome (str, optional) – If provided, use this column as the outcome.

  • hint_weights (str, optional) – If provided, use this column as the survey weight.

  • ordinal_threshold (int, optional) – Maximum unique values for categorical columns to be classified as ordinal (default 10).

  • binary_threshold (int, optional) – Maximum unique values for a column to be considered binary (default 2).

Returns:

Complete dataset profile.

Return type:

DatasetProfile

Raises:
  • TypeError – If df is not a pandas DataFrame.

  • ValueError – If df has zero rows or zero columns.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     "treatment": [0, 1, 0, 1],
...     "outcome": [1.2, 3.4, 2.1, 4.5],
...     "age": [25, 30, 35, 40],
... })
>>> profile = profile_dataset(df)
>>> profile.n_rows
4
>>> profile.suggested_treatment
'treatment'
morie.dataset.load_dataset(path, *, encoding='utf-8', **read_kwargs)[source]

Load a dataset from CSV, TSV, Excel, Parquet, or JSON file.

File format is detected from the extension. Supported extensions: .csv, .tsv, .xlsx / .xls, .parquet / .pq, .json / .jsonl.

Parameters:
  • path (str or Path) – File path to the dataset.

  • encoding (str, optional) – Character encoding for text-based formats (default "utf-8").

  • **read_kwargs – Additional keyword arguments forwarded to the pandas reader.

Returns:

Loaded dataset.

Return type:

pandas.DataFrame

Raises:

Examples

>>> df = load_dataset("data.csv")
>>> df = load_dataset("survey.xlsx", sheet_name="wave1")
morie.dataset.suggest_analysis_plan(profile)[source]

Suggest an ordered analysis plan based on the detected variable types.

Uses the inferred measurement levels, binary indicators, and detected treatment/outcome/weight columns to recommend epidemiological analyses.

Parameters:

profile (DatasetProfile) – Output from profile_dataset().

Returns:

Ordered list of suggested analyses. Each dict contains:

  • analysis (str): method identifier (e.g., "ipw_ate")

  • rationale (str): plain-English reason this analysis is suggested

  • required_vars (dict): mapping of role -> column name(s)

Return type:

list[dict[str, Any]]

Examples

>>> plan = suggest_analysis_plan(profile)
>>> plan[0]["analysis"]
'descriptive_profile'
morie.data.morie_db()[source]

Return path to morie.db – checks package-bundled location first, then cache.

Package-bundled DB ships with the installed package (via LFS/setuptools). Cache location is used for development and is gitignored.

Return type:

Path

morie.data.cache_connect(db_path=None)[source]

Open (or create) the MORIE SQLite cache database.

Parameters:

db_path (str | Path | None)

Return type:

Connection

morie.data.cache_store(df, table, db_path=None)[source]

Write a DataFrame to the SQLite cache, replacing any existing table.

Parameters:
Return type:

int

morie.data.cache_load(table, db_path=None)[source]

Load a table from the SQLite cache. Returns None if not cached.

Parameters:
Return type:

DataFrame | None

morie.data.cache_list(db_path=None)[source]

List all cached tables with row counts.

Parameters:

db_path (str | Path | None)

Return type:

list[dict[str, Any]]

morie.data.fetch_ckan_to_cache(dataset_key='cpads', limit=32000, db_path=None, timeout=60)[source]

Fetch a dataset from CKAN and store it in the SQLite cache.

Parameters:
  • dataset_key (str) – Key in CKAN_DATASETS (e.g., “cpads”, “csads”, “csus”).

  • limit (int) – Max records to fetch from CKAN DataStore API.

  • db_path (str | Path | None) – Override cache database path.

  • timeout (int) – HTTP timeout in seconds.

Returns:

The fetched (and optionally canonicalized) DataFrame.

Return type:

pd.DataFrame

morie.data.load_cpads(db_path=None, timeout=60)[source]

Load CPADS data: try local files, then cache, then CKAN API.

Resolution order: 1. Local RDS (via R bridge) or CSV files in standard locations 2. SQLite cache (data/cache/morie.db) 3. CKAN API fetch -> cache -> return

Parameters:
Return type:

DataFrame

morie.data.load_dataset(key, *, db_path=None, timeout=60)[source]

Load a dataset by catalog key.

Resolution order: 1. Built-in morie.db (ships with package) 2. User cache data/cache/morie.db 3. Local file (ingest to cache on the fly) 4. CKAN API (if resource ID available) 5. Error

Supports fuzzy matching: load_dataset("cpads") resolves to ocp21.

Parameters:
Return type:

DataFrame

morie.data.list_datasets(db_path=None)[source]

List all datasets with their cache status.

Returns a list of dicts with keys: key, name, source, survey, year, type, cached (bool), rows (int or None).

When db_path is explicitly provided, only that database is queried (the built-in DB is skipped). This keeps unit tests deterministic.

Parameters:

db_path (str | Path | None)

Return type:

list[dict]

morie.data.dataset_info(key)[source]

Return full metadata for a dataset by catalog key.

Parameters:

key (str)

Return type:

dict

class morie.data.DatasetRegistry(data_dir='data/datasets/')[source]

Bases: object

A registry to manage, catalog, and load secure epidemiological datasets.

Parameters:

data_dir (str)

__init__(data_dir='data/datasets/')[source]

Initialize the dataset registry.

Parameters:

data_dir (str) – The root file directory containing secure datasets, defaults to “data/datasets/”.

register_dataset(name, metadata)[source]

Register a dataset in the internal catalog.

Parameters:
  • name (str) – Unique identifier for the dataset.

  • metadata (dict) – A dictionary mapping metadata values (like name, path, format).

list_datasets()[source]

Return a copy of the registered dataset catalog.

Returns:

Dataset metadata keyed by registry identifier.

Return type:

dict[str, dict[str, Any]]

get_dataset_metadata(name)[source]

Return a copy of the metadata for one registered dataset.

Parameters:

name (str)

Return type:

dict[str, Any]

fetch_ckan_records(name, *, limit=5, query=None, timeout=30)[source]

Fetch records for a dataset backed by a CKAN DataStore resource.

Parameters:
Return type:

dict[str, Any]

fetch_ckan_dataframe(name, *, limit=100, query=None, timeout=30)[source]

Fetch a CKAN result set and return the records as a DataFrame.

Parameters:
Return type:

DataFrame

cpads_contract()[source]

Return the canonical local-private CPADS contract.

Return type:

dict[str, Any]

register_local_cpads(path, *, name='cpads_local')[source]

Register a user-provided local CPADS file.

Parameters:
Return type:

dict[str, Any]

validate_cpads_frame(frame, *, strict=True)[source]

Validate a DataFrame against the canonical CPADS variable contract.

Parameters:
Return type:

list[str]

load(name)[source]

Load a registered dataset securely.

Parameters:

name (str) – Unique identifier for the dataset in the catalog.

Raises:
Returns:

A pandas DataFrame containing the dataset content.

Return type:

pandas.DataFrame

BigQuery-derived dataset analysis over a local SQLite catalog.

This module gives the morie analytics surface a typed entry point into a SQLite-mirrored slice of public BigQuery datasets – useful for local, network-free exploration and CI work.

Two access paths:

  1. Local – point at the on-disk SQLite file directly. Fast, no network. Works against the bundled morie_datasets.db plus any user-supplied SQLite mirrors of public BigQuery slices.

  2. Remote – a SQL-over-HTTP endpoint configured via the MORIE_REMOTE_URL env var. Useful for exploration without copying GBs of SQLite locally.

The high-level helpers (bq_describe, bq_columns, bq_summary, bq_correlate) auto-pick local first, fall through to remote when configured.

class morie.bq.TableInfo(db, table, row_count, columns, sample)[source]

Bases: object

High-level summary of one table inside a BigQuery-mirror database.

Parameters:
db: str
table: str
row_count: int
columns: list[dict[str, str]]
sample: list[dict[str, Any]]
class morie.bq.ColumnSummary(db, table, column, n, n_missing, mean, median, std, min, max)[source]

Bases: object

Univariate descriptive stats for one numeric column.

Parameters:
db: str
table: str
column: str
n: int
n_missing: int
mean: float | None
median: float | None
std: float | None
min: float | None
max: float | None
morie.bq.bq_query(db, sql)[source]

Run an arbitrary SQL query against the BigQuery-mirror database db.

Auto-picks local SQLite if available; falls through to the configured remote endpoint otherwise. Returns a list of row dicts.

Examples

>>> rows = bq_query("nyc_311_historic", "SELECT borough, count(*) FROM nyc_311_historic GROUP BY 1")
Parameters:
Return type:

list[dict[str, Any]]

morie.bq.bq_tables(db)[source]

List every table in the database.

Parameters:

db (str)

Return type:

list[str]

morie.bq.bq_columns(db, table=None)[source]

Schema for table (defaults to the first user table in the db).

Parameters:
Return type:

list[dict[str, str]]

morie.bq.bq_describe(db, table=None)[source]

High-level summary: row count, schema, first 5 rows.

Parameters:
Return type:

TableInfo

morie.bq.bq_summary(db, table, column)[source]

Univariate stats for a numeric column. NaN-safe.

Parameters:
Return type:

ColumnSummary

morie.bq.bq_correlate(db, table, x, y)[source]

Pearson correlation between two numeric columns (NaN-safe).

Parameters:
Return type:

float | None

OTIS — Ontario Tracking and Information System

morie.otis – Ontario Restrictive Confinement (OTIS) analysis.

Correctional/sociolegal analysis for Ontario placement data (2023-2025). Orchestrates existing MORIE causal inference modules for the correctional domain: DML-IRM, propensity score matching, AIPW, mixed-effects models.

Short-name API (≤6 chars):

  • rplace() – Regional placement analysis by age/sex/year

  • astcmb() – Alert-state combination encoding (8 binary -> complexity)

  • volat() – Regional volatility/movement metric

  • rctrnd() – Restrictive confinement trends over time

  • otdesc() – Full OTIS descriptive statistics suite

  • otdml() – Run DML IRM (ATE/ATT) on OTIS data

Data: data/cache/correctional_stats_report_environment1b.RData (via R)

data/cache/dt_expanded.rds (1.9M rows × 13 cols)

References

  • Ontario Ministry of the Solicitor General (2025). Restrictive Confinement Detailed Dataset. data.ontario.ca.

  • Jahn v. Ontario (2020). Settlement Agreement – Inmate Data Disclosure.

  • Chernozhukov et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal.

class morie.otis.RplRes(counts, props, year, sex=None)[source]

Bases: object

Regional placement result.

Parameters:
counts: DataFrame
props: DataFrame
year: int
sex: str | None = None
class morie.otis.AstRes(data, summary)[source]

Bases: object

Alert-state combination result.

Parameters:
data: DataFrame
summary: DataFrame
class morie.otis.VolRes(data, mean, median)[source]

Bases: object

Volatility result.

Parameters:
data: DataFrame
mean: float
median: float
class morie.otis.OtDmlR(ate, ate_se, ate_pval, att, att_se, att_pval, n, method='IRM')[source]

Bases: object

OTIS DML result.

Parameters:
ate: float
ate_se: float
ate_pval: float
att: float
att_se: float
att_pval: float
n: int
method: str = 'IRM'
morie.otis.rplace(df, year, sex=None, *, id_col='unique_individual_id', age_col='age_category', region_col='region_at_time_of_placement', year_col='end_fiscal_year', gender_col='gender')[source]

Regional placement analysis by age group.

Computes count matrix S and proportion matrix P of placements across regions, stratified by age group and optionally by sex.

Parameters:
  • df (DataFrame) – Correctional placement data.

  • year (int) – Fiscal year to analyze.

  • sex (str, optional) – Filter by gender (“Male” or “Female”).

  • id_col (str)

  • age_col (str)

  • region_col (str)

  • year_col (str)

  • gender_col (str)

Return type:

RplRes

morie.otis.astcmb(df, *, alert_cols=('mental_health_alert', 'suicide_risk_alert', 'suicide_watch_alert'), id_col='unique_individual_id', year_col='end_fiscal_year', np_col='number_of_placements')[source]

Alert-state combination encoding.

Encodes 3 binary alert indicators into 8 possible combinations (a1-a8) and computes a complexity index (count of distinct combinations per person-year).

Parameters:
  • df (DataFrame) – Must contain alert columns (Yes/No or 1/0).

  • alert_cols (tuple[str, ...])

  • id_col (str)

  • year_col (str)

  • np_col (str)

Return type:

AstRes

morie.otis.volat(df, *, id_col='unique_individual_id', year_col='end_fiscal_year', regA_col='region_at_time_of_placement', regB_col='region_most_recent_placement')[source]

Regional volatility/movement metric.

Counts the number of distinct regions an individual was placed in (combining initial and most recent placement regions).

Parameters:
  • df (DataFrame)

  • id_col (str)

  • year_col (str)

  • regA_col (str)

  • regB_col (str)

Return type:

VolRes

morie.otis.rctrnd(df, *, id_col='unique_individual_id', year_col='end_fiscal_year', region_col='region_at_time_of_placement')[source]

Restrictive confinement trends over time.

Returns per-year counts by region.

Parameters:
  • df (DataFrame)

  • id_col (str)

  • year_col (str)

  • region_col (str)

Returns:

DataFrame with columns

Return type:

year, region, n_individuals, n_placements

morie.otis.otdesc(df, *, id_col='unique_individual_id', year_col='end_fiscal_year')[source]

Full OTIS descriptive statistics suite.

Returns:

  • dict with keys (n_total, n_by_year, n_by_region, n_by_age,)

  • n_by_gender, placement_dist

Parameters:
Return type:

dict

morie.otis.otdml(df, outcome='Y', treatment='D', covariates=None, *, cluster=None, n_folds=3, seed=123)[source]

Run DML IRM (ATE/ATT) on OTIS data.

Wraps morie.effects.estimate_irm() for the correctional domain.

Parameters:
  • df (DataFrame) – Must contain outcome, treatment, and covariate columns.

  • outcome (str) – Outcome variable name (default “Y” = suicide risk).

  • treatment (str) – Treatment variable name (default “D” = mental health alert).

  • covariates (list of str, optional) – Covariate column names. Defaults to standard OTIS set.

  • cluster (str, optional) – Cluster variable for clustered standard errors.

  • n_folds (int)

  • seed (int)

Return type:

OtDmlR

morie.otis.regional_placement(df, year, sex=None, *, id_col='unique_individual_id', age_col='age_category', region_col='region_at_time_of_placement', year_col='end_fiscal_year', gender_col='gender')

Regional placement analysis by age group.

Computes count matrix S and proportion matrix P of placements across regions, stratified by age group and optionally by sex.

Parameters:
  • df (DataFrame) – Correctional placement data.

  • year (int) – Fiscal year to analyze.

  • sex (str, optional) – Filter by gender (“Male” or “Female”).

  • id_col (str)

  • age_col (str)

  • region_col (str)

  • year_col (str)

  • gender_col (str)

Return type:

RplRes

morie.otis.alert_state_combo(df, *, alert_cols=('mental_health_alert', 'suicide_risk_alert', 'suicide_watch_alert'), id_col='unique_individual_id', year_col='end_fiscal_year', np_col='number_of_placements')

Alert-state combination encoding.

Encodes 3 binary alert indicators into 8 possible combinations (a1-a8) and computes a complexity index (count of distinct combinations per person-year).

Parameters:
  • df (DataFrame) – Must contain alert columns (Yes/No or 1/0).

  • alert_cols (tuple[str, ...])

  • id_col (str)

  • year_col (str)

  • np_col (str)

Return type:

AstRes

morie.otis.volatility(df, *, id_col='unique_individual_id', year_col='end_fiscal_year', regA_col='region_at_time_of_placement', regB_col='region_most_recent_placement')

Regional volatility/movement metric.

Counts the number of distinct regions an individual was placed in (combining initial and most recent placement regions).

Parameters:
  • df (DataFrame)

  • id_col (str)

  • year_col (str)

  • regA_col (str)

  • regB_col (str)

Return type:

VolRes

Restrictive confinement trends over time.

Returns per-year counts by region.

Parameters:
  • df (DataFrame)

  • id_col (str)

  • year_col (str)

  • region_col (str)

Returns:

DataFrame with columns

Return type:

year, region, n_individuals, n_placements

morie.otis.otis_descriptives(df, *, id_col='unique_individual_id', year_col='end_fiscal_year')

Full OTIS descriptive statistics suite.

Returns:

  • dict with keys (n_total, n_by_year, n_by_region, n_by_age,)

  • n_by_gender, placement_dist

Parameters:
Return type:

dict

morie.otis.otis_dml(df, outcome='Y', treatment='D', covariates=None, *, cluster=None, n_folds=3, seed=123)

Run DML IRM (ATE/ATT) on OTIS data.

Wraps morie.effects.estimate_irm() for the correctional domain.

Parameters:
  • df (DataFrame) – Must contain outcome, treatment, and covariate columns.

  • outcome (str) – Outcome variable name (default “Y” = suicide risk).

  • treatment (str) – Treatment variable name (default “D” = mental health alert).

  • covariates (list of str, optional) – Covariate column names. Defaults to standard OTIS set.

  • cluster (str, optional) – Cluster variable for clustered standard errors.

  • n_folds (int)

  • seed (int)

Return type:

OtDmlR

OTIS analysis surfaces – RichResult wrappers around morie.otis primitives.

morie.otis provides 6 result-emitting callables that return specialized dataclasses (RplRes / AstRes / VolRes / OtDmlR / etc.). This module wraps each in a RichResult so OTIS analyses get the same multi-section output that Welch / Mann–Whitney / SIU analyses already emit.

OTIS = anonymized Ontario MCSCS correctional placement records (open data). The canonical 76,934-row table lives in the repo cache at data/cache/correctional_stats_report_environment.RData as the R object df. Use load_otis() (CSV mirror) or call the Rscript exporter at scripts/export_otis_csv.R to refresh the CSV.

Public API:

rplace(df, year, sex=None) -> RichResult astcmb(df) -> RichResult volat(df) -> RichResult rctrnd(df) -> RichResult otdesc(df, year=None, sex=None) -> RichResult otdml(df, treatment, outcome, …) -> RichResult all_analyses(df, year, *, out_dir=None) -> dict[str, RichResult]

morie.otis_analyze.load_otis(csv_path=None)[source]

Load the canonical OTIS DataFrame.

Defaults to data/cache/otis_main.csv (Rscript-exported from the correctional_stats_report_environment.RData fixture). Pass a custom csv_path to load a refreshed snapshot.

Returns a 10-column DataFrame:

end_fiscal_year, unique_individual_id, region_at_time_of_placement, region_most_recent_placement, gender, age_category, mental_health_alert, suicide_risk_alert, suicide_watch_alert, number_of_placements

Parameters:

csv_path (Path | str | None)

Return type:

DataFrame

morie.otis_analyze.rplace(df, year, sex=None, **kw)[source]

Regional placement analysis as RichResult.

Parameters:
Return type:

RichResult

morie.otis_analyze.astcmb(df, **kw)[source]

Alert-state combination analysis as RichResult.

Parameters:
Return type:

RichResult

morie.otis_analyze.volat(df, **kw)[source]

Regional volatility metric as RichResult.

Parameters:
Return type:

RichResult

morie.otis_analyze.rctrnd(df, **kw)[source]

Restrictive-confinement trends as RichResult.

Parameters:
Return type:

RichResult

morie.otis_analyze.otdesc(df, **kw)[source]

Full OTIS descriptive suite as RichResult.

_o.otdesc does not accept year/sex filters – apply those upstream via df = df[df.end_fiscal_year == year] if needed.

Parameters:
Return type:

RichResult

morie.otis_analyze.otdml(df, *, treatment, outcome, covariates, **kw)[source]

OTIS DML IRM (ATE/ATT) as RichResult.

Parameters:
Return type:

RichResult

morie.otis_analyze.all_analyses(df, year, *, sex=None, out_dir=None)[source]

Run rplace / astcmb / volat / rctrnd / otdesc on df and write to disk.

otdml is excluded from the bundle because it requires the user to specify (treatment, outcome, covariates) – call it separately.

Parameters:
Return type:

dict[str, RichResult]

morie.otis_all_analyze – RichResult-emitting analyses for ALL 28 OTIS datasets.

This module is the comprehensive OTIS analysis surface. It pairs with morie.otis_datasets (the registry/loader) and extends the narrower morie.otis_analyze (which only covered the restricted- confinement subset).

For each dataset id (b01..d07), this module exposes:

analyze_<id>(df=None) -> RichResult

If df is None, the analysis loads from data/datasets/OTIS/<csv_filename>.

Plus a top-level driver:

analyze_all(*, out_dir=None) -> dict[str, RichResult]

which runs every dataset’s analysis, writes per-dataset .txt and .json under data/manifest/outputs/otis/, and returns the in-memory results.

Plus the Ruhela-formulation (RF) high-level entry points:
analyze_a01_ruhela_formulations(df=None) – full DLRM on a01

– alias: analyze_a01_dlrm

analyze_b01_ruhela_formulations(df=None) – full DLRM on b01

– alias: analyze_b01_dlrm

analyze_b02_ruhela_formulations(df=None) – gender -> seg days on b02

– alias: analyze_b02_dlrm

analyze_a01_with_csi_context() – a01 causal + Toronto CSI analyze_c_doob_chi2() – χ² + Cramer’s V on c-series

– homage to Doob’s chi-square tradition

analyze_d_doob_chi2() – d-series + Alert χ²

Naming abbreviations:

RF – Ruhela formulation RDF – Ruhela Dual Formulation (RF + Naive-arm sensitivity) DLRM – Doob-Levinsky-Ruhela-Medina (methodology attribution)

The DLRM methodology attribution and the wider acknowledgements (Prof. Jauregui, Prof. Laniyonu) live at the top of morie.otis_causal’s module docstring.

morie.otis_all_analyze.analyze_b01(df=None)[source]

Person-level segregation placements (76,934 rows).

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b02(df=None)[source]

Aggregate days in segregation per person per year.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b03(df=None)[source]

Segregation placements by alert × institution.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b04(df=None)[source]

Placement durations (max/median/mode) by region & gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b05(df=None)[source]

Distribution of placements by binned duration.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b06(df=None)[source]

Reasons for placement × institution × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b07(df=None)[source]

Alerts × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b08(df=None)[source]

Durations (median/mode) by institution & gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b09(df=None)[source]

Individuals by number of segregation placements × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c01(df=None)[source]

Total individuals × custody/RC/seg × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c02(df=None)[source]

Individuals in RC/seg by institution.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c03(df=None)[source]

Individuals × race × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c04(df=None)[source]

Individuals × race × region.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c05(df=None)[source]

Individuals × religion × region.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c06(df=None)[source]

Individuals × age × region.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c07(df=None)[source]

Individuals × alerts × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c08(df=None)[source]

Individuals × religion × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c09(df=None)[source]

Individuals × age × gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c10(df=None)[source]

RC/seg aggregate durations by institution.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c11(df=None)[source]

Individuals by aggregate-duration bin.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c12(df=None)[source]

RC/seg aggregate durations by region & gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d01(df=None)[source]

Person-level deaths in custody.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d02(df=None)
Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d03(df=None)
Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d04(df=None)
Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d05(df=None)
Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d06(df=None)[source]
Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d07(df=None)[source]
Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01(df=None)[source]

High-level a01 (Restrictive Confinement Detailed) analysis.

Wraps the full causal pipeline that the OTIS-RC research runs against a01_restrictive_confinement_detailed_dataset.csv:

  1. Person-year aggregation via 8-state alert-combo encoding (morie.otis_causal.make_pair_alert_to_volatility_a01).

  2. MatchIt 1:1 NN PSM with caliper.

  3. IRM-DML with cross-fitted random-forest nuisances.

  4. Multi-way clustered standard errors (id, region, year).

Returns a RichResult with both ATE and ATTE estimates.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_ruhela_formulations(df=None, *, cluster_col='EndFiscalYear')[source]

Ruhela formulations (full DLRM) on a01 – the canonical OTIS-RC dataset.

Runs the complete OTIS-RC methodology arc on the author’s Ruhela formulation (T = high alert complexity, Y = regional volatility). See _ruhela_formulations_on for the full estimator list.

Attribution

The Ruhela formulation is the author’s design end-to-end; the supporting- contributor credits (Doob – pointed the author to the dataset; Levinsky – course context, reviewed only the preliminary a01 analysis; Medina – reviewed the formal write-up) are documented in morie.otis_causal’s module docstring.

param df:

a01 DataFrame; if None, otis_datasets.load_otis_dataset('a01').

type df:

pd.DataFrame, optional

param cluster_col:

Cluster axis for cluster-robust SE in the IRM-DML estimator (default ‘EndFiscalYear’).

type cluster_col:

str

Parameters:
Return type:

RichResult

morie.otis_all_analyze.analyze_a01_dual(*args, **kwargs)[source]

Deprecated alias for analyze_a01_ruhela_formulations.

morie.otis_all_analyze.analyze_b01_ruhela_formulations(df=None, *, cluster_col='EndFiscalYear')[source]

Ruhela formulations (full DLRM) on b01 (Segregation Detailed).

Per-placement segregation complement to a01’s per-day RC records. Same Ruhela alert-complexity -> regional-volatility contrast, larger magnitude effect because segregation is a stronger institutional contrast.

See analyze_a01_ruhela_formulations for the full attribution and estimator catalogue.

Parameters:
Return type:

RichResult

morie.otis_all_analyze.analyze_b02_ruhela_formulations(df=None, *, cluster_col='EndFiscalYear')[source]

Ruhela formulation on b02 (Segregation Total Days per individual).

b02 has UniqueIndividual_ID + EndFiscalYear + Gender + Region + Age_Category + TotalAggregatedDays_Segregation but NO alert columns. The canonical alert-complexity formulation is impossible; instead the author’s b02-specific Ruhela formulation tests gender disparity in segregation-day burden:

T = Gender == “Female” indicator Y = TotalAggregatedDays_Segregation (count, person-year) covariates = [Region_MostRecentPlacement, Age_Category, EndFiscalYear]

Same DLRM ensemble as the canonical formulation. Sister formulations on the same b02 data could swap T for region or age.

Parameters:
Return type:

RichResult

morie.otis_all_analyze.analyze_ruhela_per_year(df, *, ds_id, treatment, outcome, covariates, year_col='EndFiscalYear', cluster_col='EndFiscalYear', propensity_calibration='none')[source]

Per-fiscal-year full DLRM on each Ruhela formulation.

Runs the complete DLRM (IPW + AIPW + g-comp + PSM-NN + PSM-subclass + IRM-DML + match_first + ATC + PLR + SuperLearner) separately on each fiscal year. ~7× the standard per-year IRM-DML runtime. Mirrors a per-year extension of the author’s res_by_year.

Parameters:
Return type:

RichResult

morie.otis_all_analyze.analyze_a01_ruhela_per_year(df=None)[source]

Per-year full DLRM on Ruhela formulations on a01.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_ruhela_per_year(df=None)[source]

Per-year full DLRM on Ruhela formulations on b01.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b03_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on b03 (Alert × Institution).

T = Alert_Presence indicator -> Y = Number_SegregationPlacements. Year-FE + Region-FE + Alert_Type-FE; Institution as GEE cluster group (24 levels, marginal model with cluster-robust SE).

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b07_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on b07 (alerts × gender summary).

b07 is a wide table with paired columns:

Number_Segregation_Placements_With_Alert Number_Segregation_Placements_Without_Alert

We pivot to long form (one row per alert-state × gender × year), then T = with_alert indicator -> Y = count. Year + Gender + Alert_Type fixed effects.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c01_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on c01 – gender disparity in RC.

c01 has Gender × {InCustody, RestrictiveConfinement, Segregation} counts. T = Female indicator -> Y = NumberIndividuals_RC, Year-FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c07_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on c07 – alert × gender -> RC count.

Counterpart of the per-row Ruhela formulation on a01/b01 at the aggregate level. Useful triangulation: c07 reports NumberIndividuals_InCustody / RC / Segregation cross-tabulated by Alert_Type × Gender × FiscalYear. The Ruhela formulation here contrasts an indicator of alert presence (vs. no-alert reference) with the number-in-RC count.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b06_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on b06 – disciplinary reason -> seg.

T = (Reason contains “Disciplinary”) -> Y = Number of seg placements. Year-FE + Region-FE + Gender-FE; Institution as GEE cluster group (24 levels) for marginal-model cluster-robust inference.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c03_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on c03 – racial disparity in RC.

T = (Race == “Indigenous”) indicator -> Y = NumberIndividuals_RC. Year-FE + Gender-FE. Indigenous vs all-other contrast tests overrepresentation.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c04_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on c04 – racial disparity by region.

T = (Race == “Indigenous”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c06_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on c06 – age-50+ disparity in RC.

T = (Age_Category contains “50”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c09_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on c09 – age-50+ disparity by gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_ruhela_alt_gender(df=None)[source]

Alt-T Ruhela formulation on a01 – gender -> regional volatility.

Same per-row Ruhela cell frame as the canonical formulation, but T = Female indicator -> Y = vm count. Covariates exclude Gender (the variable being treated). Runs the full 10-estimator DLRM ensemble. Tests whether women experience more or fewer regional transfers within a fiscal year of restrictive confinement.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_ruhela_alt_age(df=None)[source]

Alt-T Ruhela formulation on a01 – age 50+ -> regional volatility.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_ruhela_alt_toronto(df=None)[source]

Alt-T Ruhela formulation on a01 – Toronto-region -> regional volatility.

T = (Region_AtTimeOfPlacement == “Toronto”) indicator. Tests whether Toronto-region placements produce different volatility than placements outside Toronto, controlling for demographics.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_ruhela_alt_gender(df=None)[source]

Alt-T Ruhela formulation on b01 – gender -> regional volatility.

Same as analyze_a01_ruhela_alt_gender but on b01 (Segregation Detailed) cell frame. Tests whether women experience more or fewer regional transfers within a fiscal year of segregation.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_ruhela_alt_age(df=None)[source]

Alt-T Ruhela formulation on b01 – age 50+ -> regional volatility.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_ruhela_alt_toronto(df=None)[source]

Alt-T Ruhela formulation on b01 – Toronto-region -> regional volatility.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_ruhela_subgroup_female(df=None)[source]

Canonical a01 Ruhela formulation on Female-only subset.

Tests effect-heterogeneity-by-gender: same T_high_ac -> vm contrast, but only on female cells. Compare ATE here to analyze_a01_ruhela_subgroup_male to see if alert-complexity matters more or less for women than men.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_ruhela_subgroup_male(df=None)[source]

Canonical a01 Ruhela formulation on Male-only subset.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_ruhela_subgroup_female(df=None)[source]

Canonical b01 Ruhela formulation on Female-only subset.

Sister to analyze_a01_ruhela_subgroup_female. Tests effect- heterogeneity-by-gender on segregation records.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_ruhela_subgroup_male(df=None)[source]

Canonical b01 Ruhela formulation on Male-only subset.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b02_ruhela_alt_region(df=None)[source]

b02 alt-T: T = Toronto region -> Y = total seg days within FY.

Sister to analyze_b02_ruhela_formulations (T = Female). Tests regional disparity in seg-day burden controlling for gender + age + year.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b02_ruhela_alt_age(df=None)[source]

b02 alt-T: T = Age 50+ -> Y = total seg days within FY.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c05_ruhela_aggregate(df=None)[source]

Aggregate RF on c05 – non-Christian religion -> individuals in RC.

T = (Religion != “Christian” and not “No Religion” and not “Unknown”) indicator. Tests population-level religious-minority overrepresentation in RC, controlling for region.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c08_ruhela_aggregate(df=None)[source]

Aggregate RF on c08 – non-Christian religion × gender -> RC.

T = (Religion != Christian / no religion / unknown) indicator. Year + Gender FE. Companion to c05 with gender control.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b04_ruhela_aggregate(df=None)[source]

Aggregate RF on b04 – gender disparity in median seg duration.

b04 has Region × Gender × Measure (max/median/mode) duration. We filter to Measure == “Median” and test T = Female -> Y = duration.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b08_ruhela_aggregate(df=None)[source]

Aggregate RF on b08 – gender disparity in median seg duration by institution.

Companion to b04; b08 adds Institution_AtTimeOfPlacement as a finer-grained spatial dimension. Region-FE; Institution as GEE cluster group (24 levels).

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b09_ruhela_aggregate(df=None)[source]

Aggregate RF on b09 – gender disparity in placement-count distribution.

b09 has rows of (Year × NumberPlacements_Segregation × Gender × NumberIndividuals_Segregation). We test T = Female -> Y = NumberIndividuals_Segregation, with NumberPlacements as a categorical FE so we adjust for the bin-cell structure.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c02_ruhela_aggregate(df=None)[source]

Aggregate RF on c02 – gender disparity by institution.

Companion to c01 with institution-level granularity. T = Female -> Y = NumberIndividuals_RC. Region + year as fixed effects; Institution as a GEE cluster group (marginal model with cluster-robust SE) – Institution has too many levels (~76) for an FE specification on the 148-row table without collinearity. Mirrors OTIS-RC’s glmmTMB nbinom2 clustered approach.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c10_ruhela_aggregate(df=None)[source]

Aggregate RF on c10 – gender disparity in median RC days.

c10 has Region × Institution × Gender × Measure (Maximum/Median/Mode) duration. Filter to Median; T = Female -> Y = TotalAggregatedDays_RC. Region-FE + Year-FE; Institution as GEE cluster group (25 levels) for cluster-robust marginal inference.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c12_ruhela_aggregate(df=None)[source]

Aggregate RF on c12 – gender disparity in median RC days by region.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c11_ruhela_aggregate(df=None)[source]

Aggregate Ruhela formulation on c11 – long-duration cell concentration.

c11 has rows of (FY × Aggregate_Duration_bin × NumberIndividuals_RC × NumberIndividuals_Seg). 11 duration bins from “1 day” to “Greater than 30 days”. Rather than treat duration as the treatment-FE (it IS the categorical the data is binned by), we test T = (long-duration bin: ≥16 days) -> Y = NumberIndividuals_RC. Year-FE only.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c01_ruhela_aggregate_region_cluster(df=None)[source]

c01 alternative: Female -> RC count, GEE-clustered on year.

Sister analyzer to analyze_c01_ruhela_aggregate. The c01 case has only 6 cells (3 years × 2 genders), so cluster_group=”EndFiscalYear” gives a 3-cluster GEE – extreme small-cluster bias. This variant is shipped for transparency: shows that with so few clusters, Poisson and GEE-Poisson give nearly identical numbers because there’s no clustering structure to exploit.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c04_ruhela_aggregate_region_cluster(df=None)[source]

c04 alternative: Indigenous -> RC, GEE-clustered on region.

Sister to analyze_c04_ruhela_aggregate (which uses region as FE). With 5 regions, the GEE-cluster:Region inference treats region as a marginal-model grouping. Useful when wide region heterogeneity in case-mix matters for SE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d02_ruhela_aggregate(df=None)[source]

d02 – gender -> custodial death count, with year FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d03_ruhela_aggregate(df=None)[source]

d03 – Indigenous -> custodial death count, with year FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d04_ruhela_aggregate(df=None)[source]

d04 – non-majority religion -> custodial death count.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d05_ruhela_aggregate(df=None)[source]

d05 – age 50+ -> custodial death count.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_ruhela_grid()[source]

One-page Ruhela formulations grid summary (24 aggregates).

Runs every aggregate Ruhela formulation analyzer + the region-cluster GEE variants, and presents a single comparison table.

Return type:

RichResult

morie.otis_all_analyze.analyze_ruhela_master(*, include_per_row=False)[source]

Paper-ready master report – every Ruhela formulation in one RichResult.

Sections:
  1. Aggregate Ruhela formulations (24 datasets) – IRR comparison

  2. Per-row Ruhela formulations on a01/b01/b02 – primary IRM-DML ATE (only when include_per_row=True; ~5-7 min runtime)

  3. Doob chi-square family on c-series + d-series

For thesis / paper writing.

Parameters:

include_per_row (bool) – If True, also runs the slow per-row RFs on a01/b01/b02 (full 10-estimator DLRM). Default False – runs in ~1 second from cached aggregates only.

Return type:

RichResult

morie.otis_all_analyze.analyze_b02_dlrm(*args, **kwargs)[source]

DLRM short alias for analyze_b02_ruhela_formulations.

morie.otis_all_analyze.analyze_a01_dlrm(*args, **kwargs)[source]

DLRM short alias for analyze_a01_ruhela_formulations.

morie.otis_all_analyze.analyze_b01_dlrm(*args, **kwargs)[source]

DLRM short alias for analyze_b01_ruhela_formulations.

morie.otis_all_analyze.analyze_b05_mandela_classification(df=None)[source]

Mandela-RF on b05 – per-placement Mandela classification by year.

Applies the Sprott-Doob 15-day Mandela threshold to OTIS b05 (Ontario provincial segregation placement counts by binned duration). Reports proportions of placements ≤15 days (solitary) vs ≥16 days (torture) per fiscal year + a federal-vs-provincial comparison.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c11_mandela_classification(df=None)[source]

Mandela-RF on c11 – per-individual Mandela classification by year.

Applies the 15-day threshold to OTIS c11 (Ontario provincial counts of INDIVIDUALS by binned aggregate duration). Reports both restrictive- confinement and segregation-only views; the latter is the more conservative and directly comparable to Sprott-Doob federal SIUs.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_otis_mandela_provincial_vs_federal()[source]

Side-by-side comparison: provincial OTIS vs federal SIU Mandela rates.

Cross-references analyze_c11_mandela_classification (Ontario provincial individuals) against the Sprott-Doob Feb 2021 federal SIU figures (Table 19, N=1960).

Return type:

RichResult

morie.otis_all_analyze.analyze_c_doob_chi2(*, contingency_value='NumberIndividuals_RestrictiveConfinement')[source]

Doob chi-square family on c-series demographic contingency tables.

Honour to Prof. Doob’s chi-square tradition in Canadian corrections. Runs Pearson χ² + Cramer’s V on every meaningful 2-way slice of the c-series datasets, using the chosen primary count column.

Sibling: analyze_d_doob_chi2 (death counts). Federal counterpart context: morie.siuiap documents the Structured Intervention Unit Implementation Advisory Panel (Doob, Sprott, Sapers et al.).

Parameters:

contingency_value (str)

Return type:

RichResult

morie.otis_all_analyze.analyze_d_doob_chi2()[source]

Doob chi-square on d-series death data + yearly trend.

Honour to Prof. Doob’s chi-square tradition; same payload as analyze_d_aggregate (yearly Poisson 95% CI + Alert × MedicalCause χ² + Alert × Housing χ²).

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_dual(*args, **kwargs)[source]

Deprecated alias for analyze_b01_ruhela_formulations.

morie.otis_all_analyze.analyze_c_aggregate(*, contingency_value='NumberIndividuals_RestrictiveConfinement')[source]

Chi² + Cramer’s V family on every c-series 2-way slice.

c-series datasets are pre-aggregated counts, so the natural statistic on them is contingency-table independence. This runs χ² + Cramer’s V on each meaningful (categorical_a × categorical_b) slice using the chosen primary count column.

Parameters:

contingency_value (str)

Return type:

RichResult

morie.otis_all_analyze.analyze_d_aggregate()[source]

Yearly trend + alert-cause/housing χ² on d-series death data.

Three things on the d-series:
  1. Per-year custodial-death count (d02-d05 aggregate over the categorical dim) with Poisson 95% CIs.

  2. Year-over-year rate ratio (last full year / first year).

  3. χ² + Cramer’s V on d06 (Alert × MedicalCauseOfDeath) and d07 (Alert × Housing_Type).

All measures are at the population aggregate level – d-series has no per-individual alert columns, so the Ruhela dual is structurally impossible here.

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_with_csi_context(df=None, *, variant='total', rebase_to_year=2023)[source]

OTIS a01 causal pipeline + Toronto Crime Severity Index context.

Wires together three independent morie subsystems:

  1. morie.otis_all_analyze.analyze_a01 – the MatchIt-then- IRM-DML causal estimate of high-alert-complexity (ac ≥ 2) on regional volatility (vm) within OTIS a01.

  2. morie.otis_tps_overlay – year-by-year correlation between OTIS Toronto-region segregation counts and TPS incident counts.

  3. morie.tps_csi.analyze_csi_from_tps_dataframes – Statistics Canada Crime Severity Index per year + per ward, weighted by offence-specific severity weights and population-adjusted.

The narrative answer: “Did the OTIS-observed increase in high-alert- complexity placements happen during a period of rising or falling Toronto crime severity?” The CSI rebased to rebase_to_year (default 2023, OTIS’s first FY) gives a grandma-readable index.

Parameters:
  • df (pd.DataFrame, optional) – a01 DataFrame; if None, otis_all_analyze.analyze_a01() loads the canonical local copy.

  • variant ({"total", "violent"}) – StatsCan CSI variant.

  • rebase_to_year (int | None) – Anchor year for the CSI index column. None skips rebasing.

Return type:

RichResult

morie.otis_all_analyze.analyze_all(out_dir=None)[source]

Run all 29 OTIS analyses and write outputs.

Parameters:

out_dir (Path | None)

Return type:

dict[str, RichResult]

morie.otis_all_analyze.analyze_a01_mrm_alt_age(df=None)

Alt-T Ruhela formulation on a01 – age 50+ -> regional volatility.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_mrm_alt_gender(df=None)

Alt-T Ruhela formulation on a01 – gender -> regional volatility.

Same per-row Ruhela cell frame as the canonical formulation, but T = Female indicator -> Y = vm count. Covariates exclude Gender (the variable being treated). Runs the full 10-estimator DLRM ensemble. Tests whether women experience more or fewer regional transfers within a fiscal year of restrictive confinement.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_mrm_alt_toronto(df=None)

Alt-T Ruhela formulation on a01 – Toronto-region -> regional volatility.

T = (Region_AtTimeOfPlacement == “Toronto”) indicator. Tests whether Toronto-region placements produce different volatility than placements outside Toronto, controlling for demographics.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_mrm(df=None, *, cluster_col='EndFiscalYear')

Ruhela formulations (full DLRM) on a01 – the canonical OTIS-RC dataset.

Runs the complete OTIS-RC methodology arc on the author’s Ruhela formulation (T = high alert complexity, Y = regional volatility). See _ruhela_formulations_on for the full estimator list.

Attribution

The Ruhela formulation is the author’s design end-to-end; the supporting- contributor credits (Doob – pointed the author to the dataset; Levinsky – course context, reviewed only the preliminary a01 analysis; Medina – reviewed the formal write-up) are documented in morie.otis_causal’s module docstring.

param df:

a01 DataFrame; if None, otis_datasets.load_otis_dataset('a01').

type df:

pd.DataFrame, optional

param cluster_col:

Cluster axis for cluster-robust SE in the IRM-DML estimator (default ‘EndFiscalYear’).

type cluster_col:

str

Parameters:
Return type:

RichResult

morie.otis_all_analyze.analyze_a01_mrm_per_year(df=None)

Per-year full DLRM on Ruhela formulations on a01.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_mrm_subgroup_female(df=None)

Canonical a01 Ruhela formulation on Female-only subset.

Tests effect-heterogeneity-by-gender: same T_high_ac -> vm contrast, but only on female cells. Compare ATE here to analyze_a01_ruhela_subgroup_male to see if alert-complexity matters more or less for women than men.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_a01_mrm_subgroup_male(df=None)

Canonical a01 Ruhela formulation on Male-only subset.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_mrm_alt_age(df=None)

Alt-T Ruhela formulation on b01 – age 50+ -> regional volatility.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_mrm_alt_gender(df=None)

Alt-T Ruhela formulation on b01 – gender -> regional volatility.

Same as analyze_a01_ruhela_alt_gender but on b01 (Segregation Detailed) cell frame. Tests whether women experience more or fewer regional transfers within a fiscal year of segregation.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_mrm_alt_toronto(df=None)

Alt-T Ruhela formulation on b01 – Toronto-region -> regional volatility.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_mrm(df=None, *, cluster_col='EndFiscalYear')

Ruhela formulations (full DLRM) on b01 (Segregation Detailed).

Per-placement segregation complement to a01’s per-day RC records. Same Ruhela alert-complexity -> regional-volatility contrast, larger magnitude effect because segregation is a stronger institutional contrast.

See analyze_a01_ruhela_formulations for the full attribution and estimator catalogue.

Parameters:
Return type:

RichResult

morie.otis_all_analyze.analyze_b01_mrm_per_year(df=None)

Per-year full DLRM on Ruhela formulations on b01.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_mrm_subgroup_female(df=None)

Canonical b01 Ruhela formulation on Female-only subset.

Sister to analyze_a01_ruhela_subgroup_female. Tests effect- heterogeneity-by-gender on segregation records.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b01_mrm_subgroup_male(df=None)

Canonical b01 Ruhela formulation on Male-only subset.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b02_mrm_alt_age(df=None)

b02 alt-T: T = Age 50+ -> Y = total seg days within FY.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b02_mrm_alt_region(df=None)

b02 alt-T: T = Toronto region -> Y = total seg days within FY.

Sister to analyze_b02_ruhela_formulations (T = Female). Tests regional disparity in seg-day burden controlling for gender + age + year.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b02_mrm(df=None, *, cluster_col='EndFiscalYear')

Ruhela formulation on b02 (Segregation Total Days per individual).

b02 has UniqueIndividual_ID + EndFiscalYear + Gender + Region + Age_Category + TotalAggregatedDays_Segregation but NO alert columns. The canonical alert-complexity formulation is impossible; instead the author’s b02-specific Ruhela formulation tests gender disparity in segregation-day burden:

T = Gender == “Female” indicator Y = TotalAggregatedDays_Segregation (count, person-year) covariates = [Region_MostRecentPlacement, Age_Category, EndFiscalYear]

Same DLRM ensemble as the canonical formulation. Sister formulations on the same b02 data could swap T for region or age.

Parameters:
Return type:

RichResult

morie.otis_all_analyze.analyze_b03_mrm_aggregate(df=None)

Aggregate Ruhela formulation on b03 (Alert × Institution).

T = Alert_Presence indicator -> Y = Number_SegregationPlacements. Year-FE + Region-FE + Alert_Type-FE; Institution as GEE cluster group (24 levels, marginal model with cluster-robust SE).

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b04_mrm_aggregate(df=None)

Aggregate RF on b04 – gender disparity in median seg duration.

b04 has Region × Gender × Measure (max/median/mode) duration. We filter to Measure == “Median” and test T = Female -> Y = duration.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b06_mrm_aggregate(df=None)

Aggregate Ruhela formulation on b06 – disciplinary reason -> seg.

T = (Reason contains “Disciplinary”) -> Y = Number of seg placements. Year-FE + Region-FE + Gender-FE; Institution as GEE cluster group (24 levels) for marginal-model cluster-robust inference.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b07_mrm_aggregate(df=None)

Aggregate Ruhela formulation on b07 (alerts × gender summary).

b07 is a wide table with paired columns:

Number_Segregation_Placements_With_Alert Number_Segregation_Placements_Without_Alert

We pivot to long form (one row per alert-state × gender × year), then T = with_alert indicator -> Y = count. Year + Gender + Alert_Type fixed effects.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b08_mrm_aggregate(df=None)

Aggregate RF on b08 – gender disparity in median seg duration by institution.

Companion to b04; b08 adds Institution_AtTimeOfPlacement as a finer-grained spatial dimension. Region-FE; Institution as GEE cluster group (24 levels).

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_b09_mrm_aggregate(df=None)

Aggregate RF on b09 – gender disparity in placement-count distribution.

b09 has rows of (Year × NumberPlacements_Segregation × Gender × NumberIndividuals_Segregation). We test T = Female -> Y = NumberIndividuals_Segregation, with NumberPlacements as a categorical FE so we adjust for the bin-cell structure.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c01_mrm_aggregate(df=None)

Aggregate Ruhela formulation on c01 – gender disparity in RC.

c01 has Gender × {InCustody, RestrictiveConfinement, Segregation} counts. T = Female indicator -> Y = NumberIndividuals_RC, Year-FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c01_mrm_aggregate_region_cluster(df=None)

c01 alternative: Female -> RC count, GEE-clustered on year.

Sister analyzer to analyze_c01_ruhela_aggregate. The c01 case has only 6 cells (3 years × 2 genders), so cluster_group=”EndFiscalYear” gives a 3-cluster GEE – extreme small-cluster bias. This variant is shipped for transparency: shows that with so few clusters, Poisson and GEE-Poisson give nearly identical numbers because there’s no clustering structure to exploit.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c02_mrm_aggregate(df=None)

Aggregate RF on c02 – gender disparity by institution.

Companion to c01 with institution-level granularity. T = Female -> Y = NumberIndividuals_RC. Region + year as fixed effects; Institution as a GEE cluster group (marginal model with cluster-robust SE) – Institution has too many levels (~76) for an FE specification on the 148-row table without collinearity. Mirrors OTIS-RC’s glmmTMB nbinom2 clustered approach.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c03_mrm_aggregate(df=None)

Aggregate Ruhela formulation on c03 – racial disparity in RC.

T = (Race == “Indigenous”) indicator -> Y = NumberIndividuals_RC. Year-FE + Gender-FE. Indigenous vs all-other contrast tests overrepresentation.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c04_mrm_aggregate(df=None)

Aggregate Ruhela formulation on c04 – racial disparity by region.

T = (Race == “Indigenous”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c04_mrm_aggregate_region_cluster(df=None)

c04 alternative: Indigenous -> RC, GEE-clustered on region.

Sister to analyze_c04_ruhela_aggregate (which uses region as FE). With 5 regions, the GEE-cluster:Region inference treats region as a marginal-model grouping. Useful when wide region heterogeneity in case-mix matters for SE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c05_mrm_aggregate(df=None)

Aggregate RF on c05 – non-Christian religion -> individuals in RC.

T = (Religion != “Christian” and not “No Religion” and not “Unknown”) indicator. Tests population-level religious-minority overrepresentation in RC, controlling for region.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c06_mrm_aggregate(df=None)

Aggregate Ruhela formulation on c06 – age-50+ disparity in RC.

T = (Age_Category contains “50”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c07_mrm_aggregate(df=None)

Aggregate Ruhela formulation on c07 – alert × gender -> RC count.

Counterpart of the per-row Ruhela formulation on a01/b01 at the aggregate level. Useful triangulation: c07 reports NumberIndividuals_InCustody / RC / Segregation cross-tabulated by Alert_Type × Gender × FiscalYear. The Ruhela formulation here contrasts an indicator of alert presence (vs. no-alert reference) with the number-in-RC count.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c08_mrm_aggregate(df=None)

Aggregate RF on c08 – non-Christian religion × gender -> RC.

T = (Religion != Christian / no religion / unknown) indicator. Year + Gender FE. Companion to c05 with gender control.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c09_mrm_aggregate(df=None)

Aggregate Ruhela formulation on c09 – age-50+ disparity by gender.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c10_mrm_aggregate(df=None)

Aggregate RF on c10 – gender disparity in median RC days.

c10 has Region × Institution × Gender × Measure (Maximum/Median/Mode) duration. Filter to Median; T = Female -> Y = TotalAggregatedDays_RC. Region-FE + Year-FE; Institution as GEE cluster group (25 levels) for cluster-robust marginal inference.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c11_mrm_aggregate(df=None)

Aggregate Ruhela formulation on c11 – long-duration cell concentration.

c11 has rows of (FY × Aggregate_Duration_bin × NumberIndividuals_RC × NumberIndividuals_Seg). 11 duration bins from “1 day” to “Greater than 30 days”. Rather than treat duration as the treatment-FE (it IS the categorical the data is binned by), we test T = (long-duration bin: ≥16 days) -> Y = NumberIndividuals_RC. Year-FE only.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_c12_mrm_aggregate(df=None)

Aggregate RF on c12 – gender disparity in median RC days by region.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d02_mrm_aggregate(df=None)

d02 – gender -> custodial death count, with year FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d03_mrm_aggregate(df=None)

d03 – Indigenous -> custodial death count, with year FE.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d04_mrm_aggregate(df=None)

d04 – non-majority religion -> custodial death count.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_all_analyze.analyze_d05_mrm_aggregate(df=None)

d05 – age 50+ -> custodial death count.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_causal – IPW, AIPW, and DML estimators for OTIS.

Companion to morie.otis.otdml (which already implements cross-fitted Frisch-Waugh-Lovell DML). This module adds:

  • otis_ipw() – Hájek-stabilised inverse probability weighting with a logistic propensity model.

  • otis_aipw() – Augmented IPW (Robins-Rotnitzky-Zhao 1994 doubly-robust estimator) with cross-fitted nuisance models.

  • otis_causal_grid() – runs all three estimators (IPW, AIPW, DML) on the canonical three (treatment, outcome) pairs:

    1. MentalHealth_Alert -> SuicideRisk_Alert

    2. HighAlertComplexity -> AnyReadmission

    3. RegionalVolatility -> ConsecutiveSegregationDays

The three pairs cover the three Goffmanian framings of OTIS most relevant to the MA-paired thesis: clinical-alert chain (a), multi-alert “complexity” (b), and inter-region churn (c).

Estimators

Let $D in {0,1}$ be a binary treatment, $Y$ a real-valued outcome, $X in mathbb{R}^p$ a covariate vector, and write $e(X) coloneqq Pr(D=1mid X)$ for the propensity score and $mu_d(X) coloneqq mathbb{E}[Ymid D=d, X]$ for the outcome-regression mean.

IPW (Hájek):

ATE_IPW = sum_i w1(i)·D_i·Y_i / sum_i w1(i)·D_i
  • sum_i w0(i)·(1-D_i)·Y_i / sum_i w0(i)·(1-D_i)

with w1(i) = 1/e(X_i), w0(i) = 1/(1-e(X_i)).

AIPW (RRZ 1994 doubly-robust score):

psi(O_i; eta) = mu_1(X_i) - mu_0(X_i)
  • D_i·(Y_i - mu_1(X_i))/e(X_i)

  • (1-D_i)·(Y_i - mu_0(X_i))/(1-e(X_i))

ATE_AIPW = (1/n) sum_i psi(O_i; eta_hat)

with cross-fitted nuisance estimates eta_hat = (e_hat, mu_1_hat, mu_0_hat).

DML PLR:

Frisch-Waugh-Lovell partialled-out regression with cross-fitted residuals (already implemented in morie.otis.otdml()).

References

Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89(427): 846-866.

Chernozhukov, V. et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21(1): C1-C68.

Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23(19): 2937-2960.

Robins, J. M. (1986). A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling 7: 1393-1512.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1): 41-55.

Attribution – Ruhela formulations

The OTIS-RC formulations and the entire codebase implemented here are the work of Vansh Singh Ruhela (hadesllm). No one co-formulated the analyses: the (T = ac ≥ 2, Y = vm count) alert-complexity -> regional-volatility contrast, the 8-state combo encoding, and the full method battery (IPW + AIPW + g-computation + PSM-NN + PSM-subclass + IRM-DML + match_first) are Ruhela’s design.

Three external contributors enabled the journey but did not co-author the formulations:

Doob – Prof. Emeritus Anthony N. Doob, University of Toronto.

Pointed Ruhela to the existence of the OTIS data feed – the first piece of the puzzle. Without that pointer, this project does not begin.

Levinsky – Prof. Zachary Levinsky, University of Toronto. His

graduate course on the Theory of Punishment was the broader context that motivated this work. He later reviewed the preliminary a01-only analysis from the OTIS-RC/ folder; he did not see the full method battery shipped here.

Medina – Prof. Zorro Medina, University of Toronto. Reviewed

the formal write-up; her feedback validated the statistical route and gave the pipeline confidence in its current form.

Naming conventions

RF – Ruhela formulation: a (treatment, outcome, covariates) design

choice for a specific OTIS dataset.

RDF – Ruhela Dual Formulation: an RF paired with a Naive-arm

sensitivity contrast (e.g., on a01/b01 the Ruhela arm T_high_ac -> vm_count is paired with the Naive arm any-flag -> vm-binary).

DLRM – Doob-Levinsky-Ruhela-Medina, short alias for the

attribution chain on the methodology side. Function names like analyze_a01_dlrm are equivalent aliases for analyze_a01_ruhela_formulations.

ACKNOWLEDGEMENTS – beyond the methodology attribution above

The DLRM attribution covers the contributors whose work directly shaped the OTIS-RC analyses. The wider intellectual journey that brought this codebase into being also owes credit to:

Prof. Beatrice Jauregui, University of Toronto – for the network

of mentorship that connected Ruhela to Prof. Medina, which made this line of work possible.

Prof. Ayobami Laniyonu, University of Toronto – for valuable

lessons over the broader period of this work that informed Ruhela’s thinking on policing and corrections research.

The separation between methodology attribution (DLRM, narrow) and acknowledgements (broader, enabling) follows standard academic- publishing convention.

class morie.otis_causal.CausalEstimate(estimator: 'str', ate: 'float', ate_se: 'float', ate_pval: 'float', n: 'int', n_treated: 'int', p_treat: 'float', notes: 'list[str]' = <factory>)[source]

Bases: object

Parameters:
estimator: str
ate: float
ate_se: float
ate_pval: float
n: int
n_treated: int
p_treat: float
notes: list[str]
property ate_ci95: tuple[float, float]
morie.otis_causal.otis_ipw(df, *, treatment, outcome, covariates, eps=0.02, propensity_calibration='none')[source]

Hájek-stabilised IPW estimator of the ATE on OTIS data.

Estimates the propensity by logistic regression and weights treated/control outcomes by $1/hat e$ and $1/(1-hat e)$. Standard error follows the Lunceford-Davidian sandwich.

Parameters:
  • propensity_calibration ("none" | "platt" | "isotonic") – Calibrate the raw logistic propensities. Platt = sigmoid(a + b * logit(p)) fitted on the same data; isotonic = monotonic regression (sklearn). Calibration improves Brier score / calibration-curve diagnostics; for OTIS data with logistic propensities it usually moves the predicted prevalence toward the observed prevalence.

  • df (DataFrame)

  • treatment (str)

  • outcome (str)

  • covariates (list[str])

  • eps (float)

Return type:

CausalEstimate

morie.otis_causal.otis_aipw(df, *, treatment, outcome, covariates, n_folds=5, seed=123, eps=0.02, propensity_calibration='none')[source]

Augmented IPW (Robins-Rotnitzky-Zhao) ATE on OTIS data.

Uses n_folds cross-fitting: nuisance models (propensity, outcome-regression) are fit on $K-1$ folds and evaluated on the held-out fold, then aggregated. This is the doubly-robust influence-function plug-in estimator.

Parameters:
  • propensity_calibration ("none" | "platt" | "isotonic") – Calibrate the cross-fitted propensity scores within each fold’s test set against its training labels. Improves Brier- score / calibration-curve diagnostics.

  • df (DataFrame)

  • treatment (str)

  • outcome (str)

  • covariates (list[str])

  • n_folds (int)

  • seed (int)

  • eps (float)

Return type:

CausalEstimate

morie.otis_causal.otis_gcomputation(df, *, treatment, outcome, covariates, n_bootstrap=200, seed=123)[source]

Parametric g-computation (Robins 1986) of the ATE on OTIS data.

Fits an outcome regression $\hat\mu(d, X)$ via OLS, then computes the standardised mean difference

\hat\Delta = n^{-1} \sum_i (\hat\mu(1, X_i)
  • \hat\mu(0, X_i)).

SE via non-parametric bootstrap (resample rows, refit, recompute). Complements IPW and AIPW: g-computation is consistent if and only if the outcome model is correctly specified (single-robust); IPW is consistent iff the propensity is correct; AIPW is doubly robust.

References

Robins, J. M. (1986). A new approach to causal inference in

mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling 7: 1393-1512.

Parameters:
Return type:

CausalEstimate

morie.otis_causal.otis_psm_subclass(df, *, treatment, outcome, covariates, n_strata=5, eps=0.02)[source]

Propensity-score stratification (Rosenbaum & Rubin 1983).

Splits the sample into n_strata strata by propensity-score quantile, computes within-stratum ATE = E[Y|D=1, S=s] - E[Y|D=0, S=s], and weights stratum-specific ATEs by stratum size to get the population ATE. Standard error: weighted average of within-stratum SEs (Rosenbaum-Rubin convention).

n_strata=5 is the standard “rule of thumb” – Cochran (1968) showed 5 strata remove ~90% of bias from a single confounder.

References

Rosenbaum, P. R. & Rubin, D. B. (1983). The central role of the

propensity score in observational studies for causal effects. Biometrika 70(1): 41-55.

Cochran, W. G. (1968). The effectiveness of adjustment by

subclassification in removing bias in observational studies. Biometrics 24(2): 295-313.

Parameters:
Return type:

CausalEstimate

morie.otis_causal.otis_atc(df, *, treatment, outcome, covariates, n_folds=5, seed=123, eps=0.02)[source]

AIPW-flavoured Average Treatment effect on the Controls (ATC).

The IF for ATC = E[Y(1) - Y(0) | D = 0] is

psi^{ATC}(O; eta) = mu_1(X) - mu_0(X)
  • (1-D)/(1-e(X)) * (Y - mu_0(X))

  • e(X)/(1-e(X)) * D * (Y - mu_1(X)) / e(X)

weighted by 1 / Pr(D=0). The implementation reuses the cross-fitted nuisance pipeline of otis_aipw and reweights the score on the untreated stratum.

Parameters:
Return type:

CausalEstimate

morie.otis_causal.otis_plr(df, *, treatment, outcome, covariates, n_folds=3, seed=123, ml_outcome='rf', ml_treatment='rf')[source]

Partially Linear Regression DML (Chernozhukov et al. 2018).

Fits the structural model

Y = θ * D + g(X) + ε, D = m(X) + V

via Frisch-Waugh-Lovell partialling-out: cross-fit nuisances g(X) = E[Y|X] and m(X) = E[D|X], then regress the residualised Y on the residualised D to recover θ.

Cross-fitted random-forest nuisance models match the OTIS-RC DoubleML/PLR pipeline. Differs from IRM-DML in that PLR assumes a constant (homogeneous) treatment effect θ; IRM allows D × X interactions in the outcome model. PLR is the simpler / more restricted estimator. Concordance with IRM-DML is the standard “no effect heterogeneity” robustness signal.

Parameters:
Return type:

CausalEstimate

morie.otis_causal.otis_aipw_superlearner(df, *, treatment, outcome, covariates, n_folds=5, seed=123, eps=0.02, propensity_calibration='none')[source]

SuperLearner-stacked AIPW (cross-fitted convex stack of learners).

Stacks four learners – random forest, ridge, OLS/logistic, mean – via cross-validated convex weights minimising MSE (regression) or log-loss (propensity). Mirrors the OTIS-RC AIPW SuperLearner pipeline (notez1a.qmd lines 2028-2048: SL.glmnet, SL.xgboost, SL.glm, SL.mean) at a lighter sklearn-only stack. xgboost is optional; if importable, it is added to the stack.

Falls back to plain AIPW if sklearn isn’t available.

References

van der Laan, M. J. and Polley, E. C. and Hubbard, A. E. (2007).

Super Learner. Stat. Appl. Genet. Mol. Biol. 6(1): Article 25.

Parameters:
Return type:

CausalEstimate

morie.otis_causal.otis_psm(df, *, treatment, outcome, covariates, k=1, caliper_sd=0.2, with_replacement=False, eps=0.02)[source]

1:k nearest-neighbour propensity-score matching on logit(e(X)).

Estimates the average treatment effect on the treated (ATT) by pairing each treated unit with its $k$ nearest control units on the logit-propensity scale. An optional caliper (default $0.2,sigma_{mathrm{logit},e}$, the Austin 2011 convention) discards treated units with no control inside the caliper. When $k=1$ and with_replacement=False the SE is the paired-difference standard error $sqrt{mathrm{Var}(Y_t - Y_{c(t)}) / m}$ where $m$ is the number of matched pairs.

References

Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioural Research 46(3): 399-424.

Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74(1): 235-267.

Parameters:
Return type:

CausalEstimate

morie.otis_causal.otis_balance(df, *, treatment, covariates, outcome=None, caliper_sd=0.2, eps=0.02)[source]

Per-covariate standardised mean differences before / after weighting / matching.

Returns a DataFrame with columns covariate, smd_raw, smd_ipw, smd_psm. Conventionally $|mathrm{SMD}| > 0.10$ flags meaningful imbalance and $|mathrm{SMD}| > 0.25$ strong imbalance cite{austin2011smd}.

Parameters:
Return type:

DataFrame

morie.otis_causal.otis_overlap(df, *, treatment, covariates, eps=0.02)[source]

Propensity-score overlap diagnostics.

Returns a dict with min/max propensity per group, fraction inside the trimming interval $[varepsilon, 1-varepsilon]$, Hellinger distance between treated/control propensity densities, and the count of clipped propensities.

Parameters:
Return type:

dict

morie.otis_causal.otis_irm_dml(df, *, treatment, outcome, covariates, cluster_cols=None, n_folds=3, seed=123, eps=0.02, ml_outcome='rf', ml_propensity='rf', match_first=False, match_caliper_sd=0.2)[source]

Interactive-Regression-Model DML, matching DoubleML in R.

Computes both the ATE (population average treatment effect) and ATTE (average treatment effect on the treated) via the doubly-robust influence function

psi^{ATE}(O; eta)
= mu_1(X) - mu_0(X)
  • D(Y - mu_1(X)) / e(X)

  • (1-D)(Y - mu_0(X)) / (1 - e(X)),

psi^{ATTE}(O; eta)
= frac{D(Y - mu_0(X)) - e(X)(1-D)(Y - mu_0(X)) /

(1 - e(X))}{bar D},

with cross-fitted random-forest nuisance models. Standard errors can be cluster-robust on one or two cluster axes (cite{CameronGelbachMiller2011,LiangZeger1986}).

Returns a dict with both estimands, their SEs, and 95% CIs.

Parameters:
  • cluster_cols (str | list[str] | None) – Column names of cluster axes. None ⇒ heteroskedasticity- consistent (no clustering). "id" or ["id"] ⇒ one-way cluster on id. ["id", "rc"] ⇒ two-way Cameron-Gelbach-Miller.

  • match_first (bool) – If True, first run 1:1 nearest-neighbour propensity-score matching on logit(ê(X)) with caliper ``match_caliper_sd``× SD(logit(ê)), then fit IRM-DML on the matched subset only. Mirrors the MatchIt-then-DML pipeline in notez1a.qmd.

  • df (DataFrame)

  • treatment (str)

  • outcome (str)

  • covariates (list[str])

  • n_folds (int)

  • seed (int)

  • eps (float)

  • ml_outcome (str)

  • ml_propensity (str)

  • match_caliper_sd (float | None)

Return type:

dict

morie.otis_causal.otis_per_year_irm_dml(df, *, treatment, outcome, covariates, year_col='EndFiscalYear', cluster_cols=None, n_folds=3, seed=123, full_battery=False, propensity_calibration='none')[source]

Fit IRM-DML separately on each fiscal year.

Returns a dict keyed by year value. When full_battery=False (default), each year’s value is the standard otis_irm_dml output (mirrors Ruhela’s res_by_year in correctional_stats_report1z.RData).

When full_battery=True, each year’s value is a dict {estimator_name: CausalEstimate-or-irm-dict} with all 10 Ruhela-formulation estimators (IPW, AIPW, g-computation, PSM 1:1 NN, PSM subclass, IRM-DML, PSM->IRM-DML match_first, ATC-AIPW, PLR, SuperLearner-AIPW). Roughly 7× the per-year runtime.

Parameters:
Return type:

dict

morie.otis_causal.make_pair_alert_to_volatility_ruhela(df)[source]

Ruhela’s primary causal contrast: high alert complexity (ac ≥ 2) causes more inter-region transfers (vm).

This is “Ruhela’s equation” for alert complexity and volatility – the 8-state combo encoding documented in /path/to/workspace/OTIS-RC/notez1a.qmd and used to produce the published res_pool / res_by_year / res_all estimates in correctional_stats_report1z.RData. Sister function make_pair_alert_to_volatility_naive() is available as a robustness alternative; both are run side by side by make_pair_alert_to_volatility_all().

Encoding

For each placement-row, the three binary alerts (MentalHealth_Alert, SuicideRisk_Alert, SuicideWatch_Alert) encode one of $2^3 = 8$ states:

a1 = MH & ¬SR & ¬SW (mental-health only) a2 = ¬MH & SR & ¬SW (suicide-risk only) a3 = ¬MH & ¬SR & SW (suicide-watch only – empirically empty) a4 = MH & SR & ¬SW (MH+SR) a5 = ¬MH & SR & SW (SR+SW) a6 = MH & ¬SR & SW (MH+SW – empirically empty) a7 = MH & SR & SW (all three) a8 = ¬MH & ¬SR & ¬SW (no alerts)

Per (UniqueIndividual_ID, EndFiscalYear), the alert-state complexity ac is the number of distinct combos with positive support across the year’s placement-rows for that person. This differs from the naïve “max simultaneous flags” – a person who had three a4 placements has ac = 1 (one distinct combo), not ac = 2 (two simultaneous flags). Treatment T = 1 iff ac ≥ 2.

Outcome

vm (regional volatility moves, count) sums two indicators across the person-year placement sequence:

  1. regA ≠ regB within the same row (the person’s original-vs-most-recent regions differ in this placement);

  2. regA ≠ regA(prev row) across rows (the person’s original-region label changed across consecutive placements).

This is a Poisson-style count outcome on ${0, 1, 2, ldots}$.

returns:

T = “T_high_ac” binary, 1 iff ac ≥ 2 Y = “Y_vm_count” integer count, regional transfer count covariates = [“Gender”, “Age_Category”, “EndFiscalYear”]

rtype:

(data, T, Y, covariates) where

Parameters:

df (DataFrame)

Return type:

tuple[DataFrame, str, str, list[str]]

morie.otis_causal.make_pair_alert_to_volatility_naive(df)[source]

Naïve (simpler) alternative to make_pair_alert_to_volatility_ruhela().

Treatment ac:

per (id, year), the maximum number of simultaneous alert flags across the year’s placement rows (i.e.max(MH + SR + SW) per row). T = 1 iff this maximum is ≥ 2.

Outcome vm (binary):

per (id, year), 1 iff any placement row has Region_AtTimeOfPlacement Region_MostRecentPlacement.

This was the original morie formulation before we audited against OTIS-RC/notez1a.qmd; it produces a different treatment marginal (~24% treated vs ~14% under Ruhela’s 8-state encoding) and a binary rather than count outcome. We keep it for robustness comparison: if both formulations agree on sign and rough magnitude, the substantive conclusion is invariant to the operationalisation.

Parameters:

df (DataFrame)

Return type:

tuple[DataFrame, str, str, list[str]]

morie.otis_causal.make_pair_alert_to_volatility_all(df)[source]

Run both Ruhela and naïve alert->volatility formulations.

Returns a dict {"ruhela": (...), "naive": (...)} where each value is a 4-tuple (person_year_data, T, Y, covariates) suitable for handing to otis_irm_dml().

Parameters:

df (DataFrame)

Return type:

dict

morie.otis_causal.make_pair_alert_to_volatility(df)

Ruhela’s primary causal contrast: high alert complexity (ac ≥ 2) causes more inter-region transfers (vm).

This is “Ruhela’s equation” for alert complexity and volatility – the 8-state combo encoding documented in /path/to/workspace/OTIS-RC/notez1a.qmd and used to produce the published res_pool / res_by_year / res_all estimates in correctional_stats_report1z.RData. Sister function make_pair_alert_to_volatility_naive() is available as a robustness alternative; both are run side by side by make_pair_alert_to_volatility_all().

Encoding

For each placement-row, the three binary alerts (MentalHealth_Alert, SuicideRisk_Alert, SuicideWatch_Alert) encode one of $2^3 = 8$ states:

a1 = MH & ¬SR & ¬SW (mental-health only) a2 = ¬MH & SR & ¬SW (suicide-risk only) a3 = ¬MH & ¬SR & SW (suicide-watch only – empirically empty) a4 = MH & SR & ¬SW (MH+SR) a5 = ¬MH & SR & SW (SR+SW) a6 = MH & ¬SR & SW (MH+SW – empirically empty) a7 = MH & SR & SW (all three) a8 = ¬MH & ¬SR & ¬SW (no alerts)

Per (UniqueIndividual_ID, EndFiscalYear), the alert-state complexity ac is the number of distinct combos with positive support across the year’s placement-rows for that person. This differs from the naïve “max simultaneous flags” – a person who had three a4 placements has ac = 1 (one distinct combo), not ac = 2 (two simultaneous flags). Treatment T = 1 iff ac ≥ 2.

Outcome

vm (regional volatility moves, count) sums two indicators across the person-year placement sequence:

  1. regA ≠ regB within the same row (the person’s original-vs-most-recent regions differ in this placement);

  2. regA ≠ regA(prev row) across rows (the person’s original-region label changed across consecutive placements).

This is a Poisson-style count outcome on ${0, 1, 2, ldots}$.

returns:

T = “T_high_ac” binary, 1 iff ac ≥ 2 Y = “Y_vm_count” integer count, regional transfer count covariates = [“Gender”, “Age_Category”, “EndFiscalYear”]

rtype:

(data, T, Y, covariates) where

Parameters:

df (DataFrame)

Return type:

tuple[DataFrame, str, str, list[str]]

morie.otis_causal.otis_dml(df, *, treatment, outcome, covariates, n_folds=5, seed=123)[source]

Cross-fitted DML PLR (partialling-out) wrapper.

Reproduces morie.otis.otdml() but exposes the same CausalEstimate shape for grid comparison.

Parameters:
Return type:

CausalEstimate

morie.otis_causal.make_pair_a(df)[source]

Pair (a): MentalHealth_Alert -> SuicideRisk_Alert (binary->binary).

The clinical-alert chain: do mental-health flags causally elevate subsequent suicide-risk-alert occurrence, conditional on demographics and region?

Parameters:

df (DataFrame)

Return type:

tuple[DataFrame, str, str, list[str]]

morie.otis_causal.make_pair_b(df)[source]

Pair (b): HighAlertComplexity -> AnyReadmission.

Treatment T_b = 1 iff at least 2 of {MentalHealth, SuicideRisk, SuicideWatch} alerts are simultaneously active in the year. Outcome Y_b = 1 iff this individual has ≥2 placements (proxy for any future readmission).

Parameters:

df (DataFrame)

Return type:

tuple[DataFrame, str, str, list[str]]

morie.otis_causal.make_pair_c(df)[source]

Pair (c): RegionalVolatility -> SegregationDays.

Treatment T_c = 1 iff Region_AtTimeOfPlacement ≠ Region_MostRecentPlacement (the inter-region transfer flag). Outcome Y_c = NumberConsecutiveDays_Segregation (continuous, truncated at the 99th percentile to dampen the long tail).

Parameters:

df (DataFrame)

Return type:

tuple[DataFrame, str, str, list[str]]

morie.otis_causal.make_pair_alert_to_volatility_a01(df=None)[source]

Same as make_pair_alert_to_volatility_ruhela() but auto-loads a01 (Restrictive Confinement Detailed) instead of b01 (Segregation Detailed) if df is None.

a01 is the file the published res_pool / res_by_year / res_all are computed on; using it in our pipeline closes the “different bundled snapshot” gap.

Parameters:

df (DataFrame | None)

Return type:

tuple[DataFrame, str, str, list[str]]

morie.otis_causal.otis_causal_grid(df=None, *, seed=123)[source]

Run IPW / AIPW / DML on all three canonical (T, Y) pairs.

Returns a RichResult with a 9-row payload table – three estimators × three pairs – plus per-row standard errors, p-values, and 95% confidence intervals.

Parameters:
Return type:

RichResult

morie.otis_churn – Goffmanian institutional-churn analyses on OTIS.

Goffman (1961) describes ‘total institutions’ as totalizing environments that produce a cyclical relationship to the inmate. This module operationalises the cyclical / mortifying / embedding dimensions as formal statistical tests:

repeat_placement_concentration(b09) – Gini + power-law fit + KS test within_year_placement_count(b01) – per-(id,year) cell-size distribution within_year_region_diversity(b01) – intra-year region cycling mortification_cooccurrence(b01) – joint alert chi² + Cramer’s V disciplinary_medical_overlap(b01) – disc × med-protect co-occurrence embedding_distribution(b02) – total-days survival (lognormal vs Pareto AIC) intra_year_transition_matrix(a01) – Markov regA->regA’ within FY path_complexity_gini(b01) – Gini split by (year × region) region_alert_state_richness(b01) – distinct (region × combo) per cell regC_demog_contingency(b01) – multi-region × Gender × Age χ² irr_glmm_vm(b01) – Poisson + NB IRR for vm

Each emits a RichResult.

NOTE: OTIS UniqueIndividual_ID is anonymised as YYYY-XXXXX-AA, where AA is the dataset acronym (RC for a01, SG for b01/b02, DC for d01). The YYYY prefix locks each ID to one fiscal year, AND IDs are re-randomized per dataset file even within the same year. So longitudinal individual-level linkage and cross-dataset linkage are both not possible by design. All metrics above are either aggregate (b09 counts), within-year individual (b01), or within-year total-days (b02). See docs/source/methods/otis_linkage.md.

morie.otis_churn.gini(x)[source]
Parameters:

x (ndarray)

Return type:

float

morie.otis_churn.repeat_placement_concentration(df=None)[source]

Goffman’s ‘cyclical inmate’ – heavy concentration of placements among a small fraction of individuals.

Uses OTIS b09 (NumberPlacements × NumberIndividuals).

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.within_year_placement_count(df=None)[source]

Distribution of segregation placements per (individual × fiscal year) cell.

Because OTIS UniqueIndividual_ID is anonymised per fiscal year (YYYY-XXXXX-AA), each (id, year) cell is one anonymous person-year. This metric describes how concentrated the placements are within a fiscal year – Goffman’s intra-year cycling. Cross-year readmission is not measurable in OTIS by design. See OTIS_LINKAGE.md.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.within_year_region_diversity(df=None)[source]

Distribution of distinct regions per individual within one fiscal year.

Because OTIS UniqueIndividual_ID is anonymised per fiscal year (YYYY-XXXXX-AA), this metric measures intra-year region cycling only – not multi-year mobility. See OTIS_LINKAGE.md.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.mortification_cooccurrence(df=None)[source]

Goffman’s ‘mortification of self’ is operationalised here as concurrent stigmatising-alert flags on a single placement.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.disciplinary_medical_overlap(df=None)[source]

Goffman’s ‘tinkering trades’ tension: the same institution classifies the same person via medical AND disciplinary rationales. We look at co-occurrence on individual records.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.embedding_distribution(df=None)[source]

TotalAggregatedDays_Segregation distribution per individual per fiscal year. Compare lognormal vs Pareto fit via AIC.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.intra_year_transition_matrix(df=None)[source]

Knowledge itself is power. – Francis Bacon

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.path_complexity_gini(df=None)[source]

Gini of placements-per-(id × year) cells, split by (year, region).

Extends within_year_placement_count by reporting per-region and per-year Gini coefficients, exposing where Goffmanian cycling is most concentrated.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.region_alert_state_richness(df=None)[source]

Distinct (region × alert-combo) states occupied per (id × year).

Goffman’s mortification × institutional-mobility hybrid: how many distinct role-zones (region paired with alert presentation) does a person traverse within one fiscal year?

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.regC_demog_contingency(df=None)[source]

Multi-region path indicator (regC = number of distinct regions visited within FY) × Gender × Age contingency, with chi² + Cramer’s V.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.irr_glmm_vm(df=None)[source]

Poisson + Negative-Binomial GLM IRR table for vm ~ T_high_ac plus demographic covariates, on the (UniqueIndividual_ID, FY) cell.

A simplified implementation of the OTIS-RC glmmTMB pipeline: no random effect (statsmodels can’t easily do mixed NB on this scale). Provides Poisson and NB2 IRR side by side as a robustness pair.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.otis_churn.analyze_all(out_dir=None)[source]
Parameters:

out_dir (Path | None)

Return type:

dict[str, RichResult]

morie.otis_datasets – registry + loader for all 29 OTIS CSVs.

The Ontario Tracking Information System (OTIS) is the open-data feed for Ontario’s adult correctional system, published by the Ministry of the Solicitor General’s Service Management and Oversight Branch under the Jahn settlement framework. The 29 datasets in the official A01RCDD release fall into 4 series:

a-series (1 dataset) – Restrictive confinement detailed

(the canonical per-day record file used in Ruhela 2026 OTIS-RC)

b-series (9 datasets) – Segregation placement records c-series (12 datasets) – Individuals in segregation & restrictive

confinement

d-series (7 datasets) – Deaths in custody

CSV path convention:

data/datasets/OTIS/<id>_<short_label>.csv

Where <id> is the canonical dataset name from the official Ontario data dictionary (a01, b01, c03, d05, …).

Public API:

DATASET_REGISTRY – dict mapping id -> metadata load_otis_dataset(id) – returns DataFrame list_otis_datasets() – list of all (id, description, n_cols) download_otis_dataset(id, target_dir=None)

– fetch a fresh copy from data.ontario.ca

via the CKAN action API

download_all_otis(target_dir=None)

– fetch every file in the package

OTIS_DATA_DIR – Path OTIS_CKAN_PACKAGE – package id on data.ontario.ca

References

Ministry of the Solicitor General. Data on Inmates in Ontario.

https://data.ontario.ca/dataset/data-on-inmates-in-ontario

Ministry of the Solicitor General. Restrictive Confinement,

Segregation and Deaths in Custody Data Dictionary, 2025-11-03. https://data.ontario.ca/dataset/data-on-inmates-in-ontario/resource/d83fe893-9634-4794-a0c1-c17bf619a95a

class morie.otis_datasets.OtisDataset(id, series, csv_filename, description, columns, panel, primary_metric)[source]

Bases: object

One OTIS dataset’s metadata.

Parameters:
id: str
series: str
csv_filename: str
description: str
columns: tuple[str, ...]
panel: bool
primary_metric: str
property csv_path: Path
morie.otis_datasets.load_otis_dataset(dataset_id, data_dir=None)[source]

Load one OTIS dataset by id (b01, c03, d05, …).

Returns a pandas DataFrame with the columns documented in DATASET_REGISTRY[dataset_id].columns.

Parameters:
Return type:

DataFrame

morie.otis_datasets.list_otis_datasets()[source]

List all 29 OTIS datasets as (id, description, n_columns).

Return type:

list[tuple[str, str, int]]

morie.otis_datasets.download_otis_dataset(dataset_id, target_dir=None, *, overwrite=False)[source]

Fetch one OTIS CSV from CKAN and write to target_dir.

Returns the local path to the downloaded file. No-op if the file already exists and overwrite=False.

Parameters:
Return type:

Path

morie.otis_datasets.download_all_otis(target_dir=None, *, overwrite=False, skip_missing=True)[source]

Fetch every CSV in the OTIS registry from CKAN.

Returns {dataset_id: local_path} for successfully downloaded files. When skip_missing=True (default), files that aren’t found on CKAN are silently skipped; otherwise the function raises.

Parameters:
Return type:

dict[str, Path]

morie.otis_tps_overlay – cross-link OTIS (Ontario corrections) with TPS (Toronto police) data.

Both feeds touch Toronto, so meaningful overlays:

  1. Year-over-year correlation – does annual segregation/RC use in the OTIS “Toronto” region track annual TPS incident counts?

  2. Per-OTIS-region rollups – total seg/RC counts per region, with the Toronto-region row joinable to TPS aggregates.

  3. Crime × confinement composite – combine TPS composite-risk index (per-neighbourhood) with OTIS seg/RC totals (Toronto-region only) to surface high-correlation years.

All emit RichResult.

morie.otis_tps_overlay.yoy_correlation(tps_categories=None, *, sample_rows=50000)[source]

Year-by-year correlation between OTIS Toronto-region segregation placements and TPS crime incident counts (per category).

Parameters:
  • tps_categories (list[str] | None)

  • sample_rows (int | None)

Return type:

RichResult

morie.otis_tps_overlay.per_region_rollup(*, tps_total_by_year=None)[source]

OTIS seg/RC totals per region × year, with the Toronto row flagged for TPS-overlay use. Also splits by gender.

Parameters:

tps_total_by_year (Series | None)

Return type:

RichResult

morie.otis_tps_overlay.composite_overlay(*, sample_rows=30000)[source]

For each TPS category, compute year-aligned correlation with OTIS Toronto-region segregation. Same idea as yoy_correlation but uses ALL fiscal years jointly via concat.

Parameters:

sample_rows (int | None)

Return type:

RichResult

morie.otis_tps_overlay.analyze_all(out_dir=None, *, sample_rows=30000)[source]
Parameters:
  • out_dir (Path | None)

  • sample_rows (int | None)

Return type:

dict[str, RichResult]

TPS — Toronto Police Service

morie.tps_io – multi-format readers for TPS open-data exports.

Each TPS category at data/datasets/TPS/<Category>/ has 9 sibling format exports of the same incident records:

CSV – pandas.read_csv (canonical fast path) Excel – pandas.read_excel via openpyxl GeoJSON – pure JSON, geometry as Python list-of-coords FeatureCollection – ESRI JSON variant of GeoJSON KML / KMZ – zipfile + ElementTree, geometry as coord list GeoPackage – sqlite3, geometry stored as WKB blobs SQLiteGeodatabase – sqlite3, ESRI variant of GeoPackage Shapefile – needs pyshp; graceful degrade if not installed FileGeoDatabase – needs fiona / gdal; graceful degrade

This module returns a pandas.DataFrame for every format, with the geometry (when available) attached as a geometry column whose values are simple Python lists/tuples – no shapely / geopandas dependency.

For NeighbourhoodCrimeRates the geometry is Polygon; for incident datasets the geometry is Point. Either way, the geometry column contains the (longitude, latitude) tuples in WGS84.

Public API:

load_tps(name, format=’csv’, nrows=None) -> pd.DataFrame list_tps_formats(name) -> dict[str, Path] available_formats() -> list of supported format names

morie.tps_io.load_tps(name, format='csv', nrows=None)[source]

Load TPS dataset name in the given format.

Parameters:
  • name ("Assault", "Homicides", (case-insensitive))

  • format (one of csv | excel | geojson | featurecollection |) – kml | geopackage | sqlitegeodatabase | shapefile. “filegeodatabase” requires fiona/gdal – not supported on this build.

  • nrows (sample size cap; None = full dataset.)

Return type:

DataFrame

morie.tps_io.list_tps_formats(name)[source]

Map format name -> path of the file that would be loaded. For formats not present on disk, the path is omitted from the dict.

Parameters:

name (str)

Return type:

dict[str, Path]

morie.tps_io.available_formats()[source]

List of formats this build can actually load.

Return type:

list[str]

morie.tps_datasets – registry + loader for Toronto Police Service public crime datasets.

13 categories at data/datasets/TPS/<Category>/CSV/:

Assault, AutoTheft, BicycleTheft, BreakandEnter, CommunitySafetyIndicators, HateCrimes, Homicides, IntimatePartnerAndFamilyViolence, NeighbourhoodCrimeRates, Robbery, ShootingAndFirearmDiscarges, TheftFromMovingVehicle, TheftOver

Each category also has Excel/Shapefile/GeoJSON/etc; this module reads the CSV by default but can be pointed at any sibling format.

Schema (per-incident, varies slightly by category):

OBJECTID, EVENT_UNIQUE_ID, REPORT_DATE, OCC_DATE, REPORT_YEAR, REPORT_MONTH, REPORT_DAY, REPORT_DOY, REPORT_DOW, REPORT_HOUR, OCC_YEAR, OCC_MONTH, OCC_DAY, OCC_DOY, OCC_DOW, OCC_HOUR, DIVISION, LOCATION_TYPE, PREMISES_TYPE, UCR_CODE, UCR_EXT, OFFENCE, CSI_CATEGORY, HOOD_158, NEIGHBOURHOOD_158, HOOD_140, NEIGHBOURHOOD_140, LONG_WGS84, LAT_WGS84, x, y

Public API:

TPS_REGISTRY – dict: name -> TpsDataset load_tps_dataset(name) -> pd.DataFrame list_tps_datasets() -> list of (name, n_rows_est, …)

class morie.tps_datasets.TpsDataset(name, description, primary_date, has_geometry)[source]

Bases: object

One TPS open-data category.

Parameters:
  • name (str)

  • description (str)

  • primary_date (str)

  • has_geometry (bool)

name: str
description: str
primary_date: str
has_geometry: bool
property csv_dir: Path
csv_path(fname=None)[source]
Parameters:

fname (str | None)

Return type:

Path

morie.tps_datasets.load_tps_dataset(name, csv_filename=None, nrows=None, *, format=None)[source]

Load one TPS dataset by category name.

name is case-insensitive. Pass nrows=N for a quick sample while developing on the largest tables (Assault is 254k rows).

format (optional) – load from a non-CSV format. Valid: csv, excel, geojson, featurecollection, kml, geopackage, sqlitegeodatabase, shapefile. Default csv. Geometric formats attach a geometry column (Python coord lists). See morie.tps_io.load_tps for details.

Parameters:
  • name (str)

  • csv_filename (str | None)

  • nrows (int | None)

  • format (str | None)

Return type:

DataFrame

morie.tps_datasets.list_tps_datasets()[source]

List all TPS datasets as (name, description, primary_date_col).

Return type:

list[tuple[str, str, str]]

morie.tps_crime – cross-category crime analyses for TPS datasets.

Builds on top of morie.tps_datasets + tps_io to compare incidents ACROSS the 13 TPS categories. Useful for: - Comparing year-over-year trends side-by-side - Identifying neighbourhoods that are ‘hot’ across many crime types - Computing bivariate spatial correlation (Moran’s BV) between two

crime categories on the same neighbourhoods

  • Building a composite crime-risk index per neighbourhood

All emit RichResult.

morie.tps_crime.yoy_panel(categories=None, *, sample_rows=50000)[source]

Side-by-side year-over-year panel across TPS categories.

Parameters:
  • categories (list[str] | None)

  • sample_rows (int | None)

Return type:

RichResult

morie.tps_crime.composite_index(categories=None, *, sample_rows=50000, weights=None, top_n=25)[source]

Build a composite per-neighbourhood crime-risk index.

For each category, count incidents per HOOD_158, z-standardise across neighbourhoods, then sum (or weight-and-sum) to get a single composite. Higher index = neighbourhood with elevated incidence across many crime types.

weights lets you tune (e.g. weights={‘Homicides’: 5.0, ‘BicycleTheft’: 0.5}). Default = unit weights for all loaded cats.

Parameters:
Return type:

RichResult

morie.tps_crime.bivariate_morans_i(cat_a, cat_b, *, sample_rows=50000, k_neighbours=5)[source]

Bivariate Moran’s I – does category A’s count in a hood correlate with category B’s count in NEIGHBOURING hoods?

Parameters:
  • cat_a (str)

  • cat_b (str)

  • sample_rows (int | None)

  • k_neighbours (int)

Return type:

RichResult

morie.tps_crime.category_correlation_matrix(*, sample_rows=50000)[source]

Pearson correlation of per-hood incident counts across all 13 TPS categories.

Parameters:

sample_rows (int | None)

Return type:

RichResult

morie.tps_csi – Statistics Canada Crime Severity Index (CSI) weights for the 9 Toronto Police Service open-data categories.

The Crime Severity Index, introduced in Wallace et al.(2009) cite{Wallace2009CSI} and maintained by the Canadian Centre for Justice and Community Safety Statistics, weights each emph{Criminal Code} offence by:

weight(offence) = (avg sentence in days)

× (proportion of offenders incarcerated)

so that violent offences with high incarceration rates and long sentences contribute disproportionately to a city’s per-capita CSI score, while minor property offences contribute less. This module exposes the weights used for the 9 TPS open-data categories (Assault, Auto Theft, Bicycle Theft, Break and Enter, Homicide, Robbery, Shooting and Firearm Discharges, Theft from Motor Vehicle, Theft Over) and provides per-year + per-neighbourhood CSI aggregates.

Important caveats

  1. TPS open-data categories aggregate over multiple Criminal Code sub-offences (e.g.``Assault’’ covers Assault Level 1, Level 2 with a weapon, Level 3 aggravated, plus a small share of family and assault-of-peace-officer offences). The weights here are emph{representative blends} reflecting the typical distribution of sub-offences within each TPS category for FY2023; for an exact reproduction of Statistics Canada’s CSI for the City of Toronto, one must work directly from the CCJS UCR microdata (which is not in TPS open data).

  2. The weights below are pinned to the last published Statistics Canada methodology update (Reweighting the Crime Severity Index'', Catalogue 85-004-X) and the Toronto-specific override tables in CCJS Annual Statistics 2023.  Newer weight revisions (StatsCan revises every 5 years) may shift values by 5--15\% without changing relative ordering.  Override via the ``weights keyword argument when calling functions in this module.

  3. Statistics Canada itself reports two CSI variants: Total CSI'' and ``Violent CSI''.  Functions here default to the Total CSI weights but accept a ``variant=`'violent`' argument to use violent-only weights, where non-violent categories (BBE, theft) are zeroed.

References

Wallace, M., Turner, J., Babyak, C., & Matarazzo, A. (2009).

Measuring Crime in Canada: Introducing the Crime Severity Index and Improvements to the Uniform Crime Reporting Survey. Statistics Canada Catalogue 85-004-X.

Statistics Canada (2024). Crime Severity Index, Census

Metropolitan Areas, 2023. Catalogue 35-10-0190-01.

morie.tps_csi.csi_weight(category, *, variant='total', weights=None)[source]

Return the CSI weight for a TPS open-data category.

Parameters:
  • category (str) – TPS category name (e.g. “Assault”, “Homicides”).

  • variant ({"total", "violent"}) – Which CSI variant to use.

  • weights (dict[str, float] | None) – Optional override. When provided, takes precedence.

Return type:

float

morie.tps_csi.csi_per_year(counts_per_year, *, variant='total', weights=None, population=None, per_capita_unit=100000, rebase_to_year=None, rebase_to_value=100.0)[source]

Compute Toronto’s CSI per fiscal year from per-category counts.

Returns a DataFrame indexed by year with columns:

raw_weighted_sum – Σ w_c · n_{c,year} total_count – Σ n_{c,year} population – Toronto population that year csi_per_capita – raw_weighted_sum / population × per_capita_unit simple_count_rate – total_count / population × per_capita_unit

(for comparison)

counts_per_year may be either a dict-of-dicts or a long-format DataFrame with columns year, category, count.

Parameters:
Return type:

DataFrame

morie.tps_csi.csi_per_neighbourhood(counts_per_hood, *, variant='total', weights=None)[source]

CSI per ward (HOOD_158).

Mirrors csi_per_year() but groups by neighbourhood rather than fiscal year. Population is not divided here because TPS open data does not ship a per-ward population table; users are expected to merge in the City of Toronto Open Data NeighbourhoodCrimeRates per-ward population for per-capita rates. Returns the un-normalised weighted sum + total count per ward.

Parameters:
Return type:

DataFrame

morie.tps_csi.analyze_csi_from_tps_dataframes(dfs, *, year_col='OCC_YEAR', hood_col='HOOD_158', variant='total')[source]

High-level orchestration: take {category: tps_df} of the 9 TPS open-data feeds and return per-year + per-ward CSI.

Parameters:
  • dfs (dict[str, DataFrame]) – Keys must be a subset of CSI_CATEGORIES. Values are the TPS open-data CSV-loaded DataFrames (one row per incident).

  • year_col (str)

  • hood_col (str)

  • variant (Literal['total', 'violent'])

Return type:

RichResult

morie.tps_temporal – temporal analyses for TPS crime data.

  • year_over_year_trend: linear regression on yearly counts

  • seasonal_pattern: monthly / DOW / hour-of-day cyclic stats

  • arima_forecast: simple ARIMA(1,1,1) on monthly counts

  • changepoint_detection: PELT-style changepoint on yearly counts

All emit RichResult.

morie.tps_temporal.year_over_year_trend(df, *, year_col='OCC_YEAR', ds_name='?')[source]

Linear regression of incident counts vs year.

Parameters:
Return type:

RichResult

morie.tps_temporal.seasonal_pattern(df, *, ds_name='?')[source]

Month / day-of-week / hour-of-day cyclic patterns + chi-square test of uniformity.

Parameters:
Return type:

RichResult

morie.tps_temporal.changepoint_detection(df, *, year_col='OCC_YEAR', ds_name='?')[source]

Simple Pettitt-style change-point on yearly counts.

Parameters:
Return type:

RichResult

morie.tps_temporal.arima_forecast(df, *, h=12, ds_name='?')[source]

ARIMA(1,1,1) on monthly counts; forecast h periods ahead.

Parameters:
Return type:

RichResult

morie.tps_spatial – spatial analyses for TPS crime data.

Wires the existing morie.fn spatial primitives to TPS data: - Moran’s I (global) on neighbourhood-level counts - LISA (Local Moran’s I) for hot/cold spots - Ripley’s K for point-pattern clustering - Kernel density estimation - Gi*/Getis-Ord hotspot detection

All emit RichResult.

morie.tps_spatial.morans_i_neighbourhood(df, *, hood_col='HOOD_158', ds_name='?', k_neighbours=5)[source]

Compute global Moran’s I on neighbourhood-level incident counts.

Builds a k-NN spatial weights matrix from neighbourhood centroids (mean LAT/LONG of incidents in each hood as a proxy centroid), then runs Moran’s I on the count vector.

Parameters:
Return type:

RichResult

morie.tps_spatial.local_morans_i(df, *, hood_col='HOOD_158', ds_name='?', k_neighbours=5, top_n=20)[source]

LISA – local Moran’s Ii per neighbourhood, with hot/cold spot classification.

Parameters:
Return type:

RichResult

morie.tps_spatial.kde_density(df, *, bandwidth=0.005, ds_name='?')[source]

Kernel density estimate of incident lat/long.

Returns summary stats; not the full grid (too big for a RichResult).

Parameters:
Return type:

RichResult

morie.tps_spatial_advanced – heavyweight spatial statistics for TPS.

Builds on morie.tps_spatial (Moran’s I global, LISA, KDE) with:

  • ripley_k point-pattern clustering at multiple radii

  • getis_ord_g_star local hot/cold-spot z-scores (Gi*)

  • dbscan_clusters density-based clusters on lat/lon

  • kernel_intensity_at_centroids smoothed counts at neighbourhood centroids

  • polygon_morans_i queen-contiguity Moran’s I from polygons

Polygon contiguity uses the actual GeoJSON polygons from morie.tps_io.load_tps(NeighbourhoodCrimeRates, format=’geojson’), not the centroid-only k-NN approximation in tps_spatial.

morie.tps_spatial_advanced.ripley_k(df, *, ds_name='?', radii_km=None, max_n=5000)[source]

Ripley’s K function – for each radius r, expected number of other points within r of a typical point, normalised by intensity.

For a homogeneous Poisson process K(r) = π r²; values above this indicate clustering, below indicate regularity.

Parameters:
Return type:

RichResult

morie.tps_spatial_advanced.getis_ord_g_star(df, *, ds_name='?', hood_col='HOOD_158', k_neighbours=5, top_n=20)[source]

Local Getis-Ord Gi* statistic per neighbourhood.

z-score interpretation:
Gi* > 1.96 = significant hotspot at α=0.05 (high values

surrounded by other high values)

Gi* < -1.96 = significant cold spot

Parameters:
Return type:

RichResult

morie.tps_spatial_advanced.dbscan_clusters(df, *, ds_name='?', eps_km=0.25, min_samples=30, max_n=30000)[source]

DBSCAN on point coordinates. Returns cluster summary.

Parameters:
Return type:

RichResult

morie.tps_spatial_advanced.polygon_morans_i(*, ds_name='NeighbourhoodCrimeRates', value_col_prefix='ASSAULT_RATE', year=2024, k_neighbours=5)[source]

Polygon-aware Moran’s I using actual neighbourhood polygons from the GeoJSON export of NeighbourhoodCrimeRates, with a per-year crime-rate value column.

Parameters:
  • ds_name (str)

  • value_col_prefix (str)

  • year (int)

  • k_neighbours (int)

Return type:

RichResult

morie.tps_spatial_advanced.bivariate_moran(*, ds_name='NeighbourhoodCrimeRates', x_col_prefix='ASSAULT_RATE', y_col_prefix='HOMICIDE_RATE', year=2024, k_neighbours=5)[source]

Bivariate Moran’s I between two crime-rate columns.

Generalises polygon_morans_i to two attributes: measures the cross-correlation between attribute X at location i and attribute Y at neighbouring locations j. Polygon centroids + k-nearest-neighbour weights, same construction as polygon_morans_i.

\[I_{xy} = \frac{n}{S_0}\, \frac{\sum_i \sum_j w_{ij}\, z^x_i\, z^y_j} {\sqrt{\sum_i (z^x_i)^2 \cdot \sum_i (z^y_i)^2}}\]

where \(z^x\) and \(z^y\) are z-scored attributes.

Parameters:
  • ds_name (str)

  • x_col_prefix (str)

  • y_col_prefix (str)

  • year (int)

  • k_neighbours (int)

Return type:

RichResult

morie.tps_spatial_advanced.moran_sweep_heatmap(*, ds_name='NeighbourhoodCrimeRates', category_prefixes=None, years=None, k_neighbours=5)[source]

Run univariate Moran’s I across (category x year) and return the resulting matrix as a RichResult payload.

Parameters:
  • category_prefixes (tuple of str, optional) – Crime-rate column prefixes to sweep. Defaults to the 9 published TPS categories.

  • years (tuple of int, optional) – Years to sweep. Defaults to (2014, 2015, …, 2024).

  • ds_name (str)

  • k_neighbours (int)

Return type:

RichResult

morie.tps_stochastic – stochastic-physics-of-crime analyses on TPS.

Modelled on Frenkel-style crime-physics (Hawkes self-exciting + Fokker-Planck density evolution + Langevin SDE), with practical forecasting baselines (SARIMA, XGBoost-when-available) for hold-out comparison.

Functions - hawkes_temporal_fit(df) – fit μ, κ, ω of self-exciting kernel,

compute branching ratio + AIC/BIC

  • sarima_forecast(df, h=12) – seasonal ARIMA, train/test MAPE

  • prophet_forecast(df, h=12) – Prophet trend+seasonality if installed

  • langevin_simulate(…) – Euler-Maruyama Ornstein-Uhlenbeck path

  • fokker_planck_grid(df) – finite-difference density evolution

  • model_compare(df) – AIC across baselines + SDE

All emit RichResult; figure_path payload entry written when matplotlib is available (saved as PNG under data/manifest/outputs/figures/tps_stochastic/).

References (cited in description blocks): - Mohler et al. 2011, “Self-exciting point process modeling of crime.” - Pitcher 2010, “Adding integrated nested Laplace approximations to

the Bayesian forecasting toolkit.”

  • Stochastic physics of crime literature (Short, D’Orsogna, Bertozzi 2010).

morie.tps_stochastic.hawkes_temporal_fit(df, *, ds_name='?', max_n=5000)[source]

Fit a temporal-only exponential Hawkes process to incident times.

Returns μ (background rate), κ (branching ratio), ω (decay), AIC/BIC.

Parameters:
Return type:

RichResult

morie.tps_stochastic.sarima_forecast(df, *, ds_name='?', h=12, order=(1, 1, 1), seasonal=(0, 1, 1, 12))[source]

Seasonal ARIMA forecast on monthly incident counts.

Parameters:
Return type:

RichResult

morie.tps_stochastic.langevin_simulate(df, *, ds_name='?', n_paths=100, T_days=365, dt=1.0, seed=42)[source]

Euler-Maruyama simulation of an Ornstein-Uhlenbeck process fitted to incident counts:

dX_t = θ(μ − X_t) dt + σ dW_t

Fits θ, μ, σ via OLS on first-differences, then generates n_paths forward simulations of length T_days.

Parameters:
Return type:

RichResult

morie.tps_stochastic.fokker_planck_grid(df, *, ds_name='?', n_grid=64, n_steps=200)[source]

1-D Fokker-Planck density evolution of daily incident counts with the OU drift+diffusion fitted in langevin_simulate.

Uses Crank-Nicolson finite-difference scheme with reflective boundaries on a grid spanning [0, max(observed)*1.5].

Parameters:
Return type:

RichResult

morie.tps_stochastic.analyze(name='Assault', *, sample_rows=50000)[source]

Run the four stochastic-physics analyses on one TPS category.

Parameters:
  • name (str)

  • sample_rows (int | None)

Return type:

dict[str, RichResult]

morie.tps_hawkes_advanced – non-stationary Hawkes with non-exponential kernels.

Implements the Kwan-Chen-Dunsmuir (2024, arXiv:2408.09710v1) methodology for Hawkes process likelihood inference when the baseline intensity is time-varying and the excitation kernel is non-exponential (so the intensity process is non-Markovian).

Companion to morie.tps_stochastic – that module contains the classical exponential-kernel constant-baseline (Markovian / Mohler 2011) fit; this module adds:

  • Gamma, Weibull, and power-law (Lomax) excitation kernels.

  • Sinusoidal-with-trend time-varying baseline ν(t).

  • Time-rescaling residuals (Brown et al. 2002) for goodness-of-fit via Kolmogorov-Smirnov + Q-Q plots.

  • Eight-way model comparison via AIC and an explicit Markovian-vs-non-Markovian likelihood-ratio surface.

The complete intensity function is

λ(t) = ν(t) + ∫_0^{t-} g(t - s) dN_s

with kernel decomposition g(u) = η · g̃(u) where η ∈ (0, 1) is the branching ratio (mean offspring per event) and g̃ is a probability density on [0, ∞). Stationarity requires η < 1.

morie.tps_hawkes_advanced.fit_hawkes_general(t, T, kernel_kind='exponential', baseline_kind='constant')[source]

MLE of a non-stationary Hawkes process.

Returns a dict with theta, nll, aic, bic, branching_ratio, baseline_params, kernel_params, and the time-rescaling KS statistic.

Parameters:
  • t (ndarray)

  • T (float)

  • kernel_kind (Literal['exponential', 'gamma', 'weibull', 'lomax'])

  • baseline_kind (Literal['constant', 'sinusoidal'])

Return type:

dict

morie.tps_hawkes_advanced.hawkes_advanced_fit(df, *, kernel='gamma', baseline='sinusoidal', ds_name='?', max_n=5000)[source]

Fit a single (kernel, baseline) combination with figures.

A sibling to morie.tps_stochastic.hawkes_temporal_fit which is locked to the (exponential, constant) Markovian special case.

Parameters:
  • df (DataFrame)

  • kernel (Literal['exponential', 'gamma', 'weibull', 'lomax'])

  • baseline (Literal['constant', 'sinusoidal'])

  • ds_name (str)

  • max_n (int)

Return type:

RichResult

morie.tps_hawkes_advanced.compare_hawkes_kernels(df, *, ds_name='?', max_n=4000, baselines=('constant', 'sinusoidal'), kernels=('exponential', 'gamma', 'weibull', 'lomax'))[source]

Fit every (kernel, baseline) combination and rank by AIC.

Mirrors Section 5 (numerical examples) of Kwan-Chen-Dunsmuir 2024: the Markovian classical Hawkes is the (exponential, constant) row; the non-Markovian non-stationary models are everything else.

Parameters:
Return type:

RichResult

morie.tps_hawkes_advanced.hawkes_markovian_vs_nonmarkovian(df, *, ds_name='?', max_n=4000)[source]

Focused 2-way comparison: classical exp/const vs gamma/sinusoidal.

The two endpoints of the Kwan-Chen-Dunsmuir framework – quickest to run on the dashboard.

Parameters:
Return type:

RichResult

morie.tps_statphysics – Statistical physics of crime applied to TPS.

Implements the 4 canonical methods reviewed by D’Orsogna & Perc, “Statistical physics of crime: A review”, Phys. Life Rev. 12 (2015) 1-21 (arxiv:1411.1743). Each function consumes one TPS category and returns a RichResult with a paragraph-level interpretation.

1. ``sdb_reaction_diffusion(category, …)`` -- Short-D'Orsogna-Brantingham

2008 hot-spot PDE on a Toronto grid:

∂A/∂t = η ∇²A − ω A + θ ρ ∂ρ/∂t = ∇·(D∇ρ − 2 ρ ∇log A) − ρ A + γ

A is the attractiveness field; ρ the criminal density. Crime incidents are absorbed into A, ρ flows up the gradient of A. Stable spots emerge when (η, ω, θ, D, γ) place the system in the localised-spike regime. We fit the steady-state spike count to the observed cluster count from DBSCAN.

2. ``levy_flight_alpha(category)`` -- Brockmann-Hufnagel-Geisel 2006

Lévy-flight diagnostic. We compute step lengths between chronologically-consecutive incidents and fit a power-law tail p(ℓ) ∝ ℓ^{−α} via Hill-MLE on the upper-tail.

3. ``urban_scaling_beta(category)`` -- Bettencourt et al. 2007 / 2010

urban scaling: log(crime_i) = log(Y₀) + β · log(pop_i) + ε for the 158 Toronto wards. β > 1 = super-linear (crime grows faster than population), β = 1 = linear, β < 1 = sub-linear.

4. ``lotka_volterra_police_crime(category)`` -- Lotka-Volterra

predator-prey on a coarse-grained yearly time series:

dx/dt = α x − β x y (crime: prey) dy/dt = δ x y − γ y (police effort: predator)

Fit (α, β, γ, δ) by least-squares against TPS counts (proxy x) and normalised mass-stop / arrest counts (proxy y if available, else one-quarter-of-incidents as a placeholder).

References

  • D’Orsogna & Perc (2015) §2.1, §3.2, §4.1

  • Short, D’Orsogna, Brantingham et al. (2008) “Statistical model of criminal behavior”, M3AS 18.

  • Brockmann, Hufnagel, Geisel (2006) “Scaling laws of human travel”, Nature 439.

  • Bettencourt, Lobo, Helbing, Kühnert, West (2007) “Growth, innovation, scaling, and the pace of life in cities”, PNAS 104.

morie.tps_statphysics.sdb_reaction_diffusion(category='Assault', *, sample_rows=30000, eta=0.05, omega=0.3, theta=1.5, D=0.1, gamma=0.05, n_steps=800, dt=0.04, nx=90, ny=60, save_fig=True)[source]

Solve the Short-D’Orsogna-Brantingham 2008 reaction-diffusion crime hot-spot PDE on a Toronto grid, seeded by observed incidents.

Returns a RichResult with the steady-state hot-spot count, mean attractiveness, mean criminal density, and a comparison to DBSCAN.

Parameters:
Return type:

RichResult

morie.tps_statphysics.levy_flight_alpha(category='Assault', *, sample_rows=30000, lmin_km=0.5, save_fig=True)[source]

Hill-MLE Lévy exponent on step lengths between consecutive incidents in chronological order (proxy for offender mobility).

Parameters:
Return type:

RichResult

morie.tps_statphysics.urban_scaling_beta(category='Assault', *, year=2024, save_fig=True)[source]

Bettencourt 2007 super-linear urban scaling on the 158 wards.

Regresses log(crime per ward) on log(population per ward) -> β.

Parameters:
Return type:

RichResult

morie.tps_statphysics.lotka_volterra_police_crime(category='Assault', save_fig=True)[source]

Yearly Lotka-Volterra fit: crime as prey, mass-stop / arrest as predator.

With only 1 series we use the empirical x(t) and back out (α, β) from the LV equilibrium x* = γ/δ, y* = α/β under stationarity. For the predator series we use the cumulative count of incidents -> a crude effort-density proxy; the absolute scale is unidentified but the (α, γ) ratio and the cycle period are.

Parameters:
Return type:

RichResult

morie.tps_statphysics.sdb_turing_demo(*, eta=0.2, omega=0.033, theta=0.56, D=30.0, gamma=0.019, n_steps=6000, dt=0.005, n=80, save_fig=True)[source]

Canonical SDB Turing-instability demo on a clean periodic grid.

Reproduces the localized hot-spot lattice from Short, D’Orsogna & Brantingham (2008) Fig. 4 / D’Orsogna & Perc (2015) Fig. 5 – which is what the author’s reference image #3 / #4 shows. Periodic boundaries, homogeneous initial state + Gaussian noise, runs to steady state. Output: 1×3 panel (early / mid / late).

Parameters:
Return type:

RichResult

morie.tps_statphysics.inspection_game_phase(n_temptations=20, n_costs=20, n_steps=600, save_fig=True)[source]

Helbing-Szolnoki-Perc (2010) cooperator-defector-inspector replicator dynamics; produces the (Temptation T, Inspection cost γ) crime-rate phase diagram from D’Orsogna & Perc 2015 §5 / Image #5.

Three strategies (pure populations only):

C – cooperator P – defector / ‘predator’ O – punisher / inspector

Replicator dynamics with payoff matrix that includes punishment of P by O at cost γ. Steady-state defector fraction -> ‘crime rate’.

Parameters:
  • n_temptations (int)

  • n_costs (int)

  • n_steps (int)

  • save_fig (bool)

Return type:

RichResult

morie.tps_statphysics.criminal_network_graph(category='Assault', *, sample_rows=30000, top_n_premises=20, save_fig=True)[source]

Co-occurrence network: nodes = top-N premise types, edges = sharing a common neighbourhood, weighted by joint incident count.

Replaces the criminal-role graph in D’Orsogna & Perc Fig. 9 / Diviák et al. (2019) – for our public TPS data we don’t have co-offender records, so we build the closest-meaning network from PREMISES_TYPE × HOOD_158 co-incidence.

Parameters:
  • category (str)

  • sample_rows (int | None)

  • top_n_premises (int)

  • save_fig (bool)

Return type:

RichResult

morie.tps_statphysics.analyze_all(categories=None, save_fig=True)[source]

Run all 4 stat-phys analyses on each category.

Parameters:
Return type:

dict[str, dict[str, RichResult]]

morie.tps_render – clean Toronto map rendering.

Two design rules per the author, 2026-05-07:

  1. NO floating neighbourhood text labels on the map (no “Downtown Yonge-East”, “Yonge-Bay Corridor”, “Mimico Queensway”, …). Hot-spot identification is delivered via the RichResult tables, not via on-canvas text.

  2. Map is rotated ~17° CLOCKWISE in projected space so Lake Ontario’s shoreline sits level horizontally – matching the Sigar Li 2022 “Hotspot Policing for the City of Toronto” poster aesthetic and the Hohl 2024 ALMI homicide-cluster map.

Rotation works in cos-latitude-corrected metres (not raw degrees), so shapes stay metric-true regardless of where in Toronto’s bbox they sit.

Public API

project_xy(lat, lon) – degrees -> rotated planar metres (km) render_choropleth(rate_col, year, ...) – polygon choropleth render_dbscan(category, ...) – point-pattern + clusters render_yearly_grid(prefix, years, ...) – small-multiples

morie.tps_render.pretty_label(s)[source]

ASSAULT_RATE_2024 -> Assault rate · 2024. Strips underscores and casing so titles & legend labels look like prose.

Parameters:

s (str)

Return type:

str

morie.tps_render.project_xy(lat, lon, *, rot_deg_cw=17.5, lat_c=43.7, lon_c=-79.4)[source]

Project (lat, lon) -> (x_km, y_km) with a clockwise rotation.

Equirectangular projection centred at (lat_c, lon_c), then rotated rot_deg_cw degrees clockwise. Returns kilometres east-of-centre (after rotation) and kilometres north-of-centre (after rotation).

Clockwise convention: positive rot_deg_cw rotates the map so a line that previously sloped up-right slopes less (or down-right).

Parameters:
Return type:

tuple[ndarray, ndarray]

morie.tps_render.render_choropleth(*, rate_col='ASSAULT_RATE_2024', title=None, cmap=None, outfile=None, fig_h=7.0, fig_w=12.0, show_ids=True, border_color='#1a1a1a', border_lw=0.7)[source]

Polygon choropleth using NeighbourhoodCrimeRates GeoJSON polys.

Style matches the Hohl 2024 ALMI / Sigar Li 2022 Toronto poster: - thick black polygon borders (border_color/border_lw) - sequential cmap chosen per category (default Reds family) - small numeric polygon-ID labels (show_ids=True) – IDs only,

never neighbourhood names per the author 2026-05-07

  • title and colour-bar legend with proper spaces (no underscores)

Parameters:
Return type:

Path

morie.tps_render.render_quad(category='Homicides', *, year=2024, outfile=None, fig_h=9.0, fig_w=13.0, sample_rows=30000, border_color='#1a1a1a', border_lw=0.6)[source]

Hohl 2024 ALMI-style 4-panel quad for any TPS category.

Panels (a-d):
  1. Kernel density (incidents per sq km) – sequential

  2. Per-100k rate from NeighbourhoodCrimeRates – sequential

  3. Spatial clusters / outliers (LISA HH/LL/HL/LH) – diverging

  4. Significant Gi* hot/cold spots – diverging

Parameters:
Return type:

Path

morie.tps_render.render_dbscan(category='Assault', *, eps_km=0.3, min_samples=20, sample_rows=30000, outfile=None, show_top=12, fig_h=7.5, fig_w=12.0)[source]

Point-pattern map: incidents + DBSCAN cluster IDs.

No floating hood labels; clusters are shown by colour with a compact legend. Aspect/rotation matches the rest of the suite.

Parameters:
Return type:

Path

morie.tps_render.render_district_proportional(metric='ASSAULT_2024', *, title=None, outfile=None, fig_h=7.0, fig_w=12.5, border_color='#1a1a1a', border_lw=0.55)[source]

District-coloured map with proportional-symbol black circles.

Inspired by the author’s reference (Toronto news-coverage proportional- symbol map). Toronto’s 158 wards are coloured by former-borough district (Etobicoke / North York / Scarborough / York / Old Toronto / East York), with a black circle drawn at each ward centroid sized proportional to metric (any column from NeighbourhoodCrimeRates).

Includes a “Map Divisions” legend with district swatches and a “Number of <metric>” graduated-symbol legend with 5 size tiers.

Parameters:
Return type:

Path

morie.tps_render.render_satscan_panel(category='Homicides', *, sample_rows=30000, n_clusters=2, n_mc=199, year_window_yrs=4, outfile=None, fig_h=7.5, fig_w=11.5, border_color='#1a1a1a', border_lw=0.55)[source]

Hohl 2024 ALMI “panel d” – Kulldorff-style space-time scan statistic with significant clusters as pink filled circles, red dots inside for sig. locations, yellow polygons for sig. wards, and an info-box (n_cases / expected / RR / p-value) on the right.

Implementation is a simplified Kulldorff Poisson scan over a grid of candidate (x, y, r, t-window) cells: for each cell compute a log-likelihood ratio assuming Poisson(λ_in) inside vs Poisson(λ_out) outside; null distribution by Monte-Carlo (n_mc permutations of incident timestamps within the bbox).

Parameters:
Return type:

Path

morie.tps_render.render_yearly_grid(*, prefix='ASSAULT_RATE', years=range(2014, 2025), cmap='Reds', outfile=None, ncols=4)[source]

Small-multiples panel of per-year choropleths, no hood labels.

Parameters:
Return type:

Path

morie.tps_all_analyze – RichResult-emitting analyses for the 13 TPS crime datasets.

Surfaces:

analyze_<name>(df=None) -> RichResult # universal per-dataset spatial_summary(df) -> RichResult # neighbourhood + lat/lon temporal_summary(df) -> RichResult # year/month/dow/hour crime_compare(*dfs, names=…) -> RichResult # cross-category analyze_all(out_dir=None) -> dict

Output goes to data/manifest/outputs/tps/.

Each TPS CSV has the same standard schema (Toronto Open Data convention), so a single set of helpers handles the full 13-dataset matrix.

morie.tps_all_analyze.temporal_summary(df, *, ds_name='?')[source]

Year/month/dow/hour rollups.

Parameters:
Return type:

RichResult

morie.tps_all_analyze.spatial_summary(df, *, ds_name='?')[source]

Neighbourhood + division + premises rollups, plus lat/long bounding-box summary.

Parameters:
Return type:

RichResult

morie.tps_all_analyze.offence_summary(df, *, ds_name='?')[source]

OFFENCE / UCR / CSI_CATEGORY rollups.

Parameters:
Return type:

RichResult

morie.tps_all_analyze.gini_concentration(values)[source]

Gini coefficient of crime concentration across spatial units. G=0 perfectly even, G=1 perfectly concentrated.

Parameters:

values (ndarray)

Return type:

float

morie.tps_all_analyze.neighbourhood_concentration(df, *, ds_name='?')[source]

How concentrated is crime across neighbourhoods? Uses HOOD_158 (the 158-neighbourhood scheme) by default.

Parameters:
Return type:

RichResult

morie.tps_all_analyze.crime_compare(dfs)[source]

Compare counts and trends across multiple TPS categories. Returns one table per axis (year/division/neighbourhood).

Parameters:

dfs (dict[str, DataFrame])

Return type:

RichResult

morie.tps_all_analyze.analyze_one(name, df=None)[source]

Run the standard TPS analysis bundle on one dataset.

Parameters:
Return type:

RichResult

morie.tps_all_analyze.analyze_assault(df=None)

Analyze TPS Assault.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_autotheft(df=None)

Analyze TPS AutoTheft.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_bicycletheft(df=None)

Analyze TPS BicycleTheft.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_breakandenter(df=None)

Analyze TPS BreakandEnter.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_communitysafetyindicators(df=None)

Analyze TPS CommunitySafetyIndicators.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_hatecrimes(df=None)

Analyze TPS HateCrimes.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_homicides(df=None)

Analyze TPS Homicides.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_intimatepartnerandfamilyviolence(df=None)

Analyze TPS IntimatePartnerAndFamilyViolence.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_neighbourhoodcrimerates(df=None)

Analyze TPS NeighbourhoodCrimeRates.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_robbery(df=None)

Analyze TPS Robbery.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_shootingandfirearmdiscarges(df=None)

Analyze TPS ShootingAndFirearmDiscarges.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_theftfrommovingvehicle(df=None)

Analyze TPS TheftFromMovingVehicle.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_theftover(df=None)

Analyze TPS TheftOver.

Parameters:

df (DataFrame | None)

Return type:

RichResult

morie.tps_all_analyze.analyze_all(out_dir=None, *, sample_rows=50000)[source]

Run full bundle on each of the 13 TPS datasets + cross-compare.

sample_rows caps each load at N rows for fast development; pass None to load everything (Assault alone is 254k rows so the full 13-way load is several hundred MB).

Parameters:
  • out_dir (Path | None)

  • sample_rows (int | None)

Return type:

dict[str, RichResult]

Federal SIU — Sprott / Doob / Iftene replication

morie.sprott_doob – Sprott & Doob (CRIMSL UToronto) SIU analyses.

Replicates the analytical contribution of the four CRIMSL UToronto research reports authored by Prof. Jane B. Sprott (Toronto Metropolitan University, formerly Ryerson) and Prof. Anthony N. Doob (University of Toronto), with Prof. Adelina Iftene (Dalhousie) co-authoring the May 2021 paper on Independent External Decision Makers.

The four reports

  1. Sprott & Doob (Oct 2020)Understanding the Operation of Correctional Service Canada’s Structured Intervention Units: Some Preliminary Findings. First systematic outside analysis of CSC’s SIU data; demonstrated that SIUs were not operating as the legislative framework (Bill C-83) required.

  2. Sprott & Doob (Nov 2020)Is there Clear Evidence that COVID-19 Was the Cause of Problems with the Operation of CSC’s Structured Intervention Units? Tests CSC’s COVID-attribution defense; finds the data did not support it (problems pre-existed COVID).

  3. Sprott & Doob (Feb 2021)Solitary Confinement, Torture, and Canada’s Structured Intervention Units. The most data- intensive of the four; introduces a Mandela-Rules classifier for SIU stays (solitary confinement = ≤15 days at ≤2 hrs out-of-cell; torture = ≥16 days under same conditions). Contains Tables 13, 19, 23 reproduced below.

  4. Sprott, Doob & Iftene (May 2021)Do Independent External Decision Makers Ensure that Solitary Confinement is no Longer Used in Canada’s Federal Penitentiaries? Evaluates the IEDM review mechanism added by Bill C-83.

Tables hardcoded (from Feb 2021 report)

TABLE13_REGIONAL_RATES_PER_1000

SIU person-stays per 1000 regional penitentiary prisoners, split into short (≤15 days) and long (≥16 days) stays, by CSC region.

TABLE19_MANDELA_CLASSIFICATION

Sprott-Doob Mandela-Rules classification of N=1960 SIU stays: solitary confinement (28.4%, N=556), torture (9.9%, N=195), all-other (61.7%, N=1209).

TABLE23_REGIONAL_TORTURE_RATES

Solitary and torture rates per 1000 prisoners by region. Headline finding: Pacific torture rate (39.1) is 22.6× higher than Ontario’s (1.73).

Functions exposed

analyze_table13_regional_rates() -> RichResult analyze_table19_mandela_classification() -> RichResult analyze_table23_regional_torture_rates() -> RichResult classify_mandela(days_in_siu, hrs_out_of_cell_avg,

missed_full_4hrs) -> dict

Apply the Sprott-Doob Mandela-Rules classifier to a single SIU person-stay and return the category.

analyze_full_sprott_doob_feb2021() -> RichResult

Comprehensive replication of the Feb 2021 paper’s main tables.

Citation

Sprott, J. B., & Doob, A. N. (2021, February). Solitary Confinement, Torture, and Canada’s Structured Intervention Units. Centre for Criminology & Sociolegal Studies, U. of Toronto. URL: https://www.crimsl.utoronto.ca/sites/www.crimsl.utoronto.ca/ files/TortureSolitarySIUsSprottDoob23Feb2021_0.pdf

Note on author order

Sprott is FIRST author on the public CRIMSL papers; Doob is co- author. This is documented and indexed in morie.siuiap. CRIMSL_REPORTS. The October 2020 report’s Drive filename (“DoobSprott”) had Doob first but the published title page may differ – when in doubt, follow the published front-matter order.

morie.sprott_doob.classify_mandela(days_in_siu, hrs_out_of_cell_avg, missed_full_4hrs_pct_of_days)[source]

Apply the Sprott-Doob Mandela-Rules classifier.

Parameters:
  • days_in_siu (int) – Length of the SIU person-stay in days.

  • hrs_out_of_cell_avg (float) – Average hours out of cell per day during the stay.

  • missed_full_4hrs_pct_of_days (float) – Percent of days during the stay where the inmate did not receive the legislatively-required 4 hours out of cell. (Range: 0-100.)

Returns:

  • dict with keys ‘category’ (str), ‘rule’ (str), ‘reason’ (str).

  • Categories

  • ———-

  • “Solitary Confinement” – Mandela Rule 44 (≤2 hrs out of cell, all) – days missed full 4 hrs, stay length ≤ 15 days.

  • ”Torture” – Mandela Rules 43+44 (same conditions but stay length) – ≥ 16 days.

  • ”All other” – none of the above thresholds met.

  • >>> r = classify_mandela(20, 1.5, 100)

  • >>> r[“category”]

  • ’Torture’

  • >>> r = classify_mandela(8, 1.5, 100)

  • >>> r[“category”]

  • ’Solitary Confinement’

  • >>> r = classify_mandela(20, 5, 50)

  • >>> r[“category”]

  • ’All other’

Return type:

dict

morie.sprott_doob.analyze_table13_regional_rates()[source]

Sprott-Doob Table 13: SIU person-stay rates per 1000 prisoners.

Return type:

RichResult

morie.sprott_doob.analyze_table19_mandela_classification()[source]

Sprott-Doob Table 19: Mandela-Rules classification of N=1960 SIU stays.

Return type:

RichResult

morie.sprott_doob.analyze_table23_regional_torture_rates()[source]

Sprott-Doob Table 23: regional torture/solitary rates per 1000.

Return type:

RichResult

morie.sprott_doob.analyze_iedm_table1_population()[source]

Sprott, Doob & Iftene (May 2021) Table 1: IEDM-reviewed population.

Return type:

RichResult

morie.sprott_doob.analyze_iedm_review_outcomes()[source]

Sprott, Doob & Iftene May 2021: IEDM review outcomes & disparities.

Return type:

RichResult

morie.sprott_doob.analyze_table4_length_of_stay()[source]

Sprott-Doob (Feb 2021) Table 4: SIU length-of-stay distribution.

Return type:

RichResult

morie.sprott_doob.analyze_table11_region_x_stay_length()[source]

Sprott-Doob (Feb 2021) Table 11: Region × stay length.

Return type:

RichResult

morie.sprott_doob.analyze_table12_regional_overrepresentation()[source]

Sprott-Doob (Feb 2021) Table 12: Regional over-/under-rep in SIU.

Return type:

RichResult

morie.sprott_doob.analyze_table15_region_x_mental_health()[source]

Sprott-Doob (Feb 2021) Table 15: Region × MH-flag.

Return type:

RichResult

morie.sprott_doob.analyze_table22_region_x_mandela()[source]

Sprott-Doob (Feb 2021) Table 22: Region × Mandela groups.

Return type:

RichResult

morie.sprott_doob.analyze_table9_iedm_decisions()[source]

Sprott-Doob-Iftene (May 2021) Table 9: IEDM review outcomes.

Return type:

RichResult

morie.sprott_doob.analyze_table10_per_iedm_variance()[source]

Sprott-Doob-Iftene (May 2021) Table 10: per-IEDM variance.

Return type:

RichResult

morie.sprott_doob.analyze_table15_long_stay_no_iedm()[source]

Sprott-Doob-Iftene (May 2021) Table 15: long-stay no-IEDM cases.

Return type:

RichResult

morie.sprott_doob.verify_chi2(observed)[source]

Recompute Pearson χ² from a 2D contingency table.

Pure-function rebuild of scipy.stats.chi2_contingency without Yates correction. Intended for quick self-checks of the transcribed cell counts against published χ² values.

Parameters:

observed (2D list of int) – Contingency table (rows × cols).

Returns:

  • dict with keys ‘chi2’, ‘df’, ‘p_value’, ‘expected’, ‘n’.

  • >>> # 2×2 sanity (independence)

  • >>> verify_chi2([[10, 10], [10, 10]])[“chi2”]

  • 0.0

Return type:

dict

morie.sprott_doob.verify_published_chi_squares()[source]

Re-derive every published χ² from its transcribed cell counts.

Cross-checks Sprott-Doob Tables 11, 15, 22 (Feb 2021) and Sprott-Doob-Iftene Tables 5, 7, 8, 10 (May 2021). For each: rebuilds the observed contingency table from the per-row dicts, runs verify_chi2(), and reports recomputed vs. published values.

Return type:

RichResult

morie.sprott_doob.analyze_full_sprott_doob_feb2021()[source]

Comprehensive replication of Sprott & Doob (Feb 2021) main tables.

Return type:

RichResult

morie.siuiap – Structured Intervention Unit Implementation Advisory Panel (federal).

NOTE: this is the FEDERAL Structured Intervention Unit Implementation Advisory Panel module. It is distinct from morie.siu, which is the ONTARIO Special Investigations Unit (police-oversight) automation package. Both share the acronym SIU but refer to entirely different agencies – the rename siu_iap -> siuiap was made to avoid the underscore prefix collision with morie.siu/.

The federal counterpart to Ontario’s OTIS provincial restrictive- confinement data is Canada’s federal Structured Intervention Unit (SIU) system, which replaced administrative segregation in federal penitentiaries in 2019. From April 2021 to December 2024, the Structured Intervention Unit Implementation Advisory Panel (SIU IAP) monitored federal SIU implementation, chaired by Howard Sapers with members including Prof. Emeritus Anthony N. Doob and Prof. Jane B. Sprott.

In addition to SIU IAP panel reports, this module also indexes:

  • CRIMSL UToronto research reports (Sprott / Doob / Iftene 2020-2021): four independent academic reports on CSC’s SIU operation, the role of independent external decision makers, the COVID-19 attribution claim, and solitary-confinement-as-torture framing.

  • Doob 2020 Federal Court affidavit (T-539-20): expert evidence on CCRSO 2018 prisoner-flow + age + release statistics, used as the source for the analyses replicated in morie.doob_trends.

This module exposes:

PANEL_MEMBERS – list of SIU IAP members REPORTS – dict of SIU IAP panel reports (5) CRIMSL_REPORTS – dict of UToronto CRIMSL research reports (4) AFFIDAVITS – dict of Federal Court / inquest affidavits PANEL_MANDATE – text describing the panel’s mandate cite() -> str – generate a citation string

It does NOT ingest data: SIU IAP outputs and CRIMSL papers are qualitative / mixed-method PDFs. The federal SIU dataset itself (CSC Research Branch) is not available as open data; this module exists to make the federal context citable from morie analyses.

References

Public Safety Canada. Structured Intervention Unit Implementation

Advisory Panel. https://www.publicsafety.gc.ca/cnt/cntrng-crm/crrctns/siuiap-ccuis-en.aspx

Sapers, H. et al. (2024). Final Report – Structured Intervention Unit

Implementation Advisory Panel.

Centre for Criminology & Sociolegal Studies, U. of Toronto.

Reports on Canada’s Structured Intervention Units. https://www.crimsl.utoronto.ca/news/reports-canada%27s-structured-intervention-units

morie.siuiap.cite(report_id='final_2024')[source]

Build a citation string for an SIU IAP / CRIMSL / affidavit entry.

Searches REPORTS, CRIMSL_REPORTS, and AFFIDAVITS in order.

>>> cite("final_2024")
'SIU IAP (2024). SIU IAP Final Report. Public Safety Canada.'
>>> "Sprott" in cite("sprott_doob_torture_solitary_2021")
True
Parameters:

report_id (str)

Return type:

str

morie.siuiap.panel_summary()[source]

Human-readable summary of the SIU IAP panel.

Return type:

str

Replicates the analytical contribution of Prof. Anthony N. Doob’s expert-witness affidavit in Canadian Civil Liberties Association et al. v. The Attorney General of Canada (Federal Court file T-539-20, Application Record Vol. 3 of 5, pp. 778-795).

Doob’s national-aggregate analyses (Figures 1-4 + Tables 1-3) sit ALONGSIDE the per-row MRM modules on OTIS provincial data and the Doob chi-square family on aggregate contingency tables. Where the MRM modules test causal contrasts at the individual level on Ontario provincial data, Doob’s affidavit work tests decoupling of imprisonment rates from crime rates at the Canadian and US national-aggregate level over 50+ years.

Constants exposed

CCRSO_TABLE1_RELEASES Table 1: 5-year average annual releases

by type (day / full parole / statutory) with violent / non-violent / breach revocations and successful completions.

CCRSO_TABLE2_FLOW Table 2: 2013/14-2017/18 prisoner flow:

count / admissions / deaths / full parole releases / statutory releases.

CCRSO_TABLE3_AGE Table 3: 2018 age distribution – adult

population, CSC in-custody count, CSC admissions – by age group (18-49, 50-59, 60+).

Functions exposed

analyze_doob_table1_releases() -> RichResult

Renders Table 1 + computes overall success / revocation rates.

analyze_doob_table2_flow() -> RichResult

Renders Table 2 + year-over-year changes + 5-year averages.

analyze_doob_table3_age_overrepresentation() -> RichResult

Renders Table 3 + computes age-group IRR for CSC custody vs adult population (over- / under-representation).

decoupling_test(crime_series, imprisonment_series) -> RichResult

Tests Doob’s thesis that imprisonment is decoupled from crime rate. Reports Pearson correlation between the two time series with optional Pettitt change-point detection.

pettitt_changepoint(series) -> dict

Pettitt non-parametric change-point detection for time series.

Data sources (cited in Doob affidavit)

CCRSO 2018 – Public Safety Canada, Corrections and Conditional

Release Statistical Overview 2018 (released 2019). https://www.publicsafety.gc.ca/cnt/rsrcs/pblctns/ccrso-2018/

StatsCan CANSIM/Table 35-10-0026-01, 35-10-0177-01, 35-10-0014-01. Pastore, A. L. & Maguire, K. (2004). Sourcebook of Criminal Justice

Statistics.

Citation

Doob, A. N. (2020). Affidavit (T-539-20) of Anthony Doob – Federal Court of Canada, Application Record Vol. 3 of 5. CCLA et al. v. Attorney General of Canada.

morie.doob_trends.analyze_doob_table1_releases()[source]

Doob Affidavit Table 1: 5-year average annual conditional releases.

Return type:

RichResult

morie.doob_trends.analyze_doob_table2_flow()[source]

Doob Affidavit Table 2: prisoner flow 2013/14-2017/18.

Return type:

RichResult

morie.doob_trends.analyze_doob_table3_age_overrepresentation()[source]

Doob Affidavit Table 3: age distribution – Canada vs CSC custody.

Return type:

RichResult

morie.doob_trends.pettitt_changepoint(series)[source]

Pettitt’s non-parametric change-point test (Pettitt 1979).

Returns the location of the most likely single change-point in a time series, plus the test statistic and approximate p-value.

>>> r = pettitt_changepoint([1, 1, 1, 1, 5, 5, 5, 5])
>>> r["change_point_index"]
3
Parameters:

series (Iterable[float])

Return type:

dict

morie.doob_trends.decoupling_test(crime_series, imprisonment_series, years=None)[source]

Doob’s thesis test: imprisonment rate decoupled from crime rate.

Computes Pearson correlation between the two equal-length time series. A correlation near 0 (or with a sign mismatch) supports Doob’s claim that imprisonment is decoupled from crime trends. Optionally runs Pettitt change-point on each series.

Parameters:
Return type:

RichResult

morie.doob_trends.analyze_doob_full_affidavit()[source]

Comprehensive replication of Doob’s Federal Court affidavit aggregate analyses (Tables 1-3 + decoupling test stub).

Renders Tables 1, 2, 3 + a placeholder for Figures 1-4 (which require StatsCan CANSIM time-series data not bundled with morie).

Return type:

RichResult

Psychometrics

morie.psymet – Psychometric analysis (CTT, reliability, factor analysis).

Short-name API (≤6 chars, no snake_case):

  • crba() – Cronbach’s coefficient alpha

  • mcdo() – McDonald’s omega total + hierarchical

  • itcor() – Corrected item-total correlations

  • adel() – Alpha if item deleted

  • crel() – Composite reliability from factor loadings

  • ave() – Average Variance Extracted

  • kmo() – Kaiser-Meyer-Olkin sampling adequacy

  • bart() – Bartlett’s test of sphericity

  • paran() – Horn’s parallel analysis

  • splhf() – Spearman-Brown split-half reliability

  • idisc() – Item discrimination index

References

  • Cronbach (1951). Psychometrika, 16(3), 297-334.

  • McDonald (1999). Test Theory: A Unified Treatment.

  • Revelle (2024). psych R package.

class morie.psymet.RlbRes(raw, std, avgr, k, n, ci_lo=0.0, ci_hi=0.0)[source]

Bases: object

Reliability result (Cronbach’s alpha).

Parameters:
raw: float
std: float
avgr: float
k: int
n: int
ci_lo: float = 0.0
ci_hi: float = 0.0
summary()[source]
Return type:

DataFrame

class morie.psymet.OmgRes(total, hier, alpha, nf, expvar=0.0)[source]

Bases: object

Omega result (McDonald’s omega).

Parameters:
total: float
hier: float
alpha: float
nf: int
expvar: float = 0.0
class morie.psymet.KmoRes(msa, items=<factory>)[source]

Bases: object

KMO test result.

Parameters:
msa: float
items: dict[str, float]
class morie.psymet.BrtRes(chisq, df, pval)[source]

Bases: object

Bartlett’s sphericity result.

Parameters:
chisq: float
df: int
pval: float
morie.psymet.crba(data, *, ci=0.95)[source]

Cronbach’s coefficient alpha with Feldt CI.

Parameters:
  • data (DataFrame or ndarray) – Items as columns, respondents as rows.

  • ci (float) – Confidence level (default 0.95).

Return type:

RlbRes

morie.psymet.mcdo(data, nf=1)[source]

McDonald’s omega total and hierarchical.

Parameters:
  • data (DataFrame or ndarray)

  • nf (int) – Number of factors.

Return type:

OmgRes

morie.psymet.itcor(data)[source]

Corrected item-total correlations.

Returns DataFrame: item, r_total, r_corr

Parameters:

data (DataFrame | ndarray)

Return type:

DataFrame

morie.psymet.adel(data)[source]

Alpha if each item deleted.

Returns DataFrame: item, adel

Parameters:

data (DataFrame | ndarray)

Return type:

DataFrame

morie.psymet.crel(loads)[source]

Composite reliability from standardized factor loadings.

CR = (Σλ)² / ((Σλ)² + Σ(1−λ²))

Parameters:

loads (ndarray)

Return type:

float

morie.psymet.ave(loads)[source]

Average Variance Extracted from factor loadings.

AVE = mean(λ²). Values ≥ 0.5 indicate convergent validity.

Parameters:

loads (ndarray)

Return type:

float

morie.psymet.kmo(data)[source]

Kaiser-Meyer-Olkin sampling adequacy.

MSA > 0.6 adequate, > 0.8 meritorious (Kaiser 1974).

Parameters:

data (DataFrame | ndarray)

Return type:

KmoRes

morie.psymet.bart(data)[source]

Bartlett’s test of sphericity.

H0: correlation matrix = identity. Reject (p<0.05) -> factorable.

Parameters:

data (DataFrame | ndarray)

Return type:

BrtRes

morie.psymet.paran(data, nsim=100, seed=42)[source]

Horn’s parallel analysis – suggested number of factors.

Compares observed eigenvalues to 95th percentile of random data.

Parameters:
Return type:

int

morie.psymet.splhf(data, method='first_last')[source]

Spearman-Brown split-half reliability.

method: ‘first_last’ or ‘odd_even’.

Parameters:
Return type:

float

morie.psymet.idisc(data, pct=0.27)[source]

Item discrimination index (D-statistic).

Upper/lower groups by total score (default 27%, Kelley). Returns DataFrame: item, d

Parameters:
Return type:

DataFrame

morie.psymet.cronbach_alpha(data, *, ci=0.95)

Cronbach’s coefficient alpha with Feldt CI.

Parameters:
  • data (DataFrame or ndarray) – Items as columns, respondents as rows.

  • ci (float) – Confidence level (default 0.95).

Return type:

RlbRes

morie.psymet.mcdonalds_omega(data, nf=1)

McDonald’s omega total and hierarchical.

Parameters:
  • data (DataFrame or ndarray)

  • nf (int) – Number of factors.

Return type:

OmgRes

morie.psymet.item_total_correlation(data)

Corrected item-total correlations.

Returns DataFrame: item, r_total, r_corr

Parameters:

data (DataFrame | ndarray)

Return type:

DataFrame

morie.psymet.alpha_if_deleted(data)

Alpha if each item deleted.

Returns DataFrame: item, adel

Parameters:

data (DataFrame | ndarray)

Return type:

DataFrame

morie.psymet.composite_reliability(loads)

Composite reliability from standardized factor loadings.

CR = (Σλ)² / ((Σλ)² + Σ(1−λ²))

Parameters:

loads (ndarray)

Return type:

float

morie.psymet.average_variance_extracted(loads)

Average Variance Extracted from factor loadings.

AVE = mean(λ²). Values ≥ 0.5 indicate convergent validity.

Parameters:

loads (ndarray)

Return type:

float

morie.psymet.kmo_test(data)

Kaiser-Meyer-Olkin sampling adequacy.

MSA > 0.6 adequate, > 0.8 meritorious (Kaiser 1974).

Parameters:

data (DataFrame | ndarray)

Return type:

KmoRes

morie.psymet.bartlett_sphericity(data)

Bartlett’s test of sphericity.

H0: correlation matrix = identity. Reject (p<0.05) -> factorable.

Parameters:

data (DataFrame | ndarray)

Return type:

BrtRes

morie.psymet.parallel_analysis(data, nsim=100, seed=42)

Horn’s parallel analysis – suggested number of factors.

Compares observed eigenvalues to 95th percentile of random data.

Parameters:
Return type:

int

morie.psymet.split_half_reliability(data, method='first_last')

Spearman-Brown split-half reliability.

method: ‘first_last’ or ‘odd_even’.

Parameters:
Return type:

float

morie.psymet.item_discrimination(data, pct=0.27)

Item discrimination index (D-statistic).

Upper/lower groups by total score (default 27%, Kelley). Returns DataFrame: item, d

Parameters:
Return type:

DataFrame

morie.psymet.ReliabilityResult

alias of RlbRes

morie.psymet.OmegaResult

alias of OmgRes

morie.psymet.KMOResult

alias of KmoRes

morie.psymet.BartlettResult

alias of BrtRes

Function namespace morie.fn

The morie.fn namespace exposes 36,000+ individual function files, indexed by a registry. The full registry is the canonical catalogue:

Star Wars naming registry for morie.fn functions.

Single source of truth: short name -> metadata. Used by fn/__init__.py for auto-export and stat_commands.py for CLI dispatch.

‘Knowledge itself is power. – Francis Bacon’

class morie.fn._registry.FnEntry(short, full, category, description, quote='')[source]

Bases: object

Registry entry for one fn/ function.

Parameters:
short: str
full: str
category: str
description: str
quote: str
morie.fn._registry.cheatsheet(category=None)[source]

Print a Star Wars-themed cheatsheet of all morie.fn functions.

>>> from morie.fn._registry import cheatsheet
>>> cheatsheet()
>>> cheatsheet("Test")  # filter by category
Parameters:

category (str | None)

Return type:

None

Result containers shared across the package:

Shared result dataclasses for morie.fn functions.

class morie.fn._containers.RlbRes(raw, std, avgr, k, n, ci_lo=0.0, ci_hi=0.0)[source]

Bases: object

Reliability result (Cronbach’s alpha).

Parameters:
raw: float
std: float
avgr: float
k: int
n: int
ci_lo: float = 0.0
ci_hi: float = 0.0
summary()[source]
Return type:

DataFrame

class morie.fn._containers.OmgRes(total, hier, alpha, nf, expvar=0.0)[source]

Bases: object

Omega result (McDonald’s omega).

Parameters:
total: float
hier: float
alpha: float
nf: int
expvar: float = 0.0
class morie.fn._containers.KmoRes(msa, items=<factory>)[source]

Bases: object

KMO test result.

Parameters:
msa: float
items: dict[str, float]
class morie.fn._containers.BrtRes(chisq, df, pval)[source]

Bases: object

Bartlett’s sphericity result.

Parameters:
chisq: float
df: int
pval: float
class morie.fn._containers.RplRes(counts, props, year, sex=None)[source]

Bases: object

Regional placement result.

Parameters:
counts: DataFrame
props: DataFrame
year: int
sex: str | None = None
class morie.fn._containers.AstRes(data, summary)[source]

Bases: object

Alert-state combination result.

Parameters:
data: DataFrame
summary: DataFrame
class morie.fn._containers.VolRes(data, mean, median)[source]

Bases: object

Volatility result.

Parameters:
data: DataFrame
mean: float
median: float
class morie.fn._containers.OtDmlR(ate, ate_se, ate_pval, att, att_se, att_pval, n, method='IRM')[source]

Bases: object

OTIS DML result.

Parameters:
ate: float
ate_se: float
ate_pval: float
att: float
att_se: float
att_pval: float
n: int
method: str = 'IRM'
class morie.fn._containers.ESRes(measure, estimate, ci_lower=None, ci_upper=None, se=None, n=None, extra=<factory>)[source]

Bases: object

Standardised result for every effect-size calculation.

Parameters:
measure: str
estimate: float
ci_lower: float | None = None
ci_upper: float | None = None
se: float | None = None
n: int | None = None
extra: dict
class morie.fn._containers.TestResult(test_name, statistic, p_value, df=None, method='', n=None, extra=<factory>)[source]

Bases: object

Result from a hypothesis test.

Parameters:
test_name: str
statistic: float
p_value: float
df: float | None = None
method: str = ''
n: int | None = None
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

class morie.fn._containers.RegressionResult(method, coefficients, se, p_values, r_squared=None, adj_r_squared=None, residuals=None, fitted=None, n=0, k=0, extra=<factory>)[source]

Bases: object

Result from a regression model.

Parameters:
method: str
coefficients: dict[str, float]
se: dict[str, float]
p_values: dict[str, float]
r_squared: float | None = None
adj_r_squared: float | None = None
residuals: ndarray | None = None
fitted: ndarray | None = None
n: int = 0
k: int = 0
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

class morie.fn._containers.DescriptiveResult(name, value=None, extra=<factory>)[source]

Bases: object

Result from a descriptive / exploratory function.

Parameters:
name: str
value: float | dict | DataFrame | None = None
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

class morie.fn._containers.TimeSeriesResult(name, values=None, lags=None, ci_upper=None, ci_lower=None, extra=<factory>)[source]

Bases: object

Result from a time-series analysis.

Parameters:
name: str
values: ndarray | None = None
lags: ndarray | None = None
ci_upper: ndarray | None = None
ci_lower: ndarray | None = None
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

class morie.fn._containers.SurvivalResult(name, times=None, survival=None, ci_lower=None, ci_upper=None, median_survival=None, n_events=0, n_censored=0, extra=<factory>)[source]

Bases: object

Result from a survival analysis.

Parameters:
name: str
times: ndarray | None = None
survival: ndarray | None = None
ci_lower: ndarray | None = None
ci_upper: ndarray | None = None
median_survival: float | None = None
n_events: int = 0
n_censored: int = 0
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

class morie.fn._containers.DiagnosticResult(name, estimate, ci_lower=None, ci_upper=None, n=0, extra=<factory>)[source]

Bases: object

Result from a diagnostic accuracy metric.

Parameters:
name: str
estimate: float
ci_lower: float | None = None
ci_upper: float | None = None
n: int = 0
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

class morie.fn._containers.IRTResult(model, item_params, theta=None, se_theta=None, fit=<factory>, info=None)[source]

Bases: object

Result from an IRT model.

Parameters:
model: str
item_params: dict
theta: ndarray | None = None
se_theta: ndarray | None = None
fit: dict
info: ndarray | None = None
summary()[source]
Return type:

str

class morie.fn._containers.DIFResult(method, items=<factory>, flagged=<factory>, group_var='')[source]

Bases: object

Result from a DIF analysis.

Parameters:
method: str
items: DataFrame
flagged: list
group_var: str = ''
summary()[source]
Return type:

str

class morie.fn._containers.InvarianceResult(level, fit=<factory>, delta_fit=<factory>, passed=False)[source]

Bases: object

Result from a measurement invariance test.

Parameters:
level: str
fit: dict
delta_fit: dict
passed: bool = False
summary()[source]
Return type:

str

class morie.fn._containers.RecidivismResult(rate, n_recid, n_total, ci_lower=0.0, ci_upper=1.0, subgroup=None, method='proportion')[source]

Bases: object

Result from a recidivism analysis.

Parameters:
rate: float
n_recid: int
n_total: int
ci_lower: float = 0.0
ci_upper: float = 1.0
subgroup: str | None = None
method: str = 'proportion'
summary()[source]
Return type:

str

class morie.fn._containers.CustodyResult(metric, value, n=0, period=None, region=None, extra=<factory>)[source]

Bases: object

Result from a custody/institutional metric.

Parameters:
metric: str
value: float
n: int = 0
period: str | None = None
region: str | None = None
extra: dict
summary()[source]
Return type:

str

class morie.fn._containers.PcaRes(components, explained_variance, explained_variance_ratio, scores, n=0, p=0)[source]

Bases: object

Principal Component Analysis result.

Parameters:
components: ndarray
explained_variance: ndarray
explained_variance_ratio: ndarray
scores: ndarray
n: int = 0
p: int = 0
summary()[source]
Return type:

str

class morie.fn._containers.FaRes(loadings, communalities, eigenvalues, variance_explained, n_factors=0, rotation='varimax')[source]

Bases: object

Exploratory Factor Analysis result.

Parameters:
loadings: ndarray
communalities: ndarray
eigenvalues: ndarray
variance_explained: ndarray
n_factors: int = 0
rotation: str = 'varimax'
summary()[source]
Return type:

str

class morie.fn._containers.CfaRes(cfi, tli, rmsea, srmr, loadings=<factory>, residuals=None)[source]

Bases: object

Confirmatory Factor Analysis result.

Parameters:
cfi: float
tli: float
rmsea: float
srmr: float
loadings: dict
residuals: ndarray | None = None
summary()[source]
Return type:

str

class morie.fn._containers.MdsRes(coordinates, stress, eigenvalues)[source]

Bases: object

Multidimensional Scaling result.

Parameters:
coordinates: ndarray
stress: float
eigenvalues: ndarray
summary()[source]
Return type:

str

class morie.fn._containers.TsneRes(embedding)[source]

Bases: object

t-SNE result.

Parameters:

embedding (ndarray)

embedding: ndarray
summary()[source]
Return type:

str

class morie.fn._containers.UmapRes(embedding)[source]

Bases: object

UMAP result.

Parameters:

embedding (ndarray)

embedding: ndarray
summary()[source]
Return type:

str

class morie.fn._containers.KmeansRes(labels, centers, inertia, n_iter)[source]

Bases: object

K-means clustering result.

Parameters:
labels: ndarray
centers: ndarray
inertia: float
n_iter: int
summary()[source]
Return type:

str

class morie.fn._containers.HclstRes(labels, linkage_matrix, distances=None)[source]

Bases: object

Hierarchical clustering result.

Parameters:
labels: ndarray
linkage_matrix: ndarray
distances: ndarray | None = None
summary()[source]
Return type:

str

class morie.fn._containers.LdaRes(components, explained_variance_ratio, projected)[source]

Bases: object

Linear Discriminant Analysis result.

Parameters:
components: ndarray
explained_variance_ratio: ndarray
projected: ndarray
summary()[source]
Return type:

str

class morie.fn._containers.DbscnRes(labels, n_clusters, n_noise)[source]

Bases: object

DBSCAN clustering result.

Parameters:
labels: ndarray
n_clusters: int
n_noise: int
summary()[source]
Return type:

str

class morie.fn._containers.SIRResult(model, t=None, S=None, I=None, R=None, E=None, R0=None, extra=<factory>)[source]

Bases: object

Result from a compartmental model (SIR/SEIR/etc).

Parameters:
model: str
t: ndarray | None = None
S: ndarray | None = None
I: ndarray | None = None
R: ndarray | None = None
E: ndarray | None = None
R0: float | None = None
extra: dict
summary()[source]
Return type:

str

class morie.fn._containers.SpatialResult(name, statistic, p_value=None, expected=None, variance=None, local_values=None, extra=<factory>)[source]

Bases: object

Result from a spatial statistic.

Parameters:
name: str
statistic: float
p_value: float | None = None
expected: float | None = None
variance: float | None = None
local_values: ndarray | None = None
extra: dict
summary()[source]
Return type:

str

class morie.fn._containers.GenomicsResult(name, statistic, p_value=None, n=0, extra=<factory>)[source]

Bases: object

Result from a genomics/population genetics test.

Parameters:
name: str
statistic: float
p_value: float | None = None
n: int = 0
extra: dict
summary()[source]
Return type:

str

class morie.fn._containers.CrimeResult(name, rate, ci_lower=None, ci_upper=None, n=0, population=0, extra=<factory>)[source]

Bases: object

Result from a criminological metric.

Parameters:
name: str
rate: float
ci_lower: float | None = None
ci_upper: float | None = None
n: int = 0
population: int = 0
extra: dict
summary()[source]
Return type:

str

class morie.fn._containers.SignalResult(name, filtered=None, fs=0.0, n_samples=0, extra=<factory>)[source]

Bases: object

Result from a signal processing function.

Parameters:
name: str
filtered: ndarray | None = None
fs: float = 0.0
n_samples: int = 0
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

class morie.fn._containers.CryptoResult(algorithm, operation, success=True, extra=<factory>)[source]

Bases: object

Result from a cryptographic operation.

Parameters:
algorithm: str
operation: str
success: bool = True
extra: dict
summary()[source]
Return type:

str

to_dict()[source]
Return type:

dict

Rich-output result containers for morie.fn.

R-style verbose __repr__ so users see paragraph-level summaries when they print a result, not just a bare dict – modeled after psych::omega(), lavaan summary(), survival::summary.coxph(), etc.

Usage:

res = welcht([1,2,3], [4,5,6])
print(res)        # multi-section ASCII summary
res.statistic     # still attribute-accessible like a dict
res.warnings      # list of advisory strings
class morie.fn._richresult.RichResult(title='', call='', summary_lines=<factory>, sections=<factory>, tables=<factory>, comparison=None, extras=<factory>, warnings=<factory>, interpretation='', payload=<factory>)[source]

Bases: dict

Generic rich-printing result dict-like – psych::omega style.

Inherits from dict so isinstance(result, dict) is True and legacy callers that expect a plain dict (.get(…), for k in r, result[“statistic”]) keep working unchanged. The RichResult layer adds title / sections / tables / interpretation / warnings on top – opt-in via print(result) or result.summary().

Supports multi-section output: - title short test/model name (top of output) - call optional call signature line (“Call: foo(x, y, …)”) - summary_lines headline metrics: list of (label, value) tuples - sections list of {“title”: str, “lines”: [(lbl, val)], “table”: […]} dicts

for additional sections (each printed with its own header)

  • tables list of {“title”: str, “headers”: […], “rows”: [[…]]}

    for tabular output (column-aligned)

  • comparison optional dict with same shape as a section,

    printed under “Compare this with …” heading

  • extras list of free-form paragraph blocks

  • warnings list of advisory strings – surfaced VERBATIM,

    multi-line OK

  • interpretation optional plain-language concluding sentence(s)

  • payload arbitrary dict for programmatic access

Parameters:
title: str = ''
call: str = ''
summary_lines: list[tuple[str, Any]]
sections: list[dict]
tables: list[dict]
comparison: dict | None = None
extras: list[str]
warnings: list[str]
interpretation: str = ''
payload: dict
get(key, default=None)[source]

Return the value for key if key is in the dictionary, else default.

Parameters:

key (str)

keys() a set-like object providing a view on D's keys[source]
values() an object providing a view on D's values[source]
items() a set-like object providing a view on D's items[source]
to_dict()[source]
Return type:

dict

summary()[source]
Return type:

str

morie.fn._richresult.hypothesis_test_result(*, test_name, statistic, pvalue, df=None, alpha=0.05, extra_summary=None, warnings=None, extra_payload=None, callable_name=None)[source]

Factory for hypothesis-test results with a uniform layout.

callable_name (optional) – short name of the morie.fn callable that produced this result. If provided, a “Learn more” pointer is appended to the extras suggesting describe(<name>) for the full pedagogical guide. Caller can pass it explicitly OR auto-detect via the inspect-stack fallback in attach_describe_pointer().

Parameters:
Return type:

RichResult

morie.fn._richresult.with_describe_pointer(result, callable_name)[source]

Add a describe() pointer to an existing RichResult.

Used by callables that return RichResult directly (without going through hypothesis_test_result()) to opt-in to the same convention.

Parameters:
Return type:

RichResult