Python API¶
Part of API Reference — MORIE API reference.
Reference for every public morie.* module.
Signatures and docstrings come from sphinx.ext.autodoc; see
Statistical Methods for the methodology behind each function.
Causal inference¶
Causal estimators: IPW, AIPW, G-computation, Matching and Double Machine Learning.
- morie.causal.compute_propensity_scores(data, treatment, covariates)[source]¶
Compute propensity scores using logistic regression with standard preprocessing.
Non-numeric (object/categorical) columns are integer-encoded with
LabelEncoderand all numeric columns are standardised withStandardScalerbefore fitting. Standardisation stabilises the logistic-regression solver and makes coefficients comparable across covariates on different scales. Encoding is applied in-place on a copy; the originaldataframe is not modified.- Parameters:
data (pandas.DataFrame) – The input pandas DataFrame containing covariates and treatment.
treatment (str) – The name of the column containing the binary treatment variable.
covariates (list[str]) – A list of column names for the covariates to adjust for.
- Returns:
A pandas Series containing the predicted propensity scores (values in the open interval (0, 1)).
- Return type:
References
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.1093/biomet/70.1.41
- morie.causal.calculate_ipw_weights(data, treatment, ps_col, *, stabilized=False, trim_quantiles=None)[source]¶
Calculate inverse probability of treatment weights (IPTW).
- Parameters:
- Returns:
A pandas Series containing the IPTW for each observation.
- Return type:
- morie.causal.effective_sample_size(weights)[source]¶
Compute effective sample size for analytical weights.
- morie.causal.run_propensity_ipw_analysis(data, *, treatment='cannabis_any_use', outcome='heavy_drinking_30d', covariates=None, survey_weight_col='weight')[source]¶
Reproduce the core outputs of the old
07_propensity.Rworkflow.ATE Estimator¶
The Average Treatment Effect is estimated via the Hájek (normalised Horvitz-Thompson) IPW estimator:
\[\widehat{\text{ATE}}_{\text{Hájek}} = \frac{\sum_i T_i Y_i / e_i}{\sum_i T_i / e_i} - \frac{\sum_i (1-T_i) Y_i / (1-e_i)}{\sum_i (1-T_i) / (1-e_i)}\]where \(T_i \in \{0,1\}\) is the treatment indicator, \(Y_i\) is the outcome, and \(e_i = P(T_i=1 \mid X_i)\) is the estimated propensity score.
The Hájek estimator is preferred over the raw Horvitz-Thompson estimator because it is self-normalised (the weights sum to 1 within each arm), which gives better finite-sample performance and is less sensitive to extreme propensity scores.
Trimmed IPW weights (1st–99th percentile) are used to reduce the influence of extreme propensity scores.
References
Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19), 2937–2960. https://doi.org/10.1002/sim.1903
Hájek, J. (1971). Comment on “An essay on the logical foundations of survey sampling” by D. Basu. In V. P. Godambe & D. A. Sprott (Eds.), Foundations of Statistical Inference (pp. 236–236). Holt, Rinehart & Winston.
- morie.causal.estimate_aipw(data, *, treatment='cannabis_any_use', outcome='heavy_drinking_30d', covariates=None, outcome_model='logistic')[source]¶
Estimate the ATE via the Augmented Inverse Probability Weighting (AIPW) doubly-robust estimator.
The influence-function score for unit i is:
\[\psi_i = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i \bigl(Y_i - \hat{\mu}_1(X_i)\bigr)}{\hat{e}_i} - \frac{(1-T_i)\bigl(Y_i - \hat{\mu}_0(X_i)\bigr)}{1 - \hat{e}_i}\]The ATE estimator is \(\widehat{\text{ATE}}_\text{AIPW} = n^{-1} \sum_i \psi_i\) and the standard error is \(\hat{\sigma}_{\psi} / \sqrt{n}\).
Double robustness: the estimator is consistent if either the propensity score model \(\hat{e}(X)\) or the outcome model \(\hat{\mu}(T, X)\) is correctly specified – not necessarily both.
- Parameters:
data (pandas.DataFrame) – The input DataFrame.
treatment (str) – Binary treatment column (0/1).
outcome (str) – Outcome column.
covariates (list[str] | None) – Covariate column names. Defaults to the standard CPADS confounders.
outcome_model (str) –
"logistic"for binary outcomes,"linear"for continuous. Defaults to"logistic".
- Returns:
Dictionary with keys
ate,se,ci_lower,ci_upper,n,method.- Return type:
References
Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846–866.
Scharfstein, D. O., Rotnitzky, A., & Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. JASA, 94(448), 1096–1120.
- morie.causal.run_ebac_selection_ipw_analysis(data, *, observation_col='ebac_tot', eligible_col='alcohol_past12m', treatment='cannabis_any_use', binary_outcome='ebac_legal', continuous_outcome='ebac_tot', survey_weight_col='weight', covariates=None)[source]¶
Reproduce the core outputs of the old
07_ebac_ipw.Rworkflow.
- morie.causal.estimate_att(data, *, treatment, outcome, covariates, propensity_col=None)[source]¶
Estimate the Average Treatment Effect on the Treated (ATT) via Hajek-weighted IPW.
The ATT is defined as:
\[\text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid T=1]\]Under unconfoundedness, the ATT is identified by the Hajek IPW estimator where treated units receive weight 1 and control units receive weight \(\hat{e}(X) / (1 - \hat{e}(X))\):
\[\widehat{\text{ATT}} = \frac{1}{n_1} \sum_{i: T_i=1} Y_i - \frac{\sum_{i: T_i=0} Y_i \cdot \hat{e}_i / (1 - \hat{e}_i)} {\sum_{i: T_i=0} \hat{e}_i / (1 - \hat{e}_i)}\]- Parameters:
data (pandas.DataFrame) – Input DataFrame.
treatment (str) – Binary treatment column (0/1).
outcome (str) – Outcome column.
covariates (list[str]) – Covariate column names for propensity model.
propensity_col (str or None) – Pre-computed propensity score column. If None, propensity scores are estimated from covariates.
- Returns:
Dictionary with
att,se,ci_lower,ci_upper,n.- Return type:
References
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: a review. Review of Economics and Statistics, 86(1), 4–29. https://doi.org/10.1162/003465304323023651
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4), 1161–1189.
- morie.causal.estimate_atc(data, *, treatment, outcome, covariates, propensity_col=None)[source]¶
Estimate the Average Treatment Effect on the Controls (ATC).
The ATC is defined as:
\[\text{ATC} = \mathbb{E}[Y(1) - Y(0) \mid T=0]\]Treated units receive weight \((1 - \hat{e}(X)) / \hat{e}(X)\) and control units receive weight 1.
- Parameters:
- Returns:
Dictionary with
atc,se,ci_lower,ci_upper,n.- Return type:
References
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: a review. Review of Economics and Statistics, 86(1), 4–29.
- morie.causal.estimate_gate(data, *, treatment, outcome, covariates, group_col, propensity_col=None)[source]¶
Estimate Group Average Treatment Effects (GATE) via AIPW within strata.
Partitions the data by group_col and estimates the ATE within each group using the AIPW doubly-robust estimator.
\[\text{GATE}_g = \mathbb{E}[Y(1) - Y(0) \mid G = g]\]- Parameters:
data (pandas.DataFrame) – Input DataFrame.
treatment (str) – Binary treatment column (0/1).
outcome (str) – Outcome column.
group_col (str) – Column defining groups/strata.
propensity_col (str or None) – Pre-computed propensity score column (optional).
- Returns:
DataFrame with columns:
group,ate,se,ci_lower,ci_upper,n.- Return type:
References
Imai, K., & Ratkovic, M. (2013). Estimating treatment effect heterogeneity in randomized program evaluation. Annals of Applied Statistics, 7(1), 443–470.
Chernozhukov, V., Demirer, M., Duflo, E., & Fernandez-Val, I. (2020). Generic machine learning inference on heterogeneous treatment effects in randomized experiments. NBER Working Paper 24678.
- morie.causal.estimate_cate(data, *, treatment, outcome, covariates, meta_learner='t_learner')[source]¶
Estimate per-unit Conditional Average Treatment Effects (CATE).
The CATE for unit i is \(\tau(X_i) = \mathbb{E}[Y(1) - Y(0) \mid X = X_i]\).
T-learner (default): fits separate outcome models on treated and control units, then computes \(\hat{\tau}(X_i) = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)\).
S-learner: fits a single outcome model on all units with treatment as a feature, then computes \(\hat{\tau}(X_i) = \hat{\mu}(X_i, T=1) - \hat{\mu}(X_i, T=0)\).
- Parameters:
- Returns:
Series of per-unit CATE estimates, indexed like the input.
- Return type:
References
Kunzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. PNAS, 116(10), 4156–4165. https://doi.org/10.1073/pnas.1804597116
- morie.causal.estimate_late(data, *, treatment, outcome, instrument, covariates=None)[source]¶
Estimate the Local Average Treatment Effect (LATE) via instrumental variables.
For a binary instrument \(Z\), the Wald estimator (simple IV) is:
\[\widehat{\text{LATE}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(T, Z)} = \frac{\bar{Y}_{Z=1} - \bar{Y}_{Z=0}}{\bar{T}_{Z=1} - \bar{T}_{Z=0}}\]With covariates, the function attempts 2SLS via
linearmodels.iv.IV2SLSorstatsmodelsIV regression, falling back to the Wald estimator if neither IV library is installed.- Parameters:
- Returns:
Dictionary with
late,se,ci(tuple),f_stat,method.- Return type:
References
Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467–475. https://doi.org/10.2307/2951620
Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. JASA, 91(434), 444–455.
- morie.causal.estimate_irm(data, *, treatment, outcome, covariates, n_folds=5, random_state=42)[source]¶
Estimate the ATE via the Interactive Regression Model (IRM) using DoubleML.
The IRM extends the partially linear model by allowing treatment effect heterogeneity. It models:
\[ \begin{align}\begin{aligned}Y = g_0(T, X) + U, \quad \mathbb{E}[U \mid X, T] = 0\\T = m_0(X) + V, \quad \mathbb{E}[V \mid X] = 0\end{aligned}\end{align} \]where \(g_0\) is the outcome regression and \(m_0\) is the propensity score. The Neyman-orthogonal score for the ATE is:
\[\psi = g_0(1, X) - g_0(0, X) + \frac{T(Y - g_0(1,X))}{m_0(X)} - \frac{(1-T)(Y - g_0(0,X))}{1 - m_0(X)} - \theta\]- Parameters:
- Returns:
Dictionary with
ate,se,ci_lower,ci_upper,n,method.- Return type:
References
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097
- morie.causal.estimate_double_ml(data, outcome, treatment, covariates, *, random_state=42, n_folds=5, n_rep=1)[source]¶
Estimate the Average Treatment Effect using Double Machine Learning (DML).
Uses
doubleml.DoubleMLPLR(Partially Linear Regression Model) with Random Forest nuisance estimators for both the outcome regression (ml_l) and the treatment model (ml_m).Reproducibility¶
All stochastic operations are seeded deterministically:
numpyglobal seed is set torandom_stateimmediately before constructing the learners.Both
RandomForestRegressorinstances receiverandom_state.The
DoubleMLPLRobject is constructed withn_foldsandn_reppassed explicitly so that the cross-fitting schedule is fixed for a given seed.
To change the seed for a sensitivity run, pass
random_state=<int>.- param data:
The dataset containing outcome, treatment, and covariates.
- type data:
pandas.DataFrame
- param outcome:
Name of the continuous outcome variable.
- type outcome:
str
- param treatment:
Name of the treatment variable.
- type treatment:
str
- param covariates:
List of covariate column names to control for.
- type covariates:
list[str]
- param random_state:
Integer seed for all RNGs. Default 42.
- type random_state:
int, optional
- param n_folds:
Number of cross-fitting folds. Default 5.
- type n_folds:
int, optional
- param n_rep:
Number of repeated cross-fitting repetitions. Default 1.
- type n_rep:
int, optional
- return:
A fitted
doubleml.DoubleMLPLRobject containing the causal estimate and inference results.- rtype:
doubleml.DoubleMLPLR
References
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097
Treatment effect estimations (ATE, ATT, ATU, LATE, G-computation).
This module provides:
estimate_ate()– IPW-weighted OLS ATE (existing, preserved).estimate_plr()– Partially Linear Regression via DoubleML.estimate_pliv()– Partially Linear IV (LATE) via DoubleML or 2SLS.estimate_ate_gcomputation()– G-computation (outcome regression) ATE.sensitivity_rosenbaum()– Rosenbaum bounds for hidden confounding.e_value()– E-value for unmeasured confounding (VanderWeele & Ding, 2017).
References
- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey,
W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.
- Robins, J. M. (1986). A new approach to causal inference in mortality studies
with a sustained exposure period. Mathematical Modelling, 7, 1393-1512.
- VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational
research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.
Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer.
- morie.effects.estimate_ate(data, outcome, treatment, weights_col)[source]¶
Estimate Average Treatment Effect (ATE) using a weighted linear model.
- Parameters:
data (pandas.DataFrame) – The pandas DataFrame containing the analytical sample.
outcome (str) – The name of the outcome variable column.
treatment (str) – The name of the binary treatment indicator column.
weights_col (str) – The name of the column containing the analytical weights (e.g. IPTW).
- Returns:
A tuple containing the estimated ATE coefficient and its standard error.
- Return type:
- morie.effects.estimate_plr(data, *, treatment, outcome, covariates, n_folds=5, random_state=42)[source]¶
Partially Linear Regression (PLR) ATE via DoubleML.
The PLR model is:
\[Y = \theta_0 D + g_0(X) + \varepsilon, \quad D = m_0(X) + v\]where \(\theta_0\) is the ATE, \(g_0\) and \(m_0\) are nuisance functions estimated via cross-fitting, and the final \(\theta_0\) estimate is debiased.
Nuisance learners default to ridge-regularised linear regression for both \(g_0\) (outcome) and \(m_0\) (treatment propensity / residual).
- Parameters:
data (DataFrame) – DataFrame containing all required columns.
treatment (str) – Column name of the binary or continuous treatment variable.
outcome (str) – Column name of the outcome variable.
n_folds (int) – Number of cross-fitting folds. Default 5.
random_state (int) – Random seed for reproducibility. Default 42.
- Returns:
dict with keys
ate,se,ci_lower,ci_upper,pval,n_obs.- Raises:
ImportError – If
doublemlis not installed.ValueError – If required columns are missing.
- Return type:
References
- Chernozhukov et al. (2018). Double/debiased machine learning for treatment
and structural parameters. Econometrics Journal, 21(1), C1-C68.
- Bach, P., Chernozhukov, V., Kurz, M. S., & Spindler, M. (2022). DoubleML:
An object-oriented implementation of double machine learning in Python. Journal of Machine Learning Research, 23(53), 1-6.
- morie.effects.estimate_pliv(data, *, treatment, outcome, instrument, covariates, n_folds=5, random_state=42)[source]¶
Partially Linear IV (PLIV) for the Local Average Treatment Effect (LATE) using an instrumental variable via DoubleML.
The PLIV model is:
\[Y = \theta_0 D + g_0(X) + \varepsilon, \quad D = m_0(X) + f_0(Z, X) + v\]where Z is the instrument, satisfying relevance and exclusion restriction.
Falls back to two-stage least squares (2SLS) via statsmodels if DoubleML is unavailable.
- Parameters:
data (DataFrame) – DataFrame containing all required columns.
treatment (str) – Column name of the endogenous treatment variable.
outcome (str) – Column name of the outcome variable.
instrument (str) – Column name of the instrument.
covariates (list[str]) – List of exogenous covariate column names.
n_folds (int) – Cross-fitting folds (DoubleML path). Default 5.
random_state (int) – Random seed. Default 42.
- Returns:
dict with keys
late,se,ci_lower,ci_upper,pval,n_obs,method.- Raises:
ValueError – If required columns are missing.
- Return type:
References
- Chernozhukov et al. (2018). Double/debiased machine learning.
Econometrics Journal, 21(1), C1-C68.
- Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of
causal effects using instrumental variables. JASA, 91(434), 444-455.
- morie.effects.estimate_ate_gcomputation(data, *, treatment, outcome, covariates, outcome_model='linear')[source]¶
G-computation (outcome regression / standardisation) ATE estimator.
The G-computation estimator proceeds in three steps:
Fit an outcome model \(E[Y | T, X]\) on the observed data.
Predict potential outcomes \(\hat{Y}(1)\) and \(\hat{Y}(0)\) for every unit by setting T = 1 and T = 0 respectively.
Compute ATE as the average difference:
\[\widehat{\text{ATE}} = \frac{1}{n} \sum_i \left(\hat{Y}_i(1) - \hat{Y}_i(0)\right)\]Standard error is estimated via non-parametric bootstrap (500 iterations with seed = 42) on the full three-step procedure.
- Parameters:
- Returns:
dict with keys
ate,se,ci_lower,ci_upper,n_obs,outcome_model.- Raises:
ValueError – If required columns are missing or outcome_model is invalid.
- Return type:
References
- Robins, J. M. (1986). A new approach to causal inference in mortality
studies. Mathematical Modelling, 7, 1393-1512.
- Hernan, M. A., & Robins, J. M. (2020). Causal Inference: What If.
Chapman & Hall/CRC. (Chapter 13.)
- morie.effects.sensitivity_rosenbaum(data, *, treatment, outcome, covariates, gamma_range=(1.0, 3.0), n_gamma=20)[source]¶
Rosenbaum bounds sensitivity analysis for hidden confounding.
Tests whether the sign-rank test conclusion is robust to an unmeasured confounder that could increase the odds of treatment assignment by a factor of Gamma (the sensitivity parameter).
For a range of Gamma values, the method computes: - p_lower: p-value under the most favourable assignment (best case) - p_upper: p-value under the most adverse assignment (worst case)
The analysis is based on the Wilcoxon signed-rank statistic applied to matched pairs. Here we approximate the matched analysis by using all discordant (T=1, T=0) pairs sorted by outcome.
- Parameters:
data (DataFrame) – DataFrame containing all required columns.
treatment (str) – Column name of the binary treatment indicator (0/1).
outcome (str) – Column name of the outcome variable.
covariates (list[str]) – Covariate column names (used for matching approximation).
gamma_range (tuple[float, float]) – Tuple (min_gamma, max_gamma). Default (1.0, 3.0).
n_gamma (int) – Number of Gamma values to evaluate. Default 20.
- Returns:
DataFrame with columns
Gamma,p_lower,p_upper.- Raises:
ValueError – If required columns are missing or gamma_range is invalid.
- Return type:
Notes
This implementation uses the normal approximation to the signed-rank distribution under sensitivity bounds (Rosenbaum, 2002, Chapter 4). For small samples or exact analysis, use the R
sensitivitymworrboundspackage.References
Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer. (Chapter 4.) Rosenbaum, P. R. (2007). Sensitivity analysis for m-estimates, tests, and
confidence intervals in matched observational studies. Biometrics, 63(2), 456-464.
- morie.effects.e_value(ate, se, *, null=0.0)[source]¶
E-value for unmeasured confounding (VanderWeele & Ding, 2017).
The E-value is the minimum strength of association (on the risk-ratio scale) that an unmeasured confounder would need to have with both the treatment and the outcome to fully explain away the observed effect, conditional on the measured covariates.
For a risk ratio (RR) effect estimate the E-value is:
\[E = \text{RR} + \sqrt{\text{RR} \cdot (\text{RR} - 1)}\]where RR > 1. For RR < 1, compute E on 1/RR.
Since the ATE here is a difference (not a ratio), we first convert using a delta-method approximation to get a risk-ratio-like effect, using the relationship RR ≈ exp(|ATE - null| / se) (treating the z-score as a log-RR approximation). This is an approximation appropriate for continuous-scale effects reported with a standard error.
- Parameters:
- Returns:
E-value (float >= 1.0). Returns 1.0 if the estimate is at the null.
- Raises:
ValueError – If se <= 0.
- Return type:
Notes
For binary outcomes with a risk ratio or odds ratio estimate, convert the OR to the risk-ratio scale first, then apply the E-value formula directly. This function is designed for the continuous-ATE setting.
References
- VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational
research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.
- VanderWeele, T. J., Mathur, M. B., & Ding, P. (2019). Correcting
misinterpretations of the E-value. Annals of Internal Medicine, 170(2), 131-132.
Matching methods for causal inference in observational studies.
Implements propensity score matching (nearest-neighbour, caliper, with/without replacement), exact matching, coarsened exact matching (CEM), Mahalanobis distance matching, optimal matching, genetic matching, entropy balancing, cardinality matching, subclassification, full matching, balance diagnostics, treatment effect estimation, Abadie-Imbens standard errors, Rosenbaum bounds, doubly-robust estimation, and multi-treatment/longitudinal matching.
References
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.1093/biomet/70.1.41
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1–21. https://doi.org/10.1214/09-STS313
Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political Analysis, 20(1), 1–24. https://doi.org/10.1093/pan/mpr013
Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1), 235–267. https://doi.org/10.1111/j.1468-0262.2006.00655.x
Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1), 25–46. https://doi.org/10.1093/pan/mpr025
- class morie.matching.MatchResult(matched_data, n_treated, n_matched_control, match_pairs, method='nearest_neighbor', details=<factory>)[source]¶
Bases:
objectContainer for matching output.
- Variables:
matched_data (pd.DataFrame) – Matched dataset with match indicators.
n_treated (int) – Number of treated units.
n_matched_control (int) – Number of matched control units.
match_pairs (pd.DataFrame) – Mapping from treated to matched control indices.
method (str) – Matching method used.
details (dict) – Additional diagnostics (e.g., caliper, replacement).
- Parameters:
- class morie.matching.BalanceResult(balance_table, overall_balance, max_smd, balanced)[source]¶
Bases:
objectContainer for balance diagnostics.
- Variables:
- Parameters:
- class morie.matching.TreatmentEffectResult(estimand, estimate, std_error, ci_lower, ci_upper, p_value, n_obs, details=<factory>)[source]¶
Bases:
objectContainer for treatment effect estimates from matched samples.
- Variables:
- Parameters:
- morie.matching.estimate_propensity_score(data, treatment, covariates, *, model='logistic', max_iter=1000)[source]¶
Estimate propensity scores via logistic regression.
- Parameters:
- Returns:
Propensity scores indexed like data.
- Return type:
pd.Series
References
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.
- morie.matching.trim_propensity_scores(ps, lower=0.01, upper=0.99)[source]¶
Clip propensity scores to
[lower, upper].
- morie.matching.common_support(data, treatment, ps_col='propensity_score', *, method='minmax')[source]¶
Restrict sample to the region of common support.
- morie.matching.match_nearest_neighbor(data, treatment, covariates, *, n_neighbors=1, caliper=None, replace=False, ps=None, alpha=0.05)[source]¶
Nearest-neighbour matching on propensity score.
For each treated unit, finds the n_neighbors closest control units by propensity score distance.
- Parameters:
data (pd.DataFrame)
treatment (str)
n_neighbors (int) – Number of matches per treated unit.
caliper (float, optional) – Maximum propensity score distance for a valid match (in SD of the logit of the propensity score).
replace (bool) – Whether controls can be matched to multiple treated units.
ps (pd.Series, optional) – Pre-computed propensity scores. If
None, estimated internally.alpha (float) – Significance level for balance checks.
- Return type:
- morie.matching.match_exact(data, treatment, exact_vars)[source]¶
Exact matching on discrete covariates.
Matches treated and control units that share identical values on all exact_vars. Unmatched units are dropped.
- Parameters:
- Return type:
- morie.matching.match_cem(data, treatment, covariates, *, n_bins=5)[source]¶
Coarsened Exact Matching (Iacus, King & Porro, 2012).
Coarsens continuous covariates into bins and performs exact matching on the coarsened values. Returns the original (uncoarsened) data with CEM strata weights.
- Parameters:
- Returns:
matched_dataincludes a_cem_weightcolumn.- Return type:
References
Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political Analysis, 20(1), 1–24.
- morie.matching.match_mahalanobis(data, treatment, covariates, *, n_neighbors=1, caliper=None, replace=False, exact=None)[source]¶
Mahalanobis distance matching.
Matches on the Mahalanobis distance of covariates rather than propensity score. Optionally combined with exact matching on discrete variables.
- morie.matching.match_optimal_pair(data, treatment, covariates, *, distance='propensity', ps=None)[source]¶
Optimal pair matching minimising total distance.
Uses a greedy approximation to the assignment problem (Hungarian algorithm is O(n^3); we use sorted nearest-neighbor for scalability).
- Parameters:
- Return type:
- morie.matching.match_full(data, treatment, covariates, *, ps=None, n_subclasses=10)[source]¶
Full matching via propensity score subclassification.
Every unit is placed into a subclass containing at least one treated and one control unit. Approximated here via quantile-based stratification of the propensity score.
- Parameters:
- Return type:
References
Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association, 99(467), 609–618.
- morie.matching.subclassify(data, treatment, covariates, *, ps=None, n_strata=5)[source]¶
Subclassification (stratification) on the propensity score.
- Parameters:
- Returns:
data_with_strata (pd.DataFrame) – Input data with a
_stratumcolumn appended.stratum_effects (pd.DataFrame) – Within-stratum treatment effect estimates.
- Return type:
- morie.matching.entropy_balance(data, treatment, covariates, *, max_moment=1, max_iter=500, tol=1e-06)[source]¶
Entropy balancing weights for the control group.
Finds weights \(w_i\) for control units that minimise the Kullback-Leibler divergence from uniform weights subject to moment balance constraints:
\[\sum_{i: D_i=0} w_i\, x_i^k = \bar x^k_{D=1} \quad \text{for } k = 1, \ldots, \text{max\_moment}\]- Parameters:
- Returns:
Entropy balancing weights indexed like data. Treated units receive weight 1; control units receive the balancing weights.
- Return type:
pd.Series
References
Hainmueller, J. (2012). Entropy balancing for causal effects. Political Analysis, 20(1), 25–46.
- morie.matching.match_genetic(data, treatment, covariates, *, n_neighbors=1, pop_size=50, n_generations=20, seed=42)[source]¶
Genetic matching (Diamond & Sekhon, 2013).
Uses a genetic algorithm to find the optimal weight matrix for Mahalanobis distance matching that maximises covariate balance.
- Parameters:
- Return type:
References
Diamond, A., & Sekhon, J. S. (2013). Genetic matching for estimating causal effects. Review of Economics and Statistics, 95(3), 932–945.
- morie.matching.match_variable_ratio(data, treatment, covariates, *, min_ratio=1, max_ratio=5, caliper=0.2, ps=None)[source]¶
Variable-ratio matching on propensity score.
Each treated unit is matched to between min_ratio and max_ratio controls within the caliper.
- morie.matching.match_cardinality(data, treatment, covariates, *, balance_threshold=0.1, ps=None)[source]¶
Cardinality matching: maximise sample size subject to balance.
Finds the largest subset of matched pairs such that the standardised mean difference on each covariate is below balance_threshold.
Uses an iterative caliper-tightening heuristic.
- Parameters:
- Return type:
References
Zubizarreta, J. R. (2012). Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500), 1360–1371.
- morie.matching.balance_diagnostics(data, treatment, covariates, *, weights=None, threshold=0.1)[source]¶
Compute balance diagnostics for matched/weighted samples.
Reports standardised mean differences (SMD), variance ratios, and Kolmogorov-Smirnov statistics for each covariate.
- morie.matching.love_plot_data(unmatched_data, matched_data, treatment, covariates, *, weights_col=None)[source]¶
Generate data for a Love plot comparing pre- and post-matching balance.
- morie.matching.balance_table(data, treatment, covariates, *, weights=None)[source]¶
Construct a publication-ready balance table.
- morie.matching.estimate_att_matched(data, outcome, treatment, match_pairs, *, weights=None, alpha=0.05)[source]¶
Estimate ATT from matched data.
- Parameters:
- Return type:
- morie.matching.estimate_ate_matched(data, outcome, treatment, covariates, *, weights=None, alpha=0.05)[source]¶
Estimate ATE from matched/weighted data via weighted mean difference.
- morie.matching.estimate_atc_matched(data, outcome, treatment, match_pairs, *, alpha=0.05)[source]¶
Estimate ATC from matched data (matching controls to treated).
- Parameters:
- Return type:
- morie.matching.abadie_imbens_se(data, outcome, treatment, match_pairs, *, n_matches=1)[source]¶
Abadie-Imbens (2006) standard error for matching estimators.
Accounts for the fact that matching introduces correlation across matched observations. Uses the conditional variance estimator:
\[\hat V_{AI} = \frac{1}{n^2}\sum_i \bigl[\hat\sigma^2(X_i) + (K_M(i))^2\,\hat\sigma^2(X_i)\bigr]\]where \(K_M(i)\) is the number of times unit i is used as a match.
- Parameters:
- Returns:
Abadie-Imbens SE.
- Return type:
References
Abadie, A., & Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1), 235–267.
- morie.matching.rosenbaum_bounds(data, outcome, treatment, match_pairs, *, gamma_range=None)[source]¶
Rosenbaum bounds for sensitivity to hidden bias.
For each value of \(\Gamma\) (the maximum odds ratio of differential treatment assignment due to an unobserved confounder), computes the upper and lower bounds on the p-value for the treatment effect.
- Parameters:
- Returns:
Columns:
gamma,p_lower,p_upper,significant_lower,significant_upper.- Return type:
pd.DataFrame
References
Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer.
- morie.matching.doubly_robust_matching(data, outcome, treatment, covariates, *, ps=None, n_bootstrap=200, seed=42, alpha=0.05)[source]¶
Doubly-robust ATT estimation combining matching and regression.
Matches on propensity score, then uses bias-corrected regression adjustment within matched pairs. Consistent if either the propensity score model or the outcome model is correct.
- morie.matching.match_multi_treatment(data, treatment, covariates, *, reference_group=None, method='nearest_neighbor')[source]¶
Matching with multiple (> 2) treatment groups.
For each non-reference treatment level, matches treated units to the reference group using propensity scores from multinomial logistic regression.
- Parameters:
- Returns:
Keys are treatment levels; values are
MatchResult.- Return type:
- morie.matching.match_longitudinal(data, treatment, covariates, unit, time, treatment_time, *, n_pre_periods=1, method='nearest_neighbor')[source]¶
Longitudinal matching for panel data.
Matches treated and control units based on pre-treatment covariate values, allowing for time-varying covariates.
- morie.matching.matching_quality(unmatched_data, matched_data, treatment, covariates, *, weights=None)[source]¶
Comprehensive matching quality assessment.
Compares balance before and after matching and computes summary metrics including percent bias reduction, number of balanced covariates, and overlap statistics.
- morie.matching.overlap_diagnostics(data, treatment, covariates, *, ps=None)[source]¶
Propensity score overlap diagnostics.
Instrumental Variables and Two-Stage Least Squares estimators.
Implements 2SLS, LIML, GMM, CUE-GMM, weak-instrument diagnostics (Cragg-Donald, Kleibergen-Paap, Stock-Yogo), Anderson-Rubin inference, Sargan/Hansen J tests, Hausman tests, JIVE, split-sample IV, IV probit, panel IV, control function, and comprehensive residual diagnostics.
References
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Stock, J. H., & Yogo, M. (2005). Testing for weak instruments in linear IV regression. In Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg (pp. 80–108). Cambridge University Press.
Kleibergen, F. (2002). Pivotal statistics for testing structural parameters in instrumental variables regression. Econometrica, 70(5), 1781–1803. https://doi.org/10.1111/1468-0262.00353
Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica, 26(3), 393–415.
- class morie.iv.IVResult(coefficients, std_errors, t_stats, p_values, ci_lower, ci_upper, variable_names, n_obs, method='2sls', details=<factory>)[source]¶
Bases:
objectContainer for an IV/2SLS regression result.
- Variables:
coefficients (np.ndarray) – Estimated coefficient vector.
std_errors (np.ndarray) – Standard errors.
t_stats (np.ndarray) – t-statistics.
p_values (np.ndarray) – Two-sided p-values.
ci_lower (np.ndarray) – Lower 95 % CI bounds.
ci_upper (np.ndarray) – Upper 95 % CI bounds.
variable_names (list of str) – Names of variables corresponding to each coefficient.
n_obs (int) – Number of observations.
method (str) – Estimation method name.
details (dict) – Additional diagnostics.
- Parameters:
- class morie.iv.DiagnosticResult(statistic, p_value, test_name, critical_values=None, details=<factory>)[source]¶
Bases:
objectContainer for a single diagnostic test.
- Parameters:
- morie.iv.tsls(data, outcome, endogenous, instruments, exogenous=None, *, cluster=None, robust=True, alpha=0.05)[source]¶
Two-Stage Least Squares (2SLS) estimator.
First stage: regress each endogenous variable on the instruments and exogenous variables. Second stage: regress the outcome on the fitted endogenous variables and exogenous variables.
\[\hat\beta_{2SLS} = (X'P_Z X)^{-1} X'P_Z y\]where \(P_Z = Z(Z'Z)^{-1}Z'\) and \(Z\) includes both excluded instruments and included exogenous variables.
- Parameters:
data (pd.DataFrame)
outcome (str) – Dependent variable.
exogenous (list of str, optional) – Included exogenous regressors (added automatically to Z).
cluster (str, optional) – Cluster variable for cluster-robust SE.
robust (bool) – Use heteroskedasticity-robust SE (default True).
alpha (float) – Significance level.
- Return type:
References
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
- morie.iv.liml(data, outcome, endogenous, instruments, exogenous=None, *, robust=True, alpha=0.05)[source]¶
Limited Information Maximum Likelihood (LIML) estimator.
More robust to weak instruments than 2SLS. The LIML estimator is the k-class estimator with \(\kappa\) equal to the smallest eigenvalue of a certain matrix ratio.
\[\hat\beta_{\text{LIML}} = \bigl(X'(I - \kappa M_Z) X\bigr)^{-1} X'(I - \kappa M_Z) y\]- Parameters:
- Return type:
References
Anderson, T. W., & Rubin, H. (1949). Estimation of the parameters of a single equation in a complete system of stochastic equations. Annals of Mathematical Statistics, 20(1), 46–63.
- morie.iv.gmm_iv(data, outcome, endogenous, instruments, exogenous=None, *, weight_matrix='optimal', robust=True, alpha=0.05)[source]¶
Generalised Method of Moments (GMM) IV estimator.
Two-step efficient GMM: first estimates with the identity weight matrix, then re-estimates with the optimal weight matrix constructed from first-step residuals.
\[\hat\beta_{\text{GMM}} = \arg\min_\beta \bigl[Z'(y - X\beta)\bigr]' \hat W \bigl[Z'(y - X\beta)\bigr]\]
- morie.iv.cue_gmm(data, outcome, endogenous, instruments, exogenous=None, *, max_iter=100, tol=1e-08, alpha=0.05)[source]¶
Continuously Updated GMM estimator.
Unlike two-step GMM, CUE simultaneously optimises over \(\beta\) and the weight matrix, yielding an estimator that is invariant to the normalisation of moment conditions.
- morie.iv.wald_estimator(data, outcome, treatment, instrument, *, alpha=0.05)[source]¶
Wald (simple IV) estimator for a single binary instrument.
\[\hat\beta_{\text{Wald}} = \frac{E[Y | Z=1] - E[Y | Z=0]}{E[D | Z=1] - E[D | Z=0]}\]
- morie.iv.first_stage_diagnostics(data, endogenous, instruments, exogenous=None)[source]¶
First-stage regression diagnostics for each endogenous variable.
Reports the F-statistic, partial R-squared, and Shea’s partial R-squared for each endogenous variable.
- Parameters:
- Returns:
Columns:
endogenous_var,f_stat,partial_r2,shea_partial_r2,n_obs.- Return type:
pd.DataFrame
References
Shea, J. (1997). Instrument relevance in multivariate linear models: A simple measure. Review of Economics and Statistics, 79(2), 348–352.
- morie.iv.cragg_donald_test(data, endogenous, instruments, exogenous=None)[source]¶
Cragg-Donald (1993) minimum eigenvalue statistic for weak instruments.
- morie.iv.stock_yogo_critical_values(n_endogenous=1, n_instruments=1)[source]¶
Stock-Yogo (2005) critical values for weak instrument testing.
Returns approximate critical values for the Cragg-Donald statistic at various maximal bias/size thresholds.
- Parameters:
- Returns:
Critical values keyed by bias threshold (e.g.,
"10%_bias").- Return type:
References
Stock, J. H., & Yogo, M. (2005). Testing for weak instruments in linear IV regression.
- morie.iv.kleibergen_paap_test(data, endogenous, instruments, exogenous=None)[source]¶
Kleibergen-Paap (2006) rk statistic for weak instruments.
Robust to heteroskedasticity and clustering, unlike Cragg-Donald.
- Parameters:
- Return type:
References
Kleibergen, F., & Paap, R. (2006). Generalized reduced rank tests using the singular value decomposition. Journal of Econometrics, 133(1), 97–126.
- morie.iv.anderson_rubin_test(data, outcome, endogenous, instruments, exogenous=None, *, beta0=None, alpha=0.05)[source]¶
Anderson-Rubin (1949) test: robust to weak instruments.
Tests \(H_0: \beta = \beta_0\) using the statistic:
\[AR = \frac{(y - X\beta_0)'P_Z(y - X\beta_0) / q} {(y - X\beta_0)'M_Z(y - X\beta_0) / (n - k)}\]which has an \(F(q, n-k)\) distribution under the null regardless of instrument strength.
- Parameters:
- Returns:
detailscontains the confidence set if computed by inversion.- Return type:
- morie.iv.anderson_rubin_ci(data, outcome, endogenous, instruments, exogenous=None, *, grid_min=-10.0, grid_max=10.0, grid_n=200, alpha=0.05)[source]¶
Anderson-Rubin confidence interval by test inversion.
- morie.iv.conditional_lr_test(data, outcome, endogenous, instruments, exogenous=None, *, beta0=0.0)[source]¶
Conditional likelihood ratio test (Moreira, 2003).
More powerful than Anderson-Rubin when instruments are weak but not completely irrelevant.
- Parameters:
- Return type:
References
Moreira, M. J. (2003). A conditional likelihood ratio test for structural models. Econometrica, 71(4), 1027–1048.
- morie.iv.sargan_test(data, outcome, endogenous, instruments, exogenous=None)[source]¶
Sargan (1958) / Hansen (1982) J-test for overidentifying restrictions.
Tests whether the excluded instruments are uncorrelated with the structural error. Requires more instruments than endogenous variables (overidentification).
\[J = n \cdot \bar g' \hat S^{-1} \bar g \sim \chi^2(L - K)\]
- morie.iv.hansen_j_test(data, outcome, endogenous, instruments, exogenous=None)[source]¶
Hansen J-test (heteroskedasticity-robust overidentification test).
Uses the efficient GMM objective function as the test statistic.
- morie.iv.hausman_test(data, outcome, endogenous, instruments, exogenous=None)[source]¶
Hausman (1978) specification test: OLS vs 2SLS.
Tests whether the endogenous variables are in fact exogenous. Under the null, OLS is consistent and efficient; under the alternative, only 2SLS is consistent.
\[H = (\hat\beta_{2SLS} - \hat\beta_{OLS})' (V_{2SLS} - V_{OLS})^{-1} (\hat\beta_{2SLS} - \hat\beta_{OLS})\]- Parameters:
- Return type:
References
Hausman, J. A. (1978). Specification tests in econometrics. Econometrica, 46(6), 1251–1271.
- morie.iv.durbin_wu_hausman(data, outcome, endogenous, instruments, exogenous=None)[source]¶
Durbin-Wu-Hausman regression-based endogeneity test.
Augments the structural equation with first-stage residuals and tests their joint significance.
- morie.iv.jive(data, outcome, endogenous, instruments, exogenous=None, *, alpha=0.05)[source]¶
Jackknife IV estimator (JIVE).
Uses leave-one-out predicted values from the first stage to reduce finite-sample bias from many or weak instruments.
\[\hat D_{i,(-i)} = Z_i' \hat\Pi_{(-i)}\]- Parameters:
- Return type:
References
Angrist, J. D., Imbens, G. W., & Krueger, A. B. (1999). Jackknife instrumental variables estimation. Journal of Applied Econometrics, 14(1), 57–67.
- morie.iv.split_sample_iv(data, outcome, endogenous, instruments, exogenous=None, *, split_fraction=0.5, seed=42, alpha=0.05)[source]¶
Split-sample IV estimator (Angrist & Krueger, 1995).
Randomly splits the sample in two halves. Estimates the first stage on one half and the second stage on the other, reducing overfitting bias from many instruments.
- morie.iv.control_function(data, outcome, endogenous, instruments, exogenous=None, *, robust=True, alpha=0.05)[source]¶
Control function (CF) approach to endogeneity.
Two-step: (1) regress endogenous on instruments and exogenous to get residuals; (2) include residuals in the structural equation as additional regressors. Numerically equivalent to 2SLS for linear models but extends naturally to nonlinear models.
- morie.iv.iv_probit(data, outcome, endogenous, instruments, exogenous=None, *, alpha=0.05)[source]¶
IV Probit via control function (Rivers & Vuong, 1988).
For binary outcomes. Uses the control-function approach: first-stage OLS residuals are included in a probit second stage.
- Parameters:
- Return type:
References
Rivers, D., & Vuong, Q. H. (1988). Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics, 39(3), 347–366.
- morie.iv.panel_iv(data, outcome, endogenous, instruments, unit, *, exogenous=None, time_fe=None, alpha=0.05)[source]¶
Panel IV with unit (and optionally time) fixed effects.
Demeans all variables by unit (and time if specified), then applies 2SLS to the demeaned data.
- morie.iv.iv_diagnostics(data, outcome, endogenous, instruments, exogenous=None)[source]¶
Comprehensive IV regression diagnostic suite.
Runs first-stage diagnostics, weak instrument tests, overidentification tests, and endogeneity tests in one call.
- morie.iv.iv_residual_analysis(data, outcome, endogenous, instruments, exogenous=None)[source]¶
Residual analysis for IV regression.
Returns a DataFrame with fitted values, residuals, leverage, and standardised residuals.
Regression Discontinuity Design estimators for causal inference.
Implements sharp and fuzzy RDD, optimal bandwidth selection (IK, CCT, rule-of-thumb), local polynomial regression, kernel functions, manipulation tests, covariate balance, placebo tests, donut-hole RDD, geographic/multi-dimensional RDD, RD plots, kink RD, sensitivity analysis, and power calculations.
References
Imbens, G. W., & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79(3), 933–959. https://doi.org/10.1093/restud/rdr043
Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295–2326. https://doi.org/10.3982/ECTA11757
McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics, 142(2), 698–714. https://doi.org/10.1016/j.jeconom.2007.05.005
Cattaneo, M. D., Jansson, M., & Ma, X. (2020). Simple local polynomial density estimators. Journal of the American Statistical Association, 115(531), 1449–1455. https://doi.org/10.1080/01621459.2019.1635480
- morie.rdd.kernel_epanechnikov(u)[source]¶
Epanechnikov kernel: \(K(u) = \frac{3}{4}(1 - u^2)\) for \(|u| \le 1\).
- morie.rdd.kernel_uniform(u)[source]¶
Uniform (rectangular) kernel: \(K(u) = 0.5\) for \(|u| \le 1\).
- morie.rdd.kernel_gaussian(u)[source]¶
Gaussian kernel: \(K(u) = \phi(u)\) (standard normal density).
- class morie.rdd.RDDResult(estimate, std_error, t_stat, p_value, ci_lower, ci_upper, bandwidth, n_left, n_right, method='sharp_rdd', details=<factory>)[source]¶
Bases:
objectContainer for an RDD treatment effect estimate.
- Variables:
estimate (float) – Point estimate of the treatment effect at the cutoff.
std_error (float) – Standard error.
t_stat (float) – t-statistic.
p_value (float) – Two-sided p-value.
ci_lower (float) – Lower bound of confidence interval.
ci_upper (float) – Upper bound of confidence interval.
bandwidth (float) – Bandwidth used for estimation.
n_left (int) – Observations to the left of the cutoff within the bandwidth.
n_right (int) – Observations to the right of the cutoff within the bandwidth.
method (str) – Estimation method name.
details (dict) – Additional diagnostic information.
- Parameters:
- class morie.rdd.BandwidthResult(h_opt, method, details=<factory>)[source]¶
Bases:
objectContainer for bandwidth selection output.
- Variables:
- Parameters:
- class morie.rdd.DensityTestResult(statistic, p_value, method, details=<factory>)[source]¶
Bases:
objectContainer for manipulation/density test output.
- morie.rdd.local_polynomial_regression(x, y, eval_points, h, p=1, kernel='triangular')[source]¶
Evaluate a local polynomial regression at multiple points.
- Parameters:
- Returns:
Columns:
x,fitted,se.- Return type:
pd.DataFrame
- morie.rdd.bandwidth_ik(x, y, cutoff=0.0, kernel='triangular')[source]¶
Imbens-Kalyanaraman (2012) optimal bandwidth selector.
- Parameters:
- Return type:
References
Imbens, G. W., & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. Review of Economic Studies, 79(3), 933–959.
- morie.rdd.bandwidth_rot(x, y, cutoff=0.0)[source]¶
Rule-of-thumb bandwidth selector.
Uses Silverman’s rule adapted for the RDD context.
- Parameters:
x (np.ndarray)
y (np.ndarray)
cutoff (float)
- Return type:
- morie.rdd.bandwidth_cct(x, y, cutoff=0.0, kernel='triangular', p=1)[source]¶
Calonico-Cattaneo-Titiunik (2014) MSE-optimal bandwidth.
Also computes the CER-optimal bandwidth.
- Parameters:
- Returns:
detailsincludesh_mseandh_cer.- Return type:
References
Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295–2326.
- morie.rdd.sharp_rdd(data, outcome, running, cutoff=0.0, *, bandwidth=None, p=1, kernel='triangular', cluster=None, covariates=None, alpha=0.05)[source]¶
Sharp regression discontinuity design estimator.
Estimates the local average treatment effect at the cutoff using local polynomial regression of order p separately on each side of the cutoff:
\[\hat\tau = \hat\mu_+(c) - \hat\mu_-(c)\]If no bandwidth is provided, the CCT MSE-optimal bandwidth is used.
- Parameters:
data (pd.DataFrame)
outcome (str) – Outcome column.
running (str) – Running variable column.
cutoff (float) – RD cutoff value.
bandwidth (float, optional) – Estimation bandwidth. If
None, computed viabandwidth_cct().p (int) – Local polynomial order (default 1 = local linear).
kernel (str) – Kernel function name.
cluster (str, optional) – Column for cluster-robust variance.
covariates (list of str, optional) – Additional covariates (included additively).
alpha (float) – Significance level.
- Return type:
References
Hahn, J., Todd, P., & Van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), 201–209.
- morie.rdd.fuzzy_rdd(data, outcome, running, treatment, cutoff=0.0, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]¶
Fuzzy regression discontinuity design estimator.
Uses the cutoff as an instrument for treatment takeup. The fuzzy RD estimand is:
\[\hat\tau_{\text{FRD}} = \frac{\hat\mu_{Y+}(c) - \hat\mu_{Y-}(c)} {\hat\mu_{D+}(c) - \hat\mu_{D-}(c)}\]- Parameters:
- Returns:
detailsincludes the first-stage jump.- Return type:
- morie.rdd.rdd_bias_corrected(data, outcome, running, cutoff=0.0, *, bandwidth=None, rho=1.0, p=1, kernel='triangular', alpha=0.05)[source]¶
Bias-corrected RDD with robust confidence intervals (CCT, 2014).
Uses a higher-order local polynomial to estimate and remove the leading bias term, then constructs robust CIs.
- Parameters:
- Returns:
The
estimateis bias-corrected;ci_lower/ci_upperare the robust confidence intervals.- Return type:
- morie.rdd.mccrary_test(x, cutoff=0.0, *, n_bins=50, bandwidth=None)[source]¶
McCrary (2008) density test for manipulation of the running variable.
Constructs a histogram of the running variable and tests for a discontinuity in the density at the cutoff using local linear regression on the bin counts.
- Parameters:
- Returns:
statisticis the log-difference in estimated densities.- Return type:
References
McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics, 142(2), 698–714.
- morie.rdd.cattaneo_density_test(x, cutoff=0.0, *, p=2, kernel='triangular', bandwidth=None)[source]¶
Cattaneo, Jansson, & Ma (2020) density test for manipulation.
Uses local polynomial density estimation on each side of the cutoff and tests for a discontinuity.
- Parameters:
- Return type:
References
Cattaneo, M. D., Jansson, M., & Ma, X. (2020). Simple local polynomial density estimators. JASA, 115(531), 1449–1455.
- morie.rdd.covariate_balance_rdd(data, running, covariates, cutoff=0.0, *, bandwidth=None, kernel='triangular', alpha=0.05)[source]¶
Test covariate balance at the RD cutoff.
For each covariate, estimates a sharp RDD and reports whether there is a statistically significant jump at the cutoff.
- morie.rdd.placebo_cutoff_test(data, outcome, running, true_cutoff, placebo_cutoffs, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]¶
Test for RDD effects at placebo (false) cutoff values.
Estimates a sharp RDD at each placebo cutoff using only data on one side of the true cutoff. A well-designed RDD should show no significant effects at placebo cutoffs.
- morie.rdd.donut_rdd(data, outcome, running, cutoff=0.0, donut=0.0, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]¶
Donut-hole RDD: exclude observations within donut of the cutoff.
Addresses concerns about manipulation or heaping at the cutoff by removing observations in \([c - \text{donut}, c + \text{donut}]\).
- morie.rdd.rdd_discrete(data, outcome, running, cutoff=0.0, *, bandwidth=None, p=0, alpha=0.05)[source]¶
RDD estimator for discrete running variables.
Uses local constant (Nadaraya-Watson) or local linear regression with a uniform kernel. Standard errors are clustered at the level of the running variable mass points (Lee & Card, 2008).
- Parameters:
- Return type:
References
Lee, D. S., & Card, D. (2008). Regression discontinuity inference with specification error. Journal of Econometrics, 142(2), 655–674.
- morie.rdd.rd_plot_data(data, outcome, running, cutoff=0.0, *, n_bins=20, p_global=4, p_local=1, bandwidth=None, kernel='triangular')[source]¶
Generate data for standard RD plots.
Returns three datasets: -
binned: binned scatter plot (evenly-spaced or quantile bins) -global_poly: global polynomial fit -local_poly: local polynomial fit within bandwidth- Parameters:
- Returns:
Keys:
"binned","global_poly","local_poly".- Return type:
- morie.rdd.bandwidth_sensitivity(data, outcome, running, cutoff=0.0, *, bandwidth_range=None, p=1, kernel='triangular', alpha=0.05)[source]¶
Sensitivity of the RDD estimate to bandwidth choice.
- Parameters:
- Returns:
Columns:
bandwidth,estimate,std_error,ci_lower,ci_upper,p_value,n_eff.- Return type:
pd.DataFrame
- morie.rdd.kink_rdd(data, outcome, running, cutoff=0.0, *, bandwidth=None, kernel='triangular', alpha=0.05)[source]¶
Regression Kink Design (RKD) estimator.
Tests for a change in the slope (rather than the level) of the conditional expectation function at the cutoff.
\[\hat\tau_{\text{RKD}} = \frac{\hat\mu'_+(c) - \hat\mu'_-(c)} {\text{kink in policy variable}}\]- Parameters:
- Returns:
estimateis the change in slope at the cutoff.- Return type:
References
Card, D., Lee, D. S., Pei, Z., & Weber, A. (2015). Inference on causal effects in a generalized regression kink design. Econometrica, 83(6), 2453–2483.
- morie.rdd.rdd_local_randomisation(data, outcome, running, cutoff=0.0, window=1.0, *, n_permutations=1000, seed=42, alpha=0.05)[source]¶
RDD via local randomisation (Cattaneo, Frandsen & Titiunik, 2015).
Within a narrow window around the cutoff, assumes treatment is as-if randomly assigned and uses a permutation test.
- Parameters:
- Returns:
p_valueis the permutation-based p-value.- Return type:
References
Cattaneo, M. D., Frandsen, B. R., & Titiunik, R. (2015). Randomization inference in the regression discontinuity design. Journal of Causal Inference, 3(1), 1–24.
- morie.rdd.geographic_rdd(data, outcome, distance_to_boundary, side, *, bandwidth=None, p=1, kernel='triangular', alpha=0.05)[source]¶
Geographic (multi-dimensional) RDD using distance to a boundary.
Reduces a spatial RDD to a univariate RDD by using the signed distance from each unit to the treatment boundary as the running variable.
- Parameters:
- Return type:
References
Keele, L. J., & Titiunik, R. (2015). Geographic boundaries as regression discontinuities. Political Analysis, 23(1), 127–155.
- morie.rdd.rdd_power(n, tau, sigma, cutoff_density=1.0, bandwidth=None, kernel='triangular', alpha=0.05)[source]¶
Power calculation for a sharp RDD.
Computes approximate power based on the asymptotic distribution of the local linear estimator:
\[\text{Power} = \Phi\left(\frac{|\tau|}{\hat\sigma_\tau} - z_{1-\alpha/2}\right)\]- Parameters:
n (int) – Total sample size.
tau (float) – Hypothesised treatment effect.
sigma (float) – Outcome standard deviation.
cutoff_density (float) – Density of the running variable at the cutoff.
bandwidth (float, optional) – Bandwidth (if
None, uses a default fraction of the range).kernel (str) – Kernel name (affects efficiency constants).
alpha (float) – Significance level.
- Returns:
power,n_effective,mde(minimum detectable effect).- Return type:
- morie.rdd.rdd_sample_size(tau, sigma, cutoff_density=1.0, bandwidth=1.0, power=0.8, kernel='triangular', alpha=0.05)[source]¶
Compute the required sample size for a sharp RDD.
Difference-in-Differences estimators for causal inference.
Implements classic 2x2 DiD, generalized DiD with multiple time periods (Callaway–Sant’Anna), staggered adoption, event studies, Bacon decomposition, doubly-robust DiD, triple differences, synthetic DiD, and comprehensive pre-trend and placebo testing infrastructure.
References
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200–230. https://doi.org/10.1016/j.jeconom.2020.12.001
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277. https://doi.org/10.1016/j.jeconom.2021.03.014
Sant’Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122. https://doi.org/10.1016/j.jeconom.2020.06.003
Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088–4118. https://doi.org/10.1257/aer.20190159
- class morie.did.DiDResult(estimate, std_error, t_stat, p_value, ci_lower, ci_upper, n_treated, n_control, method='did_2x2', details=<factory>)[source]¶
Bases:
objectContainer for a Difference-in-Differences estimate.
- Variables:
estimate (float) – Point estimate of the treatment effect (ATT).
std_error (float) – Standard error of the estimate.
t_stat (float) – t-statistic (estimate / std_error).
p_value (float) – Two-sided p-value.
ci_lower (float) – Lower bound of the 95 % confidence interval.
ci_upper (float) – Upper bound of the 95 % confidence interval.
n_treated (int) – Number of treated observations.
n_control (int) – Number of control observations.
method (str) – Name of the estimation method used.
details (dict) – Additional diagnostic information.
- Parameters:
- class morie.did.EventStudyResult(coefficients, reference_period, pre_trend_f_stat, pre_trend_p_value, details=<factory>)[source]¶
Bases:
objectContainer for event-study coefficient estimates.
- Variables:
coefficients (pd.DataFrame) – DataFrame with columns
relative_time,estimate,std_error,ci_lower,ci_upper,p_value.reference_period (int) – The omitted (normalised-to-zero) relative time period.
pre_trend_f_stat (float) – Joint F-statistic for pre-treatment coefficients.
pre_trend_p_value (float) – p-value for the joint pre-trend test.
- Parameters:
- class morie.did.BaconDecomposition(components, overall_estimate)[source]¶
Bases:
objectContainer for Goodman-Bacon decomposition.
- Variables:
components (pd.DataFrame) – DataFrame with columns
group1,group2,estimate,weight,type(timing, always-treated, never-treated).overall_estimate (float) – Weighted-average TWFE DiD estimate.
- Parameters:
- morie.did.did_2x2(data, outcome, treatment, post, *, covariates=None, cluster=None, alpha=0.05)[source]¶
Estimate a classic 2x2 Difference-in-Differences treatment effect.
The canonical DiD estimator for settings with two groups (treated vs control) and two periods (pre vs post). With no covariates the estimator is:
\[\hat\tau_{\text{DiD}} = \bigl(\bar Y_{1,\text{post}} - \bar Y_{1,\text{pre}}\bigr) - \bigl(\bar Y_{0,\text{post}} - \bar Y_{0,\text{pre}}\bigr)\]When covariates are provided the regression specification is:
\[Y_{it} = \alpha + \beta\,D_i + \gamma\,\text{Post}_t + \tau\,(D_i \times \text{Post}_t) + X_{it}'\delta + \varepsilon_{it}\]and \(\hat\tau\) is the coefficient on the interaction term.
- Parameters:
data (pd.DataFrame) – Panel or repeated cross-section data.
outcome (str) – Name of the outcome column.
treatment (str) – Binary (0/1) column indicating the treatment group.
post (str) – Binary (0/1) column indicating the post-treatment period.
covariates (list of str, optional) – Additional covariates for the regression specification.
cluster (str, optional) – Column for cluster-robust standard errors.
alpha (float) – Significance level for confidence intervals (default 0.05).
- Returns:
Estimation results including point estimate, SE, CI, and diagnostics.
- Return type:
References
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
- morie.did.did_repeated_cross_section(data, outcome, treatment, post, *, covariates=None, weights=None, cluster=None, alpha=0.05)[source]¶
DiD estimator designed for repeated cross-section data.
Identical to
did_2x2()but optionally incorporates survey weights. When weights is provided, weighted least squares is used for estimation.- Parameters:
data (pd.DataFrame) – Repeated cross-section data.
outcome (str) – Column names for outcome, treatment indicator, and post indicator.
treatment (str) – Column names for outcome, treatment indicator, and post indicator.
post (str) – Column names for outcome, treatment indicator, and post indicator.
weights (str, optional) – Column containing survey / sampling weights.
cluster (str, optional) – Column for cluster-robust SE.
alpha (float) – Significance level.
- Return type:
- morie.did.did_panel_fe(data, outcome, treatment, unit, time, *, covariates=None, cluster=None, alpha=0.05)[source]¶
DiD with unit and time fixed effects (two-way fixed effects, TWFE).
Demeans the outcome by unit and time means (within transformation) and regresses on the treatment indicator.
- Parameters:
data (pd.DataFrame) – Balanced or unbalanced panel data.
outcome (str) – Outcome column.
treatment (str) – Binary column (0/1) indicating treatment status in each period.
unit (str) – Unit identifier column.
time (str) – Time period column.
covariates (list of str, optional) – Time-varying covariates.
cluster (str, optional) – Cluster variable for standard errors (defaults to unit).
alpha (float) – Significance level.
- Return type:
- morie.did.event_study(data, outcome, unit, time, treatment_time, *, covariates=None, reference_period=-1, leads=4, lags=4, cluster=None, alpha=0.05)[source]¶
Estimate an event-study specification around treatment onset.
Constructs relative-time dummies \(\{1[t - g = k]\}\) for \(k \in [-\text{leads}, \text{lags}]\) and regresses the outcome on these indicators with unit and time fixed effects. The reference_period dummy is omitted (normalised to zero).
- Parameters:
data (pd.DataFrame) – Panel data with a treatment-onset column.
outcome (str) – Outcome column.
unit (str) – Unit identifier column.
time (str) – Calendar-time column (integer-valued).
treatment_time (str) – Column giving the period in which each unit first received treatment (
np.inforNaNfor never-treated units).covariates (list of str, optional) – Time-varying covariates.
reference_period (int) – Relative-time period omitted as baseline (default -1).
leads (int) – Number of pre-treatment periods to include.
lags (int) – Number of post-treatment periods to include.
cluster (str, optional) – Cluster variable (defaults to unit).
alpha (float) – Significance level.
- Return type:
- morie.did.test_parallel_trends(data, outcome, treatment, time, *, unit=None, cluster=None, pre_periods=None)[source]¶
Test the parallel trends assumption using pre-treatment data.
Runs two tests:
Individual coefficient test: regresses the outcome on group-by-time interactions in the pre-period and reports each coefficient.
Joint F-test: tests that all interaction coefficients are jointly zero.
- Parameters:
data (pd.DataFrame) – Panel or repeated cross-section data.
outcome (str) – Outcome column name.
treatment (str) – Treatment group indicator.
time (str) – Time column (integer-valued).
unit (str, optional) – Unit identifier for panel demeaning.
cluster (str, optional) – Cluster variable for robust SE.
pre_periods (list, optional) – Explicit list of pre-treatment time values to test.
- Returns:
coefficients(pd.DataFrame),joint_f_stat,joint_p_value,parallel_trends_plausible(bool, True when p > 0.05).- Return type:
- morie.did.parallel_trends_data(data, outcome, treatment, time, *, weights=None)[source]¶
Compute group-level outcome means over time for parallel-trends plots.
- morie.did.group_time_att(data, outcome, unit, time, treatment_time, *, covariates=None, method='doubly_robust', control_group='never_treated', n_bootstrap=200, seed=42, alpha=0.05)[source]¶
Estimate group-time average treatment effects (Callaway & Sant’Anna 2021).
For each cohort \(g\) (units first treated at time \(g\)) and each post-treatment calendar period \(t\), estimates \(\text{ATT}(g, t)\).
- Parameters:
data (pd.DataFrame) – Panel data.
outcome (str) – Outcome column.
unit (str) – Unit identifier.
time (str) – Calendar-time column (integer).
treatment_time (str) – Column with treatment-onset period (use
np.inffor never-treated).covariates (list of str, optional) – Covariates for doubly-robust estimation.
method (str) –
"doubly_robust"(default),"ipw", or"outcome_regression".control_group (str) –
"never_treated"or"not_yet_treated".n_bootstrap (int) – Number of bootstrap replications for inference.
seed (int) – Random seed for bootstrap.
alpha (float) – Significance level.
- Returns:
Columns:
cohort,time,att,std_error,ci_lower,ci_upper,p_value.- Return type:
pd.DataFrame
References
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200–230.
- morie.did.aggregate_gt_att(gt_results, *, aggregation='overall', time_col='time', cohort_col='cohort', att_col='att', se_col='std_error')[source]¶
Aggregate group-time ATTs into summary treatment effect parameters.
- Parameters:
gt_results (pd.DataFrame) – Output of
group_time_att().aggregation (str) –
"overall"(single ATT),"cohort"(by cohort),"calendar_time"(by period), or"event_time"(by relative period since treatment).time_col (str) – Column name overrides.
cohort_col (str) – Column name overrides.
att_col (str) – Column name overrides.
se_col (str) – Column name overrides.
- Returns:
Aggregated estimates with
estimate,std_error,ci_lower,ci_upper.- Return type:
pd.DataFrame
- morie.did.staggered_did(data, outcome, unit, time, treatment_time, *, covariates=None, n_bootstrap=200, seed=42, alpha=0.05)[source]¶
Staggered DiD via group-time ATTs with aggregation.
A convenience wrapper around
group_time_att()andaggregate_gt_att()that returns overall, cohort-level, and event-time aggregated results.- Parameters:
data (pd.DataFrame) – Panel data.
outcome (str) – Column names.
unit (str) – Column names.
time (str) – Column names.
treatment_time (str) – Column names.
covariates (list of str, optional) – Covariates for doubly-robust estimation.
n_bootstrap (int) – Bootstrap replications.
seed (int) – Random seed.
alpha (float) – Significance level.
- Returns:
Keys:
group_time(pd.DataFrame),overall(pd.DataFrame),by_cohort(pd.DataFrame),by_event_time(pd.DataFrame).- Return type:
- morie.did.did_doubly_robust(data, outcome, treatment, post, covariates, *, ps_model='logistic', or_model='linear', cluster=None, n_bootstrap=200, seed=42, alpha=0.05)[source]¶
Doubly-robust DiD estimator (Sant’Anna & Zhao, 2020).
Combines an outcome regression model with an inverse-probability weighting model. The estimator is consistent if either model is correctly specified.
\[\hat\tau_{\text{DR}} = \frac{1}{n_1}\sum_{i:D_i=1}\bigl[\Delta Y_i - \hat\mu_0(\mathbf X_i)\bigr] - \frac{1}{\sum_{j:D_j=0}\hat w_j} \sum_{j:D_j=0}\hat w_j\bigl[\Delta Y_j - \hat\mu_0(\mathbf X_j)\bigr]\]- Parameters:
- Return type:
References
Sant’Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101–122.
- morie.did.did_triple_difference(data, outcome, treatment, post, third_diff, *, covariates=None, cluster=None, alpha=0.05)[source]¶
Triple-difference (DDD) estimator.
Extends the standard DiD by adding a third differencing dimension (e.g., within-state variation across affected and unaffected sub-populations).
\[Y = \alpha + \beta_1 D + \beta_2 \text{Post} + \beta_3 S + \beta_4 (D \times \text{Post}) + \beta_5 (D \times S) + \beta_6 (\text{Post} \times S) + \tau (D \times \text{Post} \times S) + \varepsilon\]The DDD estimate is the coefficient on the three-way interaction.
- Parameters:
data (pd.DataFrame)
outcome (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).
treatment (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).
post (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).
third_diff (str) – Column names. third_diff is the binary variable for the additional differencing group (e.g., age group, sub-population).
cluster (str, optional)
alpha (float)
- Return type:
- morie.did.bacon_decomposition(data, outcome, treatment, unit, time)[source]¶
Goodman-Bacon (2021) decomposition of the TWFE DiD estimator.
Decomposes the two-way fixed-effects DiD estimate into a weighted average of all possible 2x2 DiD comparisons: earlier-vs-later treated, treated-vs-never-treated, and later-vs-earlier treated.
- Parameters:
- Return type:
References
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254–277.
- morie.did.synthetic_did(data, outcome, unit, time, treatment_time, *, treated_units=None, zeta=None, n_bootstrap=200, seed=42, alpha=0.05)[source]¶
Synthetic Difference-in-Differences estimator.
Combines synthetic control reweighting of control units with DiD to produce a doubly-robust estimator.
Solves for unit weights \(\hat\omega\) and time weights \(\hat\lambda\) that minimise weighted pre-treatment outcome differences, then estimates:
\[\hat\tau_{\text{SDID}} = \sum_i \hat\omega_i \sum_t \hat\lambda_t \bigl(Y_{it} - \hat\mu\bigr)\]- Parameters:
data (pd.DataFrame) – Balanced panel.
outcome (str) – Column names.
unit (str) – Column names.
time (str) – Column names.
treatment_time (str) – Column names.
treated_units (list, optional) – Explicit list of treated unit IDs.
zeta (float, optional) – Regularisation parameter for unit weights. If
None, automatically selected.n_bootstrap (int) – Bootstrap replications.
seed (int) – Random seed.
alpha (float) – Significance level.
- Return type:
References
Arkhangelsky, D., et al. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088–4118.
- morie.did.wild_cluster_bootstrap(data, outcome, treatment, post, cluster, *, covariates=None, n_bootstrap=999, weight_type='rademacher', seed=42, alpha=0.05)[source]¶
DiD with wild cluster bootstrap p-values (Cameron, Gelbach & Miller, 2008).
Recommended when the number of clusters is small (< 50). Multiplies cluster-level residuals by random weights drawn from the Rademacher or Webb distributions.
- Parameters:
- Returns:
The p_value is the bootstrap p-value.
- Return type:
References
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414–427.
- morie.did.did_continuous_treatment(data, outcome, dose, post, *, covariates=None, cluster=None, alpha=0.05)[source]¶
DiD with a continuous treatment variable (dose–response DiD).
Models the outcome as:
\[Y_{it} = \alpha + \beta\,\text{Dose}_i + \gamma\,\text{Post}_t + \tau\,(\text{Dose}_i \times \text{Post}_t) + X_{it}'\delta + \varepsilon_{it}\]\(\hat\tau\) is the marginal effect of a one-unit increase in treatment intensity in the post period.
- morie.did.did_fuzzy(data, outcome, assignment, takeup, post, *, covariates=None, cluster=None, alpha=0.05)[source]¶
Fuzzy DiD estimator for settings with imperfect compliance.
Uses the interaction of assignment \(\times\) post as an instrument for takeup \(\times\) post in a 2SLS framework to recover a local average treatment effect (LATE).
- Parameters:
- Returns:
The
estimateis the LATE from fuzzy DiD.- Return type:
- morie.did.placebo_test_time(data, outcome, treatment, time, true_treatment_time, placebo_times, *, covariates=None, cluster=None, alpha=0.05)[source]¶
Run placebo DiD tests at fake treatment times.
For each time in placebo_times, re-define the post indicator as
time >= placebo_timeusing only pre-treatment data (before true_treatment_time) and estimate a DiD. A well-identified design should yield estimates close to zero.- Parameters:
- Returns:
One row per placebo time with
placebo_time,estimate,std_error,p_value,significant.- Return type:
pd.DataFrame
- morie.did.placebo_test_outcome(data, placebo_outcomes, treatment, post, *, covariates=None, cluster=None, alpha=0.05)[source]¶
Run placebo DiD on outcomes that should not be affected by treatment.
- morie.did.placebo_test_group(data, outcome, treatment, post, group_col, unaffected_groups, *, covariates=None, cluster=None, alpha=0.05)[source]¶
Run placebo DiD on groups that should not be affected.
- morie.did.did_heterogeneous(data, outcome, treatment, post, moderator, *, covariates=None, cluster=None, n_quantiles=4, alpha=0.05)[source]¶
Heterogeneity-robust DiD: estimate treatment effects by subgroups.
Splits the sample by quantiles (or categories) of moderator and estimates separate DiD effects for each stratum.
- Parameters:
- Returns:
Columns:
group,estimate,std_error,ci_lower,ci_upper,p_value,n.- Return type:
pd.DataFrame
- morie.did.did_chaisemartin_dhaultfoeuille(data, outcome, treatment, unit, time, *, n_bootstrap=200, seed=42, alpha=0.05)[source]¶
Heterogeneity-robust DiD (de Chaisemartin & D’Haultfoeuille, 2020).
Computes the instantaneous treatment effect for switchers–units whose treatment status changes–using appropriate comparisons.
\[\hat\delta = \sum_{(i,t): D_{it}>D_{i,t-1}} w_{it} \bigl[\Delta Y_{it} - \Delta \bar Y_{ct}\bigr]\]- Parameters:
- Return type:
References
de Chaisemartin, C., & D’Haultfoeuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), 2964–2996.
- morie.did.did_sensitivity_analysis(data, outcome, treatment, post, *, covariates=None, delta_range=None, cluster=None, alpha=0.05)[source]¶
Sensitivity of DiD estimate to violations of parallel trends.
Following Rambachan & Roth (2023), computes the identified set for the ATT under bounded deviations \(\delta\) from parallel trends:
\[|\text{bias}| \le \delta \cdot \hat\sigma\]For each \(\delta\), computes a bias-adjusted confidence set.
- Parameters:
- Returns:
Columns:
delta,ci_lower,ci_upper,covers_zero.- Return type:
pd.DataFrame
References
Rambachan, A., & Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5), 2555–2591.
- morie.did.did_diagnostics(data, outcome, treatment, post, *, covariates=None, cluster=None)[source]¶
Comprehensive diagnostics for a 2x2 DiD setting.
Checks: - Sample sizes by group and period - Baseline covariate balance - Outcome distributions by group and period - Pre-period outcome correlation between groups
Sensitivity analysis for causal inference assumptions.
Provides tools to assess the robustness of causal effect estimates to unmeasured confounding, model specification, and other threats to internal validity. Implements Rosenbaum bounds, E-value, Ding-VanderWeele bias formulas, tipping-point analysis, and specification curve analysis.
References
Rosenbaum (2002). Observational Studies, 2nd ed.
VanderWeele & Ding (2017). Sensitivity analysis in observational research.
Cinelli & Hazlett (2020). Making sense of sensitivity.
- class morie.sensitivity.EValueResult(point_estimate, e_value_point, e_value_ci, rr, ci_lower, ci_upper, interpretation)[source]¶
Bases:
objectResult from E-value computation.
- Parameters:
- class morie.sensitivity.RosenbaumBounds(gamma_values, p_upper, p_lower, critical_gamma, method, interpretation)[source]¶
Bases:
objectResult from Rosenbaum sensitivity analysis.
- Parameters:
- class morie.sensitivity.TippingPointResult(delta_values, adjusted_estimates, adjusted_p_values, tipping_point, original_estimate, interpretation)[source]¶
Bases:
objectResult from tipping-point analysis.
- Parameters:
- class morie.sensitivity.OmittedVariableBias(estimate, se, rv_q, rv_qa, partial_r2_treatment, benchmark_bounds, interpretation)[source]¶
Bases:
objectResult from omitted variable bias analysis (Cinelli & Hazlett).
- Parameters:
- class morie.sensitivity.SpecificationCurveResult(estimates, ses, p_values, specifications, median_estimate, iqr_lower, iqr_upper, pct_significant, pct_same_sign)[source]¶
Bases:
objectResult from specification curve analysis.
- Parameters:
- morie.sensitivity.e_value_rr(rr, ci_lower=None, ci_upper=None)[source]¶
Compute E-value for a risk ratio.
The E-value is the minimum strength of association (on the risk ratio scale) that an unmeasured confounder would need to have with both treatment and outcome to fully explain away the observed association.
- Parameters:
- Return type:
- morie.sensitivity.e_value_or(odds_ratio, ci_lower=None, ci_upper=None, prevalence=None)[source]¶
Compute E-value for an odds ratio.
For rare outcomes (prevalence < 15%), the OR approximates the RR. For common outcomes, a correction is applied using the prevalence.
- Parameters:
- Return type:
- morie.sensitivity.e_value_hr(hr, ci_lower=None, ci_upper=None)[source]¶
Compute E-value for a hazard ratio.
Uses the HR-to-RR approximation from VanderWeele (2017).
- Parameters:
- Return type:
- morie.sensitivity.e_value_d(d, se=None, n=None)[source]¶
Compute E-value for a standardized mean difference (Cohen’s d).
Converts d to a risk ratio scale using the VanderWeele & Ding approximation: RR ≈ exp(0.91 * d).
- Parameters:
- Return type:
- morie.sensitivity.rosenbaum_bounds(treated_outcomes, control_outcomes, gamma_range=None, method='wilcoxon')[source]¶
Rosenbaum sensitivity analysis for matched pair designs.
Tests how strong hidden bias (Gamma) would need to be to alter the study’s conclusions.
- Parameters:
treated_outcomes (array-like) – Outcomes for treated units in matched pairs.
control_outcomes (array-like) – Outcomes for control units in matched pairs.
gamma_range (array-like, optional) – Range of Gamma values to test. Default: 1.0 to 5.0 by 0.25.
method (str) – Test method: ‘wilcoxon’ (default), ‘sign’, ‘mcnemar’.
- Return type:
- morie.sensitivity.tipping_point_analysis(estimate, se, n_treated, n_control, delta_range=None, outcome_type='continuous')[source]¶
Tipping-point analysis for missing data sensitivity.
Evaluates how much the treatment effect changes if missing outcomes are systematically different from observed outcomes.
- Parameters:
estimate (float) – Observed treatment effect estimate.
se (float) – Standard error of the estimate.
n_treated (int) – Number of treated units.
n_control (int) – Number of control units.
delta_range (array-like, optional) – Range of bias parameters to evaluate.
outcome_type (str) – ‘continuous’ or ‘binary’.
- Return type:
- morie.sensitivity.omitted_variable_bias(estimate, se, dof, r2_yd_x, partial_r2_treatment, q=1.0, alpha=0.05, benchmark_covariates=None)[source]¶
Omitted variable bias analysis (sensemakr framework).
Implements the Cinelli & Hazlett (2020) approach to assess how much an unobserved confounder would need to explain to change the conclusion.
- Parameters:
estimate (float) – Treatment coefficient estimate.
se (float) – Standard error of the estimate.
dof (int) – Residual degrees of freedom.
r2_yd_x (float) – Partial R-squared of treatment with outcome (controlling for X).
partial_r2_treatment (float) – Same as r2_yd_x (for clarity).
q (float) – Fraction of the estimate to be explained away. Default 1.0 (full).
alpha (float) – Significance level.
benchmark_covariates (dict, optional) – Dictionary mapping covariate names to their partial R-squared with the outcome. Used to generate benchmark bounds.
- Return type:
- morie.sensitivity.specification_curve(data, outcome, treatment, covariate_sets, sample_filters=None, model_types=None, alpha=0.05)[source]¶
Run a specification curve analysis.
Estimates the treatment effect across many reasonable model specifications to assess robustness.
- Parameters:
data (pd.DataFrame) – Analysis dataset.
outcome (str) – Outcome variable name.
treatment (str) – Treatment variable name.
covariate_sets (list of list[str]) – Different sets of covariates to try.
sample_filters (list of (name, callable), optional) – Different sample restrictions to try.
model_types (list[str], optional) – Model types: ‘ols’, ‘logistic’, ‘robust’. Default: [‘ols’].
alpha (float) – Significance level.
- Return type:
- morie.sensitivity.manski_bounds(outcome_treated, outcome_control, p_treated, outcome_range=None)[source]¶
Compute Manski worst-case bounds for the ATE.
Under no assumptions about selection, the ATE is only partially identified.
- Parameters:
- Returns:
With keys: lower_bound, upper_bound, point_estimate, width.
- Return type:
- morie.sensitivity.bias_adjusted_estimate(estimate, se, rr_ud, rr_eu, prevalence_confounder=0.5)[source]¶
Compute bias-adjusted treatment effect.
Uses the Ding & VanderWeele (2016) bias formula.
- Parameters:
estimate (float) – Observed treatment effect (on log-RR or coefficient scale).
se (float) – Standard error.
rr_ud (float) – Risk ratio relating confounder to outcome.
rr_eu (float) – Risk ratio relating treatment to confounder.
prevalence_confounder (float) – Prevalence of the confounder in the population.
- Returns:
With keys: adjusted_estimate, bias, adjusted_ci_lower, adjusted_ci_upper.
- Return type:
- morie.sensitivity.probabilistic_bias_analysis(estimate, se, n_simulations=10000, bias_parms=None, seed=42)[source]¶
Probabilistic (Monte Carlo) sensitivity analysis.
Draws bias parameters from specified prior distributions and computes the distribution of bias-adjusted estimates.
- Parameters:
- Returns:
Summary of bias-adjusted estimate distribution.
- Return type:
- morie.sensitivity.sensitivity_summary(estimate, se, rr=None, odds_ratio=None, hazard_ratio=None, prevalence=None)[source]¶
Generate a comprehensive sensitivity analysis summary.
Computes E-values, tipping points, and robustness metrics for a single treatment effect estimate.
- Parameters:
- Returns:
Summary table of sensitivity metrics.
- Return type:
pd.DataFrame
Survey + descriptive statistics¶
Survey-weighted estimation wrappers.
Parallels the survey and srvyr packages in R.
This module provides:
SurveyDesign– encapsulates survey data with weights and strata.Horvitz-Thompson and Hájek estimators for population totals and means.
Ratio estimator for auxiliary-variable-assisted estimation.
Post-stratification and calibration (raking) weight adjustments.
Domain/subpopulation estimation with design-based standard errors.
Complex-survey GLM with weights, clustering, and stratification.
Statistical assumptions and caveats¶
All functions assume the supplied weights are probability/analytic weights (i.e., w_i = 1 / pi_i where pi_i is the first-order inclusion probability). They are NOT frequency or expansion weights.
Standard errors under complex sampling are approximated by Taylor linearisation or the jackknife/bootstrap where noted.
For production survey analysis, validate against the R
surveypackage (Lumley, 2010) or Stata’ssvyprefix.
References
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. Horvitz, D. G., & Thompson, D. J. (1952). A generalisation of sampling without
replacement from a finite universe. JASA, 47(260), 663-685.
- Hájek, J. (1971). Comment on “An essay on the logical foundations of survey
sampling.” In Godambe & Sprott (Eds.), Foundations of Statistical Inference. Holt, Rinehart and Winston.
- class morie.survey.SurveyDesign(data, weights_col, strata_col=None)[source]¶
Bases:
objectEncapsulates survey data with its corresponding sampling weights and strata.
- __init__(data, weights_col, strata_col=None)[source]¶
Initialize the survey design object.
- Parameters:
data (pandas.DataFrame) – The pandas DataFrame containing the survey data.
weights_col (str) – The column name corresponding to the survey weights.
strata_col (str, optional) – The column name indicating survey strata, defaults to None.
- svyglm(formula, family=None)[source]¶
Fit a survey-weighted generalized linear model.
Survey probability weights (such as CPADS
weight/wtpumf) are analytic / probability weights, not frequency expansion weights. Usingfreq_weightswould incorrectly treat each weight as a replication count, inflating the effective sample size and producing anti-conservative standard errors. The correct statsmodels parameter for analytic probability weights in a GLM isvar_weights, which scales the variance of each observation proportionally to its weight without artificially expanding n.Standard errors returned by statsmodels GLM with
var_weightsare model-based (sandwich-free); for fully robust inference usefit(cov_type='HC3')on the returned object.- Parameters:
formula (str) – A strictly patsy-compatible formula string.
family (statsmodels.genmod.families.family.Family, optional) – A statsmodels family object, defaults to Binomial().
- Returns:
A fitted statsmodels GLM result object.
- Return type:
statsmodels.genmod.generalized_linear_model.GLMResultsWrapper
References
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. (Chapter 2 – probability weights vs. frequency weights.)
- morie.survey.horvitz_thompson_total(y, inclusion_probs)[source]¶
Horvitz-Thompson estimator of population total.
The HT estimator is:
\[\hat{\tau}_{HT} = \sum_{i \in s} \frac{y_i}{\pi_i}\]where \(\pi_i\) is the first-order inclusion probability for unit i.
The variance estimator used here is the Sen-Yates-Grundy (SYG) approximation assuming simple random sampling (i.e., pairwise inclusion probabilities \(\pi_{ij} \approx \pi_i \pi_j\)), which yields:
\[\hat{V}(\hat{\tau}_{HT}) \approx \sum_i \frac{y_i^2}{\pi_i^2} \cdot \frac{1 - \pi_i}{1}\]This is the standard Horvitz-Thompson variance under with-replacement sampling (Hansen-Hurwitz estimator) and is conservative under without-replacement designs.
- Parameters:
- Returns:
dict with keys
total,se,ci_lower,ci_upper.- Raises:
ValueError – If y and inclusion_probs have different lengths, or any inclusion probability is <= 0 or > 1.
- Return type:
References
- Horvitz, D. G., & Thompson, D. J. (1952). A generalisation of sampling
without replacement from a finite universe. JASA, 47(260), 663-685.
Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. (Chapter 9A.)
- morie.survey.hajek_mean(y, weights)[source]¶
Hájek estimator for population mean (ratio estimator).
The Hájek estimator normalises the HT estimator by the estimated population size, eliminating sensitivity to total weight magnitude:
\[\bar{y}_H = \frac{\sum_i w_i y_i}{\sum_i w_i}\]where \(w_i = 1 / \pi_i\) are the survey weights.
Standard error is computed via Taylor (delta-method) linearisation:
\[\hat{V}(\bar{y}_H) \approx \frac{1}{N^2} \sum_i w_i^2 (y_i - \bar{y}_H)^2 \cdot \frac{1 - \pi_i}{\pi_i}\]For the with-replacement approximation (pi_i = w_i / N_hat) this simplifies to the weighted sample variance divided by the effective sample size.
- Parameters:
- Returns:
dict with keys
mean,se,ci_lower,ci_upper.- Raises:
ValueError – If y and weights have different lengths or weights <= 0.
- Return type:
References
- Hájek, J. (1971). Comment in Godambe & Sprott (Eds.), Foundations of
Statistical Inference. Holt, Rinehart and Winston.
Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. (Section 6.13.)
- morie.survey.ratio_estimator(y, x, weights, X_population_total)[source]¶
Survey ratio estimator for a population total.
The ratio estimator exploits auxiliary information X whose population total is known:
\[\hat{Y}_R = \hat{R} \cdot X_{\text{pop}}\]where \(\hat{R} = \hat{Y}_{HT} / \hat{X}_{HT}\) is the estimated ratio of totals.
Standard error via Taylor linearisation:
\[\hat{V}(\hat{Y}_R) \approx \frac{1}{\hat{X}_{HT}^2} \sum_i w_i^2 (y_i - \hat{R} x_i)^2\]- Parameters:
- Returns:
dict with keys
ratio,total_estimate,se,ci_lower,ci_upper.- Raises:
ValueError – If inputs have different lengths, weights <= 0, or X_population_total <= 0.
- Return type:
References
Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley. (Section 6.4.) Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
- morie.survey.poststratification_weights(df, strata_col, population_counts)[source]¶
Compute post-stratification weights so the sample distribution matches the known population distribution.
For each stratum h:
\[w_i^{\text{PS}} = \frac{N_h / N}{n_h / n}\]where \(N_h\) is the known population count in stratum h, \(N\) is the total population, \(n_h\) is the sample count in stratum h, and \(n\) is the total sample size.
- Parameters:
- Returns:
pd.Series of post-stratification weights, indexed like df.
- Raises:
ValueError – If strata_col not in df, any stratum in df is absent from population_counts, or sample stratum count is zero.
- Return type:
References
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. (Chapter 7.) Little, R. J. A. (1993). Post-stratification: A modeler’s perspective.
JASA, 88(423), 1001-1012.
- morie.survey.calibration_weights(df, aux_vars, population_totals, *, max_iter=50, tol=1e-06)[source]¶
Raking calibration (iterative proportional fitting) to match known population marginal totals.
For each auxiliary variable x_j, the calibration adjusts weights multiplicatively so that:
\[\sum_i w_i x_{ij} = T_j\]where \(T_j\) is the known population total for variable j.
IPF convergence is assessed by the maximum relative deviation across all margins at each iteration.
- Parameters:
df (DataFrame) – Input DataFrame with auxiliary variable columns.
aux_vars (list[str]) – List of column names to calibrate on.
population_totals (dict[str, float]) – Dict mapping column name to known population total.
max_iter (int) – Maximum number of IPF iterations. Default 50.
tol (float) – Convergence tolerance (maximum relative deviation). Default 1e-6.
- Returns:
pd.Series of calibrated weights.
- Raises:
ValueError – If any aux_var is missing from df or population_totals, or initial weights are zero.
- Return type:
References
- Deville, J.-C., & Sarndal, C.-E. (1992). Calibration estimators in survey
sampling. JASA, 87(418), 376-382.
- Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of
a sampled frequency table when the expected marginal totals are known. Annals of Mathematical Statistics, 11(4), 427-444.
- morie.survey.subpopulation_estimate(df, domain_col, domain_value, outcome_col, weight_col)[source]¶
Design-based estimate for a domain (subpopulation) mean.
Treats domain membership as a 0/1 indicator and applies the Hájek estimator within the full-sample design frame (i.e., does NOT subset then reweight, which would ignore the variance contribution from domain membership uncertainty).
Domain mean:
\[\bar{y}_d = \frac{\sum_{i: d_i = 1} w_i y_i}{\sum_{i: d_i = 1} w_i}\]SE via linearisation (Woodruff, 1971):
\[\hat{V}(\bar{y}_d) \approx \frac{1}{N_d^2} \sum_i w_i^2 z_i^2\]where \(z_i = d_i (y_i - \bar{y}_d)\) and \(N_d = \sum_{i: d_i=1} w_i\).
- Parameters:
df (DataFrame) – Full sample DataFrame (do NOT pre-filter to domain).
domain_col (str) – Column name identifying the domain.
domain_value – Value of domain_col that identifies the target subpopulation.
outcome_col (str) – Column name of the outcome variable.
weight_col (str) – Column name of survey weights.
- Returns:
dict with keys
mean,se,ci_lower,ci_upper,n_domain.- Raises:
ValueError – If required columns are missing.
- Return type:
References
- Woodruff, R. S. (1971). A simple method for approximating the variance of a
complicated estimate. JASA, 66(334), 411-414.
Lumley, T. (2010). Complex Surveys. Wiley. (Section 4.2.)
- morie.survey.complex_survey_glm(df, formula, weight_col, *, family='gaussian', cluster_col=None, strata_col=None)[source]¶
Fit a GLM with complex survey design (weights, optional clustering, strata).
Model is fit using statsmodels WLS/GLM with analytic
var_weights(not freq_weights – seeSurveyDesign.svyglm()for the rationale).When
cluster_colis provided, cluster-robust (sandwich) standard errors are applied viafit(cov_type='cluster', cov_kwds={'groups': ...}).Strata support is informational only at this level; stratum-specific estimates require
subpopulation_estimate().Supported families (string):
"gaussian","binomial","poisson","gamma","negativebinomial".- Parameters:
df (DataFrame) – Input DataFrame.
formula (str) – Patsy-compatible formula string (e.g.
"y ~ x1 + x2").weight_col (str) – Column name of analytic/probability survey weights.
family (str) – GLM family string. Default
"gaussian".cluster_col (str | None) – Column identifying clusters for robust SE. Default None.
strata_col (str | None) – Column identifying strata (informational only). Default None.
- Returns:
Fitted statsmodels GLM result object.
- Raises:
ValueError – If required columns are missing or family is unrecognised.
- Return type:
References
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator
and a direct test for heteroskedasticity. Econometrica, 48(4), 817-838.
Survey sampling, resampling, and design-weight utilities.
This module provides functions for common sampling designs used in epidemiological and public health research: simple random sampling, stratified sampling, cluster sampling, probability-proportional-to-size (PPS) sampling, bootstrap resampling, jackknife variance estimation, and design-weight computation.
All stochastic functions use numpy.random.default_rng() for
reproducibility rather than legacy global-state RNGs.
References
Cochran, W. G. (1977). Sampling Techniques (3rd ed.). Wiley.
Kish, L. (1965). Survey Sampling. Wiley.
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
Wolter, K. M. (2007). Introduction to Variance Estimation (2nd ed.). Springer.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
- morie.sampling.simple_random_sample(df, n, *, replace=False, seed=42)[source]¶
Draw a simple random sample (SRS) from a DataFrame.
- Parameters:
df (pandas.DataFrame) – Population frame.
n (int) – Number of units to sample.
replace (bool, optional) – If True, sample with replacement (default False).
seed (int, optional) – Random seed for reproducibility (default 42).
- Returns:
Sampled rows with their original index preserved.
- Return type:
- Raises:
ValueError – If n exceeds the number of rows when sampling without replacement.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"x": range(100)}) >>> sample = simple_random_sample(df, 10, seed=0) >>> len(sample) 10
References
Cochran, W. G. (1977). Sampling Techniques (3rd ed.), Chapter 2.
- morie.sampling.stratified_sample(df, strata_col, n_per_stratum, *, proportional=False, seed=42)[source]¶
Draw a stratified random sample.
Supports both fixed allocation (same n per stratum) and proportional allocation. When
proportional=True, n_per_stratum is interpreted as the total desired sample size and is allocated proportionally across strata.- Parameters:
df (pandas.DataFrame) – Population frame.
strata_col (str) – Column defining strata.
n_per_stratum (int or dict[str, int]) – If
proportional=False: fixed integer per stratum (or a dict mapping stratum values to per-stratum sample sizes). Ifproportional=True: total desired sample size (must be int).proportional (bool, optional) – If True, allocate sample sizes proportionally to stratum sizes (default False).
seed (int, optional) – Random seed (default 42).
- Returns:
Sampled rows with original index preserved.
- Return type:
- Raises:
ValueError – If a requested stratum size exceeds the stratum population.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"stratum": ["A"]*50 + ["B"]*50, "x": range(100)}) >>> sample = stratified_sample(df, "stratum", 10, seed=0) >>> sample.groupby("stratum").size().to_dict() {'A': 10, 'B': 10}
References
Cochran, W. G. (1977). Sampling Techniques (3rd ed.), Chapter 5.
- morie.sampling.cluster_sample(df, cluster_col, n_clusters, *, seed=42)[source]¶
Two-stage cluster sampling: select clusters, then take all units within.
This implements single-stage cluster sampling (select clusters, enumerate all units within selected clusters).
- Parameters:
df (pandas.DataFrame) – Population frame with a cluster identifier column.
cluster_col (str) – Column identifying clusters (e.g., PSU, school, clinic).
n_clusters (int) – Number of clusters to select.
seed (int, optional) – Random seed (default 42).
- Returns:
All rows belonging to the selected clusters.
- Return type:
- Raises:
ValueError – If n_clusters exceeds the number of distinct clusters.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({ ... "cluster": [1]*10 + [2]*10 + [3]*10, ... "y": range(30), ... }) >>> sample = cluster_sample(df, "cluster", 2, seed=0) >>> sample["cluster"].nunique() 2
References
Cochran, W. G. (1977). Sampling Techniques (3rd ed.), Chapter 9.
- morie.sampling.pps_sample(df, size_col, n, *, seed=42)[source]¶
Probability-proportional-to-size (PPS) sampling.
Draws n units with probability proportional to the values in size_col. Negative or zero size values are excluded with a warning.
- Parameters:
df (pandas.DataFrame) – Population frame.
size_col (str) – Column containing the size measure (must be positive).
n (int) – Number of units to draw.
seed (int, optional) – Random seed (default 42).
- Returns:
Sampled rows.
- Return type:
- Raises:
ValueError – If no positive size values remain after filtering.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"pop": [1000, 500, 200, 100], "name": ["A","B","C","D"]}) >>> sample = pps_sample(df, "pop", 2, seed=0) >>> len(sample) 2
References
Brewer, K. R. W., & Hanif, M. (1983). Sampling with Unequal Probabilities. Springer.
- morie.sampling.bootstrap_sample(df, n_bootstrap=1000, *, statistic, seed=42)[source]¶
Bootstrap resampling for any scalar statistic.
Draws n_bootstrap resamples with replacement from df, applies statistic to each, and returns the bootstrap distribution with percentile confidence interval.
- Parameters:
df (pandas.DataFrame) – Input data.
n_bootstrap (int, optional) – Number of bootstrap replicates (default 1000).
statistic (callable) – Function
f(df) -> floatcomputing the statistic of interest.seed (int, optional) – Random seed (default 42).
- Returns:
Keys:
mean(float),se(float),ci_lower(float),ci_upper(float),distribution(numpy.ndarray of length n_bootstrap).- Return type:
Examples
>>> import pandas as pd, numpy as np >>> df = pd.DataFrame({"x": np.random.default_rng(0).normal(5, 1, 200)}) >>> result = bootstrap_sample(df, 500, statistic=lambda d: d["x"].mean(), seed=0) >>> 4.5 < result["mean"] < 5.5 True
References
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
- morie.sampling.jackknife_estimate(df, *, statistic)[source]¶
Delete-1 jackknife variance estimate for a scalar statistic.
Computes the leave-one-out estimates \(\hat{\theta}_{(-i)}\) and derives the jackknife variance estimate:
\[\widehat{\text{Var}}_J = \frac{n-1}{n} \sum_{i=1}^{n} \left(\hat{\theta}_{(-i)} - \bar{\theta}_{(\cdot)}\right)^2\]- Parameters:
df (pandas.DataFrame) – Input data.
statistic (callable) – Function
f(df) -> floatcomputing the statistic of interest.
- Returns:
Keys:
estimate(float — full-sample statistic),se(float — jackknife standard error),bias(float — jackknife bias estimate).- Return type:
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]}) >>> jk = jackknife_estimate(df, statistic=lambda d: d["x"].mean()) >>> abs(jk["estimate"] - 3.0) < 1e-10 True
References
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3/4), 353–360.
Tukey, J. W. (1958). Bias and confidence in not-quite large samples (abstract). Annals of Mathematical Statistics, 29, 614.
- morie.sampling.compute_design_weights(df, strata_col, population_sizes)[source]¶
Compute inverse-probability design weights for a stratified sample.
For each stratum h, the design weight is:
\[w_{hi} = \frac{N_h}{n_h}\]where \(N_h\) is the population size and \(n_h\) is the sample size in stratum h.
- Parameters:
df (pandas.DataFrame) – Sample data.
strata_col (str) – Column defining strata.
population_sizes (dict[str, int]) – Mapping of stratum value to population size.
- Returns:
Design weights aligned with the DataFrame index.
- Return type:
- Raises:
KeyError – If a stratum in df is not found in population_sizes.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"stratum": ["A"]*10 + ["B"]*20}) >>> w = compute_design_weights(df, "stratum", {"A": 1000, "B": 2000}) >>> float(w[df["stratum"] == "A"].iloc[0]) 100.0
References
Kish, L. (1965). Survey Sampling, Chapter 2. Wiley.
- morie.sampling.effective_sample_size(weights)[source]¶
Compute the Kish effective sample size.
\[n_{\text{eff}} = \frac{\left(\sum_i w_i\right)^2}{\sum_i w_i^2}\]- Parameters:
weights (numpy.ndarray or pandas.Series) – Sampling or analytic weights.
- Returns:
Effective sample size.
- Return type:
Examples
>>> import numpy as np >>> effective_sample_size(np.ones(100)) 100.0 >>> ess = effective_sample_size(np.array([1.0, 1.0, 10.0])) >>> ess < 3.0 True
References
Kish, L. (1965). Survey Sampling, p. 162. Wiley.
- morie.sampling.design_effect(weights)[source]¶
Approximate design effect (DEFF) from weights.
The Kish approximation to the design effect is:
\[\text{DEFF} \approx \frac{n}{n_{\text{eff}}} = \frac{n \sum_i w_i^2}{\left(\sum_i w_i\right)^2}\]A DEFF of 1.0 indicates no inflation (equal weights); values > 1 indicate variance inflation from unequal weighting.
- Parameters:
weights (numpy.ndarray or pandas.Series) – Sampling or analytic weights.
- Returns:
Approximate design effect.
- Return type:
Examples
>>> import numpy as np >>> design_effect(np.ones(100)) 1.0
References
Kish, L. (1965). Survey Sampling, p. 162. Wiley.
Survival analysis toolkit for epidemiological time-to-event research.
This module provides a comprehensive set of tools for censored time-to-event
data commonly encountered in cohort studies, clinical trials, and public health
surveillance. Implementations are built on numpy, scipy, and
pandas; optional integration with lifelines is used where
available but never required.
Major components¶
Non-parametric estimators: Kaplan–Meier, Nelson–Aalen
Log-rank family tests: standard, Peto–Peto, Gehan–Wilcoxon, Tarone–Ware
Cox proportional hazards: partial likelihood estimation with Breslow and Efron tie-handling, Schoenfeld / Cox–Snell / martingale / deviance residuals
Parametric models: Exponential, Weibull, log-normal, log-logistic, Gompertz
Accelerated failure time (AFT) models
Competing risks: Cumulative incidence functions (CIF), Fine–Gray model
Discrimination: Concordance index (Harrell’s C)
RMST: Restricted mean survival time comparison
References
- Collett, D. (2015). Modelling Survival Data in Medical Research (3rd ed.).
Chapman & Hall/CRC.
- Kalbfleisch, J. D., & Prentice, R. L. (2002). *The Statistical Analysis of
Failure Time Data* (2nd ed.). Wiley.
- Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the
subdistribution of a competing risk. JASA, 94(446), 496–509.
- Klein, J. P., & Moeschberger, M. L. (2003). *Survival Analysis: Techniques
for Censored and Truncated Data* (2nd ed.). Springer.
- class morie.survival.SurvivalCurve(times, survival, ci_lower, ci_upper, at_risk, events, censored, method='Kaplan-Meier', median_survival=None)[source]¶
Bases:
objectStores a non-parametric survival or cumulative-hazard estimate.
- Parameters:
times (ndarray) – Ordered unique event times.
survival (ndarray) – Estimated survival probabilities (or cumulative hazard).
ci_lower (ndarray) – Lower confidence band.
ci_upper (ndarray) – Upper confidence band.
at_risk (ndarray) – Number at risk just before each event time.
events (ndarray) – Number of events at each event time.
censored (ndarray) – Number censored at each event time.
method (str) – Estimator name.
median_survival (float or None) – Median survival time (first time S(t) <= 0.5).
- class morie.survival.LogRankResult(method, test_statistic, p_value, df, n_groups, n_total, extra=<factory>)[source]¶
Bases:
objectResult container for log-rank family tests.
- Parameters:
- class morie.survival.CoxResult(coefficients, standard_errors, hazard_ratios, z_scores, p_values, ci_lower, ci_upper, covariate_names, concordance, log_likelihood, n_events, n_observations, method='Cox PH', baseline_hazard=None, extra=<factory>)[source]¶
Bases:
objectResult container for Cox PH model.
- Parameters:
- class morie.survival.ParametricSurvivalResult(distribution, coefficients, log_likelihood, aic, bic, n_observations, n_events, extra=<factory>)[source]¶
Bases:
objectResult container for parametric survival / AFT models.
- Parameters:
- class morie.survival.CompetingRiskResult(times, cif, ci_lower, ci_upper, event_of_interest, n_total, method='Aalen-Johansen')[source]¶
Bases:
objectResult container for competing-risk analyses.
- Parameters:
- morie.survival.kaplan_meier(time, event, confidence=0.95, ci_method='greenwood')[source]¶
Kaplan–Meier product-limit survival estimator.
- Parameters:
time (array-like) – Observed times (non-negative).
event (array-like) – Event indicator (1 = event, 0 = censored).
confidence (float, default 0.95) – Confidence level for survival bands.
ci_method (str, default "greenwood") –
"greenwood"for Greenwood’s formula or"log-log"for the complementary log-log transformation.
- Return type:
References
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. JASA, 53(282), 457–481.
- morie.survival.nelson_aalen(time, event, confidence=0.95)[source]¶
Nelson–Aalen cumulative hazard estimator.
- Parameters:
time (array-like) – Observed times.
event (array-like) – Event indicator (1 = event, 0 = censored).
confidence (float, default 0.95) – Confidence level.
- Returns:
The
survivalfield here contains the cumulative hazard H(t), not S(t).- Return type:
- morie.survival.logrank_test(time, event, group)[source]¶
Standard log-rank test (Mantel–Haenszel).
- Parameters:
time (array-like) – Observed times, event indicators, and group labels.
event (array-like) – Observed times, event indicators, and group labels.
group (array-like) – Observed times, event indicators, and group labels.
- Return type:
- morie.survival.peto_peto_test(time, event, group)[source]¶
Peto–Peto modification of the Gehan–Wilcoxon test.
- Parameters:
time (array-like)
event (array-like)
group (array-like)
- Return type:
- morie.survival.gehan_wilcoxon_test(time, event, group)[source]¶
Gehan–Wilcoxon test (Breslow test).
- Parameters:
time (array-like)
event (array-like)
group (array-like)
- Return type:
- morie.survival.tarone_ware_test(time, event, group)[source]¶
Tarone–Ware weighted log-rank test.
- Parameters:
time (array-like)
event (array-like)
group (array-like)
- Return type:
- morie.survival.cox_ph(data, duration_col, event_col, covariate_cols, ties='efron', confidence=0.95, penalizer=0.0)[source]¶
Fit a Cox proportional hazards model.
Uses Newton–Raphson optimisation of the partial likelihood.
- Parameters:
data (DataFrame) – Analysis dataset.
duration_col (str) – Column with observed times.
event_col (str) – Column with event indicators (1 = event).
ties (str, default "efron") – Tie handling:
"efron"or"breslow".confidence (float, default 0.95) – Confidence level for hazard ratio CIs.
penalizer (float, default 0.0) – L2 regularisation strength.
- Return type:
- morie.survival.schoenfeld_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]¶
Compute Schoenfeld residuals for testing the proportional hazards assumption.
- Parameters:
- Returns:
One row per event with columns for each covariate’s Schoenfeld residual and the event time.
- Return type:
DataFrame
- morie.survival.cox_snell_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]¶
Compute Cox–Snell residuals.
The Cox–Snell residual for observation i is \(r_i = \hat{H}_0(t_i) \exp(x_i^T \hat{\beta})\).
If the model is correct these should follow a unit-rate exponential distribution.
- morie.survival.martingale_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]¶
Compute martingale residuals.
\(M_i = \delta_i - r_i^{CS}\) where \(r_i^{CS}\) is the Cox–Snell residual.
- morie.survival.deviance_residuals(data, duration_col, event_col, covariate_cols, cox_result)[source]¶
Compute deviance residuals from a Cox PH model.
\(d_i = \text{sign}(M_i) \sqrt{-2[M_i + \delta_i \log(\delta_i - M_i)]}\)
- morie.survival.test_ph_assumption(data, duration_col, event_col, covariate_cols, cox_result)[source]¶
Test the proportional hazards assumption via Schoenfeld residual correlation with time.
For each covariate, computes the Pearson correlation between scaled Schoenfeld residuals and event time and reports a p-value.
- morie.survival.exponential_model(time, event)[source]¶
Fit an exponential survival model (constant hazard).
The exponential distribution has hazard \(h(t) = \lambda\) and survival \(S(t) = e^{-\lambda t}\).
- Parameters:
time (array-like)
event (array-like)
- Return type:
- morie.survival.weibull_model(time, event)[source]¶
Fit a Weibull survival model via MLE.
Parameterisation: \(S(t) = \exp(-(t/\lambda)^k)\).
- Parameters:
time (array-like)
event (array-like)
- Return type:
- morie.survival.lognormal_model(time, event)[source]¶
Fit a log-normal survival model via MLE.
- Parameters:
time (array-like)
event (array-like)
- Return type:
- morie.survival.loglogistic_model(time, event)[source]¶
Fit a log-logistic survival model via MLE.
\(S(t) = 1 / (1 + (t/\alpha)^\beta)\).
- Parameters:
time (array-like)
event (array-like)
- Return type:
- morie.survival.gompertz_model(time, event)[source]¶
Fit a Gompertz survival model via MLE.
Hazard: \(h(t) = b \exp(ct)\). Survival: \(S(t) = \exp(-(b/c)(e^{ct} - 1))\).
- Parameters:
time (array-like)
event (array-like)
- Return type:
- morie.survival.aft_weibull(data, duration_col, event_col, covariate_cols)[source]¶
Accelerated failure time model with Weibull baseline.
The AFT model specifies \(\log T = \mu + x^T \beta + \sigma W\) where W follows a standard extreme value distribution.
- Parameters:
- Return type:
- morie.survival.aft_lognormal(data, duration_col, event_col, covariate_cols)[source]¶
AFT model with log-normal baseline.
- Parameters:
- Return type:
- morie.survival.aft_loglogistic(data, duration_col, event_col, covariate_cols)[source]¶
AFT model with log-logistic baseline.
- Parameters:
- Return type:
- morie.survival.restricted_mean_survival_time(time, event, tau=None, confidence=0.95)[source]¶
Compute the restricted mean survival time (RMST).
\(\text{RMST}(\tau) = \int_0^\tau \hat{S}(t) \, dt\).
- morie.survival.rmst_difference(time1, event1, time2, event2, tau=None, confidence=0.95)[source]¶
Compare RMST between two groups.
- Parameters:
- Returns:
Keys:
rmst_diff,se,z,p_value,ci_lower,ci_upper.- Return type:
- morie.survival.cumulative_incidence_function(time, event, event_of_interest=1, confidence=0.95)[source]¶
Aalen–Johansen estimator of the cumulative incidence function (CIF) for competing risks.
- Parameters:
- Return type:
- morie.survival.fine_gray_weights(time, event, event_of_interest=1)[source]¶
Compute Fine–Gray IPCW weights for subdistribution hazard modelling.
These weights can be passed to a weighted Cox model to obtain Fine–Gray subdistribution hazard ratios.
- Parameters:
time (array-like)
event (array-like)
event_of_interest (int, default 1)
- Returns:
Observation-level weights.
- Return type:
ndarray
References
Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. JASA, 94(446), 496–509.
- morie.survival.concordance_index(event_times, predicted_scores, event_observed)[source]¶
Harrell’s concordance index (C-statistic) for survival models.
- Parameters:
event_times (array-like) – Observed times.
predicted_scores (array-like) – Model-predicted risk scores (higher = higher risk).
event_observed (array-like) – Event indicator (1 = event).
- Returns:
Concordance index in [0, 1].
- Return type:
- morie.survival.hazard_ratio(time, event, group, confidence=0.95)[source]¶
Estimate the hazard ratio between two groups from a simple Cox model.
- morie.survival.survival_plot_data(time, event, group=None, confidence=0.95)[source]¶
Generate a long-format DataFrame suitable for survival curve plotting.
- Parameters:
time (array-like)
event (array-like)
group (array-like or None) – Group labels. If
None, a single group is assumed.confidence (float, default 0.95)
- Returns:
Columns:
time,survival,ci_lower,ci_upper,at_risk,group.- Return type:
DataFrame
- morie.survival.survival_calibration(predicted_survival, time, event, n_groups=10)[source]¶
Calibration of predicted survival probabilities.
Divides observations into n_groups deciles of predicted risk and compares predicted vs. observed KM survival within each group.
- Parameters:
predicted_survival (array-like) – Model-predicted survival probabilities at a fixed time horizon.
time (array-like)
event (array-like)
n_groups (int, default 10)
- Returns:
Columns:
group,predicted_mean,observed_survival,n.- Return type:
DataFrame
- morie.survival.landmark_dataset(data, duration_col, event_col, landmark_time)[source]¶
Create a landmark analysis dataset.
Excludes subjects who experienced an event or were censored before landmark_time and resets the time origin.
- morie.survival.left_truncated_km(entry_time, exit_time, event, confidence=0.95)[source]¶
Kaplan–Meier estimator with left truncation (delayed entry).
- Parameters:
entry_time (array-like) – Time at which each subject enters the risk set.
exit_time (array-like) – Observed failure/censoring time.
event (array-like) – Event indicator.
confidence (float, default 0.95)
- Return type:
- morie.survival.turnbull_estimator(left, right, max_iter=200, tol=1e-06)[source]¶
Turnbull’s self-consistency algorithm for interval-censored data.
- Parameters:
- Returns:
(unique_times, survival_probs).- Return type:
tuple[ndarray, ndarray]
- morie.survival.compare_parametric_models(time, event)[source]¶
Fit all available parametric models and compare via AIC/BIC.
- Parameters:
time (array-like)
event (array-like)
- Returns:
Columns:
distribution,log_likelihood,aic,bic,n_events.- Return type:
DataFrame
Comprehensive hypothesis testing suite for epidemiological research.
This module provides a unified interface for the full spectrum of frequentist
hypothesis tests encountered in public health and biomedical research. Every
test function returns a TestResult dataclass containing the test
statistic, p-value, degrees of freedom, confidence interval bounds, effect
size, and method name so that downstream code can process results
programmatically without parsing text output.
Categories of tests¶
Location tests (t-tests, ANOVA, non-parametric rank tests)
Association tests (chi-squared, Fisher exact, correlation)
Distribution tests (normality, homogeneity of variance, goodness of fit)
Proportion tests (one- and two-sample z-tests, Fisher exact)
Agreement tests (Cohen’s kappa, Fleiss’ kappa, ICC)
All implementations delegate heavy numerics to scipy.stats and
statsmodels where possible; only formulas absent from those libraries
are coded from scratch.
References
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
(2nd ed.). Lawrence Erlbaum Associates.
Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley. Conover, W. J. (1999). Practical Nonparametric Statistics (3rd ed.). Wiley. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters.
Psychological Bulletin, 76(5), 378–382.
- Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in
assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
- class morie.statistics.TestResult(method, test_statistic, p_value, df=None, ci_lower=None, ci_upper=None, effect_size=None, estimate=None, n=None, extra=<factory>)[source]¶
Bases:
objectStandardised container for every hypothesis-test result in this module.
- Parameters:
method (str) – Human-readable name of the statistical test.
test_statistic (float) – Value of the test statistic (t, F, chi-squared, U, etc.).
p_value (float) – Two-sided p-value (or one-sided where documented).
df (float | None) – Degrees of freedom (
Nonefor non-parametric tests that do not define df).ci_lower (float | None) – Lower bound of the confidence interval for the estimated parameter.
ci_upper (float | None) – Upper bound of the confidence interval for the estimated parameter.
effect_size (float | None) – A standardised effect-size measure appropriate to the test.
estimate (float | None) – Point estimate of the quantity being tested (mean difference, odds ratio, correlation, etc.).
n (int | None) – Total sample size used in the test.
extra (dict) – Arbitrary additional outputs (e.g. per-group means, residuals).
- morie.statistics.one_sample_ttest(x, mu0=0.0, confidence=0.95)[source]¶
One-sample Student’s t-test.
Tests \(H_0: \mu = \mu_0\) against \(H_1: \mu \neq \mu_0\).
- Parameters:
- Return type:
- morie.statistics.two_sample_ttest(x, y, equal_var=True, confidence=0.95)[source]¶
Independent two-sample t-test (equal or unequal variance).
- Parameters:
- Return type:
- morie.statistics.welch_ttest(x, y, confidence=0.95)[source]¶
Welch’s t-test – convenience wrapper for
two_sample_ttest(equal_var=False).- Parameters:
x (array-like) – Two independent samples.
y (array-like) – Two independent samples.
confidence (float, default 0.95) – Confidence level.
- Return type:
- morie.statistics.paired_ttest(x, y, confidence=0.95)[source]¶
Paired-sample t-test.
Tests \(H_0: \mu_d = 0\) where \(d_i = x_i - y_i\).
- Parameters:
x (array-like) – Paired observations (must have equal length).
y (array-like) – Paired observations (must have equal length).
confidence (float, default 0.95) – Confidence level for the mean difference CI.
- Return type:
- morie.statistics.one_way_anova(*groups)[source]¶
One-way between-subjects ANOVA (F-test).
- Parameters:
*groups (array-like) – Two or more independent samples.
- Returns:
Effect size is \(\eta^2 = SS_{between} / SS_{total}\).
- Return type:
- morie.statistics.two_way_anova(data, outcome, factor_a, factor_b)[source]¶
Two-way factorial ANOVA via OLS type-II sums of squares.
- Parameters:
- Returns:
The
extradict contains per-factor and interaction F and p values.- Return type:
- morie.statistics.repeated_measures_anova(data, outcome, subject, within)[source]¶
One-way repeated-measures ANOVA (sphericity not assumed – uses Greenhouse–Geisser correction).
- Parameters:
- Return type:
- morie.statistics.kruskal_wallis(*groups)[source]¶
Kruskal–Wallis H-test (non-parametric one-way ANOVA).
- Parameters:
*groups (array-like) – Two or more independent samples.
- Returns:
Effect size is \(\eta^2_H = (H - k + 1) / (N - k)\).
- Return type:
- morie.statistics.friedman_test(*groups)[source]¶
Friedman test for repeated measures on ranks.
- Parameters:
*groups (array-like) – Three or more matched samples (equal length).
- Returns:
Effect size is Kendall’s W.
- Return type:
- morie.statistics.chi2_goodness_of_fit(observed, expected=None)[source]¶
Chi-squared goodness-of-fit test.
- Parameters:
observed (array-like) – Observed frequency counts.
expected (array-like or None) – Expected frequency counts. If
None, uniform distribution assumed.
- Returns:
Effect size is Cohen’s w.
- Return type:
- morie.statistics.chi2_independence(contingency_table, correction=True)[source]¶
Chi-squared test of independence for a contingency table.
- Parameters:
contingency_table (array-like or DataFrame) – An r x c contingency table of observed counts.
correction (bool, default True) – Apply Yates’s continuity correction for 2x2 tables.
- Returns:
Effect size is Cramer’s V.
- Return type:
- morie.statistics.mcnemar_test(contingency_table, exact=False)[source]¶
McNemar’s test for paired nominal data (2x2 table).
- Parameters:
contingency_table (array-like) – A 2x2 contingency table [[a, b], [c, d]].
exact (bool, default False) – If
True, use the exact binomial test instead of the chi-squared approximation.
- Return type:
- morie.statistics.cochrans_q(*groups)[source]¶
Cochran’s Q test for related samples with binary outcomes.
- Parameters:
*groups (array-like) – Three or more matched binary (0/1) samples.
- Return type:
- morie.statistics.pearson_correlation(x, y, confidence=0.95)[source]¶
Pearson product-moment correlation with Fisher z CI.
- Parameters:
x (array-like) – Two continuous variables.
y (array-like) – Two continuous variables.
confidence (float, default 0.95) – Confidence level.
- Return type:
- morie.statistics.spearman_correlation(x, y, confidence=0.95)[source]¶
Spearman rank correlation.
- Parameters:
x (array-like) – Two variables.
y (array-like) – Two variables.
confidence (float, default 0.95) – Confidence level.
- Return type:
- morie.statistics.kendall_correlation(x, y)[source]¶
Kendall’s tau-b rank correlation.
- Parameters:
x (array-like) – Two variables.
y (array-like) – Two variables.
- Return type:
- morie.statistics.point_biserial_correlation(binary, continuous, confidence=0.95)[source]¶
Point-biserial correlation between a binary and continuous variable.
- Parameters:
binary (array-like) – Binary (0/1) variable.
continuous (array-like) – Continuous variable.
confidence (float, default 0.95) – Confidence level.
- Return type:
- morie.statistics.partial_correlation(x, y, covariates, confidence=0.95)[source]¶
Partial Pearson correlation controlling for covariates.
Uses OLS residualisation: regress both x and y on the covariates, then correlate the residuals.
- Parameters:
x (array-like) – Variables of interest.
y (array-like) – Variables of interest.
covariates (array-like or DataFrame) – Matrix of control variables (n x p).
confidence (float, default 0.95) – Confidence level.
- Return type:
- morie.statistics.semi_partial_correlation(x, y, covariates)[source]¶
Semi-partial (part) correlation.
Residualises only x on the covariates and correlates the residual with the raw y.
- Parameters:
x (array-like) – Variables of interest.
y (array-like) – Variables of interest.
covariates (array-like or DataFrame) – Matrix of control variables.
- Return type:
- morie.statistics.mann_whitney_u(x, y, alternative='two-sided')[source]¶
Mann–Whitney U test (Wilcoxon rank-sum test).
- Parameters:
x (array-like) – Two independent samples.
y (array-like) – Two independent samples.
alternative (str, default "two-sided") – One of
"two-sided","less","greater".
- Returns:
Effect size is rank-biserial correlation \(r = 1 - 2U / (n_1 n_2)\).
- Return type:
- morie.statistics.wilcoxon_signed_rank(x, y=None, alternative='two-sided')[source]¶
Wilcoxon signed-rank test for paired samples or one sample.
- Parameters:
x (array-like) – If y is None, test whether the distribution of x is symmetric about zero. Otherwise, test paired differences x - y.
y (array-like or None) – Second sample for paired test.
alternative (str, default "two-sided") – One of
"two-sided","less","greater".
- Return type:
- morie.statistics.ks_test_one_sample(x, cdf='norm', args=())[source]¶
One-sample Kolmogorov–Smirnov test.
- Parameters:
x (array-like) – Sample data.
cdf (str, default "norm") – Name of a
scipy.statsdistribution (e.g."norm","expon").args (tuple) – Extra arguments to the CDF (loc, scale, etc.).
- Return type:
- morie.statistics.ks_test_two_sample(x, y)[source]¶
Two-sample Kolmogorov–Smirnov test.
- Parameters:
x (array-like) – Two independent samples.
y (array-like) – Two independent samples.
- Return type:
- morie.statistics.anderson_darling(x, dist='norm')[source]¶
Anderson–Darling test for a specified distribution family.
- Parameters:
x (array-like) – Sample data.
dist (str, default "norm") – Distribution name (
"norm","expon","logistic","gumbel").
- Returns:
The
extradict contains critical values and significance levels.- Return type:
- morie.statistics.levene_test(*groups, center='median')[source]¶
Levene’s test for equality of variances.
- Parameters:
*groups (array-like) – Two or more samples.
center (str, default "median") –
"median"(Brown–Forsythe),"mean", or"trimmed".
- Return type:
- morie.statistics.bartlett_test(*groups)[source]¶
Bartlett’s test for equality of variances.
- Parameters:
*groups (array-like) – Two or more samples.
- Return type:
- morie.statistics.runs_test(x, cutoff=None)[source]¶
Wald–Wolfowitz runs test for randomness.
- Parameters:
x (array-like) – Numeric sequence.
cutoff (float or None) – If
None, uses the median to dichotomise x.
- Return type:
- morie.statistics.shapiro_wilk(x)[source]¶
Shapiro–Wilk test for normality.
- Parameters:
x (array-like) – Sample data (n <= 5000 recommended by scipy).
- Return type:
- morie.statistics.dagostino_pearson(x)[source]¶
D’Agostino–Pearson omnibus normality test.
- Parameters:
x (array-like) – Sample data (n >= 20 recommended).
- Return type:
- morie.statistics.jarque_bera(x)[source]¶
Jarque–Bera test for normality (based on skewness and kurtosis).
- Parameters:
x (array-like) – Sample data.
- Return type:
- morie.statistics.lilliefors_test(x)[source]¶
Lilliefors test for normality (KS test with estimated mean and SD).
Uses the Kolmogorov–Smirnov statistic with parameters estimated from the data. Critical values and p-value are approximated via the D’Agostino– Stephens table or, when
statsmodelsis available, via itslillieforsimplementation.- Parameters:
x (array-like) – Sample data.
- Return type:
- morie.statistics.one_proportion_ztest(count, nobs, value=0.5, confidence=0.95)[source]¶
One-sample z-test for a proportion.
- Parameters:
- Return type:
- morie.statistics.two_proportion_ztest(count1, nobs1, count2, nobs2, confidence=0.95)[source]¶
Two-sample z-test for the difference between two proportions.
- Parameters:
- Return type:
- morie.statistics.fisher_exact_test(contingency_table, alternative='two-sided')[source]¶
Fisher’s exact test for a 2x2 contingency table.
- Parameters:
contingency_table (array-like) – A 2x2 table of counts.
alternative (str, default "two-sided") –
"two-sided","less", or"greater".
- Returns:
estimateis the odds ratio.- Return type:
- morie.statistics.cohens_kappa(rater1, rater2, confidence=0.95)[source]¶
Cohen’s kappa for inter-rater agreement between two raters.
- Parameters:
rater1 (array-like) – Categorical ratings from two raters (same length).
rater2 (array-like) – Categorical ratings from two raters (same length).
confidence (float, default 0.95) – Confidence level.
- Return type:
- morie.statistics.fleiss_kappa(ratings_matrix)[source]¶
Fleiss’ kappa for agreement among multiple raters.
- Parameters:
ratings_matrix (array-like) – An n x k matrix where rows are subjects and columns are rating categories. Cell (i, j) is the number of raters who assigned subject i to category j.
- Return type:
References
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.
- morie.statistics.intraclass_correlation(data, targets, raters, ratings, icc_type='ICC3k')[source]¶
Intraclass correlation coefficient (ICC).
Implements the Shrout & Fleiss (1979) taxonomy:
ICC1: One-way random, single measures
ICC1k: One-way random, average measures
ICC2: Two-way random, single measures
ICC2k: Two-way random, average measures
ICC3: Two-way mixed, single measures
ICC3k: Two-way mixed, average measures
- Parameters:
- Return type:
References
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
- morie.statistics.normality_battery(x)[source]¶
Run all available normality tests on a single sample.
- Parameters:
x (array-like) – Sample data.
- Returns:
Results from Shapiro–Wilk, D’Agostino–Pearson, Jarque–Bera, and Lilliefors tests.
- Return type:
- morie.statistics.variance_equality_battery(*groups)[source]¶
Run Levene’s (median) and Bartlett’s tests for homogeneity of variance.
- Parameters:
*groups (array-like) – Two or more samples.
- Return type:
- morie.statistics.correlation_matrix(data, method='pearson')[source]¶
Compute a pairwise correlation matrix with p-values.
- Parameters:
data (DataFrame) – Numeric columns only.
method (str, default "pearson") –
"pearson","spearman", or"kendall".
- Returns:
MultiIndex columns with
("r", col)and("p", col)levels.- Return type:
DataFrame
- morie.statistics.auto_test(x, y=None, paired=False, confidence=0.95)[source]¶
Automatically select and run the most appropriate test.
Decision logic:
If y is
None– one-sample t-test against zero.If
paired=True– paired t-test (if normal differences) or Wilcoxon.If two independent samples – check normality and variance equality; choose between Student’s t, Welch’s t, or Mann–Whitney U.
- Parameters:
- Return type:
Comprehensive bootstrap and resampling inference methods.
Provides nonparametric and parametric bootstrap, jackknife, permutation tests, subsampling, and cross-validation resampling for statistical inference in epidemiological and observational studies.
All methods support stratified resampling, cluster-aware resampling, and survey-weighted variants for complex survey designs.
- class morie.bootstrap_methods.BootstrapResult(estimate, se, ci_lower, ci_upper, bias, n_boot, method, ci_method, boot_distribution, original_estimate, acceleration=0.0)[source]¶
Bases:
objectResult from a bootstrap procedure.
- Parameters:
- class morie.bootstrap_methods.JackknifeResult(estimate, se, ci_lower, ci_upper, bias, n, jackknife_estimates, pseudovalues, influence_values)[source]¶
Bases:
objectResult from jackknife estimation.
- Parameters:
- class morie.bootstrap_methods.PermutationTestResult(observed_statistic, p_value, null_distribution, n_permutations, alternative, ci_lower=nan, ci_upper=nan)[source]¶
Bases:
objectResult from a permutation test.
- Parameters:
- class morie.bootstrap_methods.CrossValidationResult(scores, mean_score, se_score, ci_lower, ci_upper, n_folds, metric, fold_sizes)[source]¶
Bases:
objectResult from cross-validation.
- Parameters:
- morie.bootstrap_methods.bootstrap(data, statistic, n_boot=2000, ci_level=0.95, ci_method='bca', seed=42, stratify=None, cluster=None)[source]¶
Nonparametric bootstrap inference.
- Parameters:
data (array-like) – Input data (1D or 2D).
statistic (callable) – Function that computes the statistic from a data array.
n_boot (int) – Number of bootstrap replicates.
ci_level (float) – Confidence interval level.
ci_method (str) – CI method: ‘percentile’, ‘normal’, ‘basic’, ‘bca’, ‘studentized’.
seed (int) – Random seed.
stratify (array-like, optional) – Stratification variable for stratified bootstrap.
cluster (array-like, optional) – Cluster variable for cluster bootstrap.
- Return type:
- morie.bootstrap_methods.parametric_bootstrap(data, statistic, distribution='normal', n_boot=2000, ci_level=0.95, seed=42, **dist_params)[source]¶
Parametric bootstrap.
Generates bootstrap samples from a fitted parametric distribution rather than resampling from the data.
- Parameters:
data (array-like) – Original data (used to fit the distribution).
statistic (callable) – Function that computes the statistic.
distribution (str) – Distribution: ‘normal’, ‘poisson’, ‘binomial’, ‘exponential’, ‘gamma’.
n_boot (int) – Number of bootstrap replicates.
ci_level (float) – Confidence interval level.
seed (int) – Random seed.
- Return type:
- morie.bootstrap_methods.wild_bootstrap(y, X, statistic_idx=1, n_boot=999, ci_level=0.95, weight_distribution='rademacher', seed=42)[source]¶
Wild bootstrap for linear regression with heteroskedasticity.
- Parameters:
y (array-like) – Response variable.
X (array-like) – Design matrix (including intercept if desired).
statistic_idx (int) – Index of the coefficient to test.
n_boot (int) – Number of bootstrap replicates.
ci_level (float) – Confidence interval level.
weight_distribution (str) – ‘rademacher’ (±1 with equal probability) or ‘mammen’.
seed (int) – Random seed.
- Return type:
- morie.bootstrap_methods.block_bootstrap(data, statistic, block_size, n_boot=2000, ci_level=0.95, method='circular', seed=42)[source]¶
Block bootstrap for dependent data.
- Parameters:
data (array-like) – Time series or dependent data.
statistic (callable) – Function that computes the statistic.
block_size (int) – Length of each block.
n_boot (int) – Number of bootstrap replicates.
ci_level (float) – Confidence interval level.
method (str) – ‘moving’ (non-overlapping), ‘circular’, or ‘stationary’.
seed (int) – Random seed.
- Return type:
- morie.bootstrap_methods.jackknife(data, statistic, ci_level=0.95)[source]¶
Delete-one jackknife estimation.
- Parameters:
data (array-like) – Input data.
statistic (callable) – Function that computes the statistic.
ci_level (float) – Confidence interval level.
- Return type:
- morie.bootstrap_methods.delete_d_jackknife(data, statistic, d=2, ci_level=0.95, max_subsets=5000, seed=42)[source]¶
Delete-d jackknife (generalized jackknife).
- Parameters:
- Return type:
- morie.bootstrap_methods.permutation_test(group1, group2, statistic='mean_diff', n_permutations=9999, alternative='two-sided', seed=42)[source]¶
Two-sample permutation test.
- Parameters:
group1 (array-like) – First group.
group2 (array-like) – Second group.
statistic (str or callable) – Test statistic: ‘mean_diff’, ‘median_diff’, ‘t_stat’, or a callable(g1, g2) -> float.
n_permutations (int) – Number of permutations.
alternative (str) – ‘two-sided’, ‘greater’, ‘less’.
seed (int) – Random seed.
- Return type:
- morie.bootstrap_methods.paired_permutation_test(x, y, statistic='mean_diff', n_permutations=9999, alternative='two-sided', seed=42)[source]¶
Paired permutation test (sign-flipping).
- Parameters:
- Return type:
- morie.bootstrap_methods.subsampling(data, statistic, subsample_size=None, n_subsamples=1000, ci_level=0.95, seed=42)[source]¶
Subsampling inference (Politis, Romano & Wolf).
Unlike the bootstrap, subsampling draws without replacement at a smaller sample size, providing valid inference under weaker conditions.
- Parameters:
- Return type:
- morie.bootstrap_methods.bootstrap_632(X, y, model_fn, score_fn, n_boot=200, seed=42)[source]¶
The .632 bootstrap estimator for prediction error.
- Parameters:
- Returns:
With keys: apparent_error, bootstrap_error, error_632, error_632plus.
- Return type:
- morie.bootstrap_methods.cross_validate(X, y, model_fn, score_fn, n_folds=10, stratify=None, groups=None, seed=42)[source]¶
K-fold cross-validation.
- Parameters:
y (array-like, shape (n,)) – Response.
model_fn (callable) – model_fn(X_train, y_train) -> model with .predict().
score_fn (callable) – score_fn(y_true, y_pred) -> float.
n_folds (int) – Number of folds.
stratify (array-like, optional) – Stratification variable for stratified CV.
groups (array-like, optional) – Group variable for grouped CV (no group split across folds).
seed (int) – Random seed.
- Return type:
- morie.bootstrap_methods.repeated_cv(X, y, model_fn, score_fn, n_folds=10, n_repeats=10, seed=42)[source]¶
Repeated K-fold cross-validation.
- Parameters:
- Return type:
- morie.bootstrap_methods.leave_one_out_cv(X, y, model_fn, score_fn)[source]¶
Leave-one-out cross-validation.
- Parameters:
X (array-like) – Feature matrix.
y (array-like) – Response.
model_fn (callable) – Model fitting function.
score_fn (callable) – Scoring function.
- Return type:
Multiple testing correction and multiplicity-adjusted inference.
Provides comprehensive p-value adjustment methods, family-wise error rate (FWER) control, false discovery rate (FDR) control, and simultaneous confidence interval methods for multiple comparisons in epidemiological and biostatistical analyses.
All functions accept arrays of p-values and return adjusted p-values or decision boundaries. Compatible with the MORIE pipeline and can be applied to any collection of hypothesis tests from the analysis modules.
- class morie.multiple_testing.AdjustedPValues(original, adjusted, method, alpha, rejected, n_rejected, n_tests, labels=None)[source]¶
Bases:
objectContainer for multiplicity-adjusted p-values.
- Parameters:
- class morie.multiple_testing.GatekeepingResult(stages, overall_rejected, method, alpha)[source]¶
Bases:
objectResult from a gatekeeping or hierarchical testing procedure.
- class morie.multiple_testing.ClosedTestResult(original_p, adjusted_p, rejected, intersection_tests, method)[source]¶
Bases:
objectResult from a closed testing procedure.
- Parameters:
- morie.multiple_testing.bonferroni(p_values, alpha=0.05, labels=None)[source]¶
Bonferroni correction (FWER control).
- Parameters:
- Return type:
- morie.multiple_testing.sidak(p_values, alpha=0.05, labels=None)[source]¶
Sidak correction (FWER control, slightly less conservative than Bonferroni).
- Parameters:
- Return type:
- morie.multiple_testing.holm(p_values, alpha=0.05, labels=None)[source]¶
Holm step-down procedure (FWER control).
Uniformly more powerful than Bonferroni.
- Parameters:
- Return type:
- morie.multiple_testing.hochberg(p_values, alpha=0.05, labels=None)[source]¶
Hochberg step-up procedure (FWER control under PRDS).
- Parameters:
- Return type:
- morie.multiple_testing.hommel(p_values, alpha=0.05, labels=None)[source]¶
Hommel procedure (FWER control, most powerful single-step method under independence).
- Parameters:
- Return type:
- morie.multiple_testing.holm_sidak(p_values, alpha=0.05, labels=None)[source]¶
Holm-Sidak step-down procedure.
- Parameters:
- Return type:
- morie.multiple_testing.benjamini_hochberg(p_values, alpha=0.05, labels=None)[source]¶
Benjamini-Hochberg procedure (FDR control).
Controls the false discovery rate at level alpha under independence or positive regression dependency (PRDS).
- Parameters:
- Return type:
- morie.multiple_testing.bh(p_values, alpha=0.05, labels=None)¶
Benjamini-Hochberg procedure (FDR control).
Controls the false discovery rate at level alpha under independence or positive regression dependency (PRDS).
- Parameters:
- Return type:
- morie.multiple_testing.benjamini_yekutieli(p_values, alpha=0.05, labels=None)[source]¶
Benjamini-Yekutieli procedure (FDR control under arbitrary dependence).
- Parameters:
- Return type:
- morie.multiple_testing.by(p_values, alpha=0.05, labels=None)¶
Benjamini-Yekutieli procedure (FDR control under arbitrary dependence).
- Parameters:
- Return type:
- morie.multiple_testing.storey_q(p_values, alpha=0.05, lambda_param=0.5, labels=None)[source]¶
Storey’s q-value procedure (adaptive FDR control).
Estimates the proportion of true null hypotheses (pi0) and uses it to achieve higher power than BH when many nulls are false.
- Parameters:
- Return type:
- morie.multiple_testing.local_fdr(p_values, labels=None)[source]¶
Estimate local false discovery rate from p-values.
Uses an empirical Bayes approach with a two-component mixture model.
- morie.multiple_testing.permutation_fwer(test_statistics, null_distribution, alternative='two-sided', alpha=0.05, labels=None)[source]¶
FWER control via max-T permutation test (Westfall-Young).
- morie.multiple_testing.permutation_fdr(p_values, null_p_values, alpha=0.05, labels=None)[source]¶
FDR control via permutation null distribution.
- Parameters:
- Return type:
- morie.multiple_testing.fisher_combined(p_values)[source]¶
Fisher’s method for combining independent p-values.
- morie.multiple_testing.stouffer_combined(p_values, weights=None)[source]¶
Stouffer’s method for combining independent p-values.
- morie.multiple_testing.tippett_combined(p_values)[source]¶
Tippett’s method: combine via minimum p-value.
- morie.multiple_testing.simes_combined(p_values)[source]¶
Simes’ test for the global null hypothesis.
- morie.multiple_testing.harmonic_mean_p(p_values)[source]¶
Harmonic mean p-value for combining dependent tests.
- Parameters:
p_values (array-like) – P-values (can be dependent).
- Returns:
Harmonic mean p-value.
- Return type:
- morie.multiple_testing.cauchy_combination(p_values, weights=None)[source]¶
Cauchy combination test (Liu & Xie 2020).
Robust to arbitrary correlation structures.
- morie.multiple_testing.hierarchical_bonferroni(p_values_by_family, alpha=0.05, propagate_alpha=True)[source]¶
Hierarchical (serial gatekeeping) Bonferroni procedure.
Tests are organized in families (stages). A family is tested only if at least one hypothesis in the previous family is rejected.
- Parameters:
- Return type:
- morie.multiple_testing.fixed_sequence(p_values, alpha=0.05, labels=None)[source]¶
Fixed-sequence (predetermined order) testing procedure.
Tests are evaluated in order; the procedure stops at the first non-rejection. No multiplicity adjustment needed for tests that are reached.
- Parameters:
- Return type:
- morie.multiple_testing.fallback_procedure(p_values, weights, alpha=0.05, labels=None)[source]¶
Fallback (fixed-sequence with alpha spending) procedure.
- Parameters:
- Return type:
- morie.multiple_testing.estimate_pi0(p_values, method='storey')[source]¶
Estimate the proportion of true null hypotheses (pi0).
- morie.multiple_testing.adjust_p_values(p_values, method='bh', alpha=0.05, labels=None)[source]¶
Convenience dispatcher for p-value adjustment methods.
- Parameters:
- Return type:
- morie.multiple_testing.n_effective_tests(p_values=None, correlation_matrix=None, method='galwey')[source]¶
Estimate the effective number of independent tests.
Useful for determining appropriate Bonferroni correction when tests are correlated.
- Parameters:
p_values (array-like, optional) – P-values (used for some methods).
correlation_matrix (array-like, optional) – Correlation matrix between test statistics.
method (str) – ‘galwey’ (eigenvalue-based), ‘li_ji’ (Li & Ji 2005), ‘nyholt’ (Nyholt 2004).
- Returns:
Estimated number of effective independent tests.
- Return type:
Comprehensive effect-size calculations for epidemiological research.
This module provides every major family of effect-size estimators used in biomedical and social-science research, each with analytic or bootstrap confidence intervals. Functions are organised by the type of comparison:
Standardised mean differences: Cohen’s d, Hedges’ g, Glass’s delta
Common-language effect sizes: CLES / probability of superiority
Correlation-based: r, R-squared, eta-squared, partial eta-squared, omega-squared, epsilon-squared
Contingency-table measures: odds ratio, risk ratio, risk difference, NNT, NNH, rate ratio, incidence rate difference
Association measures: Cohen’s w, Cramer’s V, phi coefficient
Non-parametric: rank-biserial correlation, Cliff’s delta, Vargha–Delaney A
Regression: standardised coefficients, coefficient of variation
Conversion: d <-> r, OR <-> d, etc.
Meta-analysis: fixed- and random-effects pooling, I-squared, prediction intervals
References
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences
(2nd ed.). Lawrence Erlbaum Associates.
- Hedges, L. V., & Olkin, I. (1985). Statistical Methods for Meta-Analysis.
Academic Press.
- Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009).
Introduction to Meta-Analysis. Wiley.
- Vargha, A., & Delaney, H. D. (2000). A critique and improvement of the CL
common language effect size statistics of McGraw and Wong. JEBS, 25(2), 101–132.
- class morie.effect_sizes.EffectSizeResult(measure, estimate, ci_lower=None, ci_upper=None, se=None, n=None, extra=<factory>)[source]¶
Bases:
objectStandardised result for every effect-size calculation.
- Parameters:
measure (str) – Name of the effect-size statistic.
estimate (float) – Point estimate.
ci_lower (float | None) – Lower confidence bound.
ci_upper (float | None) – Upper confidence bound.
se (float | None) – Standard error (analytic or bootstrap).
n (int | None) – Sample size used.
extra (dict) – Additional outputs.
- morie.effect_sizes.cohens_d(x, y, confidence=0.95)[source]¶
Cohen’s d for independent samples (pooled SD denominator).
\[d = \frac{\bar{x} - \bar{y}}{s_p}\]- Parameters:
x (array-like) – Two independent samples.
y (array-like) – Two independent samples.
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.hedges_g(x, y, confidence=0.95)[source]¶
Hedges’ g – bias-corrected Cohen’s d.
Applies the exact correction factor \(J = 1 - 3/(4(n_1+n_2-2)-1)\).
- Parameters:
x (array-like)
y (array-like)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.glass_delta(x, y, control='y', confidence=0.95)[source]¶
Glass’s delta – uses the control group SD as denominator.
- Parameters:
- Return type:
- morie.effect_sizes.cles(x, y, confidence=0.95)[source]¶
Common Language Effect Size (probability of superiority).
Estimates \(P(X > Y)\) for randomly drawn observations from each group.
- Parameters:
x (array-like)
y (array-like)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.r_effect_size(x, y, confidence=0.95)[source]¶
Pearson r as an effect size with Fisher z CI.
- Parameters:
x (array-like)
y (array-like)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.r_squared(x, y)[source]¶
Coefficient of determination R-squared.
- Parameters:
x (array-like)
y (array-like)
- Return type:
- morie.effect_sizes.eta_squared(ss_effect, ss_total)[source]¶
Eta-squared from ANOVA sums of squares.
\(\eta^2 = SS_{effect} / SS_{total}\)
- Parameters:
- Return type:
- morie.effect_sizes.partial_eta_squared(ss_effect, ss_error)[source]¶
Partial eta-squared.
\(\eta^2_p = SS_{effect} / (SS_{effect} + SS_{error})\)
- Parameters:
- Return type:
- morie.effect_sizes.omega_squared(ss_effect, ss_total, df_effect, ms_error)[source]¶
Omega-squared – less biased than eta-squared.
\(\omega^2 = (SS_{effect} - df_{effect} \cdot MS_{error}) / (SS_{total} + MS_{error})\)
- Parameters:
- Return type:
- morie.effect_sizes.epsilon_squared(ss_effect, ss_total, df_effect, ms_error)[source]¶
Epsilon-squared (Kelley, 1935).
\(\varepsilon^2 = (SS_{effect} - df_{effect} \cdot MS_{error}) / SS_{total}\)
- Parameters:
- Return type:
- morie.effect_sizes.odds_ratio(a, b, c, d, confidence=0.95)[source]¶
Odds ratio for a 2x2 table [[a, b], [c, d]].
\(OR = (a \cdot d) / (b \cdot c)\)
- morie.effect_sizes.risk_ratio(a, b, c, d, confidence=0.95)[source]¶
Risk ratio (relative risk) for a 2x2 table.
\(RR = [a/(a+b)] / [c/(c+d)]\)
- morie.effect_sizes.risk_difference(a, b, c, d, confidence=0.95)[source]¶
Risk difference (attributable risk) for a 2x2 table.
\(RD = a/(a+b) - c/(c+d)\)
- morie.effect_sizes.number_needed_to_treat(a, b, c, d, confidence=0.95)[source]¶
Number needed to treat (NNT).
\(NNT = 1 / |RD|\)
- morie.effect_sizes.number_needed_to_harm(a, b, c, d, confidence=0.95)[source]¶
Number needed to harm (NNH) – same as NNT but with reversed sign convention.
- morie.effect_sizes.rate_ratio(events1, person_time1, events2, person_time2, confidence=0.95)[source]¶
Incidence rate ratio.
\(IRR = (e_1/PT_1) / (e_2/PT_2)\)
- Parameters:
- Return type:
- morie.effect_sizes.incidence_rate_difference(events1, person_time1, events2, person_time2, confidence=0.95)[source]¶
Incidence rate difference.
\(IRD = (e_1/PT_1) - (e_2/PT_2)\)
- morie.effect_sizes.cohens_w(observed, expected=None)[source]¶
Cohen’s w for chi-squared.
\(w = \sqrt{\chi^2 / N}\)
- Parameters:
observed (array-like) – Observed frequencies.
expected (array-like or None) – Expected frequencies (uniform if
None).
- Return type:
- morie.effect_sizes.cohens_f(eta2)[source]¶
Cohen’s f from eta-squared.
\(f = \sqrt{\eta^2 / (1 - \eta^2)}\)
- Parameters:
eta2 (float) – Eta-squared value.
- Return type:
- morie.effect_sizes.cramers_v(contingency_table, confidence=0.95)[source]¶
Cramer’s V for a contingency table.
\(V = \sqrt{\chi^2 / (N \cdot (k - 1))}\) where \(k = \min(r, c)\).
- Parameters:
contingency_table (array-like or DataFrame)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.phi_coefficient(contingency_table)[source]¶
Phi coefficient for a 2x2 contingency table.
\(\phi = \sqrt{\chi^2 / N}\)
- Parameters:
contingency_table (array-like)
- Return type:
- morie.effect_sizes.rank_biserial_correlation(x, y, confidence=0.95)[source]¶
Rank-biserial correlation (matched rank version).
\(r = 1 - 2U / (n_1 n_2)\) where U is the Mann–Whitney statistic.
- Parameters:
x (array-like)
y (array-like)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.cliffs_delta(x, y, confidence=0.95)[source]¶
Cliff’s delta (non-parametric effect size).
\(\delta = (\#(x_i > y_j) - \#(x_i < y_j)) / (n_x n_y)\)
- Parameters:
x (array-like)
y (array-like)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.vargha_delaney_a(x, y, confidence=0.95)[source]¶
Vargha–Delaney A statistic.
\(A = U / (n_1 n_2)\) where U is the Mann–Whitney statistic (probability that a randomly chosen x exceeds a randomly chosen y).
- Parameters:
x (array-like)
y (array-like)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.standardized_coefficients(X, y)[source]¶
Compute standardised regression coefficients (beta weights).
Standardises X and y to zero mean and unit variance before OLS.
- Parameters:
X (array-like or DataFrame) – Predictor matrix (n x p).
y (array-like or Series) – Outcome variable.
- Returns:
Columns:
variable,beta,se,t,p_value.- Return type:
DataFrame
- morie.effect_sizes.coefficient_of_variation(x)[source]¶
Coefficient of variation (CV).
\(CV = s / \bar{x}\)
- Parameters:
x (array-like)
- Return type:
- morie.effect_sizes.variance_ratio(x, y, confidence=0.95)[source]¶
Variance ratio (F-test for equality of variances).
- Parameters:
x (array-like)
y (array-like)
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.d_to_r(d, n1=None, n2=None)[source]¶
Convert Cohen’s d to Pearson r.
\(r = d / \sqrt{d^2 + a}\) where \(a = (n_1 + n_2)^2 / (n_1 n_2)\) or a = 4 when sample sizes are unknown.
- morie.effect_sizes.or_to_d(or_val)[source]¶
Convert odds ratio to Cohen’s d (Hasselblad & Hedges, 1995).
\(d = \log(OR) \cdot \sqrt{3} / \pi\)
- morie.effect_sizes.d_to_or(d)[source]¶
Convert Cohen’s d to an odds ratio.
\(OR = \exp(d \pi / \sqrt{3})\)
- morie.effect_sizes.d_to_nnt(d, base_rate=0.5)[source]¶
Convert Cohen’s d to NNT given a base rate.
Uses the Kraemer & Kupfer (2006) formula: \(NNT = 1 / (\Phi(d/2 + \Phi^{-1}(CER)) - CER)\)
- morie.effect_sizes.fixed_effects_meta(estimates, standard_errors, confidence=0.95)[source]¶
Fixed-effects (inverse-variance weighted) meta-analytic pooling.
- Parameters:
estimates (array-like) – Effect-size estimates from k studies.
standard_errors (array-like) – Standard errors of the estimates.
confidence (float, default 0.95)
- Return type:
- morie.effect_sizes.random_effects_meta(estimates, standard_errors, confidence=0.95, method='DL')[source]¶
Random-effects meta-analytic pooling.
- Parameters:
- Return type:
References
DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3), 177–188.
- morie.effect_sizes.i_squared(estimates, standard_errors)[source]¶
Compute Higgins’ I-squared heterogeneity statistic.
\(I^2 = \max(0, (Q - (k-1))/Q) \times 100\)
- Parameters:
estimates (array-like)
standard_errors (array-like)
- Returns:
Percentage (0–100).
- Return type:
- morie.effect_sizes.prediction_interval(estimates, standard_errors, confidence=0.95)[source]¶
Prediction interval for a new study from a random-effects meta-analysis.
Datasets¶
Dataset-agnostic analysis engine: ingest, profile, and plan analyses for any dataset.
This module provides tools to ingest datasets in multiple formats (CSV, TSV, Excel, Parquet, JSON) without prior schema knowledge, infer levels of measurement (NOIR: Nominal, Ordinal, Interval, Ratio) for each column, auto-detect key variable roles (treatment, outcome, weight, stratum, cluster), produce comprehensive dataset profiles, and suggest analysis plans.
The measurement-level inference follows Stevens’ (1946) taxonomy as refined for applied statistical practice:
Nominal — unordered categories (e.g., province, gender)
Ordinal — ordered categories (e.g., Likert scales, education level)
Interval — numeric with no true zero (e.g., temperature, year, index)
Ratio — numeric with true zero (e.g., count, weight, income, eBAC)
References
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677
Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65–72. https://doi.org/10.1080/00031305.1993.10475938
- class morie.dataset.MeasurementLevel(*values)[source]¶
Bases:
EnumStevens’ levels of measurement (NOIR taxonomy).
Level
Properties
Example variables
NOMINAL
Identity only (= / !=)
Province, gender, treatment indicator
ORDINAL
Identity + order (< / >)
Likert scale, education level, income bracket
INTERVAL
Identity + order + equal intervals, no true zero
Temperature (C), year, standardized score
RATIO
Identity + order + equal intervals + true zero
Count, weight, income, eBAC
References
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.
- NOMINAL = 'nominal'¶
- ORDINAL = 'ordinal'¶
- INTERVAL = 'interval'¶
- RATIO = 'ratio'¶
- class morie.dataset.ColumnProfile(name, dtype, level, n_unique, missing_pct, is_binary, is_constant, suggested_role, summary_stats=<factory>)[source]¶
Bases:
objectProfile of a single column in the dataset.
- Variables:
name (str) – Column name.
dtype (str) – Pandas dtype as a string.
level (MeasurementLevel) – Inferred Stevens measurement level.
n_unique (int) – Number of distinct non-null values.
missing_pct (float) – Percentage of missing values (0.0–100.0).
is_binary (bool) – True if column has exactly two unique non-null values.
is_constant (bool) – True if column has at most one unique non-null value.
suggested_role (str) – One of:
"treatment","outcome","covariate","weight","id","stratum","cluster", or"unknown".summary_stats (dict) – For numeric columns:
{mean, std, min, q25, median, q75, max}. For categorical columns:{top_counts}as a dict of value->count.
- Parameters:
- level: MeasurementLevel¶
- class morie.dataset.DatasetProfile(n_rows, n_cols, columns, suggested_treatment=None, suggested_outcome=None, suggested_weights=None)[source]¶
Bases:
objectFull profile of a dataset including all column profiles and summary diagnostics.
- Variables:
n_rows (int) – Number of rows.
n_cols (int) – Number of columns.
columns (dict[str, ColumnProfile]) – Per-column profiles keyed by column name.
suggested_treatment (str | None) – Best-guess treatment column, or None if not detected.
suggested_outcome (str | None) – Best-guess outcome column, or None if not detected.
suggested_weights (str | None) – Best-guess survey-weight column, or None if not detected.
- Parameters:
- columns: dict[str, ColumnProfile]¶
- morie.dataset.infer_measurement_level(series, *, ordinal_threshold=10)[source]¶
Infer the Stevens measurement level (NOIR) for a single column.
Detection rules (applied in order):
dtype object / category + ordinal name heuristic + n_unique <= ordinal_threshold:
ORDINALdtype object / category (default):
NOMINALdtype bool:
NOMINALdtype int/float + is_binary (n_unique <= 2):
NOMINAL(binary)dtype int/float + n_unique <= 20 + ordinal name heuristic:
ORDINALdtype float + interval name heuristic:
INTERVALdtype float:
RATIOdtype int + min >= 0:
RATIOdtype int + min < 0:
INTERVALdatetime:
INTERVAL
- Parameters:
series (pandas.Series) – Column data to classify.
ordinal_threshold (int, optional) – Maximum number of unique values for a categorical column to be considered ordinal (default 10).
- Returns:
Inferred measurement level.
- Return type:
Examples
>>> import pandas as pd >>> infer_measurement_level(pd.Series(["M", "F", "M", "F"])) <MeasurementLevel.NOMINAL: 'nominal'>
>>> infer_measurement_level(pd.Series([1, 2, 3, 4, 5], name="likert_q1")) <MeasurementLevel.ORDINAL: 'ordinal'>
- morie.dataset.profile_dataset(df, *, hint_treatment=None, hint_outcome=None, hint_weights=None, ordinal_threshold=10, binary_threshold=2)[source]¶
Fully profile a DataFrame without prior schema knowledge.
Walks every column, infers its measurement level and epidemiological role, computes summary statistics, and auto-detects likely treatment, outcome, and survey-weight columns. User-supplied hints override heuristic detection.
- Parameters:
df (pandas.DataFrame) – Input data.
hint_treatment (str, optional) – If provided, use this column as the treatment indicator.
hint_outcome (str, optional) – If provided, use this column as the outcome.
hint_weights (str, optional) – If provided, use this column as the survey weight.
ordinal_threshold (int, optional) – Maximum unique values for categorical columns to be classified as ordinal (default 10).
binary_threshold (int, optional) – Maximum unique values for a column to be considered binary (default 2).
- Returns:
Complete dataset profile.
- Return type:
- Raises:
TypeError – If df is not a pandas DataFrame.
ValueError – If df has zero rows or zero columns.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({ ... "treatment": [0, 1, 0, 1], ... "outcome": [1.2, 3.4, 2.1, 4.5], ... "age": [25, 30, 35, 40], ... }) >>> profile = profile_dataset(df) >>> profile.n_rows 4 >>> profile.suggested_treatment 'treatment'
- morie.dataset.load_dataset(path, *, encoding='utf-8', **read_kwargs)[source]¶
Load a dataset from CSV, TSV, Excel, Parquet, or JSON file.
File format is detected from the extension. Supported extensions:
.csv,.tsv,.xlsx/.xls,.parquet/.pq,.json/.jsonl.- Parameters:
- Returns:
Loaded dataset.
- Return type:
- Raises:
FileNotFoundError – If path does not exist.
ValueError – If the file extension is not recognized.
Examples
>>> df = load_dataset("data.csv") >>> df = load_dataset("survey.xlsx", sheet_name="wave1")
- morie.dataset.suggest_analysis_plan(profile)[source]¶
Suggest an ordered analysis plan based on the detected variable types.
Uses the inferred measurement levels, binary indicators, and detected treatment/outcome/weight columns to recommend epidemiological analyses.
- Parameters:
profile (DatasetProfile) – Output from
profile_dataset().- Returns:
Ordered list of suggested analyses. Each dict contains:
analysis(str): method identifier (e.g.,"ipw_ate")rationale(str): plain-English reason this analysis is suggestedrequired_vars(dict): mapping of role -> column name(s)
- Return type:
Examples
>>> plan = suggest_analysis_plan(profile) >>> plan[0]["analysis"] 'descriptive_profile'
- morie.data.morie_db()[source]¶
Return path to morie.db – checks package-bundled location first, then cache.
Package-bundled DB ships with the installed package (via LFS/setuptools). Cache location is used for development and is gitignored.
- Return type:
- morie.data.cache_connect(db_path=None)[source]¶
Open (or create) the MORIE SQLite cache database.
- Parameters:
- Return type:
- morie.data.cache_store(df, table, db_path=None)[source]¶
Write a DataFrame to the SQLite cache, replacing any existing table.
- morie.data.cache_load(table, db_path=None)[source]¶
Load a table from the SQLite cache. Returns None if not cached.
- morie.data.fetch_ckan_to_cache(dataset_key='cpads', limit=32000, db_path=None, timeout=60)[source]¶
Fetch a dataset from CKAN and store it in the SQLite cache.
- Parameters:
- Returns:
The fetched (and optionally canonicalized) DataFrame.
- Return type:
pd.DataFrame
- morie.data.load_cpads(db_path=None, timeout=60)[source]¶
Load CPADS data: try local files, then cache, then CKAN API.
Resolution order: 1. Local RDS (via R bridge) or CSV files in standard locations 2. SQLite cache (data/cache/morie.db) 3. CKAN API fetch -> cache -> return
- morie.data.load_dataset(key, *, db_path=None, timeout=60)[source]¶
Load a dataset by catalog key.
Resolution order: 1. Built-in morie.db (ships with package) 2. User cache data/cache/morie.db 3. Local file (ingest to cache on the fly) 4. CKAN API (if resource ID available) 5. Error
Supports fuzzy matching:
load_dataset("cpads")resolves toocp21.
- morie.data.list_datasets(db_path=None)[source]¶
List all datasets with their cache status.
Returns a list of dicts with keys: key, name, source, survey, year, type, cached (bool), rows (int or None).
When db_path is explicitly provided, only that database is queried (the built-in DB is skipped). This keeps unit tests deterministic.
- class morie.data.DatasetRegistry(data_dir='data/datasets/')[source]¶
Bases:
objectA registry to manage, catalog, and load secure epidemiological datasets.
- Parameters:
data_dir (str)
- __init__(data_dir='data/datasets/')[source]¶
Initialize the dataset registry.
- Parameters:
data_dir (str) – The root file directory containing secure datasets, defaults to “data/datasets/”.
- fetch_ckan_records(name, *, limit=5, query=None, timeout=30)[source]¶
Fetch records for a dataset backed by a CKAN DataStore resource.
- fetch_ckan_dataframe(name, *, limit=100, query=None, timeout=30)[source]¶
Fetch a CKAN result set and return the records as a DataFrame.
- register_local_cpads(path, *, name='cpads_local')[source]¶
Register a user-provided local CPADS file.
- validate_cpads_frame(frame, *, strict=True)[source]¶
Validate a DataFrame against the canonical CPADS variable contract.
- load(name)[source]¶
Load a registered dataset securely.
- Parameters:
name (str) – Unique identifier for the dataset in the catalog.
- Raises:
ValueError – If the dataset name is not in the registry catalog.
FileNotFoundError – If the underlying file could not be queried or synced locally.
NotImplementedError – If the specified file format is not supported.
- Returns:
A pandas DataFrame containing the dataset content.
- Return type:
BigQuery-derived dataset analysis over a local SQLite catalog.
This module gives the morie analytics surface a typed entry point into a SQLite-mirrored slice of public BigQuery datasets – useful for local, network-free exploration and CI work.
Two access paths:
Local – point at the on-disk SQLite file directly. Fast, no network. Works against the bundled
morie_datasets.dbplus any user-supplied SQLite mirrors of public BigQuery slices.Remote – a SQL-over-HTTP endpoint configured via the
MORIE_REMOTE_URLenv var. Useful for exploration without copying GBs of SQLite locally.
The high-level helpers (bq_describe, bq_columns, bq_summary,
bq_correlate) auto-pick local first, fall through to remote when
configured.
- class morie.bq.TableInfo(db, table, row_count, columns, sample)[source]¶
Bases:
objectHigh-level summary of one table inside a BigQuery-mirror database.
- Parameters:
- class morie.bq.ColumnSummary(db, table, column, n, n_missing, mean, median, std, min, max)[source]¶
Bases:
objectUnivariate descriptive stats for one numeric column.
- Parameters:
- morie.bq.bq_query(db, sql)[source]¶
Run an arbitrary SQL query against the BigQuery-mirror database
db.Auto-picks local SQLite if available; falls through to the configured remote endpoint otherwise. Returns a list of row dicts.
Examples
>>> rows = bq_query("nyc_311_historic", "SELECT borough, count(*) FROM nyc_311_historic GROUP BY 1")
- morie.bq.bq_columns(db, table=None)[source]¶
Schema for table (defaults to the first user table in the db).
OTIS — Ontario Tracking and Information System¶
morie.otis – Ontario Restrictive Confinement (OTIS) analysis.
Correctional/sociolegal analysis for Ontario placement data (2023-2025). Orchestrates existing MORIE causal inference modules for the correctional domain: DML-IRM, propensity score matching, AIPW, mixed-effects models.
Short-name API (≤6 chars):
rplace()– Regional placement analysis by age/sex/yearastcmb()– Alert-state combination encoding (8 binary -> complexity)volat()– Regional volatility/movement metricrctrnd()– Restrictive confinement trends over timeotdesc()– Full OTIS descriptive statistics suiteotdml()– Run DML IRM (ATE/ATT) on OTIS data
- Data:
data/cache/correctional_stats_report_environment1b.RData(via R) data/cache/dt_expanded.rds(1.9M rows × 13 cols)
References
Ontario Ministry of the Solicitor General (2025). Restrictive Confinement Detailed Dataset. data.ontario.ca.
Jahn v. Ontario (2020). Settlement Agreement – Inmate Data Disclosure.
Chernozhukov et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal.
- class morie.otis.RplRes(counts, props, year, sex=None)[source]¶
Bases:
objectRegional placement result.
- class morie.otis.OtDmlR(ate, ate_se, ate_pval, att, att_se, att_pval, n, method='IRM')[source]¶
Bases:
objectOTIS DML result.
- Parameters:
- morie.otis.rplace(df, year, sex=None, *, id_col='unique_individual_id', age_col='age_category', region_col='region_at_time_of_placement', year_col='end_fiscal_year', gender_col='gender')[source]¶
Regional placement analysis by age group.
Computes count matrix S and proportion matrix P of placements across regions, stratified by age group and optionally by sex.
- morie.otis.astcmb(df, *, alert_cols=('mental_health_alert', 'suicide_risk_alert', 'suicide_watch_alert'), id_col='unique_individual_id', year_col='end_fiscal_year', np_col='number_of_placements')[source]¶
Alert-state combination encoding.
Encodes 3 binary alert indicators into 8 possible combinations (a1-a8) and computes a complexity index (count of distinct combinations per person-year).
- morie.otis.volat(df, *, id_col='unique_individual_id', year_col='end_fiscal_year', regA_col='region_at_time_of_placement', regB_col='region_most_recent_placement')[source]¶
Regional volatility/movement metric.
Counts the number of distinct regions an individual was placed in (combining initial and most recent placement regions).
- morie.otis.rctrnd(df, *, id_col='unique_individual_id', year_col='end_fiscal_year', region_col='region_at_time_of_placement')[source]¶
Restrictive confinement trends over time.
Returns per-year counts by region.
- morie.otis.otdesc(df, *, id_col='unique_individual_id', year_col='end_fiscal_year')[source]¶
Full OTIS descriptive statistics suite.
- morie.otis.otdml(df, outcome='Y', treatment='D', covariates=None, *, cluster=None, n_folds=3, seed=123)[source]¶
Run DML IRM (ATE/ATT) on OTIS data.
Wraps
morie.effects.estimate_irm()for the correctional domain.- Parameters:
df (DataFrame) – Must contain outcome, treatment, and covariate columns.
outcome (str) – Outcome variable name (default “Y” = suicide risk).
treatment (str) – Treatment variable name (default “D” = mental health alert).
covariates (list of str, optional) – Covariate column names. Defaults to standard OTIS set.
cluster (str, optional) – Cluster variable for clustered standard errors.
n_folds (int)
seed (int)
- Return type:
- morie.otis.regional_placement(df, year, sex=None, *, id_col='unique_individual_id', age_col='age_category', region_col='region_at_time_of_placement', year_col='end_fiscal_year', gender_col='gender')¶
Regional placement analysis by age group.
Computes count matrix S and proportion matrix P of placements across regions, stratified by age group and optionally by sex.
- morie.otis.alert_state_combo(df, *, alert_cols=('mental_health_alert', 'suicide_risk_alert', 'suicide_watch_alert'), id_col='unique_individual_id', year_col='end_fiscal_year', np_col='number_of_placements')¶
Alert-state combination encoding.
Encodes 3 binary alert indicators into 8 possible combinations (a1-a8) and computes a complexity index (count of distinct combinations per person-year).
- morie.otis.volatility(df, *, id_col='unique_individual_id', year_col='end_fiscal_year', regA_col='region_at_time_of_placement', regB_col='region_most_recent_placement')¶
Regional volatility/movement metric.
Counts the number of distinct regions an individual was placed in (combining initial and most recent placement regions).
- morie.otis.rc_trends(df, *, id_col='unique_individual_id', year_col='end_fiscal_year', region_col='region_at_time_of_placement')¶
Restrictive confinement trends over time.
Returns per-year counts by region.
- morie.otis.otis_descriptives(df, *, id_col='unique_individual_id', year_col='end_fiscal_year')¶
Full OTIS descriptive statistics suite.
- morie.otis.otis_dml(df, outcome='Y', treatment='D', covariates=None, *, cluster=None, n_folds=3, seed=123)¶
Run DML IRM (ATE/ATT) on OTIS data.
Wraps
morie.effects.estimate_irm()for the correctional domain.- Parameters:
df (DataFrame) – Must contain outcome, treatment, and covariate columns.
outcome (str) – Outcome variable name (default “Y” = suicide risk).
treatment (str) – Treatment variable name (default “D” = mental health alert).
covariates (list of str, optional) – Covariate column names. Defaults to standard OTIS set.
cluster (str, optional) – Cluster variable for clustered standard errors.
n_folds (int)
seed (int)
- Return type:
OTIS analysis surfaces – RichResult wrappers around morie.otis primitives.
morie.otis provides 6 result-emitting callables that return specialized dataclasses (RplRes / AstRes / VolRes / OtDmlR / etc.). This module wraps each in a RichResult so OTIS analyses get the same multi-section output that Welch / Mann–Whitney / SIU analyses already emit.
OTIS = anonymized Ontario MCSCS correctional placement records (open data). The canonical 76,934-row table lives in the repo cache at data/cache/correctional_stats_report_environment.RData as the R object df. Use load_otis() (CSV mirror) or call the Rscript exporter at scripts/export_otis_csv.R to refresh the CSV.
- Public API:
rplace(df, year, sex=None) -> RichResult astcmb(df) -> RichResult volat(df) -> RichResult rctrnd(df) -> RichResult otdesc(df, year=None, sex=None) -> RichResult otdml(df, treatment, outcome, …) -> RichResult all_analyses(df, year, *, out_dir=None) -> dict[str, RichResult]
- morie.otis_analyze.load_otis(csv_path=None)[source]¶
Load the canonical OTIS DataFrame.
Defaults to data/cache/otis_main.csv (Rscript-exported from the correctional_stats_report_environment.RData fixture). Pass a custom csv_path to load a refreshed snapshot.
- Returns a 10-column DataFrame:
end_fiscal_year, unique_individual_id, region_at_time_of_placement, region_most_recent_placement, gender, age_category, mental_health_alert, suicide_risk_alert, suicide_watch_alert, number_of_placements
- morie.otis_analyze.rplace(df, year, sex=None, **kw)[source]¶
Regional placement analysis as RichResult.
- Parameters:
- Return type:
- morie.otis_analyze.astcmb(df, **kw)[source]¶
Alert-state combination analysis as RichResult.
- Parameters:
- Return type:
- morie.otis_analyze.volat(df, **kw)[source]¶
Regional volatility metric as RichResult.
- Parameters:
- Return type:
- morie.otis_analyze.rctrnd(df, **kw)[source]¶
Restrictive-confinement trends as RichResult.
- Parameters:
- Return type:
- morie.otis_analyze.otdesc(df, **kw)[source]¶
Full OTIS descriptive suite as RichResult.
_o.otdesc does not accept year/sex filters – apply those upstream via df = df[df.end_fiscal_year == year] if needed.
- Parameters:
- Return type:
- morie.otis_analyze.otdml(df, *, treatment, outcome, covariates, **kw)[source]¶
OTIS DML IRM (ATE/ATT) as RichResult.
- morie.otis_analyze.all_analyses(df, year, *, sex=None, out_dir=None)[source]¶
Run rplace / astcmb / volat / rctrnd / otdesc on df and write to disk.
otdml is excluded from the bundle because it requires the user to specify (treatment, outcome, covariates) – call it separately.
morie.otis_all_analyze – RichResult-emitting analyses for ALL 28 OTIS datasets.
This module is the comprehensive OTIS analysis surface. It pairs with morie.otis_datasets (the registry/loader) and extends the narrower morie.otis_analyze (which only covered the restricted- confinement subset).
- For each dataset id (b01..d07), this module exposes:
analyze_<id>(df=None) -> RichResult
If df is None, the analysis loads from data/datasets/OTIS/<csv_filename>.
- Plus a top-level driver:
analyze_all(*, out_dir=None) -> dict[str, RichResult]
which runs every dataset’s analysis, writes per-dataset .txt and .json under data/manifest/outputs/otis/, and returns the in-memory results.
- Plus the Ruhela-formulation (RF) high-level entry points:
- analyze_a01_ruhela_formulations(df=None) – full DLRM on a01
– alias: analyze_a01_dlrm
- analyze_b01_ruhela_formulations(df=None) – full DLRM on b01
– alias: analyze_b01_dlrm
- analyze_b02_ruhela_formulations(df=None) – gender -> seg days on b02
– alias: analyze_b02_dlrm
analyze_a01_with_csi_context() – a01 causal + Toronto CSI analyze_c_doob_chi2() – χ² + Cramer’s V on c-series
– homage to Doob’s chi-square tradition
analyze_d_doob_chi2() – d-series + Alert χ²
- Naming abbreviations:
RF – Ruhela formulation RDF – Ruhela Dual Formulation (RF + Naive-arm sensitivity) DLRM – Doob-Levinsky-Ruhela-Medina (methodology attribution)
The DLRM methodology attribution and the wider acknowledgements
(Prof. Jauregui, Prof. Laniyonu) live at the top of
morie.otis_causal’s module docstring.
- morie.otis_all_analyze.analyze_b01(df=None)[source]¶
Person-level segregation placements (76,934 rows).
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b02(df=None)[source]¶
Aggregate days in segregation per person per year.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b03(df=None)[source]¶
Segregation placements by alert × institution.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b04(df=None)[source]¶
Placement durations (max/median/mode) by region & gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b05(df=None)[source]¶
Distribution of placements by binned duration.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b06(df=None)[source]¶
Reasons for placement × institution × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b07(df=None)[source]¶
Alerts × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b08(df=None)[source]¶
Durations (median/mode) by institution & gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b09(df=None)[source]¶
Individuals by number of segregation placements × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c01(df=None)[source]¶
Total individuals × custody/RC/seg × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c02(df=None)[source]¶
Individuals in RC/seg by institution.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c03(df=None)[source]¶
Individuals × race × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c04(df=None)[source]¶
Individuals × race × region.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c05(df=None)[source]¶
Individuals × religion × region.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c06(df=None)[source]¶
Individuals × age × region.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c07(df=None)[source]¶
Individuals × alerts × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c08(df=None)[source]¶
Individuals × religion × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c09(df=None)[source]¶
Individuals × age × gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c10(df=None)[source]¶
RC/seg aggregate durations by institution.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c11(df=None)[source]¶
Individuals by aggregate-duration bin.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c12(df=None)[source]¶
RC/seg aggregate durations by region & gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d01(df=None)[source]¶
Person-level deaths in custody.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01(df=None)[source]¶
High-level a01 (Restrictive Confinement Detailed) analysis.
Wraps the full causal pipeline that the OTIS-RC research runs against
a01_restrictive_confinement_detailed_dataset.csv:Person-year aggregation via 8-state alert-combo encoding (
morie.otis_causal.make_pair_alert_to_volatility_a01).MatchIt 1:1 NN PSM with caliper.
IRM-DML with cross-fitted random-forest nuisances.
Multi-way clustered standard errors (id, region, year).
Returns a
RichResultwith both ATE and ATTE estimates.- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_ruhela_formulations(df=None, *, cluster_col='EndFiscalYear')[source]¶
Ruhela formulations (full DLRM) on a01 – the canonical OTIS-RC dataset.
Runs the complete OTIS-RC methodology arc on the author’s Ruhela formulation (T = high alert complexity, Y = regional volatility). See
_ruhela_formulations_onfor the full estimator list.Attribution¶
The Ruhela formulation is the author’s design end-to-end; the supporting- contributor credits (Doob – pointed the author to the dataset; Levinsky – course context, reviewed only the preliminary a01 analysis; Medina – reviewed the formal write-up) are documented in
morie.otis_causal’s module docstring.- param df:
a01 DataFrame; if None,
otis_datasets.load_otis_dataset('a01').- type df:
pd.DataFrame, optional
- param cluster_col:
Cluster axis for cluster-robust SE in the IRM-DML estimator (default ‘EndFiscalYear’).
- type cluster_col:
str
- Parameters:
- Return type:
- morie.otis_all_analyze.analyze_a01_dual(*args, **kwargs)[source]¶
Deprecated alias for
analyze_a01_ruhela_formulations.
- morie.otis_all_analyze.analyze_b01_ruhela_formulations(df=None, *, cluster_col='EndFiscalYear')[source]¶
Ruhela formulations (full DLRM) on b01 (Segregation Detailed).
Per-placement segregation complement to a01’s per-day RC records. Same Ruhela alert-complexity -> regional-volatility contrast, larger magnitude effect because segregation is a stronger institutional contrast.
See
analyze_a01_ruhela_formulationsfor the full attribution and estimator catalogue.- Parameters:
- Return type:
- morie.otis_all_analyze.analyze_b02_ruhela_formulations(df=None, *, cluster_col='EndFiscalYear')[source]¶
Ruhela formulation on b02 (Segregation Total Days per individual).
b02 has UniqueIndividual_ID + EndFiscalYear + Gender + Region + Age_Category + TotalAggregatedDays_Segregation but NO alert columns. The canonical alert-complexity formulation is impossible; instead the author’s b02-specific Ruhela formulation tests gender disparity in segregation-day burden:
T = Gender == “Female” indicator Y = TotalAggregatedDays_Segregation (count, person-year) covariates = [Region_MostRecentPlacement, Age_Category, EndFiscalYear]
Same DLRM ensemble as the canonical formulation. Sister formulations on the same b02 data could swap T for region or age.
- Parameters:
- Return type:
- morie.otis_all_analyze.analyze_ruhela_per_year(df, *, ds_id, treatment, outcome, covariates, year_col='EndFiscalYear', cluster_col='EndFiscalYear', propensity_calibration='none')[source]¶
Per-fiscal-year full DLRM on each Ruhela formulation.
Runs the complete DLRM (IPW + AIPW + g-comp + PSM-NN + PSM-subclass + IRM-DML + match_first + ATC + PLR + SuperLearner) separately on each fiscal year. ~7× the standard per-year IRM-DML runtime. Mirrors a per-year extension of the author’s
res_by_year.
- morie.otis_all_analyze.analyze_a01_ruhela_per_year(df=None)[source]¶
Per-year full DLRM on Ruhela formulations on a01.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_ruhela_per_year(df=None)[source]¶
Per-year full DLRM on Ruhela formulations on b01.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b03_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on b03 (Alert × Institution).
T = Alert_Presence indicator -> Y = Number_SegregationPlacements. Year-FE + Region-FE + Alert_Type-FE; Institution as GEE cluster group (24 levels, marginal model with cluster-robust SE).
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b07_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on b07 (alerts × gender summary).
- b07 is a wide table with paired columns:
Number_Segregation_Placements_With_Alert Number_Segregation_Placements_Without_Alert
We pivot to long form (one row per alert-state × gender × year), then T = with_alert indicator -> Y = count. Year + Gender + Alert_Type fixed effects.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c01_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on c01 – gender disparity in RC.
c01 has Gender × {InCustody, RestrictiveConfinement, Segregation} counts. T = Female indicator -> Y = NumberIndividuals_RC, Year-FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c07_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on c07 – alert × gender -> RC count.
Counterpart of the per-row Ruhela formulation on a01/b01 at the aggregate level. Useful triangulation: c07 reports NumberIndividuals_InCustody / RC / Segregation cross-tabulated by Alert_Type × Gender × FiscalYear. The Ruhela formulation here contrasts an indicator of alert presence (vs. no-alert reference) with the number-in-RC count.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b06_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on b06 – disciplinary reason -> seg.
T = (Reason contains “Disciplinary”) -> Y = Number of seg placements. Year-FE + Region-FE + Gender-FE; Institution as GEE cluster group (24 levels) for marginal-model cluster-robust inference.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c03_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on c03 – racial disparity in RC.
T = (Race == “Indigenous”) indicator -> Y = NumberIndividuals_RC. Year-FE + Gender-FE. Indigenous vs all-other contrast tests overrepresentation.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c04_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on c04 – racial disparity by region.
T = (Race == “Indigenous”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c06_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on c06 – age-50+ disparity in RC.
T = (Age_Category contains “50”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c09_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on c09 – age-50+ disparity by gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_ruhela_alt_gender(df=None)[source]¶
Alt-T Ruhela formulation on a01 – gender -> regional volatility.
Same per-row Ruhela cell frame as the canonical formulation, but T = Female indicator -> Y = vm count. Covariates exclude Gender (the variable being treated). Runs the full 10-estimator DLRM ensemble. Tests whether women experience more or fewer regional transfers within a fiscal year of restrictive confinement.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_ruhela_alt_age(df=None)[source]¶
Alt-T Ruhela formulation on a01 – age 50+ -> regional volatility.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_ruhela_alt_toronto(df=None)[source]¶
Alt-T Ruhela formulation on a01 – Toronto-region -> regional volatility.
T = (Region_AtTimeOfPlacement == “Toronto”) indicator. Tests whether Toronto-region placements produce different volatility than placements outside Toronto, controlling for demographics.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_ruhela_alt_gender(df=None)[source]¶
Alt-T Ruhela formulation on b01 – gender -> regional volatility.
Same as analyze_a01_ruhela_alt_gender but on b01 (Segregation Detailed) cell frame. Tests whether women experience more or fewer regional transfers within a fiscal year of segregation.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_ruhela_alt_age(df=None)[source]¶
Alt-T Ruhela formulation on b01 – age 50+ -> regional volatility.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_ruhela_alt_toronto(df=None)[source]¶
Alt-T Ruhela formulation on b01 – Toronto-region -> regional volatility.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_ruhela_subgroup_female(df=None)[source]¶
Canonical a01 Ruhela formulation on Female-only subset.
Tests effect-heterogeneity-by-gender: same T_high_ac -> vm contrast, but only on female cells. Compare ATE here to analyze_a01_ruhela_subgroup_male to see if alert-complexity matters more or less for women than men.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_ruhela_subgroup_male(df=None)[source]¶
Canonical a01 Ruhela formulation on Male-only subset.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_ruhela_subgroup_female(df=None)[source]¶
Canonical b01 Ruhela formulation on Female-only subset.
Sister to analyze_a01_ruhela_subgroup_female. Tests effect- heterogeneity-by-gender on segregation records.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_ruhela_subgroup_male(df=None)[source]¶
Canonical b01 Ruhela formulation on Male-only subset.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b02_ruhela_alt_region(df=None)[source]¶
b02 alt-T: T = Toronto region -> Y = total seg days within FY.
Sister to analyze_b02_ruhela_formulations (T = Female). Tests regional disparity in seg-day burden controlling for gender + age + year.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b02_ruhela_alt_age(df=None)[source]¶
b02 alt-T: T = Age 50+ -> Y = total seg days within FY.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c05_ruhela_aggregate(df=None)[source]¶
Aggregate RF on c05 – non-Christian religion -> individuals in RC.
T = (Religion != “Christian” and not “No Religion” and not “Unknown”) indicator. Tests population-level religious-minority overrepresentation in RC, controlling for region.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c08_ruhela_aggregate(df=None)[source]¶
Aggregate RF on c08 – non-Christian religion × gender -> RC.
T = (Religion != Christian / no religion / unknown) indicator. Year + Gender FE. Companion to c05 with gender control.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b04_ruhela_aggregate(df=None)[source]¶
Aggregate RF on b04 – gender disparity in median seg duration.
b04 has Region × Gender × Measure (max/median/mode) duration. We filter to Measure == “Median” and test T = Female -> Y = duration.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b08_ruhela_aggregate(df=None)[source]¶
Aggregate RF on b08 – gender disparity in median seg duration by institution.
Companion to b04; b08 adds Institution_AtTimeOfPlacement as a finer-grained spatial dimension. Region-FE; Institution as GEE cluster group (24 levels).
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b09_ruhela_aggregate(df=None)[source]¶
Aggregate RF on b09 – gender disparity in placement-count distribution.
b09 has rows of (Year × NumberPlacements_Segregation × Gender × NumberIndividuals_Segregation). We test T = Female -> Y = NumberIndividuals_Segregation, with NumberPlacements as a categorical FE so we adjust for the bin-cell structure.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c02_ruhela_aggregate(df=None)[source]¶
Aggregate RF on c02 – gender disparity by institution.
Companion to c01 with institution-level granularity. T = Female -> Y = NumberIndividuals_RC. Region + year as fixed effects; Institution as a GEE cluster group (marginal model with cluster-robust SE) – Institution has too many levels (~76) for an FE specification on the 148-row table without collinearity. Mirrors OTIS-RC’s
glmmTMB nbinom2clustered approach.- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c10_ruhela_aggregate(df=None)[source]¶
Aggregate RF on c10 – gender disparity in median RC days.
c10 has Region × Institution × Gender × Measure (Maximum/Median/Mode) duration. Filter to Median; T = Female -> Y = TotalAggregatedDays_RC. Region-FE + Year-FE; Institution as GEE cluster group (25 levels) for cluster-robust marginal inference.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c12_ruhela_aggregate(df=None)[source]¶
Aggregate RF on c12 – gender disparity in median RC days by region.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c11_ruhela_aggregate(df=None)[source]¶
Aggregate Ruhela formulation on c11 – long-duration cell concentration.
c11 has rows of (FY × Aggregate_Duration_bin × NumberIndividuals_RC × NumberIndividuals_Seg). 11 duration bins from “1 day” to “Greater than 30 days”. Rather than treat duration as the treatment-FE (it IS the categorical the data is binned by), we test T = (long-duration bin: ≥16 days) -> Y = NumberIndividuals_RC. Year-FE only.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c01_ruhela_aggregate_region_cluster(df=None)[source]¶
c01 alternative: Female -> RC count, GEE-clustered on year.
Sister analyzer to analyze_c01_ruhela_aggregate. The c01 case has only 6 cells (3 years × 2 genders), so cluster_group=”EndFiscalYear” gives a 3-cluster GEE – extreme small-cluster bias. This variant is shipped for transparency: shows that with so few clusters, Poisson and GEE-Poisson give nearly identical numbers because there’s no clustering structure to exploit.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c04_ruhela_aggregate_region_cluster(df=None)[source]¶
c04 alternative: Indigenous -> RC, GEE-clustered on region.
Sister to analyze_c04_ruhela_aggregate (which uses region as FE). With 5 regions, the GEE-cluster:Region inference treats region as a marginal-model grouping. Useful when wide region heterogeneity in case-mix matters for SE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d02_ruhela_aggregate(df=None)[source]¶
d02 – gender -> custodial death count, with year FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d03_ruhela_aggregate(df=None)[source]¶
d03 – Indigenous -> custodial death count, with year FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d04_ruhela_aggregate(df=None)[source]¶
d04 – non-majority religion -> custodial death count.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d05_ruhela_aggregate(df=None)[source]¶
d05 – age 50+ -> custodial death count.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_ruhela_grid()[source]¶
One-page Ruhela formulations grid summary (24 aggregates).
Runs every aggregate Ruhela formulation analyzer + the region-cluster GEE variants, and presents a single comparison table.
- Return type:
- morie.otis_all_analyze.analyze_ruhela_master(*, include_per_row=False)[source]¶
Paper-ready master report – every Ruhela formulation in one RichResult.
- Sections:
Aggregate Ruhela formulations (24 datasets) – IRR comparison
Per-row Ruhela formulations on a01/b01/b02 – primary IRM-DML ATE (only when
include_per_row=True; ~5-7 min runtime)Doob chi-square family on c-series + d-series
For thesis / paper writing.
- Parameters:
include_per_row (bool) – If True, also runs the slow per-row RFs on a01/b01/b02 (full 10-estimator DLRM). Default False – runs in ~1 second from cached aggregates only.
- Return type:
- morie.otis_all_analyze.analyze_b02_dlrm(*args, **kwargs)[source]¶
DLRM short alias for
analyze_b02_ruhela_formulations.
- morie.otis_all_analyze.analyze_a01_dlrm(*args, **kwargs)[source]¶
DLRM short alias for
analyze_a01_ruhela_formulations.
- morie.otis_all_analyze.analyze_b01_dlrm(*args, **kwargs)[source]¶
DLRM short alias for
analyze_b01_ruhela_formulations.
- morie.otis_all_analyze.analyze_b05_mandela_classification(df=None)[source]¶
Mandela-RF on b05 – per-placement Mandela classification by year.
Applies the Sprott-Doob 15-day Mandela threshold to OTIS b05 (Ontario provincial segregation placement counts by binned duration). Reports proportions of placements ≤15 days (solitary) vs ≥16 days (torture) per fiscal year + a federal-vs-provincial comparison.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c11_mandela_classification(df=None)[source]¶
Mandela-RF on c11 – per-individual Mandela classification by year.
Applies the 15-day threshold to OTIS c11 (Ontario provincial counts of INDIVIDUALS by binned aggregate duration). Reports both restrictive- confinement and segregation-only views; the latter is the more conservative and directly comparable to Sprott-Doob federal SIUs.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_otis_mandela_provincial_vs_federal()[source]¶
Side-by-side comparison: provincial OTIS vs federal SIU Mandela rates.
Cross-references analyze_c11_mandela_classification (Ontario provincial individuals) against the Sprott-Doob Feb 2021 federal SIU figures (Table 19, N=1960).
- Return type:
- morie.otis_all_analyze.analyze_c_doob_chi2(*, contingency_value='NumberIndividuals_RestrictiveConfinement')[source]¶
Doob chi-square family on c-series demographic contingency tables.
Honour to Prof. Doob’s chi-square tradition in Canadian corrections. Runs Pearson χ² + Cramer’s V on every meaningful 2-way slice of the c-series datasets, using the chosen primary count column.
Sibling:
analyze_d_doob_chi2(death counts). Federal counterpart context:morie.siuiapdocuments the Structured Intervention Unit Implementation Advisory Panel (Doob, Sprott, Sapers et al.).- Parameters:
contingency_value (str)
- Return type:
- morie.otis_all_analyze.analyze_d_doob_chi2()[source]¶
Doob chi-square on d-series death data + yearly trend.
Honour to Prof. Doob’s chi-square tradition; same payload as
analyze_d_aggregate(yearly Poisson 95% CI + Alert × MedicalCause χ² + Alert × Housing χ²).- Return type:
- morie.otis_all_analyze.analyze_b01_dual(*args, **kwargs)[source]¶
Deprecated alias for
analyze_b01_ruhela_formulations.
- morie.otis_all_analyze.analyze_c_aggregate(*, contingency_value='NumberIndividuals_RestrictiveConfinement')[source]¶
Chi² + Cramer’s V family on every c-series 2-way slice.
c-series datasets are pre-aggregated counts, so the natural statistic on them is contingency-table independence. This runs χ² + Cramer’s V on each meaningful (categorical_a × categorical_b) slice using the chosen primary count column.
- Parameters:
contingency_value (str)
- Return type:
- morie.otis_all_analyze.analyze_d_aggregate()[source]¶
Yearly trend + alert-cause/housing χ² on d-series death data.
- Three things on the d-series:
Per-year custodial-death count (d02-d05 aggregate over the categorical dim) with Poisson 95% CIs.
Year-over-year rate ratio (last full year / first year).
χ² + Cramer’s V on d06 (Alert × MedicalCauseOfDeath) and d07 (Alert × Housing_Type).
All measures are at the population aggregate level – d-series has no per-individual alert columns, so the Ruhela dual is structurally impossible here.
- Return type:
- morie.otis_all_analyze.analyze_a01_with_csi_context(df=None, *, variant='total', rebase_to_year=2023)[source]¶
OTIS a01 causal pipeline + Toronto Crime Severity Index context.
Wires together three independent morie subsystems:
morie.otis_all_analyze.analyze_a01– the MatchIt-then- IRM-DML causal estimate of high-alert-complexity (ac ≥ 2) on regional volatility (vm) within OTIS a01.morie.otis_tps_overlay– year-by-year correlation between OTIS Toronto-region segregation counts and TPS incident counts.morie.tps_csi.analyze_csi_from_tps_dataframes– Statistics Canada Crime Severity Index per year + per ward, weighted by offence-specific severity weights and population-adjusted.
The narrative answer: “Did the OTIS-observed increase in high-alert- complexity placements happen during a period of rising or falling Toronto crime severity?” The CSI rebased to
rebase_to_year(default 2023, OTIS’s first FY) gives a grandma-readable index.- Parameters:
df (pd.DataFrame, optional) – a01 DataFrame; if None,
otis_all_analyze.analyze_a01()loads the canonical local copy.variant ({"total", "violent"}) – StatsCan CSI variant.
rebase_to_year (int | None) – Anchor year for the CSI index column.
Noneskips rebasing.
- Return type:
- morie.otis_all_analyze.analyze_all(out_dir=None)[source]¶
Run all 29 OTIS analyses and write outputs.
- Parameters:
out_dir (Path | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_mrm_alt_age(df=None)¶
Alt-T Ruhela formulation on a01 – age 50+ -> regional volatility.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_mrm_alt_gender(df=None)¶
Alt-T Ruhela formulation on a01 – gender -> regional volatility.
Same per-row Ruhela cell frame as the canonical formulation, but T = Female indicator -> Y = vm count. Covariates exclude Gender (the variable being treated). Runs the full 10-estimator DLRM ensemble. Tests whether women experience more or fewer regional transfers within a fiscal year of restrictive confinement.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_mrm_alt_toronto(df=None)¶
Alt-T Ruhela formulation on a01 – Toronto-region -> regional volatility.
T = (Region_AtTimeOfPlacement == “Toronto”) indicator. Tests whether Toronto-region placements produce different volatility than placements outside Toronto, controlling for demographics.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_mrm(df=None, *, cluster_col='EndFiscalYear')¶
Ruhela formulations (full DLRM) on a01 – the canonical OTIS-RC dataset.
Runs the complete OTIS-RC methodology arc on the author’s Ruhela formulation (T = high alert complexity, Y = regional volatility). See
_ruhela_formulations_onfor the full estimator list.Attribution¶
The Ruhela formulation is the author’s design end-to-end; the supporting- contributor credits (Doob – pointed the author to the dataset; Levinsky – course context, reviewed only the preliminary a01 analysis; Medina – reviewed the formal write-up) are documented in
morie.otis_causal’s module docstring.- param df:
a01 DataFrame; if None,
otis_datasets.load_otis_dataset('a01').- type df:
pd.DataFrame, optional
- param cluster_col:
Cluster axis for cluster-robust SE in the IRM-DML estimator (default ‘EndFiscalYear’).
- type cluster_col:
str
- Parameters:
- Return type:
- morie.otis_all_analyze.analyze_a01_mrm_per_year(df=None)¶
Per-year full DLRM on Ruhela formulations on a01.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_mrm_subgroup_female(df=None)¶
Canonical a01 Ruhela formulation on Female-only subset.
Tests effect-heterogeneity-by-gender: same T_high_ac -> vm contrast, but only on female cells. Compare ATE here to analyze_a01_ruhela_subgroup_male to see if alert-complexity matters more or less for women than men.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_a01_mrm_subgroup_male(df=None)¶
Canonical a01 Ruhela formulation on Male-only subset.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_mrm_alt_age(df=None)¶
Alt-T Ruhela formulation on b01 – age 50+ -> regional volatility.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_mrm_alt_gender(df=None)¶
Alt-T Ruhela formulation on b01 – gender -> regional volatility.
Same as analyze_a01_ruhela_alt_gender but on b01 (Segregation Detailed) cell frame. Tests whether women experience more or fewer regional transfers within a fiscal year of segregation.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_mrm_alt_toronto(df=None)¶
Alt-T Ruhela formulation on b01 – Toronto-region -> regional volatility.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_mrm(df=None, *, cluster_col='EndFiscalYear')¶
Ruhela formulations (full DLRM) on b01 (Segregation Detailed).
Per-placement segregation complement to a01’s per-day RC records. Same Ruhela alert-complexity -> regional-volatility contrast, larger magnitude effect because segregation is a stronger institutional contrast.
See
analyze_a01_ruhela_formulationsfor the full attribution and estimator catalogue.- Parameters:
- Return type:
- morie.otis_all_analyze.analyze_b01_mrm_per_year(df=None)¶
Per-year full DLRM on Ruhela formulations on b01.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_mrm_subgroup_female(df=None)¶
Canonical b01 Ruhela formulation on Female-only subset.
Sister to analyze_a01_ruhela_subgroup_female. Tests effect- heterogeneity-by-gender on segregation records.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b01_mrm_subgroup_male(df=None)¶
Canonical b01 Ruhela formulation on Male-only subset.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b02_mrm_alt_age(df=None)¶
b02 alt-T: T = Age 50+ -> Y = total seg days within FY.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b02_mrm_alt_region(df=None)¶
b02 alt-T: T = Toronto region -> Y = total seg days within FY.
Sister to analyze_b02_ruhela_formulations (T = Female). Tests regional disparity in seg-day burden controlling for gender + age + year.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b02_mrm(df=None, *, cluster_col='EndFiscalYear')¶
Ruhela formulation on b02 (Segregation Total Days per individual).
b02 has UniqueIndividual_ID + EndFiscalYear + Gender + Region + Age_Category + TotalAggregatedDays_Segregation but NO alert columns. The canonical alert-complexity formulation is impossible; instead the author’s b02-specific Ruhela formulation tests gender disparity in segregation-day burden:
T = Gender == “Female” indicator Y = TotalAggregatedDays_Segregation (count, person-year) covariates = [Region_MostRecentPlacement, Age_Category, EndFiscalYear]
Same DLRM ensemble as the canonical formulation. Sister formulations on the same b02 data could swap T for region or age.
- Parameters:
- Return type:
- morie.otis_all_analyze.analyze_b03_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on b03 (Alert × Institution).
T = Alert_Presence indicator -> Y = Number_SegregationPlacements. Year-FE + Region-FE + Alert_Type-FE; Institution as GEE cluster group (24 levels, marginal model with cluster-robust SE).
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b04_mrm_aggregate(df=None)¶
Aggregate RF on b04 – gender disparity in median seg duration.
b04 has Region × Gender × Measure (max/median/mode) duration. We filter to Measure == “Median” and test T = Female -> Y = duration.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b06_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on b06 – disciplinary reason -> seg.
T = (Reason contains “Disciplinary”) -> Y = Number of seg placements. Year-FE + Region-FE + Gender-FE; Institution as GEE cluster group (24 levels) for marginal-model cluster-robust inference.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b07_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on b07 (alerts × gender summary).
- b07 is a wide table with paired columns:
Number_Segregation_Placements_With_Alert Number_Segregation_Placements_Without_Alert
We pivot to long form (one row per alert-state × gender × year), then T = with_alert indicator -> Y = count. Year + Gender + Alert_Type fixed effects.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b08_mrm_aggregate(df=None)¶
Aggregate RF on b08 – gender disparity in median seg duration by institution.
Companion to b04; b08 adds Institution_AtTimeOfPlacement as a finer-grained spatial dimension. Region-FE; Institution as GEE cluster group (24 levels).
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_b09_mrm_aggregate(df=None)¶
Aggregate RF on b09 – gender disparity in placement-count distribution.
b09 has rows of (Year × NumberPlacements_Segregation × Gender × NumberIndividuals_Segregation). We test T = Female -> Y = NumberIndividuals_Segregation, with NumberPlacements as a categorical FE so we adjust for the bin-cell structure.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c01_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on c01 – gender disparity in RC.
c01 has Gender × {InCustody, RestrictiveConfinement, Segregation} counts. T = Female indicator -> Y = NumberIndividuals_RC, Year-FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c01_mrm_aggregate_region_cluster(df=None)¶
c01 alternative: Female -> RC count, GEE-clustered on year.
Sister analyzer to analyze_c01_ruhela_aggregate. The c01 case has only 6 cells (3 years × 2 genders), so cluster_group=”EndFiscalYear” gives a 3-cluster GEE – extreme small-cluster bias. This variant is shipped for transparency: shows that with so few clusters, Poisson and GEE-Poisson give nearly identical numbers because there’s no clustering structure to exploit.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c02_mrm_aggregate(df=None)¶
Aggregate RF on c02 – gender disparity by institution.
Companion to c01 with institution-level granularity. T = Female -> Y = NumberIndividuals_RC. Region + year as fixed effects; Institution as a GEE cluster group (marginal model with cluster-robust SE) – Institution has too many levels (~76) for an FE specification on the 148-row table without collinearity. Mirrors OTIS-RC’s
glmmTMB nbinom2clustered approach.- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c03_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on c03 – racial disparity in RC.
T = (Race == “Indigenous”) indicator -> Y = NumberIndividuals_RC. Year-FE + Gender-FE. Indigenous vs all-other contrast tests overrepresentation.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c04_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on c04 – racial disparity by region.
T = (Race == “Indigenous”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c04_mrm_aggregate_region_cluster(df=None)¶
c04 alternative: Indigenous -> RC, GEE-clustered on region.
Sister to analyze_c04_ruhela_aggregate (which uses region as FE). With 5 regions, the GEE-cluster:Region inference treats region as a marginal-model grouping. Useful when wide region heterogeneity in case-mix matters for SE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c05_mrm_aggregate(df=None)¶
Aggregate RF on c05 – non-Christian religion -> individuals in RC.
T = (Religion != “Christian” and not “No Religion” and not “Unknown”) indicator. Tests population-level religious-minority overrepresentation in RC, controlling for region.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c06_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on c06 – age-50+ disparity in RC.
T = (Age_Category contains “50”) -> Y = NumberIndividuals_RC. Year-FE + Region-FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c07_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on c07 – alert × gender -> RC count.
Counterpart of the per-row Ruhela formulation on a01/b01 at the aggregate level. Useful triangulation: c07 reports NumberIndividuals_InCustody / RC / Segregation cross-tabulated by Alert_Type × Gender × FiscalYear. The Ruhela formulation here contrasts an indicator of alert presence (vs. no-alert reference) with the number-in-RC count.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c08_mrm_aggregate(df=None)¶
Aggregate RF on c08 – non-Christian religion × gender -> RC.
T = (Religion != Christian / no religion / unknown) indicator. Year + Gender FE. Companion to c05 with gender control.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c09_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on c09 – age-50+ disparity by gender.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c10_mrm_aggregate(df=None)¶
Aggregate RF on c10 – gender disparity in median RC days.
c10 has Region × Institution × Gender × Measure (Maximum/Median/Mode) duration. Filter to Median; T = Female -> Y = TotalAggregatedDays_RC. Region-FE + Year-FE; Institution as GEE cluster group (25 levels) for cluster-robust marginal inference.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c11_mrm_aggregate(df=None)¶
Aggregate Ruhela formulation on c11 – long-duration cell concentration.
c11 has rows of (FY × Aggregate_Duration_bin × NumberIndividuals_RC × NumberIndividuals_Seg). 11 duration bins from “1 day” to “Greater than 30 days”. Rather than treat duration as the treatment-FE (it IS the categorical the data is binned by), we test T = (long-duration bin: ≥16 days) -> Y = NumberIndividuals_RC. Year-FE only.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_c12_mrm_aggregate(df=None)¶
Aggregate RF on c12 – gender disparity in median RC days by region.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d02_mrm_aggregate(df=None)¶
d02 – gender -> custodial death count, with year FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d03_mrm_aggregate(df=None)¶
d03 – Indigenous -> custodial death count, with year FE.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d04_mrm_aggregate(df=None)¶
d04 – non-majority religion -> custodial death count.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_all_analyze.analyze_d05_mrm_aggregate(df=None)¶
d05 – age 50+ -> custodial death count.
- Parameters:
df (DataFrame | None)
- Return type:
morie.otis_causal – IPW, AIPW, and DML estimators for OTIS.
Companion to morie.otis.otdml (which already implements
cross-fitted Frisch-Waugh-Lovell DML). This module adds:
otis_ipw()– Hájek-stabilised inverse probability weighting with a logistic propensity model.
otis_aipw()– Augmented IPW (Robins-Rotnitzky-Zhao 1994 doubly-robust estimator) with cross-fitted nuisance models.
otis_causal_grid()– runs all three estimators (IPW, AIPW, DML) on the canonical three (treatment, outcome) pairs:
MentalHealth_Alert -> SuicideRisk_Alert
HighAlertComplexity -> AnyReadmission
RegionalVolatility -> ConsecutiveSegregationDays
The three pairs cover the three Goffmanian framings of OTIS most relevant to the MA-paired thesis: clinical-alert chain (a), multi-alert “complexity” (b), and inter-region churn (c).
Estimators¶
Let $D in {0,1}$ be a binary treatment, $Y$ a real-valued outcome, $X in mathbb{R}^p$ a covariate vector, and write $e(X) coloneqq Pr(D=1mid X)$ for the propensity score and $mu_d(X) coloneqq mathbb{E}[Ymid D=d, X]$ for the outcome-regression mean.
IPW (Hájek):
- ATE_IPW = sum_i w1(i)·D_i·Y_i / sum_i w1(i)·D_i
sum_i w0(i)·(1-D_i)·Y_i / sum_i w0(i)·(1-D_i)
with w1(i) = 1/e(X_i), w0(i) = 1/(1-e(X_i)).
AIPW (RRZ 1994 doubly-robust score):
- psi(O_i; eta) = mu_1(X_i) - mu_0(X_i)
D_i·(Y_i - mu_1(X_i))/e(X_i)
(1-D_i)·(Y_i - mu_0(X_i))/(1-e(X_i))
ATE_AIPW = (1/n) sum_i psi(O_i; eta_hat)
with cross-fitted nuisance estimates eta_hat = (e_hat, mu_1_hat, mu_0_hat).
DML PLR:
Frisch-Waugh-Lovell partialled-out regression with cross-fitted residuals (already implemented in
morie.otis.otdml()).
References
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89(427): 846-866.
Chernozhukov, V. et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21(1): C1-C68.
Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23(19): 2937-2960.
Robins, J. M. (1986). A new approach to causal inference in mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling 7: 1393-1512.
Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1): 41-55.
Attribution – Ruhela formulations¶
The OTIS-RC formulations and the entire codebase implemented here are the work of Vansh Singh Ruhela (hadesllm). No one co-formulated the analyses: the (T = ac ≥ 2, Y = vm count) alert-complexity -> regional-volatility contrast, the 8-state combo encoding, and the full method battery (IPW + AIPW + g-computation + PSM-NN + PSM-subclass + IRM-DML + match_first) are Ruhela’s design.
Three external contributors enabled the journey but did not co-author the formulations:
- Doob – Prof. Emeritus Anthony N. Doob, University of Toronto.
Pointed Ruhela to the existence of the OTIS data feed – the first piece of the puzzle. Without that pointer, this project does not begin.
- Levinsky – Prof. Zachary Levinsky, University of Toronto. His
graduate course on the Theory of Punishment was the broader context that motivated this work. He later reviewed the preliminary a01-only analysis from the
OTIS-RC/folder; he did not see the full method battery shipped here.- Medina – Prof. Zorro Medina, University of Toronto. Reviewed
the formal write-up; her feedback validated the statistical route and gave the pipeline confidence in its current form.
Naming conventions¶
- RF – Ruhela formulation: a (treatment, outcome, covariates) design
choice for a specific OTIS dataset.
- RDF – Ruhela Dual Formulation: an RF paired with a Naive-arm
sensitivity contrast (e.g., on a01/b01 the Ruhela arm T_high_ac -> vm_count is paired with the Naive arm any-flag -> vm-binary).
- DLRM – Doob-Levinsky-Ruhela-Medina, short alias for the
attribution chain on the methodology side. Function names like
analyze_a01_dlrmare equivalent aliases foranalyze_a01_ruhela_formulations.
ACKNOWLEDGEMENTS – beyond the methodology attribution above¶
The DLRM attribution covers the contributors whose work directly shaped the OTIS-RC analyses. The wider intellectual journey that brought this codebase into being also owes credit to:
- Prof. Beatrice Jauregui, University of Toronto – for the network
of mentorship that connected Ruhela to Prof. Medina, which made this line of work possible.
- Prof. Ayobami Laniyonu, University of Toronto – for valuable
lessons over the broader period of this work that informed Ruhela’s thinking on policing and corrections research.
The separation between methodology attribution (DLRM, narrow) and acknowledgements (broader, enabling) follows standard academic- publishing convention.
- class morie.otis_causal.CausalEstimate(estimator: 'str', ate: 'float', ate_se: 'float', ate_pval: 'float', n: 'int', n_treated: 'int', p_treat: 'float', notes: 'list[str]' = <factory>)[source]¶
Bases:
object- Parameters:
- morie.otis_causal.otis_ipw(df, *, treatment, outcome, covariates, eps=0.02, propensity_calibration='none')[source]¶
Hájek-stabilised IPW estimator of the ATE on OTIS data.
Estimates the propensity by logistic regression and weights treated/control outcomes by $1/hat e$ and $1/(1-hat e)$. Standard error follows the Lunceford-Davidian sandwich.
- Parameters:
propensity_calibration ("none" | "platt" | "isotonic") – Calibrate the raw logistic propensities. Platt = sigmoid(a + b * logit(p)) fitted on the same data; isotonic = monotonic regression (sklearn). Calibration improves Brier score / calibration-curve diagnostics; for OTIS data with logistic propensities it usually moves the predicted prevalence toward the observed prevalence.
df (DataFrame)
treatment (str)
outcome (str)
eps (float)
- Return type:
- morie.otis_causal.otis_aipw(df, *, treatment, outcome, covariates, n_folds=5, seed=123, eps=0.02, propensity_calibration='none')[source]¶
Augmented IPW (Robins-Rotnitzky-Zhao) ATE on OTIS data.
Uses
n_foldscross-fitting: nuisance models (propensity, outcome-regression) are fit on $K-1$ folds and evaluated on the held-out fold, then aggregated. This is the doubly-robust influence-function plug-in estimator.- Parameters:
- Return type:
- morie.otis_causal.otis_gcomputation(df, *, treatment, outcome, covariates, n_bootstrap=200, seed=123)[source]¶
Parametric g-computation (Robins 1986) of the ATE on OTIS data.
Fits an outcome regression $\hat\mu(d, X)$ via OLS, then computes the standardised mean difference
- \hat\Delta = n^{-1} \sum_i (\hat\mu(1, X_i)
\hat\mu(0, X_i)).
SE via non-parametric bootstrap (resample rows, refit, recompute). Complements IPW and AIPW: g-computation is consistent if and only if the outcome model is correctly specified (single-robust); IPW is consistent iff the propensity is correct; AIPW is doubly robust.
References
- Robins, J. M. (1986). A new approach to causal inference in
mortality studies with a sustained exposure period – application to control of the healthy worker survivor effect. Mathematical Modelling 7: 1393-1512.
- morie.otis_causal.otis_psm_subclass(df, *, treatment, outcome, covariates, n_strata=5, eps=0.02)[source]¶
Propensity-score stratification (Rosenbaum & Rubin 1983).
Splits the sample into
n_stratastrata by propensity-score quantile, computes within-stratum ATE = E[Y|D=1, S=s] - E[Y|D=0, S=s], and weights stratum-specific ATEs by stratum size to get the population ATE. Standard error: weighted average of within-stratum SEs (Rosenbaum-Rubin convention).n_strata=5 is the standard “rule of thumb” – Cochran (1968) showed 5 strata remove ~90% of bias from a single confounder.
References
- Rosenbaum, P. R. & Rubin, D. B. (1983). The central role of the
propensity score in observational studies for causal effects. Biometrika 70(1): 41-55.
- Cochran, W. G. (1968). The effectiveness of adjustment by
subclassification in removing bias in observational studies. Biometrics 24(2): 295-313.
- morie.otis_causal.otis_atc(df, *, treatment, outcome, covariates, n_folds=5, seed=123, eps=0.02)[source]¶
AIPW-flavoured Average Treatment effect on the Controls (ATC).
The IF for ATC = E[Y(1) - Y(0) | D = 0] is
- psi^{ATC}(O; eta) = mu_1(X) - mu_0(X)
(1-D)/(1-e(X)) * (Y - mu_0(X))
e(X)/(1-e(X)) * D * (Y - mu_1(X)) / e(X)
weighted by 1 / Pr(D=0). The implementation reuses the cross-fitted nuisance pipeline of otis_aipw and reweights the score on the untreated stratum.
- morie.otis_causal.otis_plr(df, *, treatment, outcome, covariates, n_folds=3, seed=123, ml_outcome='rf', ml_treatment='rf')[source]¶
Partially Linear Regression DML (Chernozhukov et al. 2018).
Fits the structural model
Y = θ * D + g(X) + ε, D = m(X) + V
via Frisch-Waugh-Lovell partialling-out: cross-fit nuisances g(X) = E[Y|X] and m(X) = E[D|X], then regress the residualised Y on the residualised D to recover θ.
Cross-fitted random-forest nuisance models match the OTIS-RC DoubleML/PLR pipeline. Differs from IRM-DML in that PLR assumes a constant (homogeneous) treatment effect θ; IRM allows D × X interactions in the outcome model. PLR is the simpler / more restricted estimator. Concordance with IRM-DML is the standard “no effect heterogeneity” robustness signal.
- morie.otis_causal.otis_aipw_superlearner(df, *, treatment, outcome, covariates, n_folds=5, seed=123, eps=0.02, propensity_calibration='none')[source]¶
SuperLearner-stacked AIPW (cross-fitted convex stack of learners).
Stacks four learners – random forest, ridge, OLS/logistic, mean – via cross-validated convex weights minimising MSE (regression) or log-loss (propensity). Mirrors the OTIS-RC AIPW SuperLearner pipeline (notez1a.qmd lines 2028-2048: SL.glmnet, SL.xgboost, SL.glm, SL.mean) at a lighter sklearn-only stack. xgboost is optional; if importable, it is added to the stack.
Falls back to plain AIPW if sklearn isn’t available.
References
- van der Laan, M. J. and Polley, E. C. and Hubbard, A. E. (2007).
Super Learner. Stat. Appl. Genet. Mol. Biol. 6(1): Article 25.
- morie.otis_causal.otis_psm(df, *, treatment, outcome, covariates, k=1, caliper_sd=0.2, with_replacement=False, eps=0.02)[source]¶
1:k nearest-neighbour propensity-score matching on logit(e(X)).
Estimates the average treatment effect on the treated (ATT) by pairing each treated unit with its $k$ nearest control units on the logit-propensity scale. An optional caliper (default $0.2,sigma_{mathrm{logit},e}$, the Austin 2011 convention) discards treated units with no control inside the caliper. When $k=1$ and
with_replacement=Falsethe SE is the paired-difference standard error $sqrt{mathrm{Var}(Y_t - Y_{c(t)}) / m}$ where $m$ is the number of matched pairs.References
Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioural Research 46(3): 399-424.
Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74(1): 235-267.
- morie.otis_causal.otis_balance(df, *, treatment, covariates, outcome=None, caliper_sd=0.2, eps=0.02)[source]¶
Per-covariate standardised mean differences before / after weighting / matching.
Returns a DataFrame with columns
covariate, smd_raw, smd_ipw, smd_psm. Conventionally $|mathrm{SMD}| > 0.10$ flags meaningful imbalance and $|mathrm{SMD}| > 0.25$ strong imbalance cite{austin2011smd}.
- morie.otis_causal.otis_overlap(df, *, treatment, covariates, eps=0.02)[source]¶
Propensity-score overlap diagnostics.
Returns a dict with min/max propensity per group, fraction inside the trimming interval $[varepsilon, 1-varepsilon]$, Hellinger distance between treated/control propensity densities, and the count of clipped propensities.
- morie.otis_causal.otis_irm_dml(df, *, treatment, outcome, covariates, cluster_cols=None, n_folds=3, seed=123, eps=0.02, ml_outcome='rf', ml_propensity='rf', match_first=False, match_caliper_sd=0.2)[source]¶
Interactive-Regression-Model DML, matching DoubleML in R.
Computes both the ATE (population average treatment effect) and ATTE (average treatment effect on the treated) via the doubly-robust influence function
- psi^{ATE}(O; eta)
- = mu_1(X) - mu_0(X)
D(Y - mu_1(X)) / e(X)
(1-D)(Y - mu_0(X)) / (1 - e(X)),
- psi^{ATTE}(O; eta)
- = frac{D(Y - mu_0(X)) - e(X)(1-D)(Y - mu_0(X)) /
(1 - e(X))}{bar D},
with cross-fitted random-forest nuisance models. Standard errors can be cluster-robust on one or two cluster axes (cite{CameronGelbachMiller2011,LiangZeger1986}).
Returns a dict with both estimands, their SEs, and 95% CIs.
- Parameters:
cluster_cols (str | list[str] | None) – Column names of cluster axes.
None⇒ heteroskedasticity- consistent (no clustering)."id"or["id"]⇒ one-way cluster on id.["id", "rc"]⇒ two-way Cameron-Gelbach-Miller.match_first (bool) – If True, first run 1:1 nearest-neighbour propensity-score matching on logit(ê(X)) with caliper ``match_caliper_sd``× SD(logit(ê)), then fit IRM-DML on the matched subset only. Mirrors the MatchIt-then-DML pipeline in notez1a.qmd.
df (DataFrame)
treatment (str)
outcome (str)
n_folds (int)
seed (int)
eps (float)
ml_outcome (str)
ml_propensity (str)
match_caliper_sd (float | None)
- Return type:
- morie.otis_causal.otis_per_year_irm_dml(df, *, treatment, outcome, covariates, year_col='EndFiscalYear', cluster_cols=None, n_folds=3, seed=123, full_battery=False, propensity_calibration='none')[source]¶
Fit IRM-DML separately on each fiscal year.
Returns a dict keyed by year value. When
full_battery=False(default), each year’s value is the standardotis_irm_dmloutput (mirrors Ruhela’sres_by_yearincorrectional_stats_report1z.RData).When
full_battery=True, each year’s value is a dict{estimator_name: CausalEstimate-or-irm-dict}with all 10 Ruhela-formulation estimators (IPW, AIPW, g-computation, PSM 1:1 NN, PSM subclass, IRM-DML, PSM->IRM-DML match_first, ATC-AIPW, PLR, SuperLearner-AIPW). Roughly 7× the per-year runtime.
- morie.otis_causal.make_pair_alert_to_volatility_ruhela(df)[source]¶
Ruhela’s primary causal contrast: high alert complexity (ac ≥ 2) causes more inter-region transfers (vm).
This is “Ruhela’s equation” for alert complexity and volatility – the 8-state combo encoding documented in
/path/to/workspace/OTIS-RC/notez1a.qmdand used to produce the publishedres_pool/res_by_year/res_allestimates incorrectional_stats_report1z.RData. Sister functionmake_pair_alert_to_volatility_naive()is available as a robustness alternative; both are run side by side bymake_pair_alert_to_volatility_all().Encoding¶
For each placement-row, the three binary alerts (MentalHealth_Alert, SuicideRisk_Alert, SuicideWatch_Alert) encode one of $2^3 = 8$ states:
a1 = MH & ¬SR & ¬SW (mental-health only) a2 = ¬MH & SR & ¬SW (suicide-risk only) a3 = ¬MH & ¬SR & SW (suicide-watch only – empirically empty) a4 = MH & SR & ¬SW (MH+SR) a5 = ¬MH & SR & SW (SR+SW) a6 = MH & ¬SR & SW (MH+SW – empirically empty) a7 = MH & SR & SW (all three) a8 = ¬MH & ¬SR & ¬SW (no alerts)
Per (UniqueIndividual_ID, EndFiscalYear), the alert-state complexity ac is the number of distinct combos with positive support across the year’s placement-rows for that person. This differs from the naïve “max simultaneous flags” – a person who had three a4 placements has ac = 1 (one distinct combo), not ac = 2 (two simultaneous flags). Treatment T = 1 iff ac ≥ 2.
Outcome¶
vm (regional volatility moves, count) sums two indicators across the person-year placement sequence:
regA ≠ regB within the same row (the person’s original-vs-most-recent regions differ in this placement);
regA ≠ regA(prev row) across rows (the person’s original-region label changed across consecutive placements).
This is a Poisson-style count outcome on ${0, 1, 2, ldots}$.
- returns:
T = “T_high_ac” binary, 1 iff ac ≥ 2 Y = “Y_vm_count” integer count, regional transfer count covariates = [“Gender”, “Age_Category”, “EndFiscalYear”]
- rtype:
(data, T, Y, covariates) where
- morie.otis_causal.make_pair_alert_to_volatility_naive(df)[source]¶
Naïve (simpler) alternative to
make_pair_alert_to_volatility_ruhela().- Treatment ac:
per (id, year), the maximum number of simultaneous alert flags across the year’s placement rows (i.e.
max(MH + SR + SW)per row). T = 1 iff this maximum is ≥ 2.- Outcome vm (binary):
per (id, year), 1 iff any placement row has
Region_AtTimeOfPlacement ≠ Region_MostRecentPlacement.
This was the original morie formulation before we audited against
OTIS-RC/notez1a.qmd; it produces a different treatment marginal (~24% treated vs ~14% under Ruhela’s 8-state encoding) and a binary rather than count outcome. We keep it for robustness comparison: if both formulations agree on sign and rough magnitude, the substantive conclusion is invariant to the operationalisation.
- morie.otis_causal.make_pair_alert_to_volatility_all(df)[source]¶
Run both Ruhela and naïve alert->volatility formulations.
Returns a dict
{"ruhela": (...), "naive": (...)}where each value is a 4-tuple(person_year_data, T, Y, covariates)suitable for handing tootis_irm_dml().
- morie.otis_causal.make_pair_alert_to_volatility(df)¶
Ruhela’s primary causal contrast: high alert complexity (ac ≥ 2) causes more inter-region transfers (vm).
This is “Ruhela’s equation” for alert complexity and volatility – the 8-state combo encoding documented in
/path/to/workspace/OTIS-RC/notez1a.qmdand used to produce the publishedres_pool/res_by_year/res_allestimates incorrectional_stats_report1z.RData. Sister functionmake_pair_alert_to_volatility_naive()is available as a robustness alternative; both are run side by side bymake_pair_alert_to_volatility_all().Encoding¶
For each placement-row, the three binary alerts (MentalHealth_Alert, SuicideRisk_Alert, SuicideWatch_Alert) encode one of $2^3 = 8$ states:
a1 = MH & ¬SR & ¬SW (mental-health only) a2 = ¬MH & SR & ¬SW (suicide-risk only) a3 = ¬MH & ¬SR & SW (suicide-watch only – empirically empty) a4 = MH & SR & ¬SW (MH+SR) a5 = ¬MH & SR & SW (SR+SW) a6 = MH & ¬SR & SW (MH+SW – empirically empty) a7 = MH & SR & SW (all three) a8 = ¬MH & ¬SR & ¬SW (no alerts)
Per (UniqueIndividual_ID, EndFiscalYear), the alert-state complexity ac is the number of distinct combos with positive support across the year’s placement-rows for that person. This differs from the naïve “max simultaneous flags” – a person who had three a4 placements has ac = 1 (one distinct combo), not ac = 2 (two simultaneous flags). Treatment T = 1 iff ac ≥ 2.
Outcome¶
vm (regional volatility moves, count) sums two indicators across the person-year placement sequence:
regA ≠ regB within the same row (the person’s original-vs-most-recent regions differ in this placement);
regA ≠ regA(prev row) across rows (the person’s original-region label changed across consecutive placements).
This is a Poisson-style count outcome on ${0, 1, 2, ldots}$.
- returns:
T = “T_high_ac” binary, 1 iff ac ≥ 2 Y = “Y_vm_count” integer count, regional transfer count covariates = [“Gender”, “Age_Category”, “EndFiscalYear”]
- rtype:
(data, T, Y, covariates) where
- morie.otis_causal.otis_dml(df, *, treatment, outcome, covariates, n_folds=5, seed=123)[source]¶
Cross-fitted DML PLR (partialling-out) wrapper.
Reproduces
morie.otis.otdml()but exposes the sameCausalEstimateshape for grid comparison.
- morie.otis_causal.make_pair_a(df)[source]¶
Pair (a): MentalHealth_Alert -> SuicideRisk_Alert (binary->binary).
The clinical-alert chain: do mental-health flags causally elevate subsequent suicide-risk-alert occurrence, conditional on demographics and region?
- morie.otis_causal.make_pair_b(df)[source]¶
Pair (b): HighAlertComplexity -> AnyReadmission.
Treatment T_b = 1 iff at least 2 of {MentalHealth, SuicideRisk, SuicideWatch} alerts are simultaneously active in the year. Outcome Y_b = 1 iff this individual has ≥2 placements (proxy for any future readmission).
- morie.otis_causal.make_pair_c(df)[source]¶
Pair (c): RegionalVolatility -> SegregationDays.
Treatment T_c = 1 iff Region_AtTimeOfPlacement ≠ Region_MostRecentPlacement (the inter-region transfer flag). Outcome Y_c = NumberConsecutiveDays_Segregation (continuous, truncated at the 99th percentile to dampen the long tail).
- morie.otis_causal.make_pair_alert_to_volatility_a01(df=None)[source]¶
Same as
make_pair_alert_to_volatility_ruhela()but auto-loads a01 (Restrictive Confinement Detailed) instead of b01 (Segregation Detailed) ifdfis None.a01 is the file the published res_pool / res_by_year / res_all are computed on; using it in our pipeline closes the “different bundled snapshot” gap.
- morie.otis_causal.otis_causal_grid(df=None, *, seed=123)[source]¶
Run IPW / AIPW / DML on all three canonical (T, Y) pairs.
Returns a
RichResultwith a 9-row payload table – three estimators × three pairs – plus per-row standard errors, p-values, and 95% confidence intervals.- Parameters:
- Return type:
morie.otis_churn – Goffmanian institutional-churn analyses on OTIS.
Goffman (1961) describes ‘total institutions’ as totalizing environments that produce a cyclical relationship to the inmate. This module operationalises the cyclical / mortifying / embedding dimensions as formal statistical tests:
repeat_placement_concentration(b09) – Gini + power-law fit + KS test within_year_placement_count(b01) – per-(id,year) cell-size distribution within_year_region_diversity(b01) – intra-year region cycling mortification_cooccurrence(b01) – joint alert chi² + Cramer’s V disciplinary_medical_overlap(b01) – disc × med-protect co-occurrence embedding_distribution(b02) – total-days survival (lognormal vs Pareto AIC) intra_year_transition_matrix(a01) – Markov regA->regA’ within FY path_complexity_gini(b01) – Gini split by (year × region) region_alert_state_richness(b01) – distinct (region × combo) per cell regC_demog_contingency(b01) – multi-region × Gender × Age χ² irr_glmm_vm(b01) – Poisson + NB IRR for vm
Each emits a RichResult.
NOTE: OTIS UniqueIndividual_ID is anonymised as YYYY-XXXXX-AA, where AA is the dataset acronym (RC for a01, SG for b01/b02, DC for d01). The YYYY prefix locks each ID to one fiscal year, AND IDs are re-randomized per dataset file even within the same year. So longitudinal individual-level linkage and cross-dataset linkage are both not possible by design. All metrics above are either aggregate (b09 counts), within-year individual (b01), or within-year total-days (b02). See docs/source/methods/otis_linkage.md.
- morie.otis_churn.repeat_placement_concentration(df=None)[source]¶
Goffman’s ‘cyclical inmate’ – heavy concentration of placements among a small fraction of individuals.
Uses OTIS b09 (NumberPlacements × NumberIndividuals).
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.within_year_placement_count(df=None)[source]¶
Distribution of segregation placements per (individual × fiscal year) cell.
Because OTIS UniqueIndividual_ID is anonymised per fiscal year (YYYY-XXXXX-AA), each (id, year) cell is one anonymous person-year. This metric describes how concentrated the placements are within a fiscal year – Goffman’s intra-year cycling. Cross-year readmission is not measurable in OTIS by design. See OTIS_LINKAGE.md.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.within_year_region_diversity(df=None)[source]¶
Distribution of distinct regions per individual within one fiscal year.
Because OTIS UniqueIndividual_ID is anonymised per fiscal year (YYYY-XXXXX-AA), this metric measures intra-year region cycling only – not multi-year mobility. See OTIS_LINKAGE.md.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.mortification_cooccurrence(df=None)[source]¶
Goffman’s ‘mortification of self’ is operationalised here as concurrent stigmatising-alert flags on a single placement.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.disciplinary_medical_overlap(df=None)[source]¶
Goffman’s ‘tinkering trades’ tension: the same institution classifies the same person via medical AND disciplinary rationales. We look at co-occurrence on individual records.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.embedding_distribution(df=None)[source]¶
TotalAggregatedDays_Segregation distribution per individual per fiscal year. Compare lognormal vs Pareto fit via AIC.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.intra_year_transition_matrix(df=None)[source]¶
Knowledge itself is power. – Francis Bacon
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.path_complexity_gini(df=None)[source]¶
Gini of placements-per-(id × year) cells, split by (year, region).
Extends within_year_placement_count by reporting per-region and per-year Gini coefficients, exposing where Goffmanian cycling is most concentrated.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.region_alert_state_richness(df=None)[source]¶
Distinct (region × alert-combo) states occupied per (id × year).
Goffman’s mortification × institutional-mobility hybrid: how many distinct role-zones (region paired with alert presentation) does a person traverse within one fiscal year?
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.regC_demog_contingency(df=None)[source]¶
Multi-region path indicator (regC = number of distinct regions visited within FY) × Gender × Age contingency, with chi² + Cramer’s V.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.otis_churn.irr_glmm_vm(df=None)[source]¶
Poisson + Negative-Binomial GLM IRR table for vm ~ T_high_ac plus demographic covariates, on the (UniqueIndividual_ID, FY) cell.
A simplified implementation of the OTIS-RC glmmTMB pipeline: no random effect (statsmodels can’t easily do mixed NB on this scale). Provides Poisson and NB2 IRR side by side as a robustness pair.
- Parameters:
df (DataFrame | None)
- Return type:
morie.otis_datasets – registry + loader for all 29 OTIS CSVs.
The Ontario Tracking Information System (OTIS) is the open-data feed for Ontario’s adult correctional system, published by the Ministry of the Solicitor General’s Service Management and Oversight Branch under the Jahn settlement framework. The 29 datasets in the official A01RCDD release fall into 4 series:
- a-series (1 dataset) – Restrictive confinement detailed
(the canonical per-day record file used in Ruhela 2026 OTIS-RC)
b-series (9 datasets) – Segregation placement records c-series (12 datasets) – Individuals in segregation & restrictive
confinement
d-series (7 datasets) – Deaths in custody
- CSV path convention:
data/datasets/OTIS/<id>_<short_label>.csv
Where <id> is the canonical dataset name from the official Ontario data dictionary (a01, b01, c03, d05, …).
- Public API:
DATASET_REGISTRY – dict mapping id -> metadata load_otis_dataset(id) – returns DataFrame list_otis_datasets() – list of all (id, description, n_cols) download_otis_dataset(id, target_dir=None)
- – fetch a fresh copy from data.ontario.ca
via the CKAN action API
- download_all_otis(target_dir=None)
– fetch every file in the package
OTIS_DATA_DIR – Path OTIS_CKAN_PACKAGE – package id on data.ontario.ca
References
- Ministry of the Solicitor General. Data on Inmates in Ontario.
- Ministry of the Solicitor General. Restrictive Confinement,
Segregation and Deaths in Custody Data Dictionary, 2025-11-03. https://data.ontario.ca/dataset/data-on-inmates-in-ontario/resource/d83fe893-9634-4794-a0c1-c17bf619a95a
- class morie.otis_datasets.OtisDataset(id, series, csv_filename, description, columns, panel, primary_metric)[source]¶
Bases:
objectOne OTIS dataset’s metadata.
- Parameters:
- morie.otis_datasets.load_otis_dataset(dataset_id, data_dir=None)[source]¶
Load one OTIS dataset by id (b01, c03, d05, …).
Returns a pandas DataFrame with the columns documented in DATASET_REGISTRY[dataset_id].columns.
- morie.otis_datasets.list_otis_datasets()[source]¶
List all 29 OTIS datasets as (id, description, n_columns).
- morie.otis_datasets.download_otis_dataset(dataset_id, target_dir=None, *, overwrite=False)[source]¶
Fetch one OTIS CSV from CKAN and write to
target_dir.Returns the local path to the downloaded file. No-op if the file already exists and
overwrite=False.
- morie.otis_datasets.download_all_otis(target_dir=None, *, overwrite=False, skip_missing=True)[source]¶
Fetch every CSV in the OTIS registry from CKAN.
Returns
{dataset_id: local_path}for successfully downloaded files. Whenskip_missing=True(default), files that aren’t found on CKAN are silently skipped; otherwise the function raises.
morie.otis_tps_overlay – cross-link OTIS (Ontario corrections) with TPS (Toronto police) data.
Both feeds touch Toronto, so meaningful overlays:
Year-over-year correlation – does annual segregation/RC use in the OTIS “Toronto” region track annual TPS incident counts?
Per-OTIS-region rollups – total seg/RC counts per region, with the Toronto-region row joinable to TPS aggregates.
Crime × confinement composite – combine TPS composite-risk index (per-neighbourhood) with OTIS seg/RC totals (Toronto-region only) to surface high-correlation years.
All emit RichResult.
- morie.otis_tps_overlay.yoy_correlation(tps_categories=None, *, sample_rows=50000)[source]¶
Year-by-year correlation between OTIS Toronto-region segregation placements and TPS crime incident counts (per category).
- Parameters:
- Return type:
- morie.otis_tps_overlay.per_region_rollup(*, tps_total_by_year=None)[source]¶
OTIS seg/RC totals per region × year, with the Toronto row flagged for TPS-overlay use. Also splits by gender.
- Parameters:
tps_total_by_year (Series | None)
- Return type:
TPS — Toronto Police Service¶
morie.tps_io – multi-format readers for TPS open-data exports.
Each TPS category at data/datasets/TPS/<Category>/ has 9 sibling format exports of the same incident records:
CSV – pandas.read_csv (canonical fast path) Excel – pandas.read_excel via openpyxl GeoJSON – pure JSON, geometry as Python list-of-coords FeatureCollection – ESRI JSON variant of GeoJSON KML / KMZ – zipfile + ElementTree, geometry as coord list GeoPackage – sqlite3, geometry stored as WKB blobs SQLiteGeodatabase – sqlite3, ESRI variant of GeoPackage Shapefile – needs pyshp; graceful degrade if not installed FileGeoDatabase – needs fiona / gdal; graceful degrade
This module returns a pandas.DataFrame for every format, with the geometry (when available) attached as a geometry column whose values are simple Python lists/tuples – no shapely / geopandas dependency.
For NeighbourhoodCrimeRates the geometry is Polygon; for incident datasets the geometry is Point. Either way, the geometry column contains the (longitude, latitude) tuples in WGS84.
- Public API:
load_tps(name, format=’csv’, nrows=None) -> pd.DataFrame list_tps_formats(name) -> dict[str, Path] available_formats() -> list of supported format names
- morie.tps_io.load_tps(name, format='csv', nrows=None)[source]¶
Load TPS dataset name in the given format.
- Parameters:
name ("Assault", "Homicides", … (case-insensitive))
format (one of csv | excel | geojson | featurecollection |) – kml | geopackage | sqlitegeodatabase | shapefile. “filegeodatabase” requires fiona/gdal – not supported on this build.
nrows (sample size cap; None = full dataset.)
- Return type:
- morie.tps_io.list_tps_formats(name)[source]¶
Map format name -> path of the file that would be loaded. For formats not present on disk, the path is omitted from the dict.
morie.tps_datasets – registry + loader for Toronto Police Service public crime datasets.
- 13 categories at data/datasets/TPS/<Category>/CSV/:
Assault, AutoTheft, BicycleTheft, BreakandEnter, CommunitySafetyIndicators, HateCrimes, Homicides, IntimatePartnerAndFamilyViolence, NeighbourhoodCrimeRates, Robbery, ShootingAndFirearmDiscarges, TheftFromMovingVehicle, TheftOver
Each category also has Excel/Shapefile/GeoJSON/etc; this module reads the CSV by default but can be pointed at any sibling format.
- Schema (per-incident, varies slightly by category):
OBJECTID, EVENT_UNIQUE_ID, REPORT_DATE, OCC_DATE, REPORT_YEAR, REPORT_MONTH, REPORT_DAY, REPORT_DOY, REPORT_DOW, REPORT_HOUR, OCC_YEAR, OCC_MONTH, OCC_DAY, OCC_DOY, OCC_DOW, OCC_HOUR, DIVISION, LOCATION_TYPE, PREMISES_TYPE, UCR_CODE, UCR_EXT, OFFENCE, CSI_CATEGORY, HOOD_158, NEIGHBOURHOOD_158, HOOD_140, NEIGHBOURHOOD_140, LONG_WGS84, LAT_WGS84, x, y
- Public API:
TPS_REGISTRY – dict: name -> TpsDataset load_tps_dataset(name) -> pd.DataFrame list_tps_datasets() -> list of (name, n_rows_est, …)
- class morie.tps_datasets.TpsDataset(name, description, primary_date, has_geometry)[source]¶
Bases:
objectOne TPS open-data category.
- morie.tps_datasets.load_tps_dataset(name, csv_filename=None, nrows=None, *, format=None)[source]¶
Load one TPS dataset by category name.
name is case-insensitive. Pass nrows=N for a quick sample while developing on the largest tables (Assault is 254k rows).
format (optional) – load from a non-CSV format. Valid:
csv,excel,geojson,featurecollection,kml,geopackage,sqlitegeodatabase,shapefile. Defaultcsv. Geometric formats attach ageometrycolumn (Python coord lists). Seemorie.tps_io.load_tpsfor details.
- morie.tps_datasets.list_tps_datasets()[source]¶
List all TPS datasets as (name, description, primary_date_col).
morie.tps_crime – cross-category crime analyses for TPS datasets.
Builds on top of morie.tps_datasets + tps_io to compare incidents ACROSS the 13 TPS categories. Useful for: - Comparing year-over-year trends side-by-side - Identifying neighbourhoods that are ‘hot’ across many crime types - Computing bivariate spatial correlation (Moran’s BV) between two
crime categories on the same neighbourhoods
Building a composite crime-risk index per neighbourhood
All emit RichResult.
- morie.tps_crime.yoy_panel(categories=None, *, sample_rows=50000)[source]¶
Side-by-side year-over-year panel across TPS categories.
- Parameters:
- Return type:
- morie.tps_crime.composite_index(categories=None, *, sample_rows=50000, weights=None, top_n=25)[source]¶
Build a composite per-neighbourhood crime-risk index.
For each category, count incidents per HOOD_158, z-standardise across neighbourhoods, then sum (or weight-and-sum) to get a single composite. Higher index = neighbourhood with elevated incidence across many crime types.
weights lets you tune (e.g. weights={‘Homicides’: 5.0, ‘BicycleTheft’: 0.5}). Default = unit weights for all loaded cats.
- morie.tps_crime.bivariate_morans_i(cat_a, cat_b, *, sample_rows=50000, k_neighbours=5)[source]¶
Bivariate Moran’s I – does category A’s count in a hood correlate with category B’s count in NEIGHBOURING hoods?
- Parameters:
- Return type:
- morie.tps_crime.category_correlation_matrix(*, sample_rows=50000)[source]¶
Pearson correlation of per-hood incident counts across all 13 TPS categories.
- Parameters:
sample_rows (int | None)
- Return type:
morie.tps_csi – Statistics Canada Crime Severity Index (CSI) weights for the 9 Toronto Police Service open-data categories.
The Crime Severity Index, introduced in Wallace et al.(2009) cite{Wallace2009CSI} and maintained by the Canadian Centre for Justice and Community Safety Statistics, weights each emph{Criminal Code} offence by:
- weight(offence) = (avg sentence in days)
× (proportion of offenders incarcerated)
so that violent offences with high incarceration rates and long sentences contribute disproportionately to a city’s per-capita CSI score, while minor property offences contribute less. This module exposes the weights used for the 9 TPS open-data categories (Assault, Auto Theft, Bicycle Theft, Break and Enter, Homicide, Robbery, Shooting and Firearm Discharges, Theft from Motor Vehicle, Theft Over) and provides per-year + per-neighbourhood CSI aggregates.
Important caveats¶
TPS open-data categories aggregate over multiple Criminal Code sub-offences (e.g.``Assault’’ covers Assault Level 1, Level 2 with a weapon, Level 3 aggravated, plus a small share of family and assault-of-peace-officer offences). The weights here are emph{representative blends} reflecting the typical distribution of sub-offences within each TPS category for FY2023; for an exact reproduction of Statistics Canada’s CSI for the City of Toronto, one must work directly from the CCJS UCR microdata (which is not in TPS open data).
The weights below are pinned to the last published Statistics Canada methodology update (
Reweighting the Crime Severity Index'', Catalogue 85-004-X) and the Toronto-specific override tables in CCJS Annual Statistics 2023. Newer weight revisions (StatsCan revises every 5 years) may shift values by 5--15\% without changing relative ordering. Override via the ``weightskeyword argument when calling functions in this module.Statistics Canada itself reports two CSI variants:
Total CSI'' and ``Violent CSI''. Functions here default to the Total CSI weights but accept a ``variant=`'violent`'argument to use violent-only weights, where non-violent categories (BBE, theft) are zeroed.
References
- Wallace, M., Turner, J., Babyak, C., & Matarazzo, A. (2009).
Measuring Crime in Canada: Introducing the Crime Severity Index and Improvements to the Uniform Crime Reporting Survey. Statistics Canada Catalogue 85-004-X.
- Statistics Canada (2024). Crime Severity Index, Census
Metropolitan Areas, 2023. Catalogue 35-10-0190-01.
- morie.tps_csi.csi_weight(category, *, variant='total', weights=None)[source]¶
Return the CSI weight for a TPS open-data category.
- morie.tps_csi.csi_per_year(counts_per_year, *, variant='total', weights=None, population=None, per_capita_unit=100000, rebase_to_year=None, rebase_to_value=100.0)[source]¶
Compute Toronto’s CSI per fiscal year from per-category counts.
- Returns a DataFrame indexed by year with columns:
raw_weighted_sum – Σ w_c · n_{c,year} total_count – Σ n_{c,year} population – Toronto population that year csi_per_capita – raw_weighted_sum / population × per_capita_unit simple_count_rate – total_count / population × per_capita_unit
(for comparison)
counts_per_year may be either a dict-of-dicts or a long-format DataFrame with columns
year,category,count.
- morie.tps_csi.csi_per_neighbourhood(counts_per_hood, *, variant='total', weights=None)[source]¶
CSI per ward (HOOD_158).
Mirrors
csi_per_year()but groups by neighbourhood rather than fiscal year. Population is not divided here because TPS open data does not ship a per-ward population table; users are expected to merge in the City of Toronto Open Data NeighbourhoodCrimeRates per-ward population for per-capita rates. Returns the un-normalised weighted sum + total count per ward.
- morie.tps_csi.analyze_csi_from_tps_dataframes(dfs, *, year_col='OCC_YEAR', hood_col='HOOD_158', variant='total')[source]¶
High-level orchestration: take
{category: tps_df}of the 9 TPS open-data feeds and return per-year + per-ward CSI.
morie.tps_temporal – temporal analyses for TPS crime data.
year_over_year_trend: linear regression on yearly counts
seasonal_pattern: monthly / DOW / hour-of-day cyclic stats
arima_forecast: simple ARIMA(1,1,1) on monthly counts
changepoint_detection: PELT-style changepoint on yearly counts
All emit RichResult.
- morie.tps_temporal.year_over_year_trend(df, *, year_col='OCC_YEAR', ds_name='?')[source]¶
Linear regression of incident counts vs year.
- Parameters:
- Return type:
- morie.tps_temporal.seasonal_pattern(df, *, ds_name='?')[source]¶
Month / day-of-week / hour-of-day cyclic patterns + chi-square test of uniformity.
- Parameters:
- Return type:
- morie.tps_temporal.changepoint_detection(df, *, year_col='OCC_YEAR', ds_name='?')[source]¶
Simple Pettitt-style change-point on yearly counts.
- Parameters:
- Return type:
- morie.tps_temporal.arima_forecast(df, *, h=12, ds_name='?')[source]¶
ARIMA(1,1,1) on monthly counts; forecast h periods ahead.
- Parameters:
- Return type:
morie.tps_spatial – spatial analyses for TPS crime data.
Wires the existing morie.fn spatial primitives to TPS data: - Moran’s I (global) on neighbourhood-level counts - LISA (Local Moran’s I) for hot/cold spots - Ripley’s K for point-pattern clustering - Kernel density estimation - Gi*/Getis-Ord hotspot detection
All emit RichResult.
- morie.tps_spatial.morans_i_neighbourhood(df, *, hood_col='HOOD_158', ds_name='?', k_neighbours=5)[source]¶
Compute global Moran’s I on neighbourhood-level incident counts.
Builds a k-NN spatial weights matrix from neighbourhood centroids (mean LAT/LONG of incidents in each hood as a proxy centroid), then runs Moran’s I on the count vector.
- Parameters:
- Return type:
- morie.tps_spatial.local_morans_i(df, *, hood_col='HOOD_158', ds_name='?', k_neighbours=5, top_n=20)[source]¶
LISA – local Moran’s Ii per neighbourhood, with hot/cold spot classification.
- morie.tps_spatial.kde_density(df, *, bandwidth=0.005, ds_name='?')[source]¶
Kernel density estimate of incident lat/long.
Returns summary stats; not the full grid (too big for a RichResult).
- Parameters:
- Return type:
morie.tps_spatial_advanced – heavyweight spatial statistics for TPS.
Builds on morie.tps_spatial (Moran’s I global, LISA, KDE) with:
ripley_k point-pattern clustering at multiple radii
getis_ord_g_star local hot/cold-spot z-scores (Gi*)
dbscan_clusters density-based clusters on lat/lon
kernel_intensity_at_centroids smoothed counts at neighbourhood centroids
polygon_morans_i queen-contiguity Moran’s I from polygons
Polygon contiguity uses the actual GeoJSON polygons from morie.tps_io.load_tps(NeighbourhoodCrimeRates, format=’geojson’), not the centroid-only k-NN approximation in tps_spatial.
- morie.tps_spatial_advanced.ripley_k(df, *, ds_name='?', radii_km=None, max_n=5000)[source]¶
Ripley’s K function – for each radius r, expected number of other points within r of a typical point, normalised by intensity.
For a homogeneous Poisson process K(r) = π r²; values above this indicate clustering, below indicate regularity.
- morie.tps_spatial_advanced.getis_ord_g_star(df, *, ds_name='?', hood_col='HOOD_158', k_neighbours=5, top_n=20)[source]¶
Local Getis-Ord Gi* statistic per neighbourhood.
- z-score interpretation:
- Gi* > 1.96 = significant hotspot at α=0.05 (high values
surrounded by other high values)
Gi* < -1.96 = significant cold spot
- morie.tps_spatial_advanced.dbscan_clusters(df, *, ds_name='?', eps_km=0.25, min_samples=30, max_n=30000)[source]¶
DBSCAN on point coordinates. Returns cluster summary.
- morie.tps_spatial_advanced.polygon_morans_i(*, ds_name='NeighbourhoodCrimeRates', value_col_prefix='ASSAULT_RATE', year=2024, k_neighbours=5)[source]¶
Polygon-aware Moran’s I using actual neighbourhood polygons from the GeoJSON export of NeighbourhoodCrimeRates, with a per-year crime-rate value column.
- Parameters:
- Return type:
- morie.tps_spatial_advanced.bivariate_moran(*, ds_name='NeighbourhoodCrimeRates', x_col_prefix='ASSAULT_RATE', y_col_prefix='HOMICIDE_RATE', year=2024, k_neighbours=5)[source]¶
Bivariate Moran’s I between two crime-rate columns.
Generalises
polygon_morans_ito two attributes: measures the cross-correlation between attribute X at location i and attribute Y at neighbouring locations j. Polygon centroids + k-nearest-neighbour weights, same construction aspolygon_morans_i.\[I_{xy} = \frac{n}{S_0}\, \frac{\sum_i \sum_j w_{ij}\, z^x_i\, z^y_j} {\sqrt{\sum_i (z^x_i)^2 \cdot \sum_i (z^y_i)^2}}\]where \(z^x\) and \(z^y\) are z-scored attributes.
- morie.tps_spatial_advanced.moran_sweep_heatmap(*, ds_name='NeighbourhoodCrimeRates', category_prefixes=None, years=None, k_neighbours=5)[source]¶
Run univariate Moran’s I across (category x year) and return the resulting matrix as a RichResult payload.
morie.tps_stochastic – stochastic-physics-of-crime analyses on TPS.
Modelled on Frenkel-style crime-physics (Hawkes self-exciting + Fokker-Planck density evolution + Langevin SDE), with practical forecasting baselines (SARIMA, XGBoost-when-available) for hold-out comparison.
Functions - hawkes_temporal_fit(df) – fit μ, κ, ω of self-exciting kernel,
compute branching ratio + AIC/BIC
sarima_forecast(df, h=12) – seasonal ARIMA, train/test MAPE
prophet_forecast(df, h=12) – Prophet trend+seasonality if installed
langevin_simulate(…) – Euler-Maruyama Ornstein-Uhlenbeck path
fokker_planck_grid(df) – finite-difference density evolution
model_compare(df) – AIC across baselines + SDE
All emit RichResult; figure_path payload entry written when matplotlib is available (saved as PNG under data/manifest/outputs/figures/tps_stochastic/).
References (cited in description blocks): - Mohler et al. 2011, “Self-exciting point process modeling of crime.” - Pitcher 2010, “Adding integrated nested Laplace approximations to
the Bayesian forecasting toolkit.”
Stochastic physics of crime literature (Short, D’Orsogna, Bertozzi 2010).
- morie.tps_stochastic.hawkes_temporal_fit(df, *, ds_name='?', max_n=5000)[source]¶
Fit a temporal-only exponential Hawkes process to incident times.
Returns μ (background rate), κ (branching ratio), ω (decay), AIC/BIC.
- Parameters:
- Return type:
- morie.tps_stochastic.sarima_forecast(df, *, ds_name='?', h=12, order=(1, 1, 1), seasonal=(0, 1, 1, 12))[source]¶
Seasonal ARIMA forecast on monthly incident counts.
- morie.tps_stochastic.langevin_simulate(df, *, ds_name='?', n_paths=100, T_days=365, dt=1.0, seed=42)[source]¶
Euler-Maruyama simulation of an Ornstein-Uhlenbeck process fitted to incident counts:
dX_t = θ(μ − X_t) dt + σ dW_t
Fits θ, μ, σ via OLS on first-differences, then generates n_paths forward simulations of length T_days.
- morie.tps_stochastic.fokker_planck_grid(df, *, ds_name='?', n_grid=64, n_steps=200)[source]¶
1-D Fokker-Planck density evolution of daily incident counts with the OU drift+diffusion fitted in langevin_simulate.
Uses Crank-Nicolson finite-difference scheme with reflective boundaries on a grid spanning [0, max(observed)*1.5].
- Parameters:
- Return type:
- morie.tps_stochastic.analyze(name='Assault', *, sample_rows=50000)[source]¶
Run the four stochastic-physics analyses on one TPS category.
- Parameters:
- Return type:
morie.tps_hawkes_advanced – non-stationary Hawkes with non-exponential kernels.
Implements the Kwan-Chen-Dunsmuir (2024, arXiv:2408.09710v1) methodology for Hawkes process likelihood inference when the baseline intensity is time-varying and the excitation kernel is non-exponential (so the intensity process is non-Markovian).
Companion to morie.tps_stochastic – that module contains the
classical exponential-kernel constant-baseline (Markovian / Mohler 2011)
fit; this module adds:
Gamma, Weibull, and power-law (Lomax) excitation kernels.
Sinusoidal-with-trend time-varying baseline ν(t).
Time-rescaling residuals (Brown et al. 2002) for goodness-of-fit via Kolmogorov-Smirnov + Q-Q plots.
Eight-way model comparison via AIC and an explicit Markovian-vs-non-Markovian likelihood-ratio surface.
The complete intensity function is
λ(t) = ν(t) + ∫_0^{t-} g(t - s) dN_s
with kernel decomposition g(u) = η · g̃(u) where η ∈ (0, 1) is the branching ratio (mean offspring per event) and g̃ is a probability density on [0, ∞). Stationarity requires η < 1.
- morie.tps_hawkes_advanced.fit_hawkes_general(t, T, kernel_kind='exponential', baseline_kind='constant')[source]¶
MLE of a non-stationary Hawkes process.
Returns a dict with
theta,nll,aic,bic,branching_ratio,baseline_params,kernel_params, and the time-rescaling KS statistic.
- morie.tps_hawkes_advanced.hawkes_advanced_fit(df, *, kernel='gamma', baseline='sinusoidal', ds_name='?', max_n=5000)[source]¶
Fit a single (kernel, baseline) combination with figures.
A sibling to
morie.tps_stochastic.hawkes_temporal_fitwhich is locked to the (exponential, constant) Markovian special case.
- morie.tps_hawkes_advanced.compare_hawkes_kernels(df, *, ds_name='?', max_n=4000, baselines=('constant', 'sinusoidal'), kernels=('exponential', 'gamma', 'weibull', 'lomax'))[source]¶
Fit every (kernel, baseline) combination and rank by AIC.
Mirrors Section 5 (numerical examples) of Kwan-Chen-Dunsmuir 2024: the Markovian classical Hawkes is the (exponential, constant) row; the non-Markovian non-stationary models are everything else.
- morie.tps_hawkes_advanced.hawkes_markovian_vs_nonmarkovian(df, *, ds_name='?', max_n=4000)[source]¶
Focused 2-way comparison: classical exp/const vs gamma/sinusoidal.
The two endpoints of the Kwan-Chen-Dunsmuir framework – quickest to run on the dashboard.
- Parameters:
- Return type:
morie.tps_statphysics – Statistical physics of crime applied to TPS.
Implements the 4 canonical methods reviewed by D’Orsogna & Perc, “Statistical physics of crime: A review”, Phys. Life Rev. 12 (2015) 1-21 (arxiv:1411.1743). Each function consumes one TPS category and returns a RichResult with a paragraph-level interpretation.
- 1. ``sdb_reaction_diffusion(category, …)`` -- Short-D'Orsogna-Brantingham
2008 hot-spot PDE on a Toronto grid:
∂A/∂t = η ∇²A − ω A + θ ρ ∂ρ/∂t = ∇·(D∇ρ − 2 ρ ∇log A) − ρ A + γ
A is the attractiveness field; ρ the criminal density. Crime incidents are absorbed into A, ρ flows up the gradient of A. Stable spots emerge when (η, ω, θ, D, γ) place the system in the localised-spike regime. We fit the steady-state spike count to the observed cluster count from DBSCAN.
- 2. ``levy_flight_alpha(category)`` -- Brockmann-Hufnagel-Geisel 2006
Lévy-flight diagnostic. We compute step lengths between chronologically-consecutive incidents and fit a power-law tail p(ℓ) ∝ ℓ^{−α} via Hill-MLE on the upper-tail.
- 3. ``urban_scaling_beta(category)`` -- Bettencourt et al. 2007 / 2010
urban scaling: log(crime_i) = log(Y₀) + β · log(pop_i) + ε for the 158 Toronto wards. β > 1 = super-linear (crime grows faster than population), β = 1 = linear, β < 1 = sub-linear.
- 4. ``lotka_volterra_police_crime(category)`` -- Lotka-Volterra
predator-prey on a coarse-grained yearly time series:
dx/dt = α x − β x y (crime: prey) dy/dt = δ x y − γ y (police effort: predator)
Fit (α, β, γ, δ) by least-squares against TPS counts (proxy x) and normalised mass-stop / arrest counts (proxy y if available, else one-quarter-of-incidents as a placeholder).
References
D’Orsogna & Perc (2015) §2.1, §3.2, §4.1
Short, D’Orsogna, Brantingham et al. (2008) “Statistical model of criminal behavior”, M3AS 18.
Brockmann, Hufnagel, Geisel (2006) “Scaling laws of human travel”, Nature 439.
Bettencourt, Lobo, Helbing, Kühnert, West (2007) “Growth, innovation, scaling, and the pace of life in cities”, PNAS 104.
- morie.tps_statphysics.sdb_reaction_diffusion(category='Assault', *, sample_rows=30000, eta=0.05, omega=0.3, theta=1.5, D=0.1, gamma=0.05, n_steps=800, dt=0.04, nx=90, ny=60, save_fig=True)[source]¶
Solve the Short-D’Orsogna-Brantingham 2008 reaction-diffusion crime hot-spot PDE on a Toronto grid, seeded by observed incidents.
Returns a RichResult with the steady-state hot-spot count, mean attractiveness, mean criminal density, and a comparison to DBSCAN.
- morie.tps_statphysics.levy_flight_alpha(category='Assault', *, sample_rows=30000, lmin_km=0.5, save_fig=True)[source]¶
Hill-MLE Lévy exponent on step lengths between consecutive incidents in chronological order (proxy for offender mobility).
- Parameters:
- Return type:
- morie.tps_statphysics.urban_scaling_beta(category='Assault', *, year=2024, save_fig=True)[source]¶
Bettencourt 2007 super-linear urban scaling on the 158 wards.
Regresses log(crime per ward) on log(population per ward) -> β.
- Parameters:
- Return type:
- morie.tps_statphysics.lotka_volterra_police_crime(category='Assault', save_fig=True)[source]¶
Yearly Lotka-Volterra fit: crime as prey, mass-stop / arrest as predator.
With only 1 series we use the empirical x(t) and back out (α, β) from the LV equilibrium x* = γ/δ, y* = α/β under stationarity. For the predator series we use the cumulative count of incidents -> a crude effort-density proxy; the absolute scale is unidentified but the (α, γ) ratio and the cycle period are.
- Parameters:
- Return type:
- morie.tps_statphysics.sdb_turing_demo(*, eta=0.2, omega=0.033, theta=0.56, D=30.0, gamma=0.019, n_steps=6000, dt=0.005, n=80, save_fig=True)[source]¶
Canonical SDB Turing-instability demo on a clean periodic grid.
Reproduces the localized hot-spot lattice from Short, D’Orsogna & Brantingham (2008) Fig. 4 / D’Orsogna & Perc (2015) Fig. 5 – which is what the author’s reference image #3 / #4 shows. Periodic boundaries, homogeneous initial state + Gaussian noise, runs to steady state. Output: 1×3 panel (early / mid / late).
- morie.tps_statphysics.inspection_game_phase(n_temptations=20, n_costs=20, n_steps=600, save_fig=True)[source]¶
Helbing-Szolnoki-Perc (2010) cooperator-defector-inspector replicator dynamics; produces the (Temptation T, Inspection cost γ) crime-rate phase diagram from D’Orsogna & Perc 2015 §5 / Image #5.
- Three strategies (pure populations only):
C – cooperator P – defector / ‘predator’ O – punisher / inspector
Replicator dynamics with payoff matrix that includes punishment of P by O at cost γ. Steady-state defector fraction -> ‘crime rate’.
- Parameters:
- Return type:
- morie.tps_statphysics.criminal_network_graph(category='Assault', *, sample_rows=30000, top_n_premises=20, save_fig=True)[source]¶
Co-occurrence network: nodes = top-N premise types, edges = sharing a common neighbourhood, weighted by joint incident count.
Replaces the criminal-role graph in D’Orsogna & Perc Fig. 9 / Diviák et al. (2019) – for our public TPS data we don’t have co-offender records, so we build the closest-meaning network from PREMISES_TYPE × HOOD_158 co-incidence.
- Parameters:
- Return type:
- morie.tps_statphysics.analyze_all(categories=None, save_fig=True)[source]¶
Run all 4 stat-phys analyses on each category.
morie.tps_render – clean Toronto map rendering.
Two design rules per the author, 2026-05-07:
NO floating neighbourhood text labels on the map (no “Downtown Yonge-East”, “Yonge-Bay Corridor”, “Mimico Queensway”, …). Hot-spot identification is delivered via the RichResult tables, not via on-canvas text.
Map is rotated ~17° CLOCKWISE in projected space so Lake Ontario’s shoreline sits level horizontally – matching the Sigar Li 2022 “Hotspot Policing for the City of Toronto” poster aesthetic and the Hohl 2024 ALMI homicide-cluster map.
Rotation works in cos-latitude-corrected metres (not raw degrees), so shapes stay metric-true regardless of where in Toronto’s bbox they sit.
Public API¶
project_xy(lat, lon) – degrees -> rotated planar metres (km)
render_choropleth(rate_col, year, ...) – polygon choropleth
render_dbscan(category, ...) – point-pattern + clusters
render_yearly_grid(prefix, years, ...) – small-multiples
- morie.tps_render.pretty_label(s)[source]¶
ASSAULT_RATE_2024 -> Assault rate · 2024. Strips underscores and casing so titles & legend labels look like prose.
- morie.tps_render.project_xy(lat, lon, *, rot_deg_cw=17.5, lat_c=43.7, lon_c=-79.4)[source]¶
Project (lat, lon) -> (x_km, y_km) with a clockwise rotation.
Equirectangular projection centred at (lat_c, lon_c), then rotated rot_deg_cw degrees clockwise. Returns kilometres east-of-centre (after rotation) and kilometres north-of-centre (after rotation).
Clockwise convention: positive rot_deg_cw rotates the map so a line that previously sloped up-right slopes less (or down-right).
- morie.tps_render.render_choropleth(*, rate_col='ASSAULT_RATE_2024', title=None, cmap=None, outfile=None, fig_h=7.0, fig_w=12.0, show_ids=True, border_color='#1a1a1a', border_lw=0.7)[source]¶
Polygon choropleth using NeighbourhoodCrimeRates GeoJSON polys.
Style matches the Hohl 2024 ALMI / Sigar Li 2022 Toronto poster: - thick black polygon borders (border_color/border_lw) - sequential cmap chosen per category (default Reds family) - small numeric polygon-ID labels (show_ids=True) – IDs only,
never neighbourhood names per the author 2026-05-07
title and colour-bar legend with proper spaces (no underscores)
- morie.tps_render.render_quad(category='Homicides', *, year=2024, outfile=None, fig_h=9.0, fig_w=13.0, sample_rows=30000, border_color='#1a1a1a', border_lw=0.6)[source]¶
Hohl 2024 ALMI-style 4-panel quad for any TPS category.
- Panels (a-d):
Kernel density (incidents per sq km) – sequential
Per-100k rate from NeighbourhoodCrimeRates – sequential
Spatial clusters / outliers (LISA HH/LL/HL/LH) – diverging
Significant Gi* hot/cold spots – diverging
- morie.tps_render.render_dbscan(category='Assault', *, eps_km=0.3, min_samples=20, sample_rows=30000, outfile=None, show_top=12, fig_h=7.5, fig_w=12.0)[source]¶
Point-pattern map: incidents + DBSCAN cluster IDs.
No floating hood labels; clusters are shown by colour with a compact legend. Aspect/rotation matches the rest of the suite.
- morie.tps_render.render_district_proportional(metric='ASSAULT_2024', *, title=None, outfile=None, fig_h=7.0, fig_w=12.5, border_color='#1a1a1a', border_lw=0.55)[source]¶
District-coloured map with proportional-symbol black circles.
Inspired by the author’s reference (Toronto news-coverage proportional- symbol map). Toronto’s 158 wards are coloured by former-borough district (Etobicoke / North York / Scarborough / York / Old Toronto / East York), with a black circle drawn at each ward centroid sized proportional to
metric(any column from NeighbourhoodCrimeRates).Includes a “Map Divisions” legend with district swatches and a “Number of <metric>” graduated-symbol legend with 5 size tiers.
- morie.tps_render.render_satscan_panel(category='Homicides', *, sample_rows=30000, n_clusters=2, n_mc=199, year_window_yrs=4, outfile=None, fig_h=7.5, fig_w=11.5, border_color='#1a1a1a', border_lw=0.55)[source]¶
Hohl 2024 ALMI “panel d” – Kulldorff-style space-time scan statistic with significant clusters as pink filled circles, red dots inside for sig. locations, yellow polygons for sig. wards, and an info-box (n_cases / expected / RR / p-value) on the right.
Implementation is a simplified Kulldorff Poisson scan over a grid of candidate (x, y, r, t-window) cells: for each cell compute a log-likelihood ratio assuming Poisson(λ_in) inside vs Poisson(λ_out) outside; null distribution by Monte-Carlo (n_mc permutations of incident timestamps within the bbox).
- morie.tps_render.render_yearly_grid(*, prefix='ASSAULT_RATE', years=range(2014, 2025), cmap='Reds', outfile=None, ncols=4)[source]¶
Small-multiples panel of per-year choropleths, no hood labels.
morie.tps_all_analyze – RichResult-emitting analyses for the 13 TPS crime datasets.
- Surfaces:
analyze_<name>(df=None) -> RichResult # universal per-dataset spatial_summary(df) -> RichResult # neighbourhood + lat/lon temporal_summary(df) -> RichResult # year/month/dow/hour crime_compare(*dfs, names=…) -> RichResult # cross-category analyze_all(out_dir=None) -> dict
Output goes to data/manifest/outputs/tps/.
Each TPS CSV has the same standard schema (Toronto Open Data convention), so a single set of helpers handles the full 13-dataset matrix.
- morie.tps_all_analyze.temporal_summary(df, *, ds_name='?')[source]¶
Year/month/dow/hour rollups.
- Parameters:
- Return type:
- morie.tps_all_analyze.spatial_summary(df, *, ds_name='?')[source]¶
Neighbourhood + division + premises rollups, plus lat/long bounding-box summary.
- Parameters:
- Return type:
- morie.tps_all_analyze.offence_summary(df, *, ds_name='?')[source]¶
OFFENCE / UCR / CSI_CATEGORY rollups.
- Parameters:
- Return type:
- morie.tps_all_analyze.gini_concentration(values)[source]¶
Gini coefficient of crime concentration across spatial units. G=0 perfectly even, G=1 perfectly concentrated.
- morie.tps_all_analyze.neighbourhood_concentration(df, *, ds_name='?')[source]¶
How concentrated is crime across neighbourhoods? Uses HOOD_158 (the 158-neighbourhood scheme) by default.
- Parameters:
- Return type:
- morie.tps_all_analyze.crime_compare(dfs)[source]¶
Compare counts and trends across multiple TPS categories. Returns one table per axis (year/division/neighbourhood).
- Parameters:
- Return type:
- morie.tps_all_analyze.analyze_one(name, df=None)[source]¶
Run the standard TPS analysis bundle on one dataset.
- Parameters:
- Return type:
- morie.tps_all_analyze.analyze_assault(df=None)¶
Analyze TPS Assault.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_autotheft(df=None)¶
Analyze TPS AutoTheft.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_bicycletheft(df=None)¶
Analyze TPS BicycleTheft.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_breakandenter(df=None)¶
Analyze TPS BreakandEnter.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_communitysafetyindicators(df=None)¶
Analyze TPS CommunitySafetyIndicators.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_hatecrimes(df=None)¶
Analyze TPS HateCrimes.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_homicides(df=None)¶
Analyze TPS Homicides.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_intimatepartnerandfamilyviolence(df=None)¶
Analyze TPS IntimatePartnerAndFamilyViolence.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_neighbourhoodcrimerates(df=None)¶
Analyze TPS NeighbourhoodCrimeRates.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_robbery(df=None)¶
Analyze TPS Robbery.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_shootingandfirearmdiscarges(df=None)¶
Analyze TPS ShootingAndFirearmDiscarges.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_theftfrommovingvehicle(df=None)¶
Analyze TPS TheftFromMovingVehicle.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_theftover(df=None)¶
Analyze TPS TheftOver.
- Parameters:
df (DataFrame | None)
- Return type:
- morie.tps_all_analyze.analyze_all(out_dir=None, *, sample_rows=50000)[source]¶
Run full bundle on each of the 13 TPS datasets + cross-compare.
sample_rows caps each load at N rows for fast development; pass None to load everything (Assault alone is 254k rows so the full 13-way load is several hundred MB).
- Parameters:
- Return type:
Federal SIU — Sprott / Doob / Iftene replication¶
morie.sprott_doob – Sprott & Doob (CRIMSL UToronto) SIU analyses.
Replicates the analytical contribution of the four CRIMSL UToronto research reports authored by Prof. Jane B. Sprott (Toronto Metropolitan University, formerly Ryerson) and Prof. Anthony N. Doob (University of Toronto), with Prof. Adelina Iftene (Dalhousie) co-authoring the May 2021 paper on Independent External Decision Makers.
The four reports¶
Sprott & Doob (Oct 2020) – Understanding the Operation of Correctional Service Canada’s Structured Intervention Units: Some Preliminary Findings. First systematic outside analysis of CSC’s SIU data; demonstrated that SIUs were not operating as the legislative framework (Bill C-83) required.
Sprott & Doob (Nov 2020) – Is there Clear Evidence that COVID-19 Was the Cause of Problems with the Operation of CSC’s Structured Intervention Units? Tests CSC’s COVID-attribution defense; finds the data did not support it (problems pre-existed COVID).
Sprott & Doob (Feb 2021) – Solitary Confinement, Torture, and Canada’s Structured Intervention Units. The most data- intensive of the four; introduces a Mandela-Rules classifier for SIU stays (solitary confinement = ≤15 days at ≤2 hrs out-of-cell; torture = ≥16 days under same conditions). Contains Tables 13, 19, 23 reproduced below.
Sprott, Doob & Iftene (May 2021) – Do Independent External Decision Makers Ensure that Solitary Confinement is no Longer Used in Canada’s Federal Penitentiaries? Evaluates the IEDM review mechanism added by Bill C-83.
Tables hardcoded (from Feb 2021 report)¶
- TABLE13_REGIONAL_RATES_PER_1000
SIU person-stays per 1000 regional penitentiary prisoners, split into short (≤15 days) and long (≥16 days) stays, by CSC region.
- TABLE19_MANDELA_CLASSIFICATION
Sprott-Doob Mandela-Rules classification of N=1960 SIU stays: solitary confinement (28.4%, N=556), torture (9.9%, N=195), all-other (61.7%, N=1209).
- TABLE23_REGIONAL_TORTURE_RATES
Solitary and torture rates per 1000 prisoners by region. Headline finding: Pacific torture rate (39.1) is 22.6× higher than Ontario’s (1.73).
Functions exposed¶
analyze_table13_regional_rates() -> RichResult analyze_table19_mandela_classification() -> RichResult analyze_table23_regional_torture_rates() -> RichResult classify_mandela(days_in_siu, hrs_out_of_cell_avg,
missed_full_4hrs) -> dict
Apply the Sprott-Doob Mandela-Rules classifier to a single SIU person-stay and return the category.
- analyze_full_sprott_doob_feb2021() -> RichResult
Comprehensive replication of the Feb 2021 paper’s main tables.
Citation¶
Sprott, J. B., & Doob, A. N. (2021, February). Solitary Confinement, Torture, and Canada’s Structured Intervention Units. Centre for Criminology & Sociolegal Studies, U. of Toronto. URL: https://www.crimsl.utoronto.ca/sites/www.crimsl.utoronto.ca/ files/TortureSolitarySIUsSprottDoob23Feb2021_0.pdf
- morie.sprott_doob.classify_mandela(days_in_siu, hrs_out_of_cell_avg, missed_full_4hrs_pct_of_days)[source]¶
Apply the Sprott-Doob Mandela-Rules classifier.
- Parameters:
days_in_siu (int) – Length of the SIU person-stay in days.
hrs_out_of_cell_avg (float) – Average hours out of cell per day during the stay.
missed_full_4hrs_pct_of_days (float) – Percent of days during the stay where the inmate did not receive the legislatively-required 4 hours out of cell. (Range: 0-100.)
- Returns:
dict with keys ‘category’ (str), ‘rule’ (str), ‘reason’ (str).
Categories
———-
“Solitary Confinement” – Mandela Rule 44 (≤2 hrs out of cell, all) – days missed full 4 hrs, stay length ≤ 15 days.
”Torture” – Mandela Rules 43+44 (same conditions but stay length) – ≥ 16 days.
”All other” – none of the above thresholds met.
>>> r = classify_mandela(20, 1.5, 100)
>>> r[“category”]
’Torture’
>>> r = classify_mandela(8, 1.5, 100)
>>> r[“category”]
’Solitary Confinement’
>>> r = classify_mandela(20, 5, 50)
>>> r[“category”]
’All other’
- Return type:
- morie.sprott_doob.analyze_table13_regional_rates()[source]¶
Sprott-Doob Table 13: SIU person-stay rates per 1000 prisoners.
- Return type:
- morie.sprott_doob.analyze_table19_mandela_classification()[source]¶
Sprott-Doob Table 19: Mandela-Rules classification of N=1960 SIU stays.
- Return type:
- morie.sprott_doob.analyze_table23_regional_torture_rates()[source]¶
Sprott-Doob Table 23: regional torture/solitary rates per 1000.
- Return type:
- morie.sprott_doob.analyze_iedm_table1_population()[source]¶
Sprott, Doob & Iftene (May 2021) Table 1: IEDM-reviewed population.
- Return type:
- morie.sprott_doob.analyze_iedm_review_outcomes()[source]¶
Sprott, Doob & Iftene May 2021: IEDM review outcomes & disparities.
- Return type:
- morie.sprott_doob.analyze_table4_length_of_stay()[source]¶
Sprott-Doob (Feb 2021) Table 4: SIU length-of-stay distribution.
- Return type:
- morie.sprott_doob.analyze_table11_region_x_stay_length()[source]¶
Sprott-Doob (Feb 2021) Table 11: Region × stay length.
- Return type:
- morie.sprott_doob.analyze_table12_regional_overrepresentation()[source]¶
Sprott-Doob (Feb 2021) Table 12: Regional over-/under-rep in SIU.
- Return type:
- morie.sprott_doob.analyze_table15_region_x_mental_health()[source]¶
Sprott-Doob (Feb 2021) Table 15: Region × MH-flag.
- Return type:
- morie.sprott_doob.analyze_table22_region_x_mandela()[source]¶
Sprott-Doob (Feb 2021) Table 22: Region × Mandela groups.
- Return type:
- morie.sprott_doob.analyze_table9_iedm_decisions()[source]¶
Sprott-Doob-Iftene (May 2021) Table 9: IEDM review outcomes.
- Return type:
- morie.sprott_doob.analyze_table10_per_iedm_variance()[source]¶
Sprott-Doob-Iftene (May 2021) Table 10: per-IEDM variance.
- Return type:
- morie.sprott_doob.analyze_table15_long_stay_no_iedm()[source]¶
Sprott-Doob-Iftene (May 2021) Table 15: long-stay no-IEDM cases.
- Return type:
- morie.sprott_doob.verify_chi2(observed)[source]¶
Recompute Pearson χ² from a 2D contingency table.
Pure-function rebuild of scipy.stats.chi2_contingency without Yates correction. Intended for quick self-checks of the transcribed cell counts against published χ² values.
- morie.sprott_doob.verify_published_chi_squares()[source]¶
Re-derive every published χ² from its transcribed cell counts.
Cross-checks Sprott-Doob Tables 11, 15, 22 (Feb 2021) and Sprott-Doob-Iftene Tables 5, 7, 8, 10 (May 2021). For each: rebuilds the observed contingency table from the per-row dicts, runs verify_chi2(), and reports recomputed vs. published values.
- Return type:
- morie.sprott_doob.analyze_full_sprott_doob_feb2021()[source]¶
Comprehensive replication of Sprott & Doob (Feb 2021) main tables.
- Return type:
morie.siuiap – Structured Intervention Unit Implementation Advisory Panel (federal).
NOTE: this is the FEDERAL Structured Intervention Unit Implementation Advisory Panel module. It is distinct from morie.siu, which is the ONTARIO Special Investigations Unit (police-oversight) automation package. Both share the acronym SIU but refer to entirely different agencies – the rename siu_iap -> siuiap was made to avoid the underscore prefix collision with morie.siu/.
The federal counterpart to Ontario’s OTIS provincial restrictive- confinement data is Canada’s federal Structured Intervention Unit (SIU) system, which replaced administrative segregation in federal penitentiaries in 2019. From April 2021 to December 2024, the Structured Intervention Unit Implementation Advisory Panel (SIU IAP) monitored federal SIU implementation, chaired by Howard Sapers with members including Prof. Emeritus Anthony N. Doob and Prof. Jane B. Sprott.
In addition to SIU IAP panel reports, this module also indexes:
CRIMSL UToronto research reports (Sprott / Doob / Iftene 2020-2021): four independent academic reports on CSC’s SIU operation, the role of independent external decision makers, the COVID-19 attribution claim, and solitary-confinement-as-torture framing.
Doob 2020 Federal Court affidavit (T-539-20): expert evidence on CCRSO 2018 prisoner-flow + age + release statistics, used as the source for the analyses replicated in morie.doob_trends.
This module exposes:
PANEL_MEMBERS – list of SIU IAP members REPORTS – dict of SIU IAP panel reports (5) CRIMSL_REPORTS – dict of UToronto CRIMSL research reports (4) AFFIDAVITS – dict of Federal Court / inquest affidavits PANEL_MANDATE – text describing the panel’s mandate cite() -> str – generate a citation string
It does NOT ingest data: SIU IAP outputs and CRIMSL papers are qualitative / mixed-method PDFs. The federal SIU dataset itself (CSC Research Branch) is not available as open data; this module exists to make the federal context citable from morie analyses.
References
- Public Safety Canada. Structured Intervention Unit Implementation
Advisory Panel. https://www.publicsafety.gc.ca/cnt/cntrng-crm/crrctns/siuiap-ccuis-en.aspx
- Sapers, H. et al. (2024). Final Report – Structured Intervention Unit
Implementation Advisory Panel.
- Centre for Criminology & Sociolegal Studies, U. of Toronto.
Reports on Canada’s Structured Intervention Units. https://www.crimsl.utoronto.ca/news/reports-canada%27s-structured-intervention-units
- morie.siuiap.cite(report_id='final_2024')[source]¶
Build a citation string for an SIU IAP / CRIMSL / affidavit entry.
Searches REPORTS, CRIMSL_REPORTS, and AFFIDAVITS in order.
>>> cite("final_2024") 'SIU IAP (2024). SIU IAP Final Report. Public Safety Canada.' >>> "Sprott" in cite("sprott_doob_torture_solitary_2021") True
morie.doob_trends – National-aggregate analyses from Doob’s Federal Court affidavit.
Replicates the analytical contribution of Prof. Anthony N. Doob’s expert-witness affidavit in Canadian Civil Liberties Association et al. v. The Attorney General of Canada (Federal Court file T-539-20, Application Record Vol. 3 of 5, pp. 778-795).
Doob’s national-aggregate analyses (Figures 1-4 + Tables 1-3) sit ALONGSIDE the per-row MRM modules on OTIS provincial data and the Doob chi-square family on aggregate contingency tables. Where the MRM modules test causal contrasts at the individual level on Ontario provincial data, Doob’s affidavit work tests decoupling of imprisonment rates from crime rates at the Canadian and US national-aggregate level over 50+ years.
Constants exposed¶
- CCRSO_TABLE1_RELEASES Table 1: 5-year average annual releases
by type (day / full parole / statutory) with violent / non-violent / breach revocations and successful completions.
- CCRSO_TABLE2_FLOW Table 2: 2013/14-2017/18 prisoner flow:
count / admissions / deaths / full parole releases / statutory releases.
- CCRSO_TABLE3_AGE Table 3: 2018 age distribution – adult
population, CSC in-custody count, CSC admissions – by age group (18-49, 50-59, 60+).
Functions exposed¶
- analyze_doob_table1_releases() -> RichResult
Renders Table 1 + computes overall success / revocation rates.
- analyze_doob_table2_flow() -> RichResult
Renders Table 2 + year-over-year changes + 5-year averages.
- analyze_doob_table3_age_overrepresentation() -> RichResult
Renders Table 3 + computes age-group IRR for CSC custody vs adult population (over- / under-representation).
- decoupling_test(crime_series, imprisonment_series) -> RichResult
Tests Doob’s thesis that imprisonment is decoupled from crime rate. Reports Pearson correlation between the two time series with optional Pettitt change-point detection.
- pettitt_changepoint(series) -> dict
Pettitt non-parametric change-point detection for time series.
Data sources (cited in Doob affidavit)¶
- CCRSO 2018 – Public Safety Canada, Corrections and Conditional
Release Statistical Overview 2018 (released 2019). https://www.publicsafety.gc.ca/cnt/rsrcs/pblctns/ccrso-2018/
StatsCan CANSIM/Table 35-10-0026-01, 35-10-0177-01, 35-10-0014-01. Pastore, A. L. & Maguire, K. (2004). Sourcebook of Criminal Justice
Statistics.
Citation¶
Doob, A. N. (2020). Affidavit (T-539-20) of Anthony Doob – Federal Court of Canada, Application Record Vol. 3 of 5. CCLA et al. v. Attorney General of Canada.
- morie.doob_trends.analyze_doob_table1_releases()[source]¶
Doob Affidavit Table 1: 5-year average annual conditional releases.
- Return type:
- morie.doob_trends.analyze_doob_table2_flow()[source]¶
Doob Affidavit Table 2: prisoner flow 2013/14-2017/18.
- Return type:
- morie.doob_trends.analyze_doob_table3_age_overrepresentation()[source]¶
Doob Affidavit Table 3: age distribution – Canada vs CSC custody.
- Return type:
- morie.doob_trends.pettitt_changepoint(series)[source]¶
Pettitt’s non-parametric change-point test (Pettitt 1979).
Returns the location of the most likely single change-point in a time series, plus the test statistic and approximate p-value.
>>> r = pettitt_changepoint([1, 1, 1, 1, 5, 5, 5, 5]) >>> r["change_point_index"] 3
- morie.doob_trends.decoupling_test(crime_series, imprisonment_series, years=None)[source]¶
Doob’s thesis test: imprisonment rate decoupled from crime rate.
Computes Pearson correlation between the two equal-length time series. A correlation near 0 (or with a sign mismatch) supports Doob’s claim that imprisonment is decoupled from crime trends. Optionally runs Pettitt change-point on each series.
- morie.doob_trends.analyze_doob_full_affidavit()[source]¶
Comprehensive replication of Doob’s Federal Court affidavit aggregate analyses (Tables 1-3 + decoupling test stub).
Renders Tables 1, 2, 3 + a placeholder for Figures 1-4 (which require StatsCan CANSIM time-series data not bundled with morie).
- Return type:
Psychometrics¶
morie.psymet – Psychometric analysis (CTT, reliability, factor analysis).
Short-name API (≤6 chars, no snake_case):
crba()– Cronbach’s coefficient alphamcdo()– McDonald’s omega total + hierarchicalitcor()– Corrected item-total correlationsadel()– Alpha if item deletedcrel()– Composite reliability from factor loadingsave()– Average Variance Extractedkmo()– Kaiser-Meyer-Olkin sampling adequacybart()– Bartlett’s test of sphericityparan()– Horn’s parallel analysissplhf()– Spearman-Brown split-half reliabilityidisc()– Item discrimination index
References
Cronbach (1951). Psychometrika, 16(3), 297-334.
McDonald (1999). Test Theory: A Unified Treatment.
Revelle (2024). psych R package.
- class morie.psymet.RlbRes(raw, std, avgr, k, n, ci_lo=0.0, ci_hi=0.0)[source]¶
Bases:
objectReliability result (Cronbach’s alpha).
- class morie.psymet.OmgRes(total, hier, alpha, nf, expvar=0.0)[source]¶
Bases:
objectOmega result (McDonald’s omega).
- morie.psymet.itcor(data)[source]¶
Corrected item-total correlations.
Returns DataFrame: item, r_total, r_corr
- morie.psymet.crel(loads)[source]¶
Composite reliability from standardized factor loadings.
CR = (Σλ)² / ((Σλ)² + Σ(1−λ²))
- morie.psymet.ave(loads)[source]¶
Average Variance Extracted from factor loadings.
AVE = mean(λ²). Values ≥ 0.5 indicate convergent validity.
- morie.psymet.kmo(data)[source]¶
Kaiser-Meyer-Olkin sampling adequacy.
MSA > 0.6 adequate, > 0.8 meritorious (Kaiser 1974).
- morie.psymet.bart(data)[source]¶
Bartlett’s test of sphericity.
H0: correlation matrix = identity. Reject (p<0.05) -> factorable.
- morie.psymet.paran(data, nsim=100, seed=42)[source]¶
Horn’s parallel analysis – suggested number of factors.
Compares observed eigenvalues to 95th percentile of random data.
- morie.psymet.splhf(data, method='first_last')[source]¶
Spearman-Brown split-half reliability.
method: ‘first_last’ or ‘odd_even’.
- morie.psymet.idisc(data, pct=0.27)[source]¶
Item discrimination index (D-statistic).
Upper/lower groups by total score (default 27%, Kelley). Returns DataFrame: item, d
- morie.psymet.cronbach_alpha(data, *, ci=0.95)¶
Cronbach’s coefficient alpha with Feldt CI.
- morie.psymet.mcdonalds_omega(data, nf=1)¶
McDonald’s omega total and hierarchical.
- morie.psymet.item_total_correlation(data)¶
Corrected item-total correlations.
Returns DataFrame: item, r_total, r_corr
- morie.psymet.alpha_if_deleted(data)¶
Alpha if each item deleted.
Returns DataFrame: item, adel
- morie.psymet.composite_reliability(loads)¶
Composite reliability from standardized factor loadings.
CR = (Σλ)² / ((Σλ)² + Σ(1−λ²))
- morie.psymet.average_variance_extracted(loads)¶
Average Variance Extracted from factor loadings.
AVE = mean(λ²). Values ≥ 0.5 indicate convergent validity.
- morie.psymet.kmo_test(data)¶
Kaiser-Meyer-Olkin sampling adequacy.
MSA > 0.6 adequate, > 0.8 meritorious (Kaiser 1974).
- morie.psymet.bartlett_sphericity(data)¶
Bartlett’s test of sphericity.
H0: correlation matrix = identity. Reject (p<0.05) -> factorable.
- morie.psymet.parallel_analysis(data, nsim=100, seed=42)¶
Horn’s parallel analysis – suggested number of factors.
Compares observed eigenvalues to 95th percentile of random data.
- morie.psymet.split_half_reliability(data, method='first_last')¶
Spearman-Brown split-half reliability.
method: ‘first_last’ or ‘odd_even’.
- morie.psymet.item_discrimination(data, pct=0.27)¶
Item discrimination index (D-statistic).
Upper/lower groups by total score (default 27%, Kelley). Returns DataFrame: item, d
Function namespace morie.fn¶
The morie.fn namespace exposes 36,000+ individual function
files, indexed by a registry. The full registry is the canonical
catalogue:
Star Wars naming registry for morie.fn functions.
Single source of truth: short name -> metadata. Used by fn/__init__.py for auto-export and stat_commands.py for CLI dispatch.
‘Knowledge itself is power. – Francis Bacon’
- class morie.fn._registry.FnEntry(short, full, category, description, quote='')[source]¶
Bases:
objectRegistry entry for one fn/ function.
- morie.fn._registry.cheatsheet(category=None)[source]¶
Print a Star Wars-themed cheatsheet of all morie.fn functions.
>>> from morie.fn._registry import cheatsheet >>> cheatsheet() >>> cheatsheet("Test") # filter by category
- Parameters:
category (str | None)
- Return type:
None
Result containers shared across the package:
Shared result dataclasses for morie.fn functions.
- class morie.fn._containers.RlbRes(raw, std, avgr, k, n, ci_lo=0.0, ci_hi=0.0)[source]¶
Bases:
objectReliability result (Cronbach’s alpha).
- class morie.fn._containers.OmgRes(total, hier, alpha, nf, expvar=0.0)[source]¶
Bases:
objectOmega result (McDonald’s omega).
- class morie.fn._containers.BrtRes(chisq, df, pval)[source]¶
Bases:
objectBartlett’s sphericity result.
- class morie.fn._containers.RplRes(counts, props, year, sex=None)[source]¶
Bases:
objectRegional placement result.
- class morie.fn._containers.AstRes(data, summary)[source]¶
Bases:
objectAlert-state combination result.
- class morie.fn._containers.OtDmlR(ate, ate_se, ate_pval, att, att_se, att_pval, n, method='IRM')[source]¶
Bases:
objectOTIS DML result.
- Parameters:
- class morie.fn._containers.ESRes(measure, estimate, ci_lower=None, ci_upper=None, se=None, n=None, extra=<factory>)[source]¶
Bases:
objectStandardised result for every effect-size calculation.
- Parameters:
- class morie.fn._containers.TestResult(test_name, statistic, p_value, df=None, method='', n=None, extra=<factory>)[source]¶
Bases:
objectResult from a hypothesis test.
- Parameters:
- class morie.fn._containers.RegressionResult(method, coefficients, se, p_values, r_squared=None, adj_r_squared=None, residuals=None, fitted=None, n=0, k=0, extra=<factory>)[source]¶
Bases:
objectResult from a regression model.
- Parameters:
- class morie.fn._containers.DescriptiveResult(name, value=None, extra=<factory>)[source]¶
Bases:
objectResult from a descriptive / exploratory function.
- class morie.fn._containers.TimeSeriesResult(name, values=None, lags=None, ci_upper=None, ci_lower=None, extra=<factory>)[source]¶
Bases:
objectResult from a time-series analysis.
- Parameters:
- class morie.fn._containers.SurvivalResult(name, times=None, survival=None, ci_lower=None, ci_upper=None, median_survival=None, n_events=0, n_censored=0, extra=<factory>)[source]¶
Bases:
objectResult from a survival analysis.
- Parameters:
- class morie.fn._containers.DiagnosticResult(name, estimate, ci_lower=None, ci_upper=None, n=0, extra=<factory>)[source]¶
Bases:
objectResult from a diagnostic accuracy metric.
- Parameters:
- class morie.fn._containers.IRTResult(model, item_params, theta=None, se_theta=None, fit=<factory>, info=None)[source]¶
Bases:
objectResult from an IRT model.
- Parameters:
- class morie.fn._containers.DIFResult(method, items=<factory>, flagged=<factory>, group_var='')[source]¶
Bases:
objectResult from a DIF analysis.
- class morie.fn._containers.InvarianceResult(level, fit=<factory>, delta_fit=<factory>, passed=False)[source]¶
Bases:
objectResult from a measurement invariance test.
- class morie.fn._containers.RecidivismResult(rate, n_recid, n_total, ci_lower=0.0, ci_upper=1.0, subgroup=None, method='proportion')[source]¶
Bases:
objectResult from a recidivism analysis.
- Parameters:
- class morie.fn._containers.CustodyResult(metric, value, n=0, period=None, region=None, extra=<factory>)[source]¶
Bases:
objectResult from a custody/institutional metric.
- class morie.fn._containers.PcaRes(components, explained_variance, explained_variance_ratio, scores, n=0, p=0)[source]¶
Bases:
objectPrincipal Component Analysis result.
- Parameters:
- class morie.fn._containers.FaRes(loadings, communalities, eigenvalues, variance_explained, n_factors=0, rotation='varimax')[source]¶
Bases:
objectExploratory Factor Analysis result.
- Parameters:
- class morie.fn._containers.CfaRes(cfi, tli, rmsea, srmr, loadings=<factory>, residuals=None)[source]¶
Bases:
objectConfirmatory Factor Analysis result.
- Parameters:
- class morie.fn._containers.MdsRes(coordinates, stress, eigenvalues)[source]¶
Bases:
objectMultidimensional Scaling result.
- class morie.fn._containers.TsneRes(embedding)[source]¶
Bases:
objectt-SNE result.
- Parameters:
embedding (ndarray)
- class morie.fn._containers.UmapRes(embedding)[source]¶
Bases:
objectUMAP result.
- Parameters:
embedding (ndarray)
- class morie.fn._containers.KmeansRes(labels, centers, inertia, n_iter)[source]¶
Bases:
objectK-means clustering result.
- class morie.fn._containers.HclstRes(labels, linkage_matrix, distances=None)[source]¶
Bases:
objectHierarchical clustering result.
- class morie.fn._containers.LdaRes(components, explained_variance_ratio, projected)[source]¶
Bases:
objectLinear Discriminant Analysis result.
- class morie.fn._containers.DbscnRes(labels, n_clusters, n_noise)[source]¶
Bases:
objectDBSCAN clustering result.
- class morie.fn._containers.SIRResult(model, t=None, S=None, I=None, R=None, E=None, R0=None, extra=<factory>)[source]¶
Bases:
objectResult from a compartmental model (SIR/SEIR/etc).
- Parameters:
- class morie.fn._containers.SpatialResult(name, statistic, p_value=None, expected=None, variance=None, local_values=None, extra=<factory>)[source]¶
Bases:
objectResult from a spatial statistic.
- Parameters:
- class morie.fn._containers.GenomicsResult(name, statistic, p_value=None, n=0, extra=<factory>)[source]¶
Bases:
objectResult from a genomics/population genetics test.
- class morie.fn._containers.CrimeResult(name, rate, ci_lower=None, ci_upper=None, n=0, population=0, extra=<factory>)[source]¶
Bases:
objectResult from a criminological metric.
- Parameters:
- class morie.fn._containers.SignalResult(name, filtered=None, fs=0.0, n_samples=0, extra=<factory>)[source]¶
Bases:
objectResult from a signal processing function.
- class morie.fn._containers.CryptoResult(algorithm, operation, success=True, extra=<factory>)[source]¶
Bases:
objectResult from a cryptographic operation.
Rich-output result containers for morie.fn.
R-style verbose __repr__ so users see paragraph-level summaries
when they print a result, not just a bare dict – modeled after
psych::omega(), lavaan summary(), survival::summary.coxph(), etc.
Usage:
res = welcht([1,2,3], [4,5,6])
print(res) # multi-section ASCII summary
res.statistic # still attribute-accessible like a dict
res.warnings # list of advisory strings
- class morie.fn._richresult.RichResult(title='', call='', summary_lines=<factory>, sections=<factory>, tables=<factory>, comparison=None, extras=<factory>, warnings=<factory>, interpretation='', payload=<factory>)[source]¶
Bases:
dictGeneric rich-printing result dict-like – psych::omega style.
Inherits from dict so isinstance(result, dict) is True and legacy callers that expect a plain dict (.get(…), for k in r, result[“statistic”]) keep working unchanged. The RichResult layer adds title / sections / tables / interpretation / warnings on top – opt-in via print(result) or result.summary().
Supports multi-section output: - title short test/model name (top of output) - call optional call signature line (“Call: foo(x, y, …)”) - summary_lines headline metrics: list of (label, value) tuples - sections list of {“title”: str, “lines”: [(lbl, val)], “table”: […]} dicts
for additional sections (each printed with its own header)
- tables list of {“title”: str, “headers”: […], “rows”: [[…]]}
for tabular output (column-aligned)
- comparison optional dict with same shape as a section,
printed under “Compare this with …” heading
extras list of free-form paragraph blocks
- warnings list of advisory strings – surfaced VERBATIM,
multi-line OK
interpretation optional plain-language concluding sentence(s)
payload arbitrary dict for programmatic access
- Parameters:
- morie.fn._richresult.hypothesis_test_result(*, test_name, statistic, pvalue, df=None, alpha=0.05, extra_summary=None, warnings=None, extra_payload=None, callable_name=None)[source]¶
Factory for hypothesis-test results with a uniform layout.
callable_name (optional) – short name of the morie.fn callable that produced this result. If provided, a “Learn more” pointer is appended to the extras suggesting describe(<name>) for the full pedagogical guide. Caller can pass it explicitly OR auto-detect via the inspect-stack fallback in attach_describe_pointer().
- morie.fn._richresult.with_describe_pointer(result, callable_name)[source]¶
Add a describe() pointer to an existing RichResult.
Used by callables that return RichResult directly (without going through hypothesis_test_result()) to opt-in to the same convention.
- Parameters:
result (RichResult)
callable_name (str)
- Return type: