Double Machine Learning (DML)¶
Part of Statistical Methods — MORIE’s statistical-methods reference.
MORIE implements the Partially Linear Regression (PLR) model from Chernozhukov et al. (2018) via the :pypi:`DoubleML` package.
Partially Linear Regression¶
The PLR model posits
where \(D\) is the treatment, \(X\) are confounders, and \(g_0, m_0\) are unknown nuisance functions estimated nonparametrically. The parameter of interest is \(\theta_0\) (ATE under PLR).
Cross-fitting¶
To avoid regularization bias from using the same sample for nuisance and target parameter estimation, DML uses K-fold cross-fitting:
Split data into \(K\) folds.
For each fold \(k\): fit nuisance models on the complement \(\mathcal{I}^c_k\), predict on \(\mathcal{I}_k\).
Form residuals: \(\tilde{Y}_i = Y_i - \hat{g}_0(X_i)\), \(\tilde{D}_i = D_i - \hat{m}_0(X_i)\).
Regress \(\tilde{Y}\) on \(\tilde{D}\) to obtain \(\hat{\theta}_0\).
This satisfies Neyman orthogonality and achieves \(\sqrt{n}\)-consistency under mild conditions on the nuisance estimators.
Neyman orthogonality¶
The score function \(\psi(W; \theta, \eta)\) satisfies
ensuring that first-order errors in \(\hat{\eta}\) do not bias \(\hat{\theta}\).
MORIE implementation¶
Python entry point: morie.effects.estimate_ate()
Default nuisance learners:
Outcome nuisance \(g_0\):
sklearn.ensemble.RandomForestRegressorPropensity nuisance \(m_0\):
sklearn.ensemble.RandomForestClassifier
Default: n_folds=5, n_rep=1.
from morie import estimate_ate
result = estimate_ate(
df,
treatment="cannabis_any_use",
outcome="heavy_drinking_30d",
covariates=["age_group", "gender", "province_region", "mental_health"],
)
print(result) # {"ate": ..., "se": ..., "ci_lower": ..., "ci_upper": ...}
Interactive Regression Model (IRM)¶
The IRM extends PLR to allow heterogeneous treatment effects. The model posits:
Unlike PLR, the outcome function \(g_0(D, X)\) interacts treatment \(D\) with covariates \(X\), making the model suitable when CATE varies across individuals. The target estimand is the ATE:
with the doubly-robust score:
Python entry point: morie.causal.estimate_irm()
Partially Linear IV (PLIV) — LATE estimation¶
When treatment \(D\) is endogenous, a valid instrument \(Z\) (correlated with \(D\) but independent of \(\varepsilon\) given \(X\)) identifies the Local Average Treatment Effect (LATE):
The Partially Linear IV model is:
Cross-fitting proceeds as in PLR, with the additional first stage estimating \(\hat{m}_0(Z, X)\).
Python entry point: morie.effects.estimate_pliv()
Nuisance learner defaults¶
PLR, \(g_0\) (outcome):
sklearn.ensemble.RandomForestRegressor(100 trees, max_depth=5).PLR, \(m_0\) (propensity):
sklearn.ensemble.RandomForestClassifier(100 trees, max_depth=5).IRM, \(g_0(d, X)\) (outcome × treatment):
sklearn.ensemble.RandomForestRegressor.IRM, \(m_0(X)\) (propensity):
sklearn.ensemble.RandomForestClassifier.PLIV, \(g_0(X)\) (outcome residual):
sklearn.ensemble.RandomForestRegressor.PLIV, \(m_0(Z, X)\) (first stage):
sklearn.ensemble.RandomForestRegressor.
References¶
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68. https://doi.org/10.1111/ectj.12097
Bach P, Chernozhukov V, Kurz MS, Spindler M (2022). DoubleML — An object-oriented implementation of double machine learning in Python. JMLR, 23(53):1–6.
Imbens GW, Angrist JD (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2):467–475. https://doi.org/10.2307/2951620