Dataset-Agnostic Analysis¶
Part of Statistical Methods — MORIE’s statistical-methods reference.
MORIE can profile and analyse any tabular dataset without prior knowledge
of its schema via the morie.dataset module. This capability is essential
when working with novel administrative health data, new survey waves, or
datasets outside of CPADS.
Levels of Measurement¶
MORIE follows Stevens (1946) four-level typology when classifying columns:
Nominal — categories, no order. Operations: =, ≠. Examples: sex, province, ethnicity.
Ordinal — ordered categories. Operations: =, ≠, <, >. Examples: Likert scale (1–5), severity grade.
Interval — equal intervals, no true zero. Operations: +, −, mean. Examples: year, temperature (°C), index score.
Ratio — equal intervals + true zero. Operations: +, −, ×, ÷, geometric mean. Examples: income, age, count, weight.
Inference rules used by morie.dataset.infer_measurement_level():
Object / category dtype + ≤
ordinal_thresholdunique values → OrdinalObject / category dtype + >
ordinal_thresholdunique values → NominalInteger / float + exactly 2 unique values → Nominal (binary)
Integer / float + ≤ 20 unique values + name suggests rank/scale → Ordinal
Float + name contains year, index, score → Interval
Float or integer + min ≥ 0 → Ratio
All other numeric → Interval
Dataset Profiling¶
morie.dataset.profile_dataset() builds a morie.dataset.DatasetProfile
containing a morie.dataset.ColumnProfile for every column.
Each ColumnProfile records:
level— inferred measurement level (NOIR)missing_pct— fraction of missing valuesis_binary— True when only two distinct non-null values existsuggested_role— one oftreatment,outcome,covariate,weight,stratum,cluster,idsummary_stats— mean/SD (numeric) or top-k category counts (categorical)
Role detection heuristics¶
Treatment column candidates: binary column with a name containing
treat, cannabis, drug, alcohol, intervention,
exposed, or assigned.
Outcome column candidates: column named with a suffix / prefix matching
outcome, result, y_, response, _freq, _harm,
_drink, or disorder.
Weight column candidates: column name containing weight, wt,
pw, or survey_wt.
If the user supplies hints via hint_treatment, hint_outcome, or
hint_weights, these override the heuristic detection.
Analysis Plan Suggestion¶
morie.dataset.suggest_analysis_plan() inspects the
DatasetProfile and returns an ordered list of suggested analyses.
Each suggestion is a dict:
{
"analysis": "ipw_ate",
"rationale": "Binary treatment + continuous outcome detected",
"required_vars": {"treatment": "cannabis_use", "outcome": "drink_freq"},
"optional_vars": {"weights": "weight_var"},
}
The plan covers associational statistics (prevalence, χ², correlation), causal estimates (IPW, AIPW, DML), and regression models, ordered from simplest to most assumption-intensive.
Usage Example¶
import pandas as pd
from morie.dataset import load_dataset, profile_dataset, suggest_analysis_plan
df = load_dataset("data/my_survey.csv")
profile = profile_dataset(df, hint_treatment="cannabis_use")
# Print rich-formatted summary table
profile.summary_table()
# Get suggested analysis plan
for step in suggest_analysis_plan(profile):
print(step["analysis"], "—", step["rationale"])
CLI usage:
morie profile-dataset --csv data/my_survey.csv
References¶
Stevens SS (1946). On the theory of scales of measurement. Science, 103(2684):677–680. https://doi.org/10.1126/science.103.2684.677
Wickham H (2014). Tidy Data. JOSS, 59(10):1–23. https://doi.org/10.18637/jss.v059.i10