Dataset-Agnostic Analysis¶

Part of Statistical Methods — MORIE’s statistical-methods reference.

MORIE can profile and analyse any tabular dataset without prior knowledge of its schema via the morie.dataset module. This capability is essential when working with novel administrative health data, new survey waves, or datasets outside of CPADS.

Levels of Measurement¶

MORIE follows Stevens (1946) four-level typology when classifying columns:

Nominal — categories, no order. Operations: =, ≠. Examples: sex, province, ethnicity.
Ordinal — ordered categories. Operations: =, ≠, <, >. Examples: Likert scale (1–5), severity grade.
Interval — equal intervals, no true zero. Operations: +, −, mean. Examples: year, temperature (°C), index score.
Ratio — equal intervals + true zero. Operations: +, −, ×, ÷, geometric mean. Examples: income, age, count, weight.

Inference rules used by morie.dataset.infer_measurement_level():

Object / category dtype + ≤ ordinal_threshold unique values → Ordinal
Object / category dtype + > ordinal_threshold unique values → Nominal
Integer / float + exactly 2 unique values → Nominal (binary)
Integer / float + ≤ 20 unique values + name suggests rank/scale → Ordinal
Float + name contains year, index, score → Interval
Float or integer + min ≥ 0 → Ratio
All other numeric → Interval

Dataset Profiling¶

morie.dataset.profile_dataset() builds a morie.dataset.DatasetProfile containing a morie.dataset.ColumnProfile for every column.

Each ColumnProfile records:

level — inferred measurement level (NOIR)
missing_pct — fraction of missing values
is_binary — True when only two distinct non-null values exist
suggested_role — one of treatment, outcome, covariate, weight, stratum, cluster, id
summary_stats — mean/SD (numeric) or top-k category counts (categorical)

Role detection heuristics¶

Treatment column candidates: binary column with a name containing treat, cannabis, drug, alcohol, intervention, exposed, or assigned.

Outcome column candidates: column named with a suffix / prefix matching outcome, result, y_, response, _freq, _harm, _drink, or disorder.

Weight column candidates: column name containing weight, wt, pw, or survey_wt.

If the user supplies hints via hint_treatment, hint_outcome, or hint_weights, these override the heuristic detection.

Analysis Plan Suggestion¶

morie.dataset.suggest_analysis_plan() inspects the DatasetProfile and returns an ordered list of suggested analyses. Each suggestion is a dict:

{
    "analysis": "ipw_ate",
    "rationale": "Binary treatment + continuous outcome detected",
    "required_vars": {"treatment": "cannabis_use", "outcome": "drink_freq"},
    "optional_vars": {"weights": "weight_var"},
}

The plan covers associational statistics (prevalence, χ², correlation), causal estimates (IPW, AIPW, DML), and regression models, ordered from simplest to most assumption-intensive.

Usage Example¶

import pandas as pd
from morie.dataset import load_dataset, profile_dataset, suggest_analysis_plan

df = load_dataset("data/my_survey.csv")
profile = profile_dataset(df, hint_treatment="cannabis_use")

# Print rich-formatted summary table
profile.summary_table()

# Get suggested analysis plan
for step in suggest_analysis_plan(profile):
    print(step["analysis"], "—", step["rationale"])

CLI usage:

morie profile-dataset --csv data/my_survey.csv

References¶

Stevens SS (1946). On the theory of scales of measurement. Science, 103(2684):677–680. https://doi.org/10.1126/science.103.2684.677
Wickham H (2014). Tidy Data. JOSS, 59(10):1–23. https://doi.org/10.18637/jss.v059.i10

MORIE

Table of Contents

Related Topics