R API¶

Part of API Reference — MORIE API reference.

Reference for every public function exported by the morie R package. Signatures and descriptions come from the Roxygen2 .Rd files in r-package/morie/man/; see Statistical Methods for the methodology behind each function.

Causal estimators¶

Augmented IPW (AIPW) doubly-robust ATE estimator.

Combines IPW and outcome regression corrections. Consistent if strong{either} the propensity model strong{or} the outcome model is correctly specified.

Usage

estimate_aipw(data, treatment, outcome, covariates,
                propensity_col = NULL,
                outcome_model = c("linear", "logistic"))

Arguments

data: A data frame containing treatment, outcome, and covariate columns.
treatment: Name of the binary treatment column (0/1).
outcome: Name of the outcome column.
covariates: Character vector of covariate column names.
propensity_col: Optional name of a pre-computed propensity column. If NULL, propensity is fit via logistic regression on covariates.
outcome_model: Family for the outcome model: "linear" or "logistic".

Returns

Named list: ate, se, ci_lower, ci_upper, n.

Estimate the Average Treatment Effect on the Controls (ATC)

Control units receive weight 1; treated units receive $w_i = (1-\hat{e}(X_i))/\hat{e}(X_i)$.

Returns

Named list: ‘atc’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n_control’.

Estimate the Average Treatment Effect (ATE) via Hajek IPW

The Hajek estimator uses stabilised IPW weights:

\[\widehat{ATE} = \bar{y}_1^{w} - \bar{y}_0^{w}\]

where $\bar{y}_t^{w} = \sum_{T_i=t} w_i Y_i / \sum_{T_i=t} w_i$ and $w_i = T_i/\hat{e}(X_i) + (1-T_i)/(1-\hat{e}(X_i))$.

Usage

estimate_ate(data, treatment, outcome, covariates, propensity_col)

Arguments

data: A data frame.
treatment: Name of the binary treatment column.
outcome: Name of the outcome column.
covariates: Character vector of covariate names.
propensity_col: Optional: name of a pre-computed propensity score column.

Returns

Named list: ‘ate’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n’, ‘ess’.

Examples

set.seed(1)
  df <- data.frame(
  t = rbinom(200, 1, 0.4),
  y = rnorm(200),
  x = rnorm(200)
  )
  estimate_ate(df, "t", "y", "x")

Estimate the Average Treatment Effect on the Treated (ATT)

Treated units receive weight 1; controls receive $w_i = \hat{e}(X_i)/(1-\hat{e}(X_i))$.

Returns

Named list: ‘att’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n_treated’.

Examples

set.seed(2)
  df <- data.frame(t = rbinom(200, 1, 0.4), y = rnorm(200), x = rnorm(200))
  estimate_att(df, "t", "y", "x")

Estimate per-unit Conditional Average Treatment Effects via either a T-learner or an S-learner meta-learner.

The strong{T-learner} fits separate outcome models on treated and control units, then predicts the counterfactual for each unit: $\widehat{CATE}_i = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)$.

The strong{S-learner} fits one model with treatment as a feature and predicts $\hat{\mu}(X_i, 1) - \hat{\mu}(X_i, 0)$ per unit.

Usage

estimate_cate(data, treatment, outcome, covariates,
              propensity_col = NULL,
              outcome_model = c("linear", "logistic"),
              meta_learner = c("t_learner", "s_learner"))

Arguments

data: A data frame containing the treatment, outcome, and covariates.
treatment: Name of the binary treatment column in data.
outcome: Name of the outcome column in data.
covariates: Character vector of covariate column names in data.
propensity_col: Optional name of a pre-computed propensity-score column. If NULL, propensities are estimated internally.
outcome_model: Outcome-model family: "linear" (default) for continuous outcomes or "logistic" for binary outcomes.
meta_learner: Meta-learner: "t_learner" (default) or "s_learner".

Returns

A data frame with one row per unit in data, containing per-unit CATE estimates and supporting columns.

Examples

\donttest{
df <- data.frame(
  y = rnorm(100),
  z = sample(0:1, 100, replace = TRUE),
  x1 = rnorm(100), x2 = rnorm(100)
)
estimate_cate(df, treatment = "z", outcome = "y",
              covariates = c("x1", "x2"))
}

G-computation (outcome regression) ATE estimator

Estimates the ATE by:

\[\widehat{ATE} = \frac{1}{n}\sum_i \bigl[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)\bigr]\]

Returns

Named list: ‘ate’, ‘se’, ‘ci_lower’, ‘ci_upper’.

Estimate Group Average Treatment Effects by applying AIPW within each level of group_col to obtain stratum-specific treatment-effect estimates.

Usage

estimate_gate(data, treatment, outcome, covariates, group_col,
              propensity_col = NULL,
              outcome_model = c("linear", "logistic"))

Arguments

data: A data frame containing the treatment, outcome, covariates, and grouping column.
treatment: Name of the binary treatment column in data.
outcome: Name of the outcome column in data.
covariates: Character vector of covariate column names.
group_col: Name of the grouping variable (e.g. "gender", "region").
propensity_col: Optional name of a pre-computed propensity-score column. If NULL, propensities are estimated internally.
outcome_model: Outcome-model family: "linear" (default) for continuous outcomes or "logistic" for binary outcomes.

Returns

A data frame with one row per group level, containing the columns group, ate, se, ci_lower, ci_upper, n.

Examples

\donttest{
set.seed(3)
df <- data.frame(
  t = rbinom(300, 1, 0.4),
  y = rnorm(300),
  x = rnorm(300),
  g = sample(c("A", "B"), 300, replace = TRUE)
)
estimate_gate(df, treatment = "t", outcome = "y",
              covariates = "x", group_col = "g")
}

Estimate the Local Average Treatment Effect (LATE) via 2SLS / Wald

Uses a binary instrument $Z$ to identify the LATE (Imbens & Angrist, 1994):

\[LATE = \frac{Cov(Y, Z)}{Cov(T, Z)}\]

With covariates, uses two-stage OLS (Wald within residuals). Requires ‘ivreg::ivreg()’ if available; otherwise falls back to the closed-form Wald estimator.

Usage

estimate_late(data, treatment, outcome, instrument, covariates)

Arguments

data: A data frame.
treatment: Name of the binary endogenous treatment column.
outcome: Name of the outcome column.
instrument: Name of the binary instrument column.
covariates: Optional character vector of exogenous covariates.

Returns

Named list: ‘late’, ‘se’, ‘ci_lower’, ‘ci_upper’,

Estimate propensity scores via logistic regression

Usage

estimate_propensity_scores(data, treatment, covariates, trim)

Arguments

data: A data frame.
treatment: Name of the binary treatment column.
covariates: Character vector of covariate names.
trim: Quantile pair used to winsorize extreme scores (default 0.01, 0.99).

Returns

Numeric vector of propensity scores (same length as ‘nrow(data)’).

Examples

df <- data.frame(t = c(0,1,0,1,0,1), x = rnorm(6))
  ps <- estimate_propensity_scores(df, "t", "x")

Effect sizes + tests¶

One-way ANOVA

Usage

anova_one_way(...)

Arguments

...: Numeric vectors, one per group.

Returns

Named list: ‘F’, ‘df_between’, ‘df_within’, ‘p_value’,

Examples

anova_one_way(rnorm(30, 0), rnorm(30, 0.5), rnorm(30, 1))

Chi-square test of independence or goodness-of-fit

Usage

chi_square_test(observed, expected)

Arguments

observed: Observed counts (matrix for independence, vector for GOF).
expected: Expected counts for GOF (optional; uniform if NULL).

Returns

Named list: ‘chi_sq’, ‘df’, ‘p_value’, ‘cramers_v’.

Cohen’s d effect size

Usage

cohens_d(x1, x2, pooled)

Arguments

x1: Numeric vector (group 1).
x2: Numeric vector (group 2).
pooled: Use pooled SD (default ‘TRUE’). If ‘FALSE’, uses ‘sd(x2)’.

Returns

Numeric Cohen’s d.

Cramer’s V for categorical association

Usage

cramers_v(contingency_table)

Arguments

contingency_table: A numeric matrix of observed counts.

Returns

Numeric Cramer’s V in [0, 1].

Compute the E-value for unmeasured confounding (VanderWeele and Ding, 2017).

The E-value quantifies the minimum strength of confounding association needed to fully explain away an observed treatment effect. For a risk ratio less than 1, use its reciprocal before applying the formula.

Usage

e_value(rr, rr_lower)

Arguments

rr: Risk ratio estimate (> 0). Supply > 1; if < 1, pass its reciprocal.
rr_lower: Lower bound of the 95% CI (used to compute E-value for CI).

Returns

Named list with components e_value and e_value_ci.

Examples

e_value(rr = 3.9, rr_lower = 2.4)

Kish effective sample size

Usage

effective_sample_size(weights)

Arguments

weights: Numeric vector of sampling weights.

Returns

Numeric ESS.

Eta-squared from F-statistic

Usage

eta_squared(f_stat, df_between, df_within)

Arguments

f_stat: F statistic.
df_between: Degrees of freedom (numerator).
df_within: Degrees of freedom (denominator).

Returns

Numeric eta-squared.

Fisher’s exact test for 2x2 tables

Usage

fisher_exact_test(table_2x2, alternative)

Arguments

table_2x2: A 2x2 matrix or data frame of counts.
alternative: ‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘odds_ratio’, ‘ci’, ‘p_value’.

Hedges’ g (bias-corrected Cohen’s d)

Returns

Numeric Hedges’ g.

Kendall’s tau-b

Usage

kendall_tau(x, y)

Arguments

x: Numeric vector.
y: Numeric vector.

Returns

Named list: ‘tau’, ‘p_value’.

Kruskal-Wallis non-parametric ANOVA

Usage

kruskal_wallis_test(...)

Arguments

...: Numeric vectors, one per group.

Returns

Named list: ‘H’, ‘df’, ‘p_value’.

Levene test for equality of variances

Usage

levene_test(...)

Arguments

...: Numeric vectors, one per group.

Returns

Named list: ‘F’, ‘p_value’.

Mann-Whitney U test (Wilcoxon rank-sum)

Usage

mann_whitney_test(x1, x2, alternative)

Arguments

x1: Numeric vector (group 1).
x2: Numeric vector (group 2).
alternative: ‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘W’, ‘p_value’, ‘r’ (effect size).

Odds ratio and 95% CI from a 2x2 contingency table.

Usage

odds_ratio_ci(table_2x2, alpha)

Arguments

table_2x2: A 2x2 matrix: rows are treatment, columns are outcome.
alpha: Significance level.

Returns

Named list: ‘or’, ‘ci_lower’, ‘ci_upper’, ‘p_value’.

Compute omega-squared as a less-biased effect-size estimator than eta-squared for one-way ANOVA designs.

Usage

omega_squared(f_stat, df_between, df_within, n)

Arguments

f_stat: The F statistic from the one-way ANOVA.
df_between: Between-groups degrees of freedom.
df_within: Within-groups (residual) degrees of freedom.
n: Total sample size.

Returns

Numeric omega-squared.

Examples

omega_squared(f_stat = 5.2, df_between = 2, df_within = 87, n = 90)

One-sample t-test

Usage

one_sample_t_test(x, mu0, alternative)

Arguments

x: Numeric vector.
mu0: Null hypothesis mean (default 0).
alternative: ‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘t’, ‘df’, ‘p_value’, ‘ci’.

Survey + sampling¶

Bootstrap resampling for any statistic

Usage

bootstrap_sample(df, statistic, n_bootstrap, seed)

Arguments

df: A data frame.
statistic: A function taking a data frame and returning a scalar.
n_bootstrap: Number of bootstrap replicates.
seed: Random seed.

Returns

Named list: ‘estimate’, ‘se’, ‘ci_lower’, ‘ci_upper’,

Examples

df <- data.frame(x = rnorm(100))
  bootstrap_sample(df, statistic = function(d) mean(d$x))

Calibration weights via iterative proportional fitting (raking)

Adjusts initial design weights so that weighted marginal totals match known population totals for each auxiliary variable.

Usage

calibration_weights(df, aux_vars, population_totals, initial_weights, max_iter, tol)

Arguments

df: A data frame.
aux_vars: Character vector of categorical auxiliary variable names.
population_totals: Named list: ‘“var_level”’ -> population count.
initial_weights: Optional numeric vector of starting weights.
max_iter: Maximum IPF iterations.
tol: Convergence tolerance.

Returns

Numeric vector of calibrated weights.

Two-stage cluster sampling

Randomly selects ‘n_clusters’ clusters, then takes all units within selected clusters.

Usage

cluster_sample(df, cluster_col, n_clusters, seed)

Arguments

df: A data frame.
cluster_col: Name of the cluster identifier column.
n_clusters: Number of clusters to select.
seed: Random seed.

Returns

Data frame of selected units with ‘.weight’ column.

Compute inverse-probability design weights

Usage

compute_design_weights(df, strata_col, population_sizes)

Arguments

df: A data frame.
strata_col: Name of the stratification column.
population_sizes: Named integer vector: stratum level -> population size.

Returns

Numeric vector of design weights (same length as ‘nrow(df)’).

Design effect (DEFF)

Usage

design_effect(weights)

Arguments

weights: Numeric vector of sampling weights.

Returns

Numeric design effect (= n / ESS).

Generate synthetic epidemiology-style tabular data

Generates non-identifying synthetic data suitable for development, testing, and demos. The generator uses a canonical variable set and allows output column renaming through ‘name_map’ so it can be adapted to multiple studies. Synthetic data should not be used for final inferential reporting.

Usage

generate_synthetic_data(n, seed, special_code_rate, profile, name_map)

Arguments

n: Number of rows.
seed: Random seed for reproducibility.
special_code_rate: Proportion of values replaced with survey-style
profile: Convenience profile for output naming; ignored when
name_map: Optional named character vector mapping canonical keys to

Returns

A data.frame with synthetic records.

Delete-1 jackknife variance estimate

Usage

jackknife_estimate(df, statistic)

Arguments

df: A data frame.
statistic: A function taking a data frame and returning a scalar.

Returns

Named list: ‘estimate’, ‘se’, ‘bias’.

Datasets + I/O¶

Canonicalize raw CPADS PUMF columns

Usage

canonicalize_cpads_data(data)

Arguments

data: Raw CPADS data frame.

Returns

Data frame with canonical MORIE analysis columns.

Load the real CPADS CSV from this repository

Usage

load_cpads_data(cpads_csv)

Arguments

cpads_csv: Path to the CPADS CSV.

Returns

Canonicalized CPADS data frame.

Note

Documentation for R function morie_assistant_query() is pending. Run roxygen2 to generate the .Rd file.

Returns the path to morie.db that ships with the package. This database contains all CPADS, CCS, CSADS, CSUS, HealthInfobase, and CIHI datasets pre-loaded as SQLite tables.

Usage

morie_builtin_db()

Returns

File path string.

Reads a local file and writes it to the cache so that CI and Docker environments (which may lack the original files) can still run tests.

Usage

morie_cache_file(path, table_name, db_path = NULL)

Arguments

path: Path to a CSV or RDS file.
table_name: Name for the cached table.
db_path: Optional override for the database path.

Returns

Number of rows cached (invisible).

List all tables in the MORIE cache

Usage

morie_cache_list(db_path = NULL)

Arguments

db_path: Optional override for the database path.

Returns

A data.frame with columns table and rows.

Load a table from the MORIE cache

Usage

morie_cache_load(table_name, db_path = NULL)

Arguments

table_name: Name of the SQLite table.
db_path: Optional override for the database path.

Returns

A data.frame, or NULL if the table does not exist.

Writes (or replaces) a table in the shared SQLite cache.

Usage

morie_cache_store(data, table_name, db_path = NULL)

Arguments

data: A data.frame to cache.
table_name: Name of the SQLite table.
db_path: Optional override for the database path.

Returns

Number of rows written (invisible).

Returns a data.frame describing every dataset available through the MORIE data management system. Each row maps a short catalog key to its source, survey, year, file format, local path, SQLite table name, and CKAN resource ID (if available).

Usage

morie_dataset_catalog()

Returns

A data.frame with 36 rows (one per dataset) and columns: key, name, source, survey, year, format, type, large_file, local_path, table_name, ckan_resource_id.

Details

Keys match the Python DATASET_CATALOG in data.py exactly. Use ``morie_load_dataset`` to load by key.

Get metadata for a single dataset

Usage

morie_dataset_info(key)

Arguments

key: Dataset catalog key (or fuzzy match).

Returns

A named list with dataset metadata.

Opens (or creates) the shared cache at data/cache/morie.db. Both R (DBI/RSQLite) and Python (sqlite3) read/write this same file.

Usage

morie_db_connect(db_path = NULL)

Arguments

db_path: Path to the SQLite file. Defaults to MORIE_CACHE_DB env var or data/cache/morie.db.

Returns

A DBI connection object.

Downloads large bootstrap weight CSVs that are too big to ship with the package. Data is cached in the user cache database for future use.

Usage

morie_download_bootstrap(survey = "all", limit = 32000L, db_path = NULL)

Arguments

survey: One of "csads_2021", "csads_2023", "csus_2019", "csus_2023", or "all" (default).
limit: Max records per CKAN request (default 32000).
db_path: Optional override for cache database path.

Returns

Invisibly, the number of CSV files successfully downloaded.

Fetch data from the CKAN API and cache it

Usage

morie_fetch_ckan(dataset_key = "cpads", limit = 32000L, db_path = NULL)

Arguments

dataset_key: One of "cpads", "csads", "csus".
limit: Max records to fetch.
db_path: Optional override for the database path.

Returns

A data.frame.

List all datasets with cache status

Usage

morie_list_datasets(db_path = NULL)

Arguments

db_path: Optional override for the database path.

Returns

A data.frame with columns: key, name, source, survey, year, type, cached (logical), rows (integer or NA).

Resolution order: enumerate{ item Local RDS/CSV files in standard project locations item SQLite cache (data/cache/morie.db) item CKAN API fetch (requires internet) }

Usage

morie_load_cpads(db_path = NULL, use_ckan = TRUE)

Arguments

db_path: Optional override for the database path.
use_ckan: Logical; if TRUE and data not found locally or in cache, attempt to fetch from the CKAN API.

Returns

A data.frame with canonical CPADS columns.

Resolution: SQLite cache -> local file ingest -> CKAN API -> error. Supports fuzzy matching: morie_load_dataset("cpads_2021") resolves to oc_cpads_2021.

Usage

morie_load_dataset(key, db_path = NULL)

Arguments

key: Dataset catalog key (or fuzzy match).
db_path: Optional override for the database path.

Returns

A data.frame.

Resolve standard project paths

Usage

morie_paths(project_root = NULL)

Arguments

project_root: Project root directory. If NULL, inferred from the current working directory.

Returns

Named list of key paths.

Lists or retrieves bundled userguide PDF files. These are the official PUMF codebooks and user guides from Health Canada / Statistics Canada.

Usage

morie_userguide(name = NULL)

Arguments

name: Filename (e.g., "20212022-cpads-pumf-user-guide.pdf"). If NULL, lists all available userguides.

Returns

File path string, or character vector of filenames.

Workflow + audit¶

Send a free-form question to the Perseus assistant via the bundled: Python LLM bridge. Returns the assistant’s text response.

Usage

ask_percy(question, context = NULL,
            python_bin = Sys.getenv("MORIE_PYTHON_BIN", "python3"))

Arguments

question: Character string. The natural-language question.
context: Optional named list of context variables (data summaries, column descriptions, etc.) appended to the prompt.
python_bin: Path to the Python interpreter that has the pkg{morie} Python package installed. Defaults to $MORIE_PYTHON_BIN or python3.

Returns

Character string containing the assistant’s response.

Audit declared outputs against files on disk

Usage

audit_public_outputs(project_root, manifest)

Arguments

project_root: Project root directory.
manifest: Manifest data frame. If ‘NULL’, loaded from disk.

Returns

Data frame containing declared and observed output status.

Build an MORIE assistant prompt

Usage

build_assistant_prompt(question, context)

Arguments

question: User question.
context: Optional context string.

Returns

Character scalar prompt.

Build an outputs manifest from a directory of artifacts

Usage

build_outputs_manifest(output_dir, manifest_path, public_prefix, extensions)

Arguments

output_dir: Directory containing output files.
manifest_path: CSV path to write.
public_prefix: Prefix used in ‘public_path’ values.
extensions: File extensions to include (without dots).

Returns

Manifest data frame.

Compose a question and optional context into the structured prompt: format expected by ``ask_percy``.

Usage

build_prompt(question, context = NULL)

Arguments

question: Character string. The natural-language question.
context: Optional named list of context variables.

Returns

Character string. The composed prompt.

Return the canonical CPADS local-data contract

Returns

Named list describing the expected local CPADS contract.

Default synthetic-data variable name map

Returns a named character vector mapping canonical variable keys used by [generate_synthetic_data()] to output column names.

Usage

default_synthetic_name_map(profile)

Arguments

profile: Name profile. ‘“generic”’ is recommended for new projects.

Returns

Named character vector.

Default workflow step map

Returns the default named map of workflow steps to project script paths.

Returns

Named character vector.

Find a project root directory

Searches upward from ‘start’ for a directory containing the current Sphinx/package-root markers, while still tolerating legacy Quarto-era markers in older checkouts.

Usage

find_project_root(start, max_up)

Arguments

start: Starting directory.
max_up: Maximum number of parent traversals.

Returns

Absolute path to the detected project root.

List implemented MORIE CPADS modules

Usage

list_morie_modules()

Returns

Data frame describing the implemented module surface.

Other¶

Paired t-test

Usage

paired_t_test(x1, x2, alternative)

Arguments

x1: Numeric vector (before/condition 1).
x2: Numeric vector (after/condition 2).
alternative: ‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘t’, ‘df’, ‘p_value’, ‘ci_diff’, ‘mean_diff’.

Point-biserial correlation

Usage

point_biserial_r(binary_var, continuous_var)

Arguments

binary_var: Binary numeric vector (0/1).
continuous_var: Continuous numeric vector.

Returns

Named list: ‘r’, ‘p_value’.

Power for a two-proportion z-test

Mirrors R’s ‘power.prop.test()’.

Usage

power_prop_test(n, p1, p2, sig_level, power, alternative)

Arguments

n: Sample size per group.
p1: Proportion in group 1.
p2: Proportion in group 2.
sig_level: Type I error rate.
power: Desired power.
alternative: ‘“two.sided”’ or ‘“one.sided”’.

Returns

Result of ‘stats::power.prop.test()’.

Examples

power_prop_test(p1 = 0.30, p2 = 0.20, power = 0.80)

Power for a two-sample t-test

Solve for any missing parameter (‘n’, ‘delta’, ‘sd’, ‘sig.level’, or ‘power’). Mirrors R’s ‘power.t.test()’.

Usage

power_t_test(n, delta, sd, sig_level, power, alternative, type)

Arguments

n: Sample size per group (NULL to solve for it).
delta: Effect size (difference in means).
sd: Standard deviation (pooled).
sig_level: Type I error rate (alpha).
power: Desired power (1 - beta).
alternative: ‘“two.sided”’ or ‘“one.sided”’.
type: ‘“two.sample”’, ‘“one.sample”’, or ‘“paired”’.

Returns

Result of ‘stats::power.t.test()’.

Examples

power_t_test(n = NULL, delta = 0.5, power = 0.80)

Probability proportional to size (PPS) sampling

Usage

pps_sample(df, size_col, n, seed)

Arguments

df: A data frame.
size_col: Name of the size measure column.
n: Number of units to select.
seed: Random seed.

Returns

Data frame of selected units with ‘.weight’ (Hansen-Hurwitz weights).

Compute a confidence interval for a binomial proportion using the Wilson score method (default), exact Clopper-Pearson, or Wald method.

Usage

proportion_ci(successes, n, alpha = 0.05,
              method = c("wilson", "exact", "wald"))

Arguments

successes: Number of successes.
n: Total observations.
alpha: Significance level (default 0.05 for a 95% CI).
method: "wilson" (default), "exact" (Clopper-Pearson), or "wald".

Returns

Named list with components p_hat, ci_lower, ci_upper.

Examples

proportion_ci(35, 100)

Read outputs manifest from a project

Usage

read_outputs_manifest(project_root, manifest_path, validate)

Arguments

project_root: Project root path.
manifest_path: Optional explicit manifest path.
validate: If ‘TRUE’, validate schema.

Returns

Manifest data frame.

Risk difference (ARD) with Newcombe CI

Usage

risk_difference_ci(table_2x2, alpha)

Arguments

table_2x2: A 2x2 matrix: rows are exposure, columns are outcome.
alpha: Significance level.

Returns

Named list: ‘rd’, ‘ci_lower’, ‘ci_upper’.

Risk ratio (relative risk) with log-normal CI

Usage

risk_ratio_ci(table_2x2, alpha)

Arguments

table_2x2: A 2x2 matrix: rows are exposure, columns are outcome (disease = col 1).
alpha: Significance level.

Returns

Named list: ‘rr’, ‘ci_lower’, ‘ci_upper’.

Run the eBAC selection-adjusted IPW workflow

Mirrors the core outputs of the old ‘07_ebac_ipw.R’ workflow.

Usage

run_ebac_selection_ipw_analysis(data, output_dir, treatment, covariates)

Arguments

data: Analysis data frame.
output_dir: Optional directory for CSV outputs.
treatment: Treatment column name.
covariates: Covariate names used in the observation model.

Returns

Named list of output tables and the observed-domain analysis frame.

Run one implemented MORIE module against CPADS data

Usage

run_morie_module(
  module_name,
  cpads_csv = .cpads_default_csv(),
  output_dir = NULL
)

Arguments

module_name: Module name.
cpads_csv: Path to the CPADS CSV.
output_dir: Optional directory for CSV outputs.

Returns

Named list of data-frame outputs.

Run multiple implemented MORIE modules

Usage

run_morie_modules(
  modules = list_morie_modules()$name,
  cpads_csv = .cpads_default_csv(),
  output_dir = NULL
)

Arguments

modules: Character vector of module names.
cpads_csv: Path to the CPADS CSV.
output_dir: Optional directory for CSV outputs.

Returns

Named list of module outputs.

Run multiple workflow steps

Usage

run_pipeline(steps, project_root, script_map, stop_on_error, verbose)

Arguments

steps: Ordered vector of workflow step names.
project_root: Project root directory.
script_map: Named character vector mapping steps to script paths.
stop_on_error: If ‘TRUE’, stop at first failure.
verbose: If ‘TRUE’, streams command output.

Returns

Data frame of step statuses.

Run the CPADS propensity/IPW workflow

Mirrors the core outputs of the old ‘07_propensity.R’ workflow.

Usage

run_propensity_ipw_analysis(data, output_dir, trim, treatment, outcome, covariates)

Arguments

data: Analysis data frame.
output_dir: Optional directory for CSV outputs.
trim: Quantile pair used to trim extreme IPW values.
treatment: Binary treatment column.
outcome: Binary outcome column.
covariates: Covariate names for the propensity model.

Returns

Named list of output tables and the analysis data.

Run one project workflow step

Usage

run_workflow_step(step, project_root, script_map, rscript_bin, verbose)

Arguments

step: Step name present in ‘script_map’.
project_root: Project root directory.
script_map: Named character vector mapping steps to script paths.
rscript_bin: Optional path to ‘Rscript’ binary.
verbose: If ‘TRUE’, streams command output.

Returns

Named list with step metadata and exit status.

Sample size for logistic regression detecting a target odds ratio

Uses the formula from Hsieh et al. (1998):

\[n = \frac{(z_{\alpha/2} + z_\beta)^2}{p_1(1-p_1) [\log(OR)]^2}\]

Usage

sample_size_logistic(p0, or, alpha, power, two_sided)

Arguments

p0: Prevalence under control.
or: Target odds ratio.
alpha: Significance level.
power: Desired power.
two_sided: Logical.

Returns

Integer sample size.

Rosenbaum bounds sensitivity analysis

For a range of hidden-confounding levels $\Gamma$, tests whether the treatment effect remains significant. A large $\Gamma$ at which the result remains significant indicates robustness.

Uses Wilcoxon signed-rank statistic bounds for matched designs. For unmatched data, computes sign-score bounds.

Usage

sensitivity_rosenbaum(treated, control, gamma_range)

Arguments

treated: Numeric vector of outcomes for treated units.
control: Numeric vector of outcomes for control units
gamma_range: Numeric vector of $\Gamma$ values to test.

Returns

Data frame with columns: ‘gamma’, ‘p_lower’, ‘p_upper’.

Shapiro-Wilk normality test

Usage

shapiro_wilk_test(x, alpha)

Arguments

x: Numeric vector.
alpha: Significance level for the ‘is_normal’ flag (default 0.05).

Returns

Named list: ‘W’, ‘p_value’, ‘is_normal’.

Simple random sample from a data frame

Usage

simple_random_sample(df, n, replace, seed)

Arguments

df: A data frame.
n: Number of units to select.
replace: Sample with replacement? Default ‘FALSE’.
seed: Random seed for reproducibility.

Returns

A data frame of ‘n’ sampled rows with a ‘.weight’ column added.

Examples

df <- data.frame(x = 1:100)
  srs_sample <- simple_random_sample(df, 20)

Spearman rank correlation

Usage

spearman_rho(x, y)

Arguments

x: Numeric vector.
y: Numeric vector.

Returns

Named list: ‘rho’, ‘p_value’.

Proportional or fixed stratified random sample

Usage

stratified_sample(df, strata_col, n_per_stratum, proportional, seed)

Arguments

df: A data frame.
strata_col: Name of the stratification column.
n_per_stratum: Either an integer (equal allocation) or a named integer
proportional: Logical; if ‘TRUE’, allocate proportionally to strata sizes.
seed: Random seed.

Returns

Data frame of sampled rows with a ‘.weight’ column.

Examples

df <- data.frame(g = c(rep("A", 60), rep("B", 40)), x = rnorm(100))
  stratified_sample(df, "g", n_per_stratum = 10)

Summarize an output audit

Usage

summarize_output_audit(audit_tbl)

Arguments

audit_tbl: Result from [audit_public_outputs()].

Returns

Named list with high-level diagnostics.

Two-sample t-test with tidy output

Usage

two_sample_t_test(x1, x2, equal_var, alternative)

Arguments

x1: Numeric vector (group 1).
x2: Numeric vector (group 2).
equal_var: Assume equal variances? Default ‘FALSE’ (Welch test).
alternative: ‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘t’, ‘df’, ‘p_value’, ‘ci_diff’, ‘cohens_d’.

Examples

two_sample_t_test(rnorm(50, 0.5), rnorm(50, 0))

Validate a CPADS analysis data frame

Usage

validate_cpads_data(data, strict)

Arguments

data: Data frame to validate.
strict: If ‘TRUE’, stop when required variables are missing.

Returns

Character vector of missing variable names.

Validate outputs manifest structure

Usage

validate_outputs_manifest(manifest, strict)

Arguments

manifest: Data frame to validate.
strict: If ‘TRUE’, stop on validation failures.

Returns

‘TRUE’ when validation passes.

Wilcoxon signed-rank test (paired)

Usage

wilcoxon_signed_rank_test(x1, x2, alternative)

Arguments

x1: Numeric vector (before).
x2: Numeric vector (after).
alternative: ‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘V’, ‘p_value’.

Write synthetic epidemiology-style data to CSV

Usage

write_synthetic_data(path, n, seed, special_code_rate, profile, name_map, overwrite)

Arguments

path: Output CSV path.
n: Number of rows.
seed: Random seed.
special_code_rate: Proportion of survey-style missing codes.
profile: Naming profile for output columns.
name_map: Optional custom variable name map.
overwrite: If ‘TRUE’, overwrite existing file.

Returns

Normalized output path.

MORIE

Table of Contents

Related Topics

R API¶

Causal estimators¶

Effect sizes + tests¶

Survey + sampling¶

Datasets + I/O¶

Workflow + audit¶

Other¶