R API

Part of API Reference — MORIE API reference.

Reference for every public function exported by the morie R package. Signatures and descriptions come from the Roxygen2 .Rd files in r-package/morie/man/; see Statistical Methods for the methodology behind each function.

Causal estimators

Augmented IPW (AIPW) doubly-robust ATE estimator.

Combines IPW and outcome regression corrections. Consistent if strong{either} the propensity model strong{or} the outcome model is correctly specified.

Usage

estimate_aipw(data, treatment, outcome, covariates,
                propensity_col = NULL,
                outcome_model = c("linear", "logistic"))

Arguments

data

A data frame containing treatment, outcome, and covariate columns.

treatment

Name of the binary treatment column (0/1).

outcome

Name of the outcome column.

covariates

Character vector of covariate column names.

propensity_col

Optional name of a pre-computed propensity column. If NULL, propensity is fit via logistic regression on covariates.

outcome_model

Family for the outcome model: "linear" or "logistic".

Returns

Named list: ate, se, ci_lower, ci_upper, n.

Estimate the Average Treatment Effect on the Controls (ATC)

Control units receive weight 1; treated units receive \(w_i = (1-\hat{e}(X_i))/\hat{e}(X_i)\).

Returns

Named list: ‘atc’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n_control’.

Estimate the Average Treatment Effect (ATE) via Hajek IPW

The Hajek estimator uses stabilised IPW weights:

\[\widehat{ATE} = \bar{y}_1^{w} - \bar{y}_0^{w}\]

where \(\bar{y}_t^{w} = \sum_{T_i=t} w_i Y_i / \sum_{T_i=t} w_i\) and \(w_i = T_i/\hat{e}(X_i) + (1-T_i)/(1-\hat{e}(X_i))\).

Usage

estimate_ate(data, treatment, outcome, covariates, propensity_col)

Arguments

data

A data frame.

treatment

Name of the binary treatment column.

outcome

Name of the outcome column.

covariates

Character vector of covariate names.

propensity_col

Optional: name of a pre-computed propensity score column.

Returns

Named list: ‘ate’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n’, ‘ess’.

Examples

set.seed(1)
  df <- data.frame(
  t = rbinom(200, 1, 0.4),
  y = rnorm(200),
  x = rnorm(200)
  )
  estimate_ate(df, "t", "y", "x")

Estimate the Average Treatment Effect on the Treated (ATT)

Treated units receive weight 1; controls receive \(w_i = \hat{e}(X_i)/(1-\hat{e}(X_i))\).

Returns

Named list: ‘att’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n_treated’.

Examples

set.seed(2)
  df <- data.frame(t = rbinom(200, 1, 0.4), y = rnorm(200), x = rnorm(200))
  estimate_att(df, "t", "y", "x")

Estimate per-unit Conditional Average Treatment Effects via either a T-learner or an S-learner meta-learner.

The strong{T-learner} fits separate outcome models on treated and control units, then predicts the counterfactual for each unit: \(\widehat{CATE}_i = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)\).

The strong{S-learner} fits one model with treatment as a feature and predicts \(\hat{\mu}(X_i, 1) - \hat{\mu}(X_i, 0)\) per unit.

Usage

estimate_cate(data, treatment, outcome, covariates,
              propensity_col = NULL,
              outcome_model = c("linear", "logistic"),
              meta_learner = c("t_learner", "s_learner"))

Arguments

data

A data frame containing the treatment, outcome, and covariates.

treatment

Name of the binary treatment column in data.

outcome

Name of the outcome column in data.

covariates

Character vector of covariate column names in data.

propensity_col

Optional name of a pre-computed propensity-score column. If NULL, propensities are estimated internally.

outcome_model

Outcome-model family: "linear" (default) for continuous outcomes or "logistic" for binary outcomes.

meta_learner

Meta-learner: "t_learner" (default) or "s_learner".

Returns

A data frame with one row per unit in data, containing per-unit CATE estimates and supporting columns.

Examples

\donttest{
df <- data.frame(
  y = rnorm(100),
  z = sample(0:1, 100, replace = TRUE),
  x1 = rnorm(100), x2 = rnorm(100)
)
estimate_cate(df, treatment = "z", outcome = "y",
              covariates = c("x1", "x2"))
}

G-computation (outcome regression) ATE estimator

Estimates the ATE by:

\[\widehat{ATE} = \frac{1}{n}\sum_i \bigl[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)\bigr]\]

Returns

Named list: ‘ate’, ‘se’, ‘ci_lower’, ‘ci_upper’.

Estimate Group Average Treatment Effects by applying AIPW within each level of group_col to obtain stratum-specific treatment-effect estimates.

Usage

estimate_gate(data, treatment, outcome, covariates, group_col,
              propensity_col = NULL,
              outcome_model = c("linear", "logistic"))

Arguments

data

A data frame containing the treatment, outcome, covariates, and grouping column.

treatment

Name of the binary treatment column in data.

outcome

Name of the outcome column in data.

covariates

Character vector of covariate column names.

group_col

Name of the grouping variable (e.g. "gender", "region").

propensity_col

Optional name of a pre-computed propensity-score column. If NULL, propensities are estimated internally.

outcome_model

Outcome-model family: "linear" (default) for continuous outcomes or "logistic" for binary outcomes.

Returns

A data frame with one row per group level, containing the columns group, ate, se, ci_lower, ci_upper, n.

Examples

\donttest{
set.seed(3)
df <- data.frame(
  t = rbinom(300, 1, 0.4),
  y = rnorm(300),
  x = rnorm(300),
  g = sample(c("A", "B"), 300, replace = TRUE)
)
estimate_gate(df, treatment = "t", outcome = "y",
              covariates = "x", group_col = "g")
}

Estimate the Local Average Treatment Effect (LATE) via 2SLS / Wald

Uses a binary instrument \(Z\) to identify the LATE (Imbens & Angrist, 1994):

\[LATE = \frac{Cov(Y, Z)}{Cov(T, Z)}\]

With covariates, uses two-stage OLS (Wald within residuals). Requires ‘ivreg::ivreg()’ if available; otherwise falls back to the closed-form Wald estimator.

Usage

estimate_late(data, treatment, outcome, instrument, covariates)

Arguments

data

A data frame.

treatment

Name of the binary endogenous treatment column.

outcome

Name of the outcome column.

instrument

Name of the binary instrument column.

covariates

Optional character vector of exogenous covariates.

Returns

Named list: ‘late’, ‘se’, ‘ci_lower’, ‘ci_upper’,

Estimate propensity scores via logistic regression

Usage

estimate_propensity_scores(data, treatment, covariates, trim)

Arguments

data

A data frame.

treatment

Name of the binary treatment column.

covariates

Character vector of covariate names.

trim

Quantile pair used to winsorize extreme scores (default 0.01, 0.99).

Returns

Numeric vector of propensity scores (same length as ‘nrow(data)’).

Examples

df <- data.frame(t = c(0,1,0,1,0,1), x = rnorm(6))
  ps <- estimate_propensity_scores(df, "t", "x")

Effect sizes + tests

One-way ANOVA

Usage

anova_one_way(...)

Arguments

...

Numeric vectors, one per group.

Returns

Named list: ‘F’, ‘df_between’, ‘df_within’, ‘p_value’,

Examples

anova_one_way(rnorm(30, 0), rnorm(30, 0.5), rnorm(30, 1))

Chi-square test of independence or goodness-of-fit

Usage

chi_square_test(observed, expected)

Arguments

observed

Observed counts (matrix for independence, vector for GOF).

expected

Expected counts for GOF (optional; uniform if NULL).

Returns

Named list: ‘chi_sq’, ‘df’, ‘p_value’, ‘cramers_v’.

Cohen’s d effect size

Usage

cohens_d(x1, x2, pooled)

Arguments

x1

Numeric vector (group 1).

x2

Numeric vector (group 2).

pooled

Use pooled SD (default ‘TRUE’). If ‘FALSE’, uses ‘sd(x2)’.

Returns

Numeric Cohen’s d.

Cramer’s V for categorical association

Usage

cramers_v(contingency_table)

Arguments

contingency_table

A numeric matrix of observed counts.

Returns

Numeric Cramer’s V in [0, 1].

Compute the E-value for unmeasured confounding (VanderWeele and Ding, 2017).

The E-value quantifies the minimum strength of confounding association needed to fully explain away an observed treatment effect. For a risk ratio less than 1, use its reciprocal before applying the formula.

Usage

e_value(rr, rr_lower)

Arguments

rr

Risk ratio estimate (> 0). Supply > 1; if < 1, pass its reciprocal.

rr_lower

Lower bound of the 95% CI (used to compute E-value for CI).

Returns

Named list with components e_value and e_value_ci.

Examples

e_value(rr = 3.9, rr_lower = 2.4)

Kish effective sample size

Usage

effective_sample_size(weights)

Arguments

weights

Numeric vector of sampling weights.

Returns

Numeric ESS.

Eta-squared from F-statistic

Usage

eta_squared(f_stat, df_between, df_within)

Arguments

f_stat

F statistic.

df_between

Degrees of freedom (numerator).

df_within

Degrees of freedom (denominator).

Returns

Numeric eta-squared.

Fisher’s exact test for 2x2 tables

Usage

fisher_exact_test(table_2x2, alternative)

Arguments

table_2x2

A 2x2 matrix or data frame of counts.

alternative

‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘odds_ratio’, ‘ci’, ‘p_value’.

Hedges’ g (bias-corrected Cohen’s d)

Returns

Numeric Hedges’ g.

Kendall’s tau-b

Usage

kendall_tau(x, y)

Arguments

x

Numeric vector.

y

Numeric vector.

Returns

Named list: ‘tau’, ‘p_value’.

Kruskal-Wallis non-parametric ANOVA

Usage

kruskal_wallis_test(...)

Arguments

...

Numeric vectors, one per group.

Returns

Named list: ‘H’, ‘df’, ‘p_value’.

Levene test for equality of variances

Usage

levene_test(...)

Arguments

...

Numeric vectors, one per group.

Returns

Named list: ‘F’, ‘p_value’.

Mann-Whitney U test (Wilcoxon rank-sum)

Usage

mann_whitney_test(x1, x2, alternative)

Arguments

x1

Numeric vector (group 1).

x2

Numeric vector (group 2).

alternative

‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘W’, ‘p_value’, ‘r’ (effect size).

Odds ratio and 95% CI from a 2x2 contingency table.

Usage

odds_ratio_ci(table_2x2, alpha)

Arguments

table_2x2

A 2x2 matrix: rows are treatment, columns are outcome.

alpha

Significance level.

Returns

Named list: ‘or’, ‘ci_lower’, ‘ci_upper’, ‘p_value’.

Compute omega-squared as a less-biased effect-size estimator than eta-squared for one-way ANOVA designs.

Usage

omega_squared(f_stat, df_between, df_within, n)

Arguments

f_stat

The F statistic from the one-way ANOVA.

df_between

Between-groups degrees of freedom.

df_within

Within-groups (residual) degrees of freedom.

n

Total sample size.

Returns

Numeric omega-squared.

Examples

omega_squared(f_stat = 5.2, df_between = 2, df_within = 87, n = 90)

One-sample t-test

Usage

one_sample_t_test(x, mu0, alternative)

Arguments

x

Numeric vector.

mu0

Null hypothesis mean (default 0).

alternative

‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘t’, ‘df’, ‘p_value’, ‘ci’.

Survey + sampling

Bootstrap resampling for any statistic

Usage

bootstrap_sample(df, statistic, n_bootstrap, seed)

Arguments

df

A data frame.

statistic

A function taking a data frame and returning a scalar.

n_bootstrap

Number of bootstrap replicates.

seed

Random seed.

Returns

Named list: ‘estimate’, ‘se’, ‘ci_lower’, ‘ci_upper’,

Examples

df <- data.frame(x = rnorm(100))
  bootstrap_sample(df, statistic = function(d) mean(d$x))

Calibration weights via iterative proportional fitting (raking)

Adjusts initial design weights so that weighted marginal totals match known population totals for each auxiliary variable.

Usage

calibration_weights(df, aux_vars, population_totals, initial_weights, max_iter, tol)

Arguments

df

A data frame.

aux_vars

Character vector of categorical auxiliary variable names.

population_totals

Named list: ‘“var_level”’ -> population count.

initial_weights

Optional numeric vector of starting weights.

max_iter

Maximum IPF iterations.

tol

Convergence tolerance.

Returns

Numeric vector of calibrated weights.

Two-stage cluster sampling

Randomly selects ‘n_clusters’ clusters, then takes all units within selected clusters.

Usage

cluster_sample(df, cluster_col, n_clusters, seed)

Arguments

df

A data frame.

cluster_col

Name of the cluster identifier column.

n_clusters

Number of clusters to select.

seed

Random seed.

Returns

Data frame of selected units with ‘.weight’ column.

Compute inverse-probability design weights

Usage

compute_design_weights(df, strata_col, population_sizes)

Arguments

df

A data frame.

strata_col

Name of the stratification column.

population_sizes

Named integer vector: stratum level -> population size.

Returns

Numeric vector of design weights (same length as ‘nrow(df)’).

Design effect (DEFF)

Usage

design_effect(weights)

Arguments

weights

Numeric vector of sampling weights.

Returns

Numeric design effect (= n / ESS).

Generate synthetic epidemiology-style tabular data

Generates non-identifying synthetic data suitable for development, testing, and demos. The generator uses a canonical variable set and allows output column renaming through ‘name_map’ so it can be adapted to multiple studies. Synthetic data should not be used for final inferential reporting.

Usage

generate_synthetic_data(n, seed, special_code_rate, profile, name_map)

Arguments

n

Number of rows.

seed

Random seed for reproducibility.

special_code_rate

Proportion of values replaced with survey-style

profile

Convenience profile for output naming; ignored when

name_map

Optional named character vector mapping canonical keys to

Returns

A data.frame with synthetic records.

Delete-1 jackknife variance estimate

Usage

jackknife_estimate(df, statistic)

Arguments

df

A data frame.

statistic

A function taking a data frame and returning a scalar.

Returns

Named list: ‘estimate’, ‘se’, ‘bias’.

Datasets + I/O

Canonicalize raw CPADS PUMF columns

Usage

canonicalize_cpads_data(data)

Arguments

data

Raw CPADS data frame.

Returns

Data frame with canonical MORIE analysis columns.

Load the real CPADS CSV from this repository

Usage

load_cpads_data(cpads_csv)

Arguments

cpads_csv

Path to the CPADS CSV.

Returns

Canonicalized CPADS data frame.

Note

Documentation for R function morie_assistant_query() is pending. Run roxygen2 to generate the .Rd file.

Returns the path to morie.db that ships with the package. This database contains all CPADS, CCS, CSADS, CSUS, HealthInfobase, and CIHI datasets pre-loaded as SQLite tables.

Usage

morie_builtin_db()

Returns

File path string.

Reads a local file and writes it to the cache so that CI and Docker environments (which may lack the original files) can still run tests.

Usage

morie_cache_file(path, table_name, db_path = NULL)

Arguments

path

Path to a CSV or RDS file.

table_name

Name for the cached table.

db_path

Optional override for the database path.

Returns

Number of rows cached (invisible).

List all tables in the MORIE cache

Usage

morie_cache_list(db_path = NULL)

Arguments

db_path

Optional override for the database path.

Returns

A data.frame with columns table and rows.

Load a table from the MORIE cache

Usage

morie_cache_load(table_name, db_path = NULL)

Arguments

table_name

Name of the SQLite table.

db_path

Optional override for the database path.

Returns

A data.frame, or NULL if the table does not exist.

Writes (or replaces) a table in the shared SQLite cache.

Usage

morie_cache_store(data, table_name, db_path = NULL)

Arguments

data

A data.frame to cache.

table_name

Name of the SQLite table.

db_path

Optional override for the database path.

Returns

Number of rows written (invisible).

Returns a data.frame describing every dataset available through the MORIE data management system. Each row maps a short catalog key to its source, survey, year, file format, local path, SQLite table name, and CKAN resource ID (if available).

Usage

morie_dataset_catalog()

Returns

A data.frame with 36 rows (one per dataset) and columns: key, name, source, survey, year, format, type, large_file, local_path, table_name, ckan_resource_id.

Details

Keys match the Python DATASET_CATALOG in data.py exactly. Use ``morie_load_dataset`` to load by key.

Get metadata for a single dataset

Usage

morie_dataset_info(key)

Arguments

key

Dataset catalog key (or fuzzy match).

Returns

A named list with dataset metadata.

Opens (or creates) the shared cache at data/cache/morie.db. Both R (DBI/RSQLite) and Python (sqlite3) read/write this same file.

Usage

morie_db_connect(db_path = NULL)

Arguments

db_path

Path to the SQLite file. Defaults to MORIE_CACHE_DB env var or data/cache/morie.db.

Returns

A DBI connection object.

Downloads large bootstrap weight CSVs that are too big to ship with the package. Data is cached in the user cache database for future use.

Usage

morie_download_bootstrap(survey = "all", limit = 32000L, db_path = NULL)

Arguments

survey

One of "csads_2021", "csads_2023", "csus_2019", "csus_2023", or "all" (default).

limit

Max records per CKAN request (default 32000).

db_path

Optional override for cache database path.

Returns

Invisibly, the number of CSV files successfully downloaded.

Fetch data from the CKAN API and cache it

Usage

morie_fetch_ckan(dataset_key = "cpads", limit = 32000L, db_path = NULL)

Arguments

dataset_key

One of "cpads", "csads", "csus".

limit

Max records to fetch.

db_path

Optional override for the database path.

Returns

A data.frame.

List all datasets with cache status

Usage

morie_list_datasets(db_path = NULL)

Arguments

db_path

Optional override for the database path.

Returns

A data.frame with columns: key, name, source, survey, year, type, cached (logical), rows (integer or NA).

Resolution order: enumerate{ item Local RDS/CSV files in standard project locations item SQLite cache (data/cache/morie.db) item CKAN API fetch (requires internet) }

Usage

morie_load_cpads(db_path = NULL, use_ckan = TRUE)

Arguments

db_path

Optional override for the database path.

use_ckan

Logical; if TRUE and data not found locally or in cache, attempt to fetch from the CKAN API.

Returns

A data.frame with canonical CPADS columns.

Resolution: SQLite cache -> local file ingest -> CKAN API -> error. Supports fuzzy matching: morie_load_dataset("cpads_2021") resolves to oc_cpads_2021.

Usage

morie_load_dataset(key, db_path = NULL)

Arguments

key

Dataset catalog key (or fuzzy match).

db_path

Optional override for the database path.

Returns

A data.frame.

Resolve standard project paths

Usage

morie_paths(project_root = NULL)

Arguments

project_root

Project root directory. If NULL, inferred from the current working directory.

Returns

Named list of key paths.

Lists or retrieves bundled userguide PDF files. These are the official PUMF codebooks and user guides from Health Canada / Statistics Canada.

Usage

morie_userguide(name = NULL)

Arguments

name

Filename (e.g., "20212022-cpads-pumf-user-guide.pdf"). If NULL, lists all available userguides.

Returns

File path string, or character vector of filenames.

Workflow + audit

Send a free-form question to the Perseus assistant via the bundled

Python LLM bridge. Returns the assistant’s text response.

Usage

ask_percy(question, context = NULL,
            python_bin = Sys.getenv("MORIE_PYTHON_BIN", "python3"))

Arguments

question

Character string. The natural-language question.

context

Optional named list of context variables (data summaries, column descriptions, etc.) appended to the prompt.

python_bin

Path to the Python interpreter that has the pkg{morie} Python package installed. Defaults to $MORIE_PYTHON_BIN or python3.

Returns

Character string containing the assistant’s response.

Audit declared outputs against files on disk

Usage

audit_public_outputs(project_root, manifest)

Arguments

project_root

Project root directory.

manifest

Manifest data frame. If ‘NULL’, loaded from disk.

Returns

Data frame containing declared and observed output status.

Build an MORIE assistant prompt

Usage

build_assistant_prompt(question, context)

Arguments

question

User question.

context

Optional context string.

Returns

Character scalar prompt.

Build an outputs manifest from a directory of artifacts

Usage

build_outputs_manifest(output_dir, manifest_path, public_prefix, extensions)

Arguments

output_dir

Directory containing output files.

manifest_path

CSV path to write.

public_prefix

Prefix used in ‘public_path’ values.

extensions

File extensions to include (without dots).

Returns

Manifest data frame.

Compose a question and optional context into the structured prompt

format expected by ``ask_percy``.

Usage

build_prompt(question, context = NULL)

Arguments

question

Character string. The natural-language question.

context

Optional named list of context variables.

Returns

Character string. The composed prompt.

Return the canonical CPADS local-data contract

Returns

Named list describing the expected local CPADS contract.

Default synthetic-data variable name map

Returns a named character vector mapping canonical variable keys used by [generate_synthetic_data()] to output column names.

Usage

default_synthetic_name_map(profile)

Arguments

profile

Name profile. ‘“generic”’ is recommended for new projects.

Returns

Named character vector.

Default workflow step map

Returns the default named map of workflow steps to project script paths.

Returns

Named character vector.

Find a project root directory

Searches upward from ‘start’ for a directory containing the current Sphinx/package-root markers, while still tolerating legacy Quarto-era markers in older checkouts.

Usage

find_project_root(start, max_up)

Arguments

start

Starting directory.

max_up

Maximum number of parent traversals.

Returns

Absolute path to the detected project root.

List implemented MORIE CPADS modules

Usage

list_morie_modules()

Returns

Data frame describing the implemented module surface.

Other

Paired t-test

Usage

paired_t_test(x1, x2, alternative)

Arguments

x1

Numeric vector (before/condition 1).

x2

Numeric vector (after/condition 2).

alternative

‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘t’, ‘df’, ‘p_value’, ‘ci_diff’, ‘mean_diff’.

Point-biserial correlation

Usage

point_biserial_r(binary_var, continuous_var)

Arguments

binary_var

Binary numeric vector (0/1).

continuous_var

Continuous numeric vector.

Returns

Named list: ‘r’, ‘p_value’.

Power for a two-proportion z-test

Mirrors R’s ‘power.prop.test()’.

Usage

power_prop_test(n, p1, p2, sig_level, power, alternative)

Arguments

n

Sample size per group.

p1

Proportion in group 1.

p2

Proportion in group 2.

sig_level

Type I error rate.

power

Desired power.

alternative

‘“two.sided”’ or ‘“one.sided”’.

Returns

Result of ‘stats::power.prop.test()’.

Examples

power_prop_test(p1 = 0.30, p2 = 0.20, power = 0.80)

Power for a two-sample t-test

Solve for any missing parameter (‘n’, ‘delta’, ‘sd’, ‘sig.level’, or ‘power’). Mirrors R’s ‘power.t.test()’.

Usage

power_t_test(n, delta, sd, sig_level, power, alternative, type)

Arguments

n

Sample size per group (NULL to solve for it).

delta

Effect size (difference in means).

sd

Standard deviation (pooled).

sig_level

Type I error rate (alpha).

power

Desired power (1 - beta).

alternative

‘“two.sided”’ or ‘“one.sided”’.

type

‘“two.sample”’, ‘“one.sample”’, or ‘“paired”’.

Returns

Result of ‘stats::power.t.test()’.

Examples

power_t_test(n = NULL, delta = 0.5, power = 0.80)

Probability proportional to size (PPS) sampling

Usage

pps_sample(df, size_col, n, seed)

Arguments

df

A data frame.

size_col

Name of the size measure column.

n

Number of units to select.

seed

Random seed.

Returns

Data frame of selected units with ‘.weight’ (Hansen-Hurwitz weights).

Compute a confidence interval for a binomial proportion using the Wilson score method (default), exact Clopper-Pearson, or Wald method.

Usage

proportion_ci(successes, n, alpha = 0.05,
              method = c("wilson", "exact", "wald"))

Arguments

successes

Number of successes.

n

Total observations.

alpha

Significance level (default 0.05 for a 95% CI).

method

"wilson" (default), "exact" (Clopper-Pearson), or "wald".

Returns

Named list with components p_hat, ci_lower, ci_upper.

Examples

proportion_ci(35, 100)

Read outputs manifest from a project

Usage

read_outputs_manifest(project_root, manifest_path, validate)

Arguments

project_root

Project root path.

manifest_path

Optional explicit manifest path.

validate

If ‘TRUE’, validate schema.

Returns

Manifest data frame.

Risk difference (ARD) with Newcombe CI

Usage

risk_difference_ci(table_2x2, alpha)

Arguments

table_2x2

A 2x2 matrix: rows are exposure, columns are outcome.

alpha

Significance level.

Returns

Named list: ‘rd’, ‘ci_lower’, ‘ci_upper’.

Risk ratio (relative risk) with log-normal CI

Usage

risk_ratio_ci(table_2x2, alpha)

Arguments

table_2x2

A 2x2 matrix: rows are exposure, columns are outcome (disease = col 1).

alpha

Significance level.

Returns

Named list: ‘rr’, ‘ci_lower’, ‘ci_upper’.

Run the eBAC selection-adjusted IPW workflow

Mirrors the core outputs of the old ‘07_ebac_ipw.R’ workflow.

Usage

run_ebac_selection_ipw_analysis(data, output_dir, treatment, covariates)

Arguments

data

Analysis data frame.

output_dir

Optional directory for CSV outputs.

treatment

Treatment column name.

covariates

Covariate names used in the observation model.

Returns

Named list of output tables and the observed-domain analysis frame.

Run one implemented MORIE module against CPADS data

Usage

run_morie_module(
  module_name,
  cpads_csv = .cpads_default_csv(),
  output_dir = NULL
)

Arguments

module_name

Module name.

cpads_csv

Path to the CPADS CSV.

output_dir

Optional directory for CSV outputs.

Returns

Named list of data-frame outputs.

Run multiple implemented MORIE modules

Usage

run_morie_modules(
  modules = list_morie_modules()$name,
  cpads_csv = .cpads_default_csv(),
  output_dir = NULL
)

Arguments

modules

Character vector of module names.

cpads_csv

Path to the CPADS CSV.

output_dir

Optional directory for CSV outputs.

Returns

Named list of module outputs.

Run multiple workflow steps

Usage

run_pipeline(steps, project_root, script_map, stop_on_error, verbose)

Arguments

steps

Ordered vector of workflow step names.

project_root

Project root directory.

script_map

Named character vector mapping steps to script paths.

stop_on_error

If ‘TRUE’, stop at first failure.

verbose

If ‘TRUE’, streams command output.

Returns

Data frame of step statuses.

Run the CPADS propensity/IPW workflow

Mirrors the core outputs of the old ‘07_propensity.R’ workflow.

Usage

run_propensity_ipw_analysis(data, output_dir, trim, treatment, outcome, covariates)

Arguments

data

Analysis data frame.

output_dir

Optional directory for CSV outputs.

trim

Quantile pair used to trim extreme IPW values.

treatment

Binary treatment column.

outcome

Binary outcome column.

covariates

Covariate names for the propensity model.

Returns

Named list of output tables and the analysis data.

Run one project workflow step

Usage

run_workflow_step(step, project_root, script_map, rscript_bin, verbose)

Arguments

step

Step name present in ‘script_map’.

project_root

Project root directory.

script_map

Named character vector mapping steps to script paths.

rscript_bin

Optional path to ‘Rscript’ binary.

verbose

If ‘TRUE’, streams command output.

Returns

Named list with step metadata and exit status.

Sample size for logistic regression detecting a target odds ratio

Uses the formula from Hsieh et al. (1998):

\[n = \frac{(z_{\alpha/2} + z_\beta)^2}{p_1(1-p_1) [\log(OR)]^2}\]

Usage

sample_size_logistic(p0, or, alpha, power, two_sided)

Arguments

p0

Prevalence under control.

or

Target odds ratio.

alpha

Significance level.

power

Desired power.

two_sided

Logical.

Returns

Integer sample size.

Rosenbaum bounds sensitivity analysis

For a range of hidden-confounding levels \(\Gamma\), tests whether the treatment effect remains significant. A large \(\Gamma\) at which the result remains significant indicates robustness.

Uses Wilcoxon signed-rank statistic bounds for matched designs. For unmatched data, computes sign-score bounds.

Usage

sensitivity_rosenbaum(treated, control, gamma_range)

Arguments

treated

Numeric vector of outcomes for treated units.

control

Numeric vector of outcomes for control units

gamma_range

Numeric vector of \(\Gamma\) values to test.

Returns

Data frame with columns: ‘gamma’, ‘p_lower’, ‘p_upper’.

Shapiro-Wilk normality test

Usage

shapiro_wilk_test(x, alpha)

Arguments

x

Numeric vector.

alpha

Significance level for the ‘is_normal’ flag (default 0.05).

Returns

Named list: ‘W’, ‘p_value’, ‘is_normal’.

Simple random sample from a data frame

Usage

simple_random_sample(df, n, replace, seed)

Arguments

df

A data frame.

n

Number of units to select.

replace

Sample with replacement? Default ‘FALSE’.

seed

Random seed for reproducibility.

Returns

A data frame of ‘n’ sampled rows with a ‘.weight’ column added.

Examples

df <- data.frame(x = 1:100)
  srs_sample <- simple_random_sample(df, 20)

Spearman rank correlation

Usage

spearman_rho(x, y)

Arguments

x

Numeric vector.

y

Numeric vector.

Returns

Named list: ‘rho’, ‘p_value’.

Proportional or fixed stratified random sample

Usage

stratified_sample(df, strata_col, n_per_stratum, proportional, seed)

Arguments

df

A data frame.

strata_col

Name of the stratification column.

n_per_stratum

Either an integer (equal allocation) or a named integer

proportional

Logical; if ‘TRUE’, allocate proportionally to strata sizes.

seed

Random seed.

Returns

Data frame of sampled rows with a ‘.weight’ column.

Examples

df <- data.frame(g = c(rep("A", 60), rep("B", 40)), x = rnorm(100))
  stratified_sample(df, "g", n_per_stratum = 10)

Summarize an output audit

Usage

summarize_output_audit(audit_tbl)

Arguments

audit_tbl

Result from [audit_public_outputs()].

Returns

Named list with high-level diagnostics.

Two-sample t-test with tidy output

Usage

two_sample_t_test(x1, x2, equal_var, alternative)

Arguments

x1

Numeric vector (group 1).

x2

Numeric vector (group 2).

equal_var

Assume equal variances? Default ‘FALSE’ (Welch test).

alternative

‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘t’, ‘df’, ‘p_value’, ‘ci_diff’, ‘cohens_d’.

Examples

two_sample_t_test(rnorm(50, 0.5), rnorm(50, 0))

Validate a CPADS analysis data frame

Usage

validate_cpads_data(data, strict)

Arguments

data

Data frame to validate.

strict

If ‘TRUE’, stop when required variables are missing.

Returns

Character vector of missing variable names.

Validate outputs manifest structure

Usage

validate_outputs_manifest(manifest, strict)

Arguments

manifest

Data frame to validate.

strict

If ‘TRUE’, stop on validation failures.

Returns

‘TRUE’ when validation passes.

Wilcoxon signed-rank test (paired)

Usage

wilcoxon_signed_rank_test(x1, x2, alternative)

Arguments

x1

Numeric vector (before).

x2

Numeric vector (after).

alternative

‘“two.sided”’, ‘“greater”’, or ‘“less”’.

Returns

Named list: ‘V’, ‘p_value’.

Write synthetic epidemiology-style data to CSV

Usage

write_synthetic_data(path, n, seed, special_code_rate, profile, name_map, overwrite)

Arguments

path

Output CSV path.

n

Number of rows.

seed

Random seed.

special_code_rate

Proportion of survey-style missing codes.

profile

Naming profile for output columns.

name_map

Optional custom variable name map.

overwrite

If ‘TRUE’, overwrite existing file.

Returns

Normalized output path.