R API¶
Part of API Reference — MORIE API reference.
Reference for every public function exported by the
morie R package. Signatures and descriptions come from the
Roxygen2 .Rd files in r-package/morie/man/; see
Statistical Methods for the methodology behind each function.
Causal estimators¶
Augmented IPW (AIPW) doubly-robust ATE estimator.
Combines IPW and outcome regression corrections. Consistent if strong{either} the propensity model strong{or} the outcome model is correctly specified.
Usage
estimate_aipw(data, treatment, outcome, covariates,
propensity_col = NULL,
outcome_model = c("linear", "logistic"))
Arguments
dataA data frame containing treatment, outcome, and covariate columns.
treatmentName of the binary treatment column (0/1).
outcomeName of the outcome column.
covariatesCharacter vector of covariate column names.
propensity_colOptional name of a pre-computed propensity column. If NULL, propensity is fit via logistic regression on
covariates.outcome_modelFamily for the outcome model:
"linear"or"logistic".
Returns
Named list:
ate,se,ci_lower,ci_upper,n.
Estimate the Average Treatment Effect on the Controls (ATC)
Control units receive weight 1; treated units receive \(w_i = (1-\hat{e}(X_i))/\hat{e}(X_i)\).
Returns
Named list: ‘atc’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n_control’.
Estimate the Average Treatment Effect (ATE) via Hajek IPW
The Hajek estimator uses stabilised IPW weights:
where \(\bar{y}_t^{w} = \sum_{T_i=t} w_i Y_i / \sum_{T_i=t} w_i\) and \(w_i = T_i/\hat{e}(X_i) + (1-T_i)/(1-\hat{e}(X_i))\).
Usage
estimate_ate(data, treatment, outcome, covariates, propensity_col)
Arguments
dataA data frame.
treatmentName of the binary treatment column.
outcomeName of the outcome column.
covariatesCharacter vector of covariate names.
propensity_colOptional: name of a pre-computed propensity score column.
Returns
Named list: ‘ate’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n’, ‘ess’.
Examples
set.seed(1)
df <- data.frame(
t = rbinom(200, 1, 0.4),
y = rnorm(200),
x = rnorm(200)
)
estimate_ate(df, "t", "y", "x")
Estimate the Average Treatment Effect on the Treated (ATT)
Treated units receive weight 1; controls receive \(w_i = \hat{e}(X_i)/(1-\hat{e}(X_i))\).
Returns
Named list: ‘att’, ‘se’, ‘ci_lower’, ‘ci_upper’, ‘n_treated’.
Examples
set.seed(2)
df <- data.frame(t = rbinom(200, 1, 0.4), y = rnorm(200), x = rnorm(200))
estimate_att(df, "t", "y", "x")
Estimate per-unit Conditional Average Treatment Effects via either a T-learner or an S-learner meta-learner.
The strong{T-learner} fits separate outcome models on treated and control units, then predicts the counterfactual for each unit: \(\widehat{CATE}_i = \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i)\).
The strong{S-learner} fits one model with treatment as a feature and predicts \(\hat{\mu}(X_i, 1) - \hat{\mu}(X_i, 0)\) per unit.
Usage
estimate_cate(data, treatment, outcome, covariates,
propensity_col = NULL,
outcome_model = c("linear", "logistic"),
meta_learner = c("t_learner", "s_learner"))
Arguments
dataA data frame containing the treatment, outcome, and covariates.
treatmentName of the binary treatment column in
data.outcomeName of the outcome column in
data.covariatesCharacter vector of covariate column names in
data.propensity_colOptional name of a pre-computed propensity-score column. If
NULL, propensities are estimated internally.outcome_modelOutcome-model family:
"linear"(default) for continuous outcomes or"logistic"for binary outcomes.meta_learnerMeta-learner:
"t_learner"(default) or"s_learner".
Returns
A data frame with one row per unit in
data, containing per-unit CATE estimates and supporting columns.
Examples
\donttest{
df <- data.frame(
y = rnorm(100),
z = sample(0:1, 100, replace = TRUE),
x1 = rnorm(100), x2 = rnorm(100)
)
estimate_cate(df, treatment = "z", outcome = "y",
covariates = c("x1", "x2"))
}
G-computation (outcome regression) ATE estimator
Estimates the ATE by:
Returns
Named list: ‘ate’, ‘se’, ‘ci_lower’, ‘ci_upper’.
Estimate Group Average Treatment Effects by applying AIPW within each
level of group_col to obtain stratum-specific treatment-effect
estimates.
Usage
estimate_gate(data, treatment, outcome, covariates, group_col,
propensity_col = NULL,
outcome_model = c("linear", "logistic"))
Arguments
dataA data frame containing the treatment, outcome, covariates, and grouping column.
treatmentName of the binary treatment column in
data.outcomeName of the outcome column in
data.covariatesCharacter vector of covariate column names.
group_colName of the grouping variable (e.g.
"gender","region").propensity_colOptional name of a pre-computed propensity-score column. If
NULL, propensities are estimated internally.outcome_modelOutcome-model family:
"linear"(default) for continuous outcomes or"logistic"for binary outcomes.
Returns
A data frame with one row per group level, containing the columns
group,ate,se,ci_lower,ci_upper,n.
Examples
\donttest{
set.seed(3)
df <- data.frame(
t = rbinom(300, 1, 0.4),
y = rnorm(300),
x = rnorm(300),
g = sample(c("A", "B"), 300, replace = TRUE)
)
estimate_gate(df, treatment = "t", outcome = "y",
covariates = "x", group_col = "g")
}
Estimate the Local Average Treatment Effect (LATE) via 2SLS / Wald
Uses a binary instrument \(Z\) to identify the LATE (Imbens & Angrist, 1994):
With covariates, uses two-stage OLS (Wald within residuals). Requires ‘ivreg::ivreg()’ if available; otherwise falls back to the closed-form Wald estimator.
Usage
estimate_late(data, treatment, outcome, instrument, covariates)
Arguments
dataA data frame.
treatmentName of the binary endogenous treatment column.
outcomeName of the outcome column.
instrumentName of the binary instrument column.
covariatesOptional character vector of exogenous covariates.
Returns
Named list: ‘late’, ‘se’, ‘ci_lower’, ‘ci_upper’,
Estimate propensity scores via logistic regression
Usage
estimate_propensity_scores(data, treatment, covariates, trim)
Arguments
dataA data frame.
treatmentName of the binary treatment column.
covariatesCharacter vector of covariate names.
trimQuantile pair used to winsorize extreme scores (default 0.01, 0.99).
Returns
Numeric vector of propensity scores (same length as ‘nrow(data)’).
Examples
df <- data.frame(t = c(0,1,0,1,0,1), x = rnorm(6))
ps <- estimate_propensity_scores(df, "t", "x")
Effect sizes + tests¶
One-way ANOVA
Usage
anova_one_way(...)
Arguments
...Numeric vectors, one per group.
Returns
Named list: ‘F’, ‘df_between’, ‘df_within’, ‘p_value’,
Examples
anova_one_way(rnorm(30, 0), rnorm(30, 0.5), rnorm(30, 1))
Chi-square test of independence or goodness-of-fit
Usage
chi_square_test(observed, expected)
Arguments
observedObserved counts (matrix for independence, vector for GOF).
expectedExpected counts for GOF (optional; uniform if NULL).
Returns
Named list: ‘chi_sq’, ‘df’, ‘p_value’, ‘cramers_v’.
Cohen’s d effect size
Usage
cohens_d(x1, x2, pooled)
Arguments
x1Numeric vector (group 1).
x2Numeric vector (group 2).
pooledUse pooled SD (default ‘TRUE’). If ‘FALSE’, uses ‘sd(x2)’.
Returns
Numeric Cohen’s d.
Cramer’s V for categorical association
Usage
cramers_v(contingency_table)
Arguments
contingency_tableA numeric matrix of observed counts.
Returns
Numeric Cramer’s V in [0, 1].
Compute the E-value for unmeasured confounding (VanderWeele and Ding, 2017).
The E-value quantifies the minimum strength of confounding association needed to fully explain away an observed treatment effect. For a risk ratio less than 1, use its reciprocal before applying the formula.
Usage
e_value(rr, rr_lower)
Arguments
rrRisk ratio estimate (> 0). Supply > 1; if < 1, pass its reciprocal.
rr_lowerLower bound of the 95% CI (used to compute E-value for CI).
Returns
Named list with components
e_valueande_value_ci.
Examples
e_value(rr = 3.9, rr_lower = 2.4)
Kish effective sample size
Usage
effective_sample_size(weights)
Arguments
weightsNumeric vector of sampling weights.
Returns
Numeric ESS.
Eta-squared from F-statistic
Usage
eta_squared(f_stat, df_between, df_within)
Arguments
f_statF statistic.
df_betweenDegrees of freedom (numerator).
df_withinDegrees of freedom (denominator).
Returns
Numeric eta-squared.
Fisher’s exact test for 2x2 tables
Usage
fisher_exact_test(table_2x2, alternative)
Arguments
table_2x2A 2x2 matrix or data frame of counts.
alternative‘“two.sided”’, ‘“greater”’, or ‘“less”’.
Returns
Named list: ‘odds_ratio’, ‘ci’, ‘p_value’.
Hedges’ g (bias-corrected Cohen’s d)
Returns
Numeric Hedges’ g.
Kendall’s tau-b
Usage
kendall_tau(x, y)
Arguments
xNumeric vector.
yNumeric vector.
Returns
Named list: ‘tau’, ‘p_value’.
Kruskal-Wallis non-parametric ANOVA
Usage
kruskal_wallis_test(...)
Arguments
...Numeric vectors, one per group.
Returns
Named list: ‘H’, ‘df’, ‘p_value’.
Levene test for equality of variances
Usage
levene_test(...)
Arguments
...Numeric vectors, one per group.
Returns
Named list: ‘F’, ‘p_value’.
Mann-Whitney U test (Wilcoxon rank-sum)
Usage
mann_whitney_test(x1, x2, alternative)
Arguments
x1Numeric vector (group 1).
x2Numeric vector (group 2).
alternative‘“two.sided”’, ‘“greater”’, or ‘“less”’.
Returns
Named list: ‘W’, ‘p_value’, ‘r’ (effect size).
Odds ratio and 95% CI from a 2x2 contingency table.
Usage
odds_ratio_ci(table_2x2, alpha)
Arguments
table_2x2A 2x2 matrix: rows are treatment, columns are outcome.
alphaSignificance level.
Returns
Named list: ‘or’, ‘ci_lower’, ‘ci_upper’, ‘p_value’.
Compute omega-squared as a less-biased effect-size estimator than eta-squared for one-way ANOVA designs.
Usage
omega_squared(f_stat, df_between, df_within, n)
Arguments
f_statThe F statistic from the one-way ANOVA.
df_betweenBetween-groups degrees of freedom.
df_withinWithin-groups (residual) degrees of freedom.
nTotal sample size.
Returns
Numeric omega-squared.
Examples
omega_squared(f_stat = 5.2, df_between = 2, df_within = 87, n = 90)
One-sample t-test
Usage
one_sample_t_test(x, mu0, alternative)
Arguments
xNumeric vector.
mu0Null hypothesis mean (default 0).
alternative‘“two.sided”’, ‘“greater”’, or ‘“less”’.
Returns
Named list: ‘t’, ‘df’, ‘p_value’, ‘ci’.
Survey + sampling¶
Bootstrap resampling for any statistic
Usage
bootstrap_sample(df, statistic, n_bootstrap, seed)
Arguments
dfA data frame.
statisticA function taking a data frame and returning a scalar.
n_bootstrapNumber of bootstrap replicates.
seedRandom seed.
Returns
Named list: ‘estimate’, ‘se’, ‘ci_lower’, ‘ci_upper’,
Examples
df <- data.frame(x = rnorm(100))
bootstrap_sample(df, statistic = function(d) mean(d$x))
Calibration weights via iterative proportional fitting (raking)
Adjusts initial design weights so that weighted marginal totals match known population totals for each auxiliary variable.
Usage
calibration_weights(df, aux_vars, population_totals, initial_weights, max_iter, tol)
Arguments
dfA data frame.
aux_varsCharacter vector of categorical auxiliary variable names.
population_totalsNamed list: ‘“var_level”’ -> population count.
initial_weightsOptional numeric vector of starting weights.
max_iterMaximum IPF iterations.
tolConvergence tolerance.
Returns
Numeric vector of calibrated weights.
Two-stage cluster sampling
Randomly selects ‘n_clusters’ clusters, then takes all units within selected clusters.
Usage
cluster_sample(df, cluster_col, n_clusters, seed)
Arguments
dfA data frame.
cluster_colName of the cluster identifier column.
n_clustersNumber of clusters to select.
seedRandom seed.
Returns
Data frame of selected units with ‘.weight’ column.
Compute inverse-probability design weights
Usage
compute_design_weights(df, strata_col, population_sizes)
Arguments
dfA data frame.
strata_colName of the stratification column.
population_sizesNamed integer vector: stratum level -> population size.
Returns
Numeric vector of design weights (same length as ‘nrow(df)’).
Design effect (DEFF)
Usage
design_effect(weights)
Arguments
weightsNumeric vector of sampling weights.
Returns
Numeric design effect (= n / ESS).
Generate synthetic epidemiology-style tabular data
Generates non-identifying synthetic data suitable for development, testing, and demos. The generator uses a canonical variable set and allows output column renaming through ‘name_map’ so it can be adapted to multiple studies. Synthetic data should not be used for final inferential reporting.
Usage
generate_synthetic_data(n, seed, special_code_rate, profile, name_map)
Arguments
nNumber of rows.
seedRandom seed for reproducibility.
special_code_rateProportion of values replaced with survey-style
profileConvenience profile for output naming; ignored when
name_mapOptional named character vector mapping canonical keys to
Returns
A data.frame with synthetic records.
Delete-1 jackknife variance estimate
Usage
jackknife_estimate(df, statistic)
Arguments
dfA data frame.
statisticA function taking a data frame and returning a scalar.
Returns
Named list: ‘estimate’, ‘se’, ‘bias’.
Datasets + I/O¶
Canonicalize raw CPADS PUMF columns
Usage
canonicalize_cpads_data(data)
Arguments
dataRaw CPADS data frame.
Returns
Data frame with canonical MORIE analysis columns.
Load the real CPADS CSV from this repository
Usage
load_cpads_data(cpads_csv)
Arguments
cpads_csvPath to the CPADS CSV.
Returns
Canonicalized CPADS data frame.
Note
Documentation for R function morie_assistant_query() is pending. Run roxygen2 to generate the .Rd file.
Returns the path to morie.db that ships with the package.
This database contains all CPADS, CCS, CSADS, CSUS, HealthInfobase, and
CIHI datasets pre-loaded as SQLite tables.
Usage
morie_builtin_db()
Returns
File path string.
Reads a local file and writes it to the cache so that CI and Docker environments (which may lack the original files) can still run tests.
Usage
morie_cache_file(path, table_name, db_path = NULL)
Arguments
pathPath to a CSV or RDS file.
table_nameName for the cached table.
db_pathOptional override for the database path.
Returns
Number of rows cached (invisible).
List all tables in the MORIE cache
Usage
morie_cache_list(db_path = NULL)
Arguments
db_pathOptional override for the database path.
Returns
A data.frame with columns
tableandrows.
Load a table from the MORIE cache
Usage
morie_cache_load(table_name, db_path = NULL)
Arguments
table_nameName of the SQLite table.
db_pathOptional override for the database path.
Returns
A data.frame, or
NULLif the table does not exist.
Writes (or replaces) a table in the shared SQLite cache.
Usage
morie_cache_store(data, table_name, db_path = NULL)
Arguments
dataA data.frame to cache.
table_nameName of the SQLite table.
db_pathOptional override for the database path.
Returns
Number of rows written (invisible).
Returns a data.frame describing every dataset available through the MORIE data management system. Each row maps a short catalog key to its source, survey, year, file format, local path, SQLite table name, and CKAN resource ID (if available).
Usage
morie_dataset_catalog()
Returns
A data.frame with 36 rows (one per dataset) and columns: key, name, source, survey, year, format, type, large_file, local_path, table_name, ckan_resource_id.
Details
Keys match the Python DATASET_CATALOG in data.py exactly.
Use ``morie_load_dataset`` to load by key.
Get metadata for a single dataset
Usage
morie_dataset_info(key)
Arguments
keyDataset catalog key (or fuzzy match).
Returns
A named list with dataset metadata.
Opens (or creates) the shared cache at data/cache/morie.db.
Both R (DBI/RSQLite) and Python (sqlite3) read/write this same file.
Usage
morie_db_connect(db_path = NULL)
Arguments
db_pathPath to the SQLite file. Defaults to
MORIE_CACHE_DBenv var ordata/cache/morie.db.
Returns
A DBI connection object.
Downloads large bootstrap weight CSVs that are too big to ship with the package. Data is cached in the user cache database for future use.
Usage
morie_download_bootstrap(survey = "all", limit = 32000L, db_path = NULL)
Arguments
surveyOne of
"csads_2021","csads_2023","csus_2019","csus_2023", or"all"(default).limitMax records per CKAN request (default 32000).
db_pathOptional override for cache database path.
Returns
Invisibly, the number of CSV files successfully downloaded.
Fetch data from the CKAN API and cache it
Usage
morie_fetch_ckan(dataset_key = "cpads", limit = 32000L, db_path = NULL)
Arguments
dataset_keyOne of
"cpads","csads","csus".limitMax records to fetch.
db_pathOptional override for the database path.
Returns
A data.frame.
List all datasets with cache status
Usage
morie_list_datasets(db_path = NULL)
Arguments
db_pathOptional override for the database path.
Returns
A data.frame with columns: key, name, source, survey, year, type, cached (logical), rows (integer or NA).
Resolution order:
enumerate{
item Local RDS/CSV files in standard project locations
item SQLite cache (data/cache/morie.db)
item CKAN API fetch (requires internet)
}
Usage
morie_load_cpads(db_path = NULL, use_ckan = TRUE)
Arguments
db_pathOptional override for the database path.
use_ckanLogical; if TRUE and data not found locally or in cache, attempt to fetch from the CKAN API.
Returns
A data.frame with canonical CPADS columns.
Resolution: SQLite cache -> local file ingest -> CKAN API -> error.
Supports fuzzy matching: morie_load_dataset("cpads_2021") resolves
to oc_cpads_2021.
Usage
morie_load_dataset(key, db_path = NULL)
Arguments
keyDataset catalog key (or fuzzy match).
db_pathOptional override for the database path.
Returns
A data.frame.
Resolve standard project paths
Usage
morie_paths(project_root = NULL)
Arguments
project_rootProject root directory. If
NULL, inferred from the current working directory.
Returns
Named list of key paths.
Lists or retrieves bundled userguide PDF files. These are the official PUMF codebooks and user guides from Health Canada / Statistics Canada.
Usage
morie_userguide(name = NULL)
Arguments
nameFilename (e.g.,
"20212022-cpads-pumf-user-guide.pdf"). IfNULL, lists all available userguides.
Returns
File path string, or character vector of filenames.
Workflow + audit¶
- Send a free-form question to the Perseus assistant via the bundled
Python LLM bridge. Returns the assistant’s text response.
Usage
ask_percy(question, context = NULL,
python_bin = Sys.getenv("MORIE_PYTHON_BIN", "python3"))
Arguments
questionCharacter string. The natural-language question.
contextOptional named list of context variables (data summaries, column descriptions, etc.) appended to the prompt.
python_binPath to the Python interpreter that has the pkg{morie} Python package installed. Defaults to
$MORIE_PYTHON_BINorpython3.
Returns
Character string containing the assistant’s response.
Audit declared outputs against files on disk
Usage
audit_public_outputs(project_root, manifest)
Arguments
project_rootProject root directory.
manifestManifest data frame. If ‘NULL’, loaded from disk.
Returns
Data frame containing declared and observed output status.
Build an MORIE assistant prompt
Usage
build_assistant_prompt(question, context)
Arguments
questionUser question.
contextOptional context string.
Returns
Character scalar prompt.
Build an outputs manifest from a directory of artifacts
Usage
build_outputs_manifest(output_dir, manifest_path, public_prefix, extensions)
Arguments
output_dirDirectory containing output files.
manifest_pathCSV path to write.
public_prefixPrefix used in ‘public_path’ values.
extensionsFile extensions to include (without dots).
Returns
Manifest data frame.
- Compose a question and optional context into the structured prompt
format expected by
``ask_percy``.
Usage
build_prompt(question, context = NULL)
Arguments
questionCharacter string. The natural-language question.
contextOptional named list of context variables.
Returns
Character string. The composed prompt.
Return the canonical CPADS local-data contract
Returns
Named list describing the expected local CPADS contract.
Default synthetic-data variable name map
Returns a named character vector mapping canonical variable keys used by [generate_synthetic_data()] to output column names.
Usage
default_synthetic_name_map(profile)
Arguments
profileName profile. ‘“generic”’ is recommended for new projects.
Returns
Named character vector.
Default workflow step map
Returns the default named map of workflow steps to project script paths.
Returns
Named character vector.
Find a project root directory
Searches upward from ‘start’ for a directory containing the current Sphinx/package-root markers, while still tolerating legacy Quarto-era markers in older checkouts.
Usage
find_project_root(start, max_up)
Arguments
startStarting directory.
max_upMaximum number of parent traversals.
Returns
Absolute path to the detected project root.
List implemented MORIE CPADS modules
Usage
list_morie_modules()
Returns
Data frame describing the implemented module surface.
Other¶
Paired t-test
Usage
paired_t_test(x1, x2, alternative)
Arguments
x1Numeric vector (before/condition 1).
x2Numeric vector (after/condition 2).
alternative‘“two.sided”’, ‘“greater”’, or ‘“less”’.
Returns
Named list: ‘t’, ‘df’, ‘p_value’, ‘ci_diff’, ‘mean_diff’.
Point-biserial correlation
Usage
point_biserial_r(binary_var, continuous_var)
Arguments
binary_varBinary numeric vector (0/1).
continuous_varContinuous numeric vector.
Returns
Named list: ‘r’, ‘p_value’.
Power for a two-proportion z-test
Mirrors R’s ‘power.prop.test()’.
Usage
power_prop_test(n, p1, p2, sig_level, power, alternative)
Arguments
nSample size per group.
p1Proportion in group 1.
p2Proportion in group 2.
sig_levelType I error rate.
powerDesired power.
alternative‘“two.sided”’ or ‘“one.sided”’.
Returns
Result of ‘stats::power.prop.test()’.
Examples
power_prop_test(p1 = 0.30, p2 = 0.20, power = 0.80)
Power for a two-sample t-test
Solve for any missing parameter (‘n’, ‘delta’, ‘sd’, ‘sig.level’, or ‘power’). Mirrors R’s ‘power.t.test()’.
Usage
power_t_test(n, delta, sd, sig_level, power, alternative, type)
Arguments
nSample size per group (NULL to solve for it).
deltaEffect size (difference in means).
sdStandard deviation (pooled).
sig_levelType I error rate (alpha).
powerDesired power (1 - beta).
alternative‘“two.sided”’ or ‘“one.sided”’.
type‘“two.sample”’, ‘“one.sample”’, or ‘“paired”’.
Returns
Result of ‘stats::power.t.test()’.
Examples
power_t_test(n = NULL, delta = 0.5, power = 0.80)
Probability proportional to size (PPS) sampling
Usage
pps_sample(df, size_col, n, seed)
Arguments
dfA data frame.
size_colName of the size measure column.
nNumber of units to select.
seedRandom seed.
Returns
Data frame of selected units with ‘.weight’ (Hansen-Hurwitz weights).
Compute a confidence interval for a binomial proportion using the Wilson score method (default), exact Clopper-Pearson, or Wald method.
Usage
proportion_ci(successes, n, alpha = 0.05,
method = c("wilson", "exact", "wald"))
Arguments
successesNumber of successes.
nTotal observations.
alphaSignificance level (default 0.05 for a 95% CI).
method"wilson"(default),"exact"(Clopper-Pearson), or"wald".
Returns
Named list with components
p_hat,ci_lower,ci_upper.
Examples
proportion_ci(35, 100)
Read outputs manifest from a project
Usage
read_outputs_manifest(project_root, manifest_path, validate)
Arguments
project_rootProject root path.
manifest_pathOptional explicit manifest path.
validateIf ‘TRUE’, validate schema.
Returns
Manifest data frame.
Risk difference (ARD) with Newcombe CI
Usage
risk_difference_ci(table_2x2, alpha)
Arguments
table_2x2A 2x2 matrix: rows are exposure, columns are outcome.
alphaSignificance level.
Returns
Named list: ‘rd’, ‘ci_lower’, ‘ci_upper’.
Risk ratio (relative risk) with log-normal CI
Usage
risk_ratio_ci(table_2x2, alpha)
Arguments
table_2x2A 2x2 matrix: rows are exposure, columns are outcome (disease = col 1).
alphaSignificance level.
Returns
Named list: ‘rr’, ‘ci_lower’, ‘ci_upper’.
Run the eBAC selection-adjusted IPW workflow
Mirrors the core outputs of the old ‘07_ebac_ipw.R’ workflow.
Usage
run_ebac_selection_ipw_analysis(data, output_dir, treatment, covariates)
Arguments
dataAnalysis data frame.
output_dirOptional directory for CSV outputs.
treatmentTreatment column name.
covariatesCovariate names used in the observation model.
Returns
Named list of output tables and the observed-domain analysis frame.
Run one implemented MORIE module against CPADS data
Usage
run_morie_module(
module_name,
cpads_csv = .cpads_default_csv(),
output_dir = NULL
)
Arguments
module_nameModule name.
cpads_csvPath to the CPADS CSV.
output_dirOptional directory for CSV outputs.
Returns
Named list of data-frame outputs.
Run multiple implemented MORIE modules
Usage
run_morie_modules(
modules = list_morie_modules()$name,
cpads_csv = .cpads_default_csv(),
output_dir = NULL
)
Arguments
modulesCharacter vector of module names.
cpads_csvPath to the CPADS CSV.
output_dirOptional directory for CSV outputs.
Returns
Named list of module outputs.
Run multiple workflow steps
Usage
run_pipeline(steps, project_root, script_map, stop_on_error, verbose)
Arguments
stepsOrdered vector of workflow step names.
project_rootProject root directory.
script_mapNamed character vector mapping steps to script paths.
stop_on_errorIf ‘TRUE’, stop at first failure.
verboseIf ‘TRUE’, streams command output.
Returns
Data frame of step statuses.
Run the CPADS propensity/IPW workflow
Mirrors the core outputs of the old ‘07_propensity.R’ workflow.
Usage
run_propensity_ipw_analysis(data, output_dir, trim, treatment, outcome, covariates)
Arguments
dataAnalysis data frame.
output_dirOptional directory for CSV outputs.
trimQuantile pair used to trim extreme IPW values.
treatmentBinary treatment column.
outcomeBinary outcome column.
covariatesCovariate names for the propensity model.
Returns
Named list of output tables and the analysis data.
Run one project workflow step
Usage
run_workflow_step(step, project_root, script_map, rscript_bin, verbose)
Arguments
stepStep name present in ‘script_map’.
project_rootProject root directory.
script_mapNamed character vector mapping steps to script paths.
rscript_binOptional path to ‘Rscript’ binary.
verboseIf ‘TRUE’, streams command output.
Returns
Named list with step metadata and exit status.
Sample size for logistic regression detecting a target odds ratio
Uses the formula from Hsieh et al. (1998):
Usage
sample_size_logistic(p0, or, alpha, power, two_sided)
Arguments
p0Prevalence under control.
orTarget odds ratio.
alphaSignificance level.
powerDesired power.
two_sidedLogical.
Returns
Integer sample size.
Rosenbaum bounds sensitivity analysis
For a range of hidden-confounding levels \(\Gamma\), tests whether the treatment effect remains significant. A large \(\Gamma\) at which the result remains significant indicates robustness.
Uses Wilcoxon signed-rank statistic bounds for matched designs. For unmatched data, computes sign-score bounds.
Usage
sensitivity_rosenbaum(treated, control, gamma_range)
Arguments
treatedNumeric vector of outcomes for treated units.
controlNumeric vector of outcomes for control units
gamma_rangeNumeric vector of \(\Gamma\) values to test.
Returns
Data frame with columns: ‘gamma’, ‘p_lower’, ‘p_upper’.
Shapiro-Wilk normality test
Usage
shapiro_wilk_test(x, alpha)
Arguments
xNumeric vector.
alphaSignificance level for the ‘is_normal’ flag (default 0.05).
Returns
Named list: ‘W’, ‘p_value’, ‘is_normal’.
Simple random sample from a data frame
Usage
simple_random_sample(df, n, replace, seed)
Arguments
dfA data frame.
nNumber of units to select.
replaceSample with replacement? Default ‘FALSE’.
seedRandom seed for reproducibility.
Returns
A data frame of ‘n’ sampled rows with a ‘.weight’ column added.
Examples
df <- data.frame(x = 1:100)
srs_sample <- simple_random_sample(df, 20)
Spearman rank correlation
Usage
spearman_rho(x, y)
Arguments
xNumeric vector.
yNumeric vector.
Returns
Named list: ‘rho’, ‘p_value’.
Proportional or fixed stratified random sample
Usage
stratified_sample(df, strata_col, n_per_stratum, proportional, seed)
Arguments
dfA data frame.
strata_colName of the stratification column.
n_per_stratumEither an integer (equal allocation) or a named integer
proportionalLogical; if ‘TRUE’, allocate proportionally to strata sizes.
seedRandom seed.
Returns
Data frame of sampled rows with a ‘.weight’ column.
Examples
df <- data.frame(g = c(rep("A", 60), rep("B", 40)), x = rnorm(100))
stratified_sample(df, "g", n_per_stratum = 10)
Summarize an output audit
Usage
summarize_output_audit(audit_tbl)
Arguments
audit_tblResult from [audit_public_outputs()].
Returns
Named list with high-level diagnostics.
Two-sample t-test with tidy output
Usage
two_sample_t_test(x1, x2, equal_var, alternative)
Arguments
x1Numeric vector (group 1).
x2Numeric vector (group 2).
equal_varAssume equal variances? Default ‘FALSE’ (Welch test).
alternative‘“two.sided”’, ‘“greater”’, or ‘“less”’.
Returns
Named list: ‘t’, ‘df’, ‘p_value’, ‘ci_diff’, ‘cohens_d’.
Examples
two_sample_t_test(rnorm(50, 0.5), rnorm(50, 0))
Validate a CPADS analysis data frame
Usage
validate_cpads_data(data, strict)
Arguments
dataData frame to validate.
strictIf ‘TRUE’, stop when required variables are missing.
Returns
Character vector of missing variable names.
Validate outputs manifest structure
Usage
validate_outputs_manifest(manifest, strict)
Arguments
manifestData frame to validate.
strictIf ‘TRUE’, stop on validation failures.
Returns
‘TRUE’ when validation passes.
Wilcoxon signed-rank test (paired)
Usage
wilcoxon_signed_rank_test(x1, x2, alternative)
Arguments
x1Numeric vector (before).
x2Numeric vector (after).
alternative‘“two.sided”’, ‘“greater”’, or ‘“less”’.
Returns
Named list: ‘V’, ‘p_value’.
Write synthetic epidemiology-style data to CSV
Usage
write_synthetic_data(path, n, seed, special_code_rate, profile, name_map, overwrite)
Arguments
pathOutput CSV path.
nNumber of rows.
seedRandom seed.
special_code_rateProportion of survey-style missing codes.
profileNaming profile for output columns.
name_mapOptional custom variable name map.
overwriteIf ‘TRUE’, overwrite existing file.
Returns
Normalized output path.