Title: | Stratification and Matching for Large Observational Data Sets |
---|---|
Description: | A pilot matching design to automatically stratify and match large datasets. The manual_stratify() function allows users to manually stratify a dataset based on categorical variables of interest, while the auto_stratify() function does automatically by allocating a held-aside (pilot) data set, fitting a prognostic score (see Hansen (2008) <doi:10.1093/biomet/asn004>) on the pilot set, and stratifying the data set based on prognostic score quantiles. The strata_match() function then does optimal matching of the data set in parallel within strata. |
Authors: | Rachael C. Aikens [aut, cre], Joseph Rigdon [aut], Justin Lee [aut], Michael Baiocchi [aut], Jonathan Chen [aut] |
Maintainer: | Rachael C. Aikens <[email protected]> |
License: | GPL-3 |
Version: | 0.1.9 |
Built: | 2025-02-11 05:13:39 UTC |
Source: | https://github.com/cran/stratamatch |
Automatically creates strata for matching based on a prognostic score formula
or a vector of prognostic scores already estimated by the user. Creates a
auto_strata
object, which can be passed to strata_match
for stratified matching or unpacked by the user to be matched by some other
means.
auto_stratify( data, treat, prognosis, outcome = NULL, size = 2500, pilot_fraction = 0.1, pilot_size = NULL, pilot_sample = NULL, group_by_covariates = NULL )
auto_stratify( data, treat, prognosis, outcome = NULL, size = 2500, pilot_fraction = 0.1, pilot_size = NULL, pilot_sample = NULL, group_by_covariates = NULL )
data |
|
treat |
string giving the name of column designating treatment assignment |
prognosis |
information on how to build prognostic scores. Three different input types are allowed:
|
outcome |
string giving the name of column with outcome information. Required if prognostic_scores is specified. Otherwise it will be inferred from prog_formula |
size |
numeric, desired size of strata (default = 2500) |
pilot_fraction |
numeric between 0 and 1 giving the proportion of controls to be allotted for building the prognostic score (default = 0.1) |
pilot_size |
alternative to pilot_fraction. Approximate number of
observations to be used in pilot set. Note that the actual pilot set size
returned may not be exactly |
pilot_sample |
a data.frame of held aside samples for building
prognostic score model. If |
group_by_covariates |
character vector giving the names of covariates to be grouped by (optional). If specified, the pilot set will be sampled in a stratified manner, so that the composition of the pilot set reflects the composition of the whole data set in terms of these covariates. The specified covariates must be categorical. |
Stratifying by prognostic score quantiles can be more effective than manually stratifying a data set because the prognostic score is continuous, thus the strata produced tend to be of equal size with similar prognosis.
Automatic stratification requires information on how the prognostic scores
should be derived. This is primarily determined by the specifciation of the
prognosis
argument. Three main forms of input for prognosis
are allowed:
A vector of prognostic scores. This vector
should be the same length and order of the rows in the data set. If this
method is used, the outcome
argument must also be specified; this is
simply a string giving the name of the column which contains outcome
information.
A formula for prognosis (e.g. outcome ~ X1 + X2
).
If this method is used, auto_stratify
will automatically split the
data set into a pilot_set
and an analysis_set
. The pilot set
will be used to fit a logistic regression model for outcome in the absence of
treatment, and this model will be used to estimate prognostic scores on the
analysis set. The analysis set will then be stratified based on the
estimated prognostic scores. In this case the outcome
argument need
not be specified since it can be inferred from the input formula.
A
model for prognosis (e.g. a glm
object). If this method is used, the
outcome
argument must also be specified
Returns an auto_strata
object. This contains:
outcome
- a string giving the name of the column where outcome
information is stored
treat
- a string giving the name of the column encoding
treatment assignment
analysis_set
- the data set with strata assignments
call
- the call to auto_stratify
used to generate this
object
issue_table
- a table of each stratum and potential issues of
size and treat:control balance. In small or imbalanced strata, it may be
difficult or infeasible to find high-quality matches, while very large
strata may be computationally intensive to match.
strata_table
- a table of each stratum and the prognostic
score quantile bin to which it corresponds
prognostic_scores
- a vector of prognostic scores.
prognostic_model
- a model for prognosis fit on a pilot data
set. Will be NULL
if a vector of prognostic scores was provided as
the prognosis
argument to auto_stratify
rather than a model
or formula.
pilot_set
- the set of controls used to fit the prognostic
model. These are excluded from subsequent analysis so that the prognostic
score is not overfit to the data used to estimate the treatment effect.
Will be NULL
if a pre-fit model or a vector of prognostic scores was
provided as the prognosis
argument to auto_stratify
rather
than formula.
This section suggests fixes for common errors that appear while fitting the prognostic score or using it to estimate prognostic scores on the analysis set.
Encountered an error while fitting the prognostic model...
numeric probabilities 0 or 1 produced
. This error means that the
prognostic model can perfectly separate positive from negative outcomes.
Estimating a treatment effect in this case is unwise since an individual's
baseline characteristics perfectly determine their outcome, regardless of
whether they recieve the treatment. This error may also appear on rare
occaisions when your pilot set is very small (number of observations
approximately <= number of covariates in the prognostic model), so that
perfect separation happens by chance.
Encountered an error while estimating prognostic scores ...
factor X has new levels ...
This may indicate that some value(s) of one
or more categorical variables appear in the analysis set which were not
seen in the pilot set. This means that when we try to obtain prognostic
scores for our analysis set, we run into some new value that our prognostic
model was not prepared to handle. There are a few options we have to
troubleshoot this problem:
Rejection sampling. Run auto_stratify
again with the
same arguments until this error does not occur (i.e. until some
observations with the missing value are randomly selected into the pilot
set)
Eliminate this covariate from the prognostic formula.
Remove observations with the rare covariate value from the entire data set. Consider carefully how this exclusion might affect your results.
Other errors or warnings can occur if the pilot set is too small and the prognostic formula is too complicated. Always make sure that the number of observations in the pilot set is large enough that you can confidently fit a prognostic model with the number of covariates you want.
manual_stratify
, new_auto_strata
# make sample data set set.seed(111) dat <- make_sample_data(n = 75) # construct a pilot set, build a prognostic score for `outcome` based on X2 # and stratify the data set based on the scores into sets of about 25 # observations a.strat_formula <- auto_stratify(dat, "treat", outcome ~ X2, size = 25) # stratify the data set based on a model for prognosis pilot_data <- make_sample_data(n = 30) prognostic_model <- glm(outcome ~ X2, pilot_data, family = "binomial") a.strat_model <- auto_stratify(dat, "treat", prognostic_model, outcome = "outcome", size = 25 ) # stratify the data set based on a vector of prognostic scores prognostic_scores <- predict(prognostic_model, newdata = dat, type = "response" ) a.strat_scores <- auto_stratify(dat, "treat", prognostic_scores, outcome = "outcome", size = 25 ) # diagnostic plots plot(a.strat_formula) plot(a.strat_formula, type = "AC", propensity = treat ~ X1, stratum = 1) plot(a.strat_formula, type = "hist", propensity = treat ~ X1, stratum = 1) plot(a.strat_formula, type = "residual")
# make sample data set set.seed(111) dat <- make_sample_data(n = 75) # construct a pilot set, build a prognostic score for `outcome` based on X2 # and stratify the data set based on the scores into sets of about 25 # observations a.strat_formula <- auto_stratify(dat, "treat", outcome ~ X2, size = 25) # stratify the data set based on a model for prognosis pilot_data <- make_sample_data(n = 30) prognostic_model <- glm(outcome ~ X2, pilot_data, family = "binomial") a.strat_model <- auto_stratify(dat, "treat", prognostic_model, outcome = "outcome", size = 25 ) # stratify the data set based on a vector of prognostic scores prognostic_scores <- predict(prognostic_model, newdata = dat, type = "response" ) a.strat_scores <- auto_stratify(dat, "treat", prognostic_scores, outcome = "outcome", size = 25 ) # diagnostic plots plot(a.strat_formula) plot(a.strat_formula, type = "AC", propensity = treat ~ X1, stratum = 1) plot(a.strat_formula, type = "hist", propensity = treat ~ X1, stratum = 1) plot(a.strat_formula, type = "residual")
By default, returns only the internal cut points. Cutoffs at 0 and 1 are implied.
extract_cut_points(x)
extract_cut_points(x)
x |
an autostrata object |
a vector of the score values delineating cutoffs between strata
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) cutoffs <- extract_cut_points(a.strat)
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) cutoffs <- extract_cut_points(a.strat)
Extract cutoffs between strata
## S3 method for class 'auto_strata' extract_cut_points(x)
## S3 method for class 'auto_strata' extract_cut_points(x)
x |
an autostrata object |
a vector of the score values delineating cutoffs between strata
An deidentified data set containing the demographics, comorbidities, DNR code
status, and surgical team assignment of 10,157 patients in the Stanford
University Hospital Intensive Care Unit (ICU). This data was extracted from
the electronic record system, deidentified, and made publically available by
Chavez et al (2018) <doi:10.1371/journal.pone.0190569>. It was reprocessed
for use in the stratamatch
package as a sample data set. For more
details on the data extraction and inclusion criteria, see Chavez et al.
ICU_data
ICU_data
A data frame with 10157 rows and 29 variables:
patient id, numeric
age of patient at time of admission to the ICU in days, numeric
whether the patient was documented to be female prior to ICU visit, binary
whether the patient's race/ethnicity was documented as Asian prior to ICU visit, binary
whether the patient's race/ethnicity was unknown prior to ICU visit, binary
whether the patient's race/ethnicity was documented as Other" prior to ICU visit, binary
whether the patient's race/ethnicity was documented as Black/African American prior to ICU visit, binary
whether the patient's race/ethnicity was documented as PacificIslander prior to ICU visit, binary
whether the patient's race/ethnicity was documented as Native American prior to ICU visit, binary
whether the patient was "self pay" (i.e. uninsured), binary
whether the patient was documented to be latino prior to ICU visit, binary
whether the patient had code status set to any DNR "Do not resuscitate" order at any point during their ICU stay, binary
whether the patient was assigned to a surgical team at any point during their ICU stay, binary
License information for this data is as follows:
Copyright (c) 2016, Stanford University
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
https://simtk.org/frs/download_confirm.php/latestzip/1969/ICUDNR-latest.zip?group_id=892
auto_strata
classChecks if the target object is an auto_strata
object.
is.auto_strata(object)
is.auto_strata(object)
object |
any R object |
Returns TRUE
if its argument has auto_strata
among its
classes and FALSE
otherwise.
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) is.auto_strata(a.strat) # returns TRUE
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) is.auto_strata(a.strat) # returns TRUE
manual_strata
classChecks if the target object is a manual_strata
object.
is.manual_strata(object)
is.manual_strata(object)
object |
any R object |
Returns TRUE
if its argument has manual_strata
among
its classes and FALSE
otherwise.
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) is.manual_strata(m.strat) # returns TRUE
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) is.manual_strata(m.strat) # returns TRUE
strata
classChecks if the target object is a strata
object.
is.strata(object)
is.strata(object)
object |
any R object |
Returns TRUE
if its argument has strata
among its
classes and FALSE
otherwise.
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) is.strata(m.strat) # returns TRUE
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) is.strata(m.strat) # returns TRUE
Makes the match distance with strata specifications for strata_match
.
This function is largely unecessary to call outside of stratamatch, but it is
exported for the benefit of the user to aid in debugging. Note that this
function requires that the R package optmatch
is installed.
make_match_distances(object, model, method)
make_match_distances(object, model, method)
object |
a strata object |
model |
(optional) formula for matching. If left blank, all
columns of the analysis set in |
method |
either "prop" for propensity score matching based on a glm fit
with model |
a match distance matrix for optmatch
https://cran.r-project.org/package=optmatch
dat <- make_sample_data(n = 75) # stratify with auto_stratify a.strat <- auto_stratify(dat, "treat", outcome ~ X2, size = 25) # make match distances. Requires optmatch package to be installed. md <- make_match_distances(a.strat, treat ~ X1 + X2, method = "mahal")
dat <- make_sample_data(n = 75) # stratify with auto_stratify a.strat <- auto_stratify(dat, "treat", outcome ~ X2, size = 25) # make match distances. Requires optmatch package to be installed. md <- make_match_distances(a.strat, treat ~ X1 + X2, method = "mahal")
Makes a simple data frame with treat (binary), outcome (binary), and five covariates: X1 (continuous), X2 (continuous), B1 (binary), B2 (binary), and C1 (categorical). Probability outcome = 1 is sigmoid(treat + X1). Probability treatment = 1 is sigmoid(- 0.2 * X1 + X2 - B1 + 2 * B2)
make_sample_data(n = 100)
make_sample_data(n = 100)
n |
the size of the desired data set |
# make sample data set of 30 observations dat <- make_sample_data(n = 30)
# make sample data set of 30 observations dat <- make_sample_data(n = 30)
Stratifies a data set based on a set of blocking covariates specified by the
user. Creates a manual_strata
object, which can be passed to
strata_match
for stratified matching or unpacked by the user to be
matched by some other means.
manual_stratify(data, strata_formula, force = FALSE)
manual_stratify(data, strata_formula, force = FALSE)
data |
data.frame with observations as rows, features as columns |
strata_formula |
the formula to be used for stratification. (e.g. |
force |
a boolean. If true, run even if a variable appears continuous. (default = FALSE) |
Returns a manual_strata
object. This contains:
treat
- a string giving the name of the column encoding
treatment assignment
covariates
- a character vector with the names of the
categorical columns on which the data were stratified
analysis_set
- the data set with strata assignments
call
- the call to manual_stratify
used to generate this
object
issue_table
- a table of each stratum and potential issues of
size and treat:control balance. In small or imbalanced strata, it may be
difficult or infeasible to find high-quality matches, while very large
strata may be computationally intensive to match.
strata_table
- a table of each stratum and the covariate bin
to which it corresponds
auto_stratify
, new_manual_strata
# make sample data set dat <- make_sample_data(n = 75) # stratify based on B1 and B2 m.strat <- manual_stratify(dat, treat ~ B1 + B2) # diagnostic plot plot(m.strat)
# make sample data set dat <- make_sample_data(n = 75) # stratify based on B1 and B2 m.strat <- manual_stratify(dat, treat ~ B1 + B2) # diagnostic plot plot(m.strat)
auto_strata
objectGenerates diagnostic plots for the product of a stratification by
auto_stratify
. There are four plot types:
"SR"
(default) - produces a scatter plot of strata by size and
treat:control ratio
"hist"
- produces a histogram of propensity
scores within a stratum
"AC"
- produces a Assignment-Control
plot of individuals within a stratum
"residual"
- produces a
residual plot for the prognostic model
## S3 method for class 'auto_strata' plot( x, type = "SR", label = FALSE, stratum = "all", strata_lines = TRUE, jitter_prognosis, jitter_propensity, propensity, ... )
## S3 method for class 'auto_strata' plot( x, type = "SR", label = FALSE, stratum = "all", strata_lines = TRUE, jitter_prognosis, jitter_propensity, propensity, ... )
x |
an |
type |
string giving the plot type (default = |
label |
ignored unless |
stratum |
ignored unless |
strata_lines |
default = |
jitter_prognosis |
ignored unless |
jitter_propensity |
ignored unless |
propensity |
ignored unless |
... |
other arguments |
Aikens, Greaves, and Baiocchi (2020) in Statistics in Medicine, Section 3.2 for an explaination of Assignment-Control plots (formerly "Fisher-Mill" plots).
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) plot(a.strat) # makes size-ratio scatter plot plot(a.strat, type = "hist", propensity = treat ~ X1, stratum = 1) plot(a.strat, type = "AC", propensity = treat ~ X1, stratum = 1) plot(a.strat, type = "residual")
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) plot(a.strat) # makes size-ratio scatter plot plot(a.strat, type = "hist", propensity = treat ~ X1, stratum = 1) plot(a.strat, type = "AC", propensity = treat ~ X1, stratum = 1) plot(a.strat, type = "residual")
manual_strata
objectGenerates diagnostic plots for the product of a stratification by
manual_stratify
. There are two plot types:
"SR"
(default) - produces a scatter plot of strata by size and
treat:control ratio
"hist"
- produces a histogram of propensity
scores within a stratum.
Note that residual plots and AC plots are not
supported for manual_strata
objects because no prognostic model is
fit.
## S3 method for class 'manual_strata' plot(x, type = "SR", label = FALSE, stratum = "all", propensity, ...)
## S3 method for class 'manual_strata' plot(x, type = "SR", label = FALSE, stratum = "all", propensity, ...)
x |
a |
type |
string giving the plot type (default = |
label |
ignored unless |
stratum |
ignored unless |
propensity |
ignored unless |
... |
other arguments |
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) plot(m.strat) # makes size-ratio scatter plot plot(m.strat, type = "hist", propensity = treat ~ X1, stratum = 1)
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) plot(m.strat) # makes size-ratio scatter plot plot(m.strat, type = "hist", propensity = treat ~ X1, stratum = 1)
Print method for auto_strata
object
## S3 method for class 'auto_strata' print(x, ...)
## S3 method for class 'auto_strata' print(x, ...)
x |
an |
... |
other arguments |
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) print(a.strat) # prints information about a.strat
dat <- make_sample_data() a.strat <- auto_stratify(dat, "treat", outcome ~ X1 + X2) print(a.strat) # prints information about a.strat
Print method for manual_strata
object
## S3 method for class 'manual_strata' print(x, ...)
## S3 method for class 'manual_strata' print(x, ...)
x |
a |
... |
other arguments |
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) print(m.strat) # prints information about m.strat
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) print(m.strat) # prints information about m.strat
Given a data set and some parameters about how to split the data, this
function partitions the data accordingly and returns the partitioned data as
a list containing the analysis_set
and pilot_set
.
split_pilot_set( data, treat, pilot_fraction = 0.1, pilot_size = NULL, group_by_covariates = NULL )
split_pilot_set( data, treat, pilot_fraction = 0.1, pilot_size = NULL, group_by_covariates = NULL )
data |
|
treat |
string giving the name of column designating treatment assignment |
pilot_fraction |
numeric between 0 and 1 giving the proportion of controls to be allotted for building the prognostic score (default = 0.1) |
pilot_size |
alternative to pilot_fraction. Approximate number of
observations to be used in pilot set. Note that the actual pilot set size
returned may not be exactly |
group_by_covariates |
character vector giving the names of covariates to be grouped by (optional). If specified, the pilot set will be sampled in a stratified manner, so that the composition of the pilot set reflects the composition of the whole data set in terms of these covariates. The specified covariates must be categorical. |
a list with analaysis_set and pilot_set
dat <- make_sample_data() splt <- split_pilot_set(dat, "treat", 0.2) # can be passed into auto_stratify if desired a.strat <- auto_stratify(splt$analysis_set, "treat", outcome ~ X1, pilot_sample = splt$pilot_set )
dat <- make_sample_data() splt <- split_pilot_set(dat, "treat", 0.2) # can be passed into auto_stratify if desired a.strat <- auto_stratify(splt$analysis_set, "treat", outcome ~ X1, pilot_sample = splt$pilot_set )
Match within strata in series using optmatch. Note that this function
requires that the R package optmatch
is installed.
strata_match(object, model = NULL, method = "prop", k = 1)
strata_match(object, model = NULL, method = "prop", k = 1)
object |
a strata object |
model |
(optional) formula for matching. If left blank, all
columns of the analysis set in |
method |
either "prop" for propensity score matching based on a glm fit
with model |
k |
the number of control individuals to be matched to each treated
individual. If |
a named factor with matching assignments
https://cran.r-project.org/package=optmatch
# make a sample data set set.seed(1) dat <- make_sample_data(n = 75) # stratify with auto_stratify a.strat <- auto_stratify(dat, "treat", outcome ~ X2, size = 25) # 1:1 match based on propensity formula: treat ~ X1 + X2 # Requires optmatch package to be installed. strata_match(a.strat, model = treat ~ X1 + X2, k = 1) # full match within strata based on mahalanobis distance. # Requires optmatch package to be installed. strata_match(a.strat, model = treat ~ X1 + X2, method = "mahal", k = 1)
# make a sample data set set.seed(1) dat <- make_sample_data(n = 75) # stratify with auto_stratify a.strat <- auto_stratify(dat, "treat", outcome ~ X2, size = 25) # 1:1 match based on propensity formula: treat ~ X1 + X2 # Requires optmatch package to be installed. strata_match(a.strat, model = treat ~ X1 + X2, k = 1) # full match within strata based on mahalanobis distance. # Requires optmatch package to be installed. strata_match(a.strat, model = treat ~ X1 + X2, method = "mahal", k = 1)
This package employs a pilot matching design to automatically stratify and
match large datasets. The manual_stratify
function allows
users to manually stratify a dataset based on categorical variables of
interest, while the auto_stratify
function does automatically
by allocating a held-aside (pilot) data set, fitting a prognostic score (see
Hansen (2008) <doi:10.1093/biomet/asn004>) on the pilot set, and stratifying
the data set based on prognostic score quantiles. The
strata_match
function then does optimal matching of the data
set within strata.
Summarize number and sizes of strata in a strata
object. Also prints
number of strata with potential issues.
## S3 method for class 'strata' summary(object, ...)
## S3 method for class 'strata' summary(object, ...)
object |
a |
... |
other arguments |
For more information, access the issue table for your strata object with
mystrata$issue_table
.
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) summary(m.strat) # Summarizes strata in m.strat
dat <- make_sample_data() m.strat <- manual_stratify(dat, treat ~ C1) summary(m.strat) # Summarizes strata in m.strat