Select control units that optimize covariate balance — optimize

Select control units within strata that optimize covariate balance. Uses randomized rounding of a linear program or a mixed integer linear program.

optimize_controls(
  z,
  X,
  st,
  importances = NULL,
  treated = 1,
  ratio = NULL,
  q_s = NULL,
  treated_star = NULL,
  q_star_s = NULL,
  weight_star = 1,
  integer = FALSE,
  solver = "Rglpk",
  seed = NULL,
  runs = 1,
  time_limit = Inf,
  threads = 1,
  correct_sizes = TRUE,
  low_memory = FALSE
)

Arguments

z: a factor with the ith entry equal to the treatment of unit i.
X: a matrix or data frame containing constraints in the columns. The number of rows should equal the length of z. Balance is achieved when a constraint sums to 0, such that numbers closer to 0 are better. When a constraint does not apply to a particular unit, the entry should be NA. This should typically be generated using generate_constraints().
st: a stratum vector with the ith entry equal to the stratum of unit i. This should have the same order of units and length as z.
importances: a vector with length equal to the number of constraints or columns in X. This can be generated using generate_constraints() and each nonnegative value denotes how much to prioritize each constraint, with the default being 1 for all constraints.
treated: which treatment value should be considered the treated units. This must be one of the values of z.
ratio: a numeric or vector specifying the desired ratio of controls to `treated` in each stratum. If there is one control group and all treated units should be included, this can be a numeric. Otherwise, this should be a vector with one entry per treatment group, in the same order as the levels of z, including the treated level. If NULL, q_s should be specified.
q_s: a named vector or matrix indicating how many units are to be selected from each stratum. If there is one control group and all treated units are desired, this can be a vector; otherwise, this should have one row per treatment group, where the order of the rows matches the order of the levels of z, including the treated level. If NULL, ratio should be specified. If both are specified, q_s will take priority. Typically, if the desired ratio is not feasible for every stratum, q_s should be generated using generate_qs().
treated_star: which treatment value should be considered the treated units for the supplemental comparison. This must be one of the values of z. If multiple supplemental comparisons are desired, this should be a vector with one entry per supplemental comparison.
q_star_s: a named vector or matrix, indicating how many supplemental units are to be selected from each stratum. The matrix should have one row per treatment group, where the order of the rows matches the order of the levels of z, including the treated level. If multiple supplemental comparisons are desired, this should be a list with one entry per supplemental comparison.
weight_star: a numeric stating how much to prioritize balance between the supplemental units as compared to balance between the main units. If multiple supplemental comparisons are desired, this should be a vector with one entry per supplemental comparison.
integer: a logical stating whether to use a mixed integer programming solver instead of randomized rounding. Default is FALSE.
solver: a character stating which solver to use to run the linear program. Options are "Rglpk" (default) or "gurobi". You must have the 'gurobi' package installed to use the "gurobi" option. If available, this is the recommended solver.
seed: the seed to use when doing the randomized rounding of the linear program. This will allow results to be reproduced if desired. The default is NULL, which will choose a random seed to use and return.
runs: the number of times to run randomized rounding of the linear solution. The objective values of all runs will be reported, but the detailed results will only be reported for the run with the lowest objective value. The default is 1.
time_limit: numeric stating maximum amount of seconds for which the program is allowed to run before aborting. Default is Inf for no time limit.
threads: The maximum number of threads that should be used. This is only applicable if solver = 'gurobi'.
correct_sizes: boolean stating whether the desired sample sizes should be exactly correct (if correct_sizes = TRUE) or only need to be correct in expectation. For multiple comparisons, sample sizes may only be correct in expectation.
low_memory: boolean stating whether some outputs should not be included due to the scale of the problem being too large compared to memory space. If TRUE, eps and eps_star will not be reported. Imbalances can be computed post hoc using the check_balance() instead.

Value

List containing:

objective

objective value of the randomized rounding or mixed integer linear program solution.

objective_wo_importances

objective value of the randomized rounding or mixed integer linear program solution not weighted by the variable importances.

eps

the amount of imbalance obtained in each constraint from the linear program. The row names specify the covariate, the population of interest, and, if there are more than two comparison groups, which groups are being compared.

eps_star

same as eps but for the supplemental units instead of the units in the main comparison. If there are multiple supplemental comparisons, this is a list. If there are none, this is NULL.

importances

the importance of each on the balance constraints.

weight_star

the importance of balancing in the supplemental comparison relative to the main comparison. If there are multiple supplemental comparisons, this is a vector. If there are none, this is NULL.

selected

whether each unit was selected for the main comparison.

selected_star

whether each unit was selected for the supplement. If there are multiple supplemental comparisons, this is a list. If there are none, this is NULL.

pr

the linear program weight assigned to each unit for the main comparison.

pr_star

the linear program weight assigned to each unit for the supplement. If there are multiple supplemental comparisons, this is a list. If there are none, this is NULL.

rrdetails

A list containing:

seed: the seed used before commencing the random sampling.
run_objectives: the objective values for each run of randomized rounding.
run_objectives_wo_importances: the objective values for each run of randomized rounding, not scaled by constraint importances.

lpdetails

the full return of the function Rglpk_solve_LP() or gurobi() plus information about the epsilons and objective values for the linear program solution.

Examples


data('nh0506')

# Create strata
age_cat <- cut(nh0506$age,
               breaks = c(19, 39, 50, 85),
               labels = c('< 40 years', '40 - 50 years', '> 50 years'))
strata <- age_cat : nh0506$sex

# Balance age, race, education, poverty ratio, and bmi both across and within the levels of strata
constraints <- generate_constraints(
                 balance_formulas = list(age + race + education + povertyr + bmi ~ 1 + strata),
                 z = nh0506$z,
                 data = nh0506)

# Choose one control for every treated unit in each stratum,
# balancing the covariates as described by the constraints
results <- optimize_controls(z = nh0506$z,
                             X = constraints$X,
                             st = strata,
                             importances = constraints$importances,
                             ratio = 1)

# If you want to use a ratio that's not feasible,
# you can supply a vector of the desired number of controls per stratum, q_s,
# typically generated by creating a distance matrix between strata and using
# generate_qs():

if (FALSE) {
age_dist <- matrix(data = c(0, 1, 2, 1, 0, 1, 2, 1, 0),
                   nrow = 3,
                   byrow = TRUE,
                   dimnames = list(levels(age_cat), levels(age_cat)))

sex_dist <- matrix(data = c(0, 1, 1, 0),
                   nrow = 2,
                   dimnames = list(levels(nh0506$sex), levels(nh0506$sex)))

strata_dist <- create_dist_matrix(age_dist, sex_dist)

qs <- generate_qs(z = nh0506$z,
                  st = strata,
                  ratio = 2.5,
                  max_ratio = 2.6,
                  max_extra_s = 0,
                  strata_dist = strata_dist)

results <- optimize_controls(z = nh0506$z,
                             X = constraints$X,
                             st = strata,
                             importances = constraints$importances,
                             q_s = qs)

}

# We can also have multiple treatment and control groups,
# as well as multiple simultaneous comparisons:

if (FALSE) {
data('nh0506_3groups')
strata2 <- cut(nh0506_3groups$age, breaks = c(19, 39, 50, 85),
              labels = c('< 40 years', '40 - 50 years', '> 50 years'))
constraints2 <- generate_constraints(
  balance_formulas = list(age + race + education + povertyr + bmi + sex ~ 1 + strata2),
  z = nh0506_3groups$z,
  data = nh0506_3groups,
  treated = 'daily smoker')
q_star_s <- matrix(c(rep(table(nh0506_3groups$z, strata2)['some smoking', ] -
                           table(nh0506_3groups$z, strata2)['daily smoker', ], 2),
                     rep(0, 3)), byrow = TRUE, nrow = 3,
                   dimnames = list(levels(nh0506_3groups$z), levels(strata2)))

results <- optimize_controls(z = nh0506_3groups$z,
                             X = constraints2$X,
                             importances = constraints2$importances,
                             st = strata2,
                             ratio = 1,
                             treated = 'daily smoker',
                             treated_star = 'some smoking',
                             q_star_s = q_star_s,
                             correct_sizes = FALSE)

}