Function to process arguments passed to the main methods in the
did package as well as conducting some tests to make sure
data is in proper format / try to throw helpful error messages.
Usage
pre_process_did(
yname,
tname,
idname,
gname,
xformla = NULL,
data,
panel = TRUE,
allow_unbalanced_panel,
control_group = c("nevertreated", "notyettreated"),
anticipation = 0,
weightsname = NULL,
alp = 0.05,
bstrap = FALSE,
cband = FALSE,
biters = 1000,
clustervars = NULL,
est_method = "dr",
base_period = "varying",
print_details = TRUE,
faster_mode = FALSE,
pl = FALSE,
cores = 1,
call = NULL
)Arguments
- yname
The name of the outcome variable
- tname
The name of the column containing the time periods
- idname
The individual (cross-sectional unit) id name
- gname
The name of the variable in
datathat contains the first period when a particular observation is treated. This should be a positive number for all observations in treated groups. It defines which "group" a unit belongs to. It should be 0 for units in the untreated group.- xformla
A formula for the covariates to include in the model. It should be of the form
~ X1 + X2. Default is NULL which is equivalent toxformla=~1. This is used to create a matrix of covariates which is then passed to the 2x2 DID estimator chosen inest_method.For time-varying covariates: (1) With balanced panel data, in each 2x2 comparison, the covariates are taken to be the value of the covariates in the earlier time period, and all of the underlying computation involve change in Y as a function of those values of covariates. (2) With repeated cross sections data and unbalanced panel data, the covariates are taken from each time period and computations involve Y_post conditional on X_post minus Y_pre conditional on X_pre. A byproduct of this is that, with balanced panel data and in the presence of time-varying covariates, it is possible to get different numerical results according to whether or not
allow_unbalanced_panel=TRUEorFALSE.- data
The name of the data.frame that contains the data
- panel
Whether or not the data is a panel dataset. The panel dataset should be provided in long format – that is, where each row corresponds to a unit observed at a particular point in time. The default is TRUE. When is using a panel dataset, the variable
idnamemust be set. Whenpanel=FALSE, the data is treated as repeated cross sections.- allow_unbalanced_panel
Whether or not function should "balance" the panel with respect to time and id. The default values if
FALSEwhich means thatatt_gt()will drop all units where data is not observed in all periods. The advantage of this is that the computations are faster (sometimes substantially).- control_group
Which units to use the control group. The default is "nevertreated" which sets the control group to be the group of units that never participate in the treatment. This group does not change across groups or time periods. The other option is to set
group="notyettreated". In this case, the control group is set to the group of units that have not yet participated in the treatment in that time period. This includes all never treated units, but it includes additional units that eventually participate in the treatment, but have not participated yet.- anticipation
The number of time periods before participating in the treatment where units can anticipate participating in the treatment and therefore it can affect their untreated potential outcomes
- weightsname
The name of the column containing the sampling weights. If not set, all observations have same weight.
- alp
the significance level, default is 0.05
- bstrap
Boolean for whether or not to compute standard errors using the multiplier bootstrap. If standard errors are clustered, then one must set
bstrap=TRUE. Default isTRUE(in addition, cband is also by defaultTRUEindicating that uniform confidence bands will be returned. If bstrap isFALSE, then analytical standard errors are reported.- cband
Boolean for whether or not to compute a uniform confidence band that covers all of the group-time average treatment effects with fixed probability
1-alp. In order to compute uniform confidence bands,bstrapmust also be set toTRUE. The default isTRUE.- biters
The number of bootstrap iterations to use. The default is 1000, and this is only applicable if
bstrap=TRUE.- clustervars
A vector of variables names to cluster on. At most, there can be two variables (otherwise will throw an error) and one of these must be the same as idname which allows for clustering at the individual level. By default, we cluster at individual level (when
bstrap=TRUE).- est_method
the method to compute group-time average treatment effects. The default is "dr" which uses the doubly robust approach in the
DRDIDpackage. Other built-in methods include "ipw" for inverse probability weighting and "reg" for first step regression estimators. The user can also pass their own function for estimating group time average treatment effects. This should be a functionf(Y1,Y0,treat,covariates)whereY1is annx1vector of outcomes in the post-treatment outcomes,Y0is annx1vector of pre-treatment outcomes,treatis a vector indicating whether or not an individual participates in the treatment, andcovariatesis annxkmatrix of covariates. The function should return a list that includesATT(an estimated average treatment effect), andinf.func(annx1influence function). The function can return other things as well, but these are the only two that are required.est_methodis only used if covariates are included.- base_period
Whether to use a "varying" base period or a "universal" base period. Either choice results in the same post-treatment estimates of ATT(g,t)'s. In pre-treatment periods, using a varying base period amounts to computing a pseudo-ATT in each treatment period by comparing the change in outcomes for a particular group relative to its comparison group in the pre-treatment periods (i.e., in pre-treatment periods this setting computes changes from period t-1 to period t, but repeatedly changes the value of t)
A universal base period fixes the base period to always be (g-anticipation-1). This does not compute pseudo-ATT(g,t)'s in pre-treatment periods, but rather reports average changes in outcomes from period t to (g-anticipation-1) for a particular group relative to its comparison group. This is analogous to what is often reported in event study regressions.
Using a varying base period results in an estimate of ATT(g,t) being reported in the period immediately before treatment. Using a universal base period normalizes the estimate in the period right before treatment (or earlier when the user allows for anticipation) to be equal to 0, but one extra estimate in an earlier period.
- print_details
Whether or not to show details/progress of computations. Default is
FALSE.- faster_mode
This option enables a faster version of
did, optimizing computation time for large datasets by improving data management within the package. The default is set toFALSE. While the difference is minimal for small datasets, it is recommended for use with large datasets.- pl
Whether or not to use parallel processing
- cores
The number of cores to use for parallel processing
- call
Function call to att_gt
Value
a DIDparams object
