did 2.5.0
This is a large release that consolidates all development since 2.3.0. Headline changes: a substantially faster engine (group-time ATTs are roughly 2.5-3x faster in common settings, and the conditional pre-test is several times faster and far lighter on memory), first-class clustered and unbalanced-panel inference, support for transformation and factor covariates, a point-estimates-only mode, and a long list of correctness fixes. Numerical results are unchanged up to floating-point precision except where a bug fix is explicitly noted.
Dependencies
- Requires
DRDID (>= 1.3.0), which provides faster and more robust 2x2 DiD estimators (used internally for every group-time ATT) and guards against silently-incorrect standard errors on ill-conditioned (near-singular) designs.
New features
New
compute_inffuncargument inatt_gt()(defaultTRUE). Setcompute_inffunc = FALSEfor a point-estimates-only run: it returns the group-time ATT point estimates (identical to a full run) without influence functions, standard errors, uniform bands, or the pre-test. This is faster and uses much less memory (no influence-function matrix is ever formed or bootstrapped), which helps for quick exploration or very large datasets. The result cannot be passed toaggte()(which now errors with a clear message);bstrapandcbandare set toFALSEautomatically.Covariate formulas (
xformla) may now use transformations and other model-matrix terms, e.g.~ I(X^2),~ log(X),~ poly(X, 2),~ X1 * X2. These previously errored because pre-processing stored the evaluated model frame (losing the raw variable and creating matrix-valued columns) instead of the raw covariates. Pre-processing now keeps the raw covariates so the design matrix can be rebuilt, and drops rows whose evaluated design is non-finite (e.g.log()of a non-positive value).Factor covariates now produce exactly the same estimates, standard errors, and warning messages as adding their dummy columns by hand. Previously the
faster_mode = FALSEpath applieddroplevels()within each 2x2 comparison, so a factor level absent from a comparison changed the design (and could error with “contrasts can be applied only to factors with 2 or more levels”). The design matrix is now built once over the full sample, with global factor levels, and row-subset per cell.New
fix_weightsargument inatt_gt()for explicit control over how time-varying sampling weights are resolved in each 2x2 comparison:NULL(default, prior behavior),"varying"(per-observation weights via the RC estimators),"base_period"(fix at g-1), or"first_period". See?att_gt. A runtime message points users to it when time-varying weights are detected in balanced panel data.att_gt()accepts...to forward additional arguments to a customest_methodfunction.Added
nobs()S3 methods forMPandAGGTEobjobjects (number of unique cross-sectional units), andstatistic(t-statistic) andp.value(pointwise, two-sided) columns totidy()output for both classes, followingbroomconventions.The influence-function matrix returned in
MP$inffuncnow carries the unit ids as rownames (theidnamevalues; an internal observation index for repeated cross sections), and its row-order contract is documented in?att_gtand?MP. The row ORDER is mode-specific –faster_mode = FALSEsorts units by id whilefaster_mode = TRUEuses an internal (period, cohort, id) ordering – so external consumers of the influence functions (e.g. sensitivity analyses or custom cluster aggregation) must align rows by rowname, never by position. Values are unchanged; only the labels are new.
Performance and memory
att_gt(faster_mode = TRUE)(the default) is about 2.5-3x faster on common problems, with identical results. Each (g,t) cell previously built and extracted adata.tableeven though the cohort vectors are already materialized in the pre-computed tensors; the per-cell cohort is now a plain list of vectors passed straight to the DRDID estimators (panel, repeated cross-section, and unbalanced-panel paths), and the unbalanced-panel influence-function aggregation applies a sparse unit-aggregation operator built once per call (bit-identical to the previous per-cellrowsum()/data.tablegroup-by, but without re-hashing the row-to-unit map for every (g,t) cell — about 2x faster end-to-end on large unbalanced panels). Earlier (g,t) work also feeds in: cumulative cohort sizes, cached pre-treatment periods, pre-computed cohort and period-membership masks on the repeated-cross-section / unbalanced-panel path, and sparse-triplet influence-matrix construction. One intentional edge-case change in the unbalanced-panel aggregation:NA/NaNentries in a 2x2 influence function (only reachable with a customest_methodthat returns a finite ATT alongside a partially-NAinfluence function) now propagate to that unit’s aggregated influence value, so the affected (g,t) standard error isNA, matchingfaster_mode = FALSE; previously the fast path silently zeroed them, understating the standard error.The per-(g,t) overlap-check propensity logit (fit for every
dr/ipwcell to detect propensity-score overlap violations) now usesfastglm’s low-level entry point (fastglmPure) instead of thefastglm()wrapper, skipping its per-call input coercion and family/deviance bookkeeping. The fitted values – and therefore the overlap decision – are bit-identical.Two further per-cell guard speedups, in both
faster_modepaths, with bit-identical estimates, standard errors, influence functions, and warnings. (1) For intercept-only designs (xformla = NULL/~1, the default) the overlap and regression-feasibility guards use their closed forms – an unweighted intercept-only logit fits every unit atmean(D), and the control-unit Gram matrix is the scalar control count – skipping the per-cell logit fit entirely; within1e-6of the0.999cutoff the real fit is still used, so knife-edge decisions are unchanged. (2) For panel data withcontrol_group = "nevertreated"and non-varying weights, the guard booleans are computed once per (group, covariate-period, weight-period) instead of being refit for every post-treatment cell of a group (the guards’ inputs are bit-identical across those cells); failed cells still warn once per cell, andoptions(did.disable_check_cache = TRUE)restores the per-cell checks. Together these make a default no-covariateatt_gt()run roughly 10-15% faster, with similar gains on covariate runs using never-treated controls.att_gt(faster_mode = FALSE)builds the covariate design matrix once and assembles each 2x2 cell directly from precomputed per-period blocks (outcomes, weights, design) indexed by position, instead of rebuildingmodel.matrix()and reshaping (get_wide_data()) the long data for every cell. Bit-identical, with about half the transient allocation;options(did.disable_precompute = TRUE)restores the original per-cell assembly (the once-built design matrix is used either way, so the option is a debugging escape hatch for the cell assembly only, with identical results). The repeated-cross-section / unbalanced-panel slow path gets the same treatment: each (g,t) cell is assembled positionally from per-period row indices and plain column vectors precomputed once per call, replacing two fulldata.tablesubsets, a full-data%in%, and adroplevels()per cell (about 1.7x faster end-to-end on both the repeated-cross-section and unbalanced-panel paths, bit-identical, behind the same escape hatch); thefix_weights = "base_period"/"first_period"weight lookup likewise uses per-period vectors instead of a per-cell full-table subset. The repeated-cross-section / unbalanced influence-function aggregation now usesrowsum()instead ofstats::aggregate()(about 40x faster on that step).faster_mode = TRUEandfaster_mode = FALSEremain identical to numerical precision for every supported option.Pre-processing for
faster_mode = TRUEis leaner, with identical outputs: only the columns the call references (id/time/group/outcome/weights/cluster plus the rawxformlavariables) are copied out of the input data, so wide data frames no longer pay a full-table copy (cutting the transient memory peak by roughly the size of the unused columns); the balanced-panel checks use an arithmetic row-count test instead of full by-unit groupings (the grouping now only runs when unbalanced units actually need to be identified); guaranteed no-op complete-case passes are short-circuited behindanyNA(); the temporaryasif_never_treated/treated_first_periodcolumns are replaced by local vectors; and the period/crosstable count tables are derived without re-grouping the unit-level table.The conditional pre-test (
conditional_did_pretest()) is several times faster end-to-end and far lighter on memory. Its multiplier bootstrap (test.mboot()) previously looped overbitersdraws, each multiplying the fulln x k x nXinfluence array by fresh weights –O(n^2 k)work and anO(n^2 k)transient allocation per draw (over 1 GB per draw at a few thousand units). It is now a single tiled matrix contraction (100x+ faster on that step, with the per-draw gigabyte allocations eliminated), numerically identical to the old loop up to floating-point summation order; theindicator()weighting function is also vectorized.Internal speedups with identical results: vectorized the multiplier-bootstrap post-processing (
mboot()), removed a duplicatedn x kmatrix construction in the aggregation estimated-weight influence term (wif()), preallocated the sparse influence-function assembly, and removed redundant work in pre-processing and simulation.
Clustered and unbalanced-panel inference
Clustered standard errors are now available without the bootstrap. With
clustervarsset andbstrap = FALSE,att_gt()andaggte()report cluster-robust standard errors computed analytically from the cluster sums of the influence function, at every aggregation level (group-time, simple, group, dynamic, calendar), and the pre-test Wald statistic is reported under clustering.The cluster-robust multiplier bootstrap (
mboot) now follows Callaway & Sant’Anna (2021), Remark 10: it draws one multiplier per cluster and aggregates the influence function to cluster sums. Identical to before for equal-sized clusters; correct cluster-sum aggregation for unbalanced clusters and repeated cross sections.Clustered inference (bootstrap and analytical) is supported for panel data, unbalanced panels, and repeated cross sections, and is identical under
faster_mode = TRUEandfaster_mode = FALSE. For repeated cross sections without anidname, the internal observation id is used to align the cluster identifiers with the influence function (idnameitself is required wheneverpanel = TRUE; see below).aggte()no longer silently ignores aclustervarsrequest it cannot honor (the aggregation can only use the cluster informationatt_gt()retained). It now warns – including when an override names a different variable thanatt_gt()clustered on – and falls back to non-clustered standard errors, instead of silently returning the i.i.d. error or crashing inmboot().Fixed two
faster_mode = TRUEclustered-standard-error bugs on unbalanced panels so the fast path reproduces the slow path: (1) the analytical clustered SE silently fell back to the i.i.d. error because the stored per-unit cluster vector was observation-length and no longer aligned with the influence function; and (2) inaggte(), the estimated-weight influence term was added in id-sorted order while the influence function is in first-appearance order, misattributing it and giving a wrong aggregated SE (point estimates were unaffected). Balanced panels and repeated cross sections were unaffected.
Bug fixes
Fixed the conditional parallel-trends pre-test (
conditional_did_pretest()), which had silently broken under R >= 4.0 and spuriously rejected almost always whenever there was more than one pre-treatment ATT(g,t) cell. The observed Cramér-von Mises statistic was left in(n_gt x nX)orientation while its bootstrap null distribution is(nX x n_gt), scaling the observed statistic byn / n_gtand driving the p-value to ~0. The root cause wasifelse(class(J) == "matrix", ...):class()of a matrix became length-2 (c("matrix","array")) in R 4.0, soifelse()evaluated both branches and the no-transpose branch always won. The orientation is now selected withis.matrix().aggte(type = "group", na.rm = TRUE)with a finitemax_eno longer errors (“No valid att_gt() estimates found …”) when a group’s only non-missing ATT(g,t) lies pastmax_e; the group filter now applies the samemax_ewindow. The defaultmax_e = Infis unchanged.Duplicated
(idname, tname)rows (the same unit observed more than once in a period, a common long-format data-prep mistake) are now rejected with a clear error in both code paths. Previously onlyfaster_mode = TRUEcaught this;faster_mode = FALSEsilently produced incorrect estimates.A
weightsnamecolumn with negative values or a non-positive mean is now rejected with a clear, identical error in both code paths, instead of silently producingNA/NaNestimates.Fixed
faster_mode = TRUEvsFALSEATT disagreement when sampling weights (weightsname) vary across time: the fast path was always using first-period weights and now uses the same period’s weights as the slow path.Fixed influence-function aggregation for
fix_weights = "varying"on balanced panels (now aggregates by unit id withrowsum()rather than assuming stacked order), and a length mismatch forfix_weights = "base_period"/"first_period"on unbalanced panels after weight-based unit dropping.Fixed
glance.MP()returningNULLforngroup/ntimeunderfaster_mode = TRUE.Fixed an
aggte()crash (“Error in get(gname): invalid first argument”) when the group column is literally namedgnameanddreamerr >= 1.5.0is installed (data.table’sget()was intercepted; replaced withset()).aggte()no longer modifies the data stored inside theMPobject by reference: underfaster_mode = TRUEit previously recoded the never-treatedgnamevalueInfto0inMP$DIDparams$dataas a side effect. The input object is now left untouched; all results are unchanged.Fixed groups treated after the last observed period but within the anticipation window being coerced to never-treated (contaminating the control group with anticipation effects), and a data-filter inconsistency for always-treated units when
anticipation > 0. Added an informative message clarifying that never-treated units are unaffected byanticipation.When internal 2x2 estimation fails for a specific (g,t) cell (e.g. a singular design),
att_gt()now warns and sets that cell’s ATT toNAinstead of crashing, in bothfaster_mode = TRUEandFALSE(#185, #190).pl = TRUEon Windows now warns and falls back to sequential processing instead of crashing (#176).
Validation and clearer errors
Misspelled
yname,idname,tname,gname,weightsname, orclustervarsnow produce a clear message listing the missing columns (#203).Column names reserved for internal use by
did(.w,.rowid,.G,.C,post,asif_never_treated,treated_first_period) are now rejected with a clear error when used asyname,tname,idname,gname,weightsname, orclustervars, or referenced inxformla. Previously they could silently collide with the columnsdidcreates internally; rename such columns before callingatt_gt().control_groupandbase_periodmust now exactly match one of their documented values, in both modes. Previouslyfaster_mode = TRUEaccepted partial and case-insensitive abbreviations (e.g.control_group = "never"), andfaster_mode = FALSEsilently treated any unrecognizedbase_periodvalue as"varying".An invalid
est_method(an unrecognized string or an unquoted variable) now errors clearly instead of silently defaulting to"dr"(#194).fix_weights = "base_period"/"first_period"are blocked for repeated cross sections (panel = FALSE);fix_weights = "varying"is blocked with a customest_methodfunction (whose signature differs from the internal RC path). Both with clear messages.anticipationmust now be a non-negative number in both modes. Previously onlyfaster_mode = TRUEenforced this; thefaster_mode = FALSEpath silently accepted negative values (shifting the base period later than the treatment period), which was undocumented and inconsistent across modes.panel = TRUE(the default) withoutidnamenow errors with “Must provide idname when panel = TRUE. Set panel = FALSE for repeated cross sections.” Previously this failed with a cryptic internaldata.tableerror (faster_mode = TRUE) or a misleading “All observations dropped while converting data to balanced panel” message (faster_mode = FALSE).A non-numeric outcome variable (
yname) is now rejected up front with a clear message in both code paths (logical 0/1 outcomes remain allowed). Previously a character or factor outcome “ran” to completion with all-NAATTs and misleading per-cell warnings, and a list-column outcome failed with a crypticcomplete.cases()error.The per-(g,t) regression-feasibility check now reports the real cause when it fails: “Covariate matrix for control units is singular or numerically ill-conditioned … consider centering/rescaling covariates or removing collinear terms” instead of the misleading “Not enough control units … to run specified regression” (which fired even with thousands of control units, e.g. for a quadratic in a year-scale covariate). The check itself is unchanged (and now uses
crossprod()); affected cells still returnNAwith a warning.alpmust now be a single number strictly between 0 and 1 (e.g.alp = 1.5previously inverted the confidence bands silently or errored deep insidequantile()), andbitersmust be a single positive whole number whenbstrap = TRUE(a negative value previously crashed inside the bootstrap’s linear-algebra code with no hint about the cause).Cleaner failed-cell warnings under
faster_mode = FALSE: each failed (g,t) cell now warns exactly once with the same text asfaster_mode = TRUE(“overlap condition violated for group g in time period t”). Previously the slow path warned twice per failed overlap/rank check – the diagnostic plus a wrapper warning leaking the internal sentinel ("... : overlap. The ATT for this cell will be set to NA."). Genuine estimator errors are still surfaced by the wrapper warning. Additionally, when the Wald pre-test is unavailable, the warning now distinguishes “pre-treatment ATT(g,t) estimates exist but all have missing/zero variance” from “no pre-treatment cells exist at all” (the latter previously mis-diagnosed the former as “all groups are first treated early in the panel”).The documented
clustervarscontract – at most one cluster variable beyondidname, and it must be time-invariant within unit – is now enforced up front in both code paths, with one shared, plainly-worded error message (also used bymboot()). Previouslyfaster_mode = FALSEwithbstrap = FALSE(the analytical clustered-SE path) accepted extra cluster variables and silently clustered on the first one only; a time-varying cluster variable on that path triggered a fallback warning advisingbstrap = TRUE, advice that then errored inmboot()for the very same input; and thefaster_mode = TRUEerror exposed internal argument names (“args$clustervars must be … a character scalar”), contradicting the documented vector interface.Per-cell empty-cell warnings under
faster_mode = FALSEnow name the period that is actually empty. Underbase_period = "universal"(and for post-treatment cells under"varying"), the repeated-cross-section path warned “No units in group g in time period t” with the cell’s current period – a period where the group does have observations – instead of the empty base period;faster_mode = TRUEalready reported the base period correctly.Aligned two pre-processing warnings across modes: the
faster_mode = TRUE“no never-treated group” warning now also discloses that data from periods on/after the last cohort’s treatment date is filtered out (both modes always dropped those periods; only the slow path said so), and thefaster_mode = FALSEbalanced-panel coercion warning now reports the number of dropped units (“k units are missing in some periods. Converting to balanced panel by dropping them.”, same text asfaster_mode = TRUE) instead of mislabeling the unit count as “observations” (an undercount of the rows actually removed).
Documentation, namespace, and internals
Reduced namespace pollution: replaced blanket
import(stats),import(utils), andimport(BMisc)with selectiveimportFrom()calls.didno longer re-exportsstats::filter/stats::lag(which previously maskeddplyr::filter/dplyr::lag).R CMD checkpasses with 0 code-related NOTEs.Replaced fragile
ifelse(cond, x <- a, x <- b)side-effect idioms (which relied on R’s branch-evaluation order) withif/else, andget()/:=withset()insidedata.tableloops, throughout; behavior is unchanged.Substantially expanded the test suite:
glance(),ggdid(), error handling, edge cases, all aggregation types, systematicfaster_modeconsistency across dozens of parameter combinations, and JEL replication tests. The suite now runs with 0 warnings (previously 66+). Added a GitHub Action to auto-bump the dev version inDESCRIPTIONon PR merge.Expanded
weightsnamedocumentation (how time-varying weights are handled for balanced panels vs. repeated cross sections / unbalanced panels); grammar and typo fixes across docs, vignettes, and error messages; corrected thempdtadata documentation.Replaced deprecated
BMiscfunction names (getListElement,rhs.vars) with their snake_case equivalents (get_list_element,rhs_vars).
did 2.3.0
CRAN release: 2025-12-13
Code improvements that make the package faster and more memory efficient
Improved automated testing and regression testing
Check if data is balanced if
panel = TRUEandallow_unbalanced_panel = TRUE. If it is, disableallow_unbalanced_paneland proceed with panel data setup. This is different from the previous behavior, which would always proceed as ifpanel = FALSE.Significantly reduced the number of recursive package dependencies, enabling faster installation times and a smaller build footprint.
did 2.1.2
CRAN release: 2022-07-20
Added wrapper function for HonestDiD package
Fix bug for setups where
gnameis not contained intname(but is in thetnamerange)Fix bug for including too many groups with universal base period in pre-treatment periods
Bug fix for anticipation using
notyettreatedcomparison group
did 2.1.1
CRAN release: 2022-01-27
Bug fixes related to unbalanced panel and clustered standard errors
Bug fixes for conditional_did_pretest
Even faster bootstrap code (thanks Kyle Butts)
Updated version requirement for
BMiscpackageBug fix for unbalanced panel and repeated cross sections in pre-treatment periods using universal base period
did 2.1.0
CRAN release: 2021-12-10
Code is substantially faster/more memory efficient
Support for universal base period
Major improvements to unit testing
Completely removed
mp.spattandmp.spatt.testfunctions (which were the original names foratt_gt)Simulation/testing code now exported
Removed some slow running checks
Multiplier bootstrap code is now written in C++
Improvements to error handling, added some additional warning messages, removed some unnecessary warning messages
Bug fixes for NA standard errors that occur with very small groups
did 2.0.1
Improved plots
Maximum event time for event studies
Compute critical value for simultaneous confidence bands even when some standard error is zero (set these to NA)
Improved codes for unbalanced panel data: faster and more memory efficient
Correct estimates of P(G=g|Eventually treated) with unbalanced panel data. This affects aggte objects with unbalanced panel data
Bug fixes for summary aggte objects
Allow clustering for unbalanced panel data
Fixed error in calendar-type aggregation within aggte function (point estimates were not being weighted by group-size; now they are).
Additional error handling
did 2.0.0
CRAN release: 2020-12-11
Big improvement on code base / functionality / testing
Deprecated mp.spatt function and replaced it with att_gt function
Calling att_gt is similar to calling mp.spatt; instead of formula for outcome of the form
y~treat, now just pass the name of the outcome variableDeprecated mp.spatt function and replaced it with conditional_did_pretest function
New est_method parameter. Can call any function for 2x2 DID in the DRDID package (default is now doubly robust estimation, but inverse probability weights and regression estimators are also supported) as well as provide custom 2x2 DID estimators
Bug fixes for including groups that are already treated in the first period
Allow for user to select control group – either never treated or not yet treated
Add functionality for uniform confidence bands for all aggregated treatment effect parameters
Introduced dynamic effects in pre-treatment periods. These allow for users to report event study plots that are common that include pre-treatment periods and are common in applied work. The event study plots in the did package are robust to selective treatment timing (unlike standard regression event study plots)
Support for using repeated cross sections data instead of panel data is much improved
Support for using sampling weights is much improved
Big improvement to website, vignettes, and code documentation
Code for dealing with unbalanced panels
Allow for event studies to be computed over subsets of event times
Allow for treatment anticipation via anticipation argument
did 1.2.2
CRAN release: 2019-06-21
Improved ways to summarize aggregated treatment effect parameters
Fixed bug related to needing new version of BMisc
Fixed bug related to plotting with no pre-treatment periods
Improved ways to easily plot aggregated treatment effect parameters
did 1.2.1
CRAN release: 2019-06-14
Added some error handling for some cases with small group sizes, and fixed some cryptic error messages
Fixes handling for data being in format besides data.frame (e.g. tibble)
Add warnings about small group sizes which are a very common source of trouble
did 1.2.0
CRAN release: 2018-10-16
- Updates for handling repeated cross sections data, both estimation and inference
did 1.1.2
CRAN release: 2018-09-11
- bug fixes for testing without covariates, allowed to pass NULL in addition to ~1
