DiD with Multiple Time Periods#
The did module implements the difference-in-differences (DiD) methodology for settings with multiple time periods and
variation in treatment timing from the work of Callaway and Sant’Anna (2020).
This approach addresses the challenges of staggered DiD designs by providing flexible estimators for group-time average treatment effects and
various aggregation schemes to summarize treatment effect heterogeneity.
Setup and Notation#
We consider a setup with \(\mathcal{T}\) time periods. Let \(D_{it}\) be a binary variable indicating if unit \(i\) is treated in period \(t\). The treatment adoption process follows two key assumptions:
No treatment in the first period. \(D_{i1} = 0\) for all units.
Irreversibility of Treatment (Staggered Adoption). Once a unit is treated, it remains treated. Formally, \(D_{it-1} = 1\) implies \(D_{it} = 1\).
Let \(G_i\) be the time period when unit \(i\) is first treated. If a unit is never treated, we set \(G_i = \infty\). Units are thus partitioned into groups based on their treatment adoption time. Let \(C_i\) be an indicator for units that are never treated (\(G_i = \infty\)).
We use the potential outcomes framework and let \(Y_{it}(g)\) be the potential outcome for unit \(i\) at time \(t\) if it were first treated in period \(g\). The potential outcome under no treatment is \(Y_{it}(0)\). The observed outcome is a combination of these potential outcomes, determined by the group to which unit \(i\) belongs
The Group-Time Average Treatment Effect#
A key insight of the Callaway and Sant’Anna approach is that rather than estimating a single aggregate treatment effect, we can instead target more disaggregated parameters that capture treatment effect heterogeneity. In staggered adoption designs, effects may differ both across groups (early versus late adopters may respond differently to treatment) and over time (effects may grow, fade, or evolve as units accumulate exposure). The group-time framework provides the flexibility to capture all of these patterns.
The fundamental parameter of interest is the group-time average treatment effect, \(ATT(g, t)\), which is the average treatment effect for group \(g\) at time \(t\) given by
This parameter is flexible and does not impose homogeneity across groups or time. The set of all \(ATT(g, t)\)’s can be used to understand treatment effect dynamics and heterogeneity.
Identifying Assumptions#
Like all causal inference methods, identification of treatment effects from observational data requires assumptions. The DiD framework replaces the strong assumption of unconfoundedness (that treatment assignment is as good as random conditional on observables) with assumptions about how outcomes would have evolved in the absence of treatment. The key assumptions below formalize conditions under which we can use comparison groups to construct valid counterfactuals for what would have happened to treated units had they not been treated.
The identification of \(ATT(g, t)\) relies on the following key assumptions.
Assumption 1 (Limited Treatment Anticipation)
Potential outcomes are not affected by the treatment in periods far enough before it is implemented. For a known anticipation horizon \(\delta \ge 0\),
When \(\delta = 0\), this is a “no anticipation” assumption.
Assumption 2 (Conditional Parallel Trends)
The average evolution of untreated potential outcomes is the same for a treatment group and a comparison group, conditional on a set of pre-treatment covariates \(X\). Two alternative formulations are available.
Based on a “Never-Treated” Group. For each group \(g\) and for periods \(t \ge g - \delta\),
Based on “Not-Yet-Treated” Groups. For each group \(g\) and for periods \(t \ge g - \delta\),
where \(s\) is a time period such that \(t + \delta \le s\).
Assumption 3 (Overlap)
For any given covariates, there is a positive probability of being in a treatment group and in the comparison group. Formally, for some \(\varepsilon > 0\), \(P(G_g = 1) > \varepsilon\) and the generalized propensity score
is bounded away from 1.
Nonparametric Identification of ATT(g,t)#
Under the assumptions above, \(ATT(g, t)\) is non-parametrically identified. The paper provides three types of estimands that can be used to identify these effects, each with different strengths and properties. For what follows, let \(\Delta Y_{t,g,\delta} = Y_t - Y_{g-\delta-1}\) denote the change in outcomes from the pre-treatment base period to the current period.
Never-Treated Comparison Group Estimands#
When using never-treated units as the comparison group, we first define the following key quantities. The propensity score for being in group \(g\) conditional on being either in group \(g\) or never-treated is
and the expected outcome change for never-treated units is
Inverse Probability Weighting (IPW) Estimand
The IPW estimand reweights observations to balance the covariate distributions between the treatment and comparison groups and is given by
This estimand is consistent when the propensity score model is correctly specified.
Outcome Regression (OR) Estimand
The OR estimand uses regression adjustment to control for differences in covariates and is given by
This approach is consistent when the outcome regression model is correctly specified.
Doubly Robust (DR) Estimand
The DR estimand combines both IPW and OR approaches, providing consistency if either the propensity score or outcome regression model is correctly specified, but not necessarily both. The DR estimand is given by
This estimand offers the best of both worlds, providing robustness against model mis-specification and improved efficiency properties.
Not-Yet-Treated Comparison Group Estimands#
When using not-yet-treated units as the comparison group, we work with different propensity score and outcome regression functions given by
and
Inverse Probability Weighting (IPW) Estimand
The IPW estimand for the not-yet-treated comparison adapts the weighting scheme to account for units that have not been treated by time \(t + \delta\) and is given by
Outcome Regression (OR) Estimand
The OR estimand adjusts for differences using the expected outcomes of not-yet-treated units is given by
Doubly Robust (DR) Estimand
The DR estimand for not-yet-treated comparisons combines both approaches and is given by
The choice between never-treated and not-yet-treated comparison groups depends on the specific empirical context. Never-treated comparisons may be more stable but require the existence of a sufficiently large never-treated group. Not-yet-treated comparisons can utilize more data but may be less appropriate when treatment timing is endogenous.
Unconditional Estimands#
When pre-treatment covariates play no role in identification (i.e., the parallel trends assumption holds unconditionally on \(X\)), the estimands simplify considerably. For the never-treated comparison group, the unconditional estimand is
For the not-yet-treated comparison group, the unconditional estimand is
These expressions clearly resemble the canonical two-period, two-group DiD estimand. The average effect for group \(g\) is identified by comparing the outcome path experienced by that group to the path experienced by the comparison group. Under parallel trends, this latter path represents the counterfactual outcome path that group \(g\) would have experienced without treatment.
Doubly Robust Estimation#
Although the IPW, OR, and DR estimands are equivalent from an identification standpoint, the DR approach has important advantages for estimation and inference. The DR estimators are consistent if either the propensity score model or the outcome regression model is correctly specified, but not necessarily both. This double robustness provides important protection against model misspecification.
Additionally, DR estimators allow for the use of flexible estimation methods, including those involving regularization and model selection, making them particularly attractive when the number of covariates is moderate or large relative to the sample size.
Aggregation of Effects#
While the group-time average treatment effects \(ATT(g, t)\) provide a complete characterization of treatment effect heterogeneity, the number of such parameters can be large in applications with many groups and time periods. Researchers often want to summarize these effects to answer specific policy questions. For example, do effects grow or fade over time? Do early adopters experience different effects than late adopters? What is the overall average effect of the policy?
A key feature of this methodology is the ability to aggregate the \(ATT(g, t)\)’s into meaningful summary measures. This allows researchers to answer specific policy questions and understand different dimensions of treatment effect heterogeneity.
Event-Study Aggregation#
Event-study plots aggregate effects by length of exposure to treatment, where \(e = t - g\) represents the time elapsed since treatment adoption. This aggregation reveals how treatment effects evolve dynamically after implementation. The event- study parameter is
This parameter weights the group-time effects by the relative size of each group among those observed \(e\) periods after treatment, providing insights into whether effects strengthen, weaken, or remain stable over time.
Compositional Changes and Balanced Event-Study
When comparing \(\theta_{es}(e)\) across different values of \(e\), one must be aware that compositional changes can complicate interpretation. For example, comparing \(\theta_{es}(e_2)\) and \(\theta_{es}(e_1)\) includes not only the dynamic effect of treatment but also two additional terms arising from different compositions of groups at different event times.
To address this, the paper proposes a “balanced” event-study parameter that uses a fixed set of groups across all event times:
This calculates the average treatment effect for units whose event time is \(e\) among those observed for at least \(e'\) periods. Since the composition of groups is the same across all values of \(e \le e'\), differences in \(\theta_{es}^{bal}(e; e')\) across event times cannot be attributed to compositional changes. The trade-off is that fewer groups are used, potentially leading to less precise inference.
Group-Specific Effects#
To understand whether treatment timing matters, we can average effects over time for each group. This allows us to understand whether early adopters experience different treatment effects compared to late adopters. For a specific group \(\tilde{g}\), the average effect is
This measure helps identify whether there are advantages or disadvantages to adopting treatment earlier versus later in the sample period.
Calendar-Time Effects#
Calendar-time aggregation averages effects across all treated groups for each time period, revealing how treatment effects vary with time-specific factors such as macroeconomic conditions or concurrent policy changes. For a specific time period \(\tilde{t}\), the calendar-time effect is
This aggregation weights each group’s contribution by its relative size among all groups treated by time \(\tilde{t}\).
Overall Average Treatment Effect#
When a single summary measure is needed, we can compute an overall average that aggregates across all groups and post- treatment time periods. One such measure weights group-specific effects by the distribution of treatment timing
This provides a single number summarizing the average treatment effect across the entire treated population, properly accounting for the staggered adoption pattern.
These aggregations provide transparent and interpretable ways to summarize treatment effect heterogeneity, with researcher- specified non-negative weights that directly reflect the policy questions of interest.
Inference and Pre-Treatment Testing#
Asymptotic Properties#
The DR estimators for \(ATT(g, t)\) are asymptotically normal. Under the doubly robust condition (either the propensity score or outcome regression model correctly specified), the estimators admit an influence function representation
where \(\psi_{g,t}\) is the influence function. This representation enables straightforward computation of standard errors and forms the basis for bootstrap inference procedures.
Simultaneous Confidence Bands#
The paper proposes a multiplier bootstrap procedure for constructing simultaneous confidence bands that cover all \(ATT(g, t)\) with probability \(1 - \alpha\). Unlike pointwise confidence intervals, simultaneous bands account for the dependency across different group-time average treatment effect estimators and avoid multiple testing problems. This is particularly important when visualizing the overall estimation uncertainty across all group-time effects.
Pre-Treatment Placebo Tests#
Although the limited anticipation assumption implies \(ATT(g, t) = 0\) for all \(t < g - \delta\), it is common practice to estimate these pre-treatment parameters and use them to assess the credibility of the parallel trends assumption. If the estimated pre-treatment effects are significantly different from zero, this provides evidence against the identifying assumptions. The DiD estimands can be adjusted to include pre-treatment periods by replacing the “long differences” \((Y_t - Y_{g-\delta-1})\) with “short differences” \((Y_t - Y_{t-1})\) for \(t < g - \delta\).
Note
For the full theoretical details, including efficiency bounds, asymptotic properties, and the multiplier bootstrap algorithm, please refer to the original paper by Callaway and Sant’Anna (2020).