Introduction to Difference-in-Differences#

Difference-in-differences (DiD) is one of the most widely used methods for estimating causal effects from observational data. It has become a cornerstone of empirical research across economics, public health, political science, sociology, and many other fields where randomized experiments are impractical or impossible. The method’s appeal lies in its intuitive logic and relatively mild assumptions compared to alternatives.

The core idea is to compare how outcomes change over time for units that receive a treatment to how outcomes change for units that do not. If both groups would have evolved similarly absent treatment, then any divergence after treatment can be attributed to the treatment itself.

We start with the canonical two-period case to build intuition, then extend to multiple periods and staggered treatment timing.

Notation and Potential Outcomes#

The potential outcomes framework provides precise definitions of causal effects. For unit \(i\) in period \(s\), we define

  • \(Y_{is}(0)\) — the untreated potential outcome, what unit \(i\) would experience in period \(s\) without treatment

  • \(Y_{is}(1)\) — the treated potential outcome, what unit \(i\) would experience in period \(s\) with treatment

  • \(D_i\) — a group membership indicator equal to 1 for treated units and 0 for untreated units

In the canonical two-period setup with periods \(t-1\) and \(t\), no one is treated in the first period, and units in the treated group become treated in the second period. This means observed outcomes are given by

\[Y_{i,t-1} = Y_{i,t-1}(0) \quad \text{and} \quad Y_{it} = D_i Y_{it}(1) + (1-D_i) Y_{it}(0).\]

In the first period, we observe untreated potential outcomes for everyone since no treatment has occurred yet. There is an implicit no-anticipation assumption here, meaning units do not change their behavior in anticipation of future treatment. In the second period, we observe treated potential outcomes for units that actually participate in the treatment and untreated potential outcomes for units that do not participate.

The Average Treatment Effect on the Treated#

The primary parameter of interest in DiD designs is the Average Treatment Effect on the Treated, or ATT. It is defined as

\[ATT = \mathbb{E}[Y_t(1) - Y_t(0) \mid D=1].\]

This quantity represents the average difference between treated and untreated potential outcomes for units in the treated group. It answers a specific causal question, namely what was the average effect of the treatment on those who actually received it.

The fundamental challenge of causal inference is that we never observe \(Y_t(0)\) for treated units. We see what happened to them after treatment, but we cannot observe what would have happened to them in the absence of treatment. This unobserved quantity is called the counterfactual. DiD attempts to solve this problem by using the untreated group to construct an estimate of this counterfactual.

Two-Way Fixed Effects and Its Limitations#

The two-period setup provides clean intuition, but most empirical applications involve richer settings. Policies often roll out gradually across regions or over time, creating variation in when different units receive treatment. This staggered adoption is common in policy evaluation, where reforms phase in across states, countries implement regulations at different dates, or firms adopt new practices over several years. While this variation seems helpful for identifying causal effects, the traditional estimation approach can produce misleading results.

Consider a setting with \(\mathcal{T}\) total time periods in which different units become treated at different times.

The TWFE Regression#

The traditional approach to estimating treatment effects in this setting is the two-way fixed effects (TWFE) linear regression

\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + \varepsilon_{it},\]

where

  • \(\theta_t\) — time fixed effect

  • \(\eta_i\) — unit fixed effect

  • \(D_{it}\) — treatment indicator equal to 1 if unit \(i\) has been treated by time \(t\)

  • \(\varepsilon_{it}\) — time-varying unobservables

  • \(\alpha\) — the parameter of interest, typically interpreted as the average effect of participating in treatment

When there are only two time periods, this approach works well. The coefficient \(\alpha\) is numerically equal to the ATT under parallel trends and exhibits robustness to treatment effect heterogeneity. Even if the effect varies across individual units, the TWFE estimate correctly captures the average treatment effect on the treated.

The Problem with Staggered Adoption#

This robustness does not extend to settings with multiple time periods and variation in treatment timing. Goodman-Bacon (2021) showed that the TWFE estimator equals a weighted average of all possible two-group/two-period DiD estimators in the data, and that some of these implicit comparisons are problematic.

Consider the three types of comparisons that TWFE implicitly makes. First, it compares newly treated units to never-treated units, which is exactly in the spirit of DiD. We adjust the path of outcomes for newly treated units by the path of outcomes for units that never participate in treatment. Second, it compares newly treated units to not-yet-treated units, which is also reasonable since these units have not yet been affected by treatment and can serve as valid comparisons for the current period.

The third comparison, however, is problematic. TWFE also compares newly treated units to already-treated units, those that received treatment in earlier periods. But already-treated units do not represent untreated potential outcomes. Their outcomes in later periods reflect the ongoing effects of treatment, including any treatment effect dynamics. Using them as controls means that treatment effect dynamics from earlier-treated groups contaminate the estimate of \(\alpha\).

Consequences of Contamination#

This contamination can have severe consequences. It is possible to construct examples where the effect of treatment is positive for all units in all time periods, yet the TWFE estimate is negative. Effects can appear smaller than they actually are, and spurious “pre-trends” can appear in the data even when the parallel trends assumption genuinely holds. The estimated coefficient \(\alpha\) does not correspond to any clearly interpretable causal parameter.

These problems arise even when treatment effects are homogeneous across groups. The issues are structural, stemming from which comparisons TWFE makes, not from treatment effect heterogeneity per se. Heterogeneity makes the problems worse, but homogeneity does not eliminate them.

Event-Study Regressions#

A common extension of TWFE is the event-study regression

\[Y_{it} = \alpha_i + \alpha_t + \sum_{k \neq -1} \gamma_k D_{it}^k + \varepsilon_{it}.\]

where \(D_{it}^k = \mathbf{1}\{t - G_i = k\}\) is an indicator for unit \(i\) being exactly \(k\) periods from initial treatment at time \(t\), and \(G_i\) denotes the period when unit \(i\) first receives treatment. For instance, \(D_{it}^0\) equals one if unit \(i\) is first treated at time \(t\), while \(D_{it}^{-2}\) equals one if unit \(i\) will be treated in two periods.

Researchers typically interpret the coefficients \(\gamma_k\) for \(k \geq 0\) as dynamic treatment effects, showing how the impact of treatment evolves over time since implementation. The coefficients \(\gamma_k\) for \(k < 0\) are interpreted as pre-trends, serving as placebo tests of the parallel trends assumption. If these pre-treatment coefficients are close to zero, it suggests the treated and comparison groups were evolving similarly before treatment occurred.

Unfortunately, these interpretations can be severely misleading. The estimated post-treatment effects \(\hat{\gamma}_k\) for \(k \geq 0\) are biased for the true dynamic effects, even when treatment effects are homogeneous across groups. The pre-treatment coefficients \(\hat{\gamma}_k\) for \(k < 0\) can appear statistically significant even when parallel trends genuinely holds, making pre-trend tests unreliable. These problems occur because event-study regressions suffer from the same fundamental issue as TWFE, implicitly using already-treated units as part of the comparison group.

Group-Time Average Treatment Effects#

The problems with TWFE stem from pooling all variation into a single regression coefficient. This forces the estimator to make implicit comparisons, including problematic ones that use already-treated units as controls. The solution is to avoid this pooling entirely.

Rather than estimating a single treatment effect, modern methods estimate separate effects for each combination of treatment cohort and time period, using only valid comparisons. This disaggregated approach yields a richer set of parameters that can then be aggregated in transparent ways.

Let \(G_i\) denote the time period when unit \(i\) first receives treatment. If a unit is never treated, we set \(G_i = \infty\). Units with the same treatment timing form a group or cohort. For example, if some states raised their minimum wage in 2010 and others in 2012, there are two treatment groups, the 2010 cohort and the 2012 cohort. Units that never receive treatment form the comparison group.

The group-time average treatment effect is defined as

\[ATT(g, t) = \mathbb{E}[Y_t(g) - Y_t(0) \mid G = g].\]

This is the average effect of participating in treatment for units in group \(g\) at time period \(t\). The notation \(Y_t(g)\) denotes the potential outcome at time \(t\) if a unit were first treated in period \(g\). This parameter is flexible and does not impose homogeneity across groups or time. When there are only two time periods and two groups, the ATT from the canonical case equals \(ATT(g=2, t=2)\).

To give a concrete example, suppose a researcher has access to data from 2010 to 2015, with some units first treated in 2012 and others first treated in 2014. Then \(ATT(g=2012, t=2014)\) is the average effect of participating in treatment for the group of units that became treated in 2012, measured in 2014, which is two years after their treatment began. Similarly, \(ATT(g=2014, t=2014)\) is the effect for units treated in 2014, measured in that same year.

Aggregating Group-Time Effects#

Group-time average treatment effects are the fundamental building blocks of causal inference in staggered adoption settings. However, with many groups and time periods, the set of all \(ATT(g, t)\) can be large. There are both benefits and costs to working with these disaggregated effects. The main benefit is that it is relatively straightforward to examine heterogeneous effects across groups and time. The cost is that summarizing many parameters into interpretable conclusions can be challenging. Several aggregation schemes address this challenge.

Event-Study Aggregation#

Event-study aggregation answers the question of how treatment effects evolve with time since treatment. Define \(e = t - g\) as the event time or relative time, representing the number of periods since treatment started. The event-study parameter is

\[\theta_D(e) = \sum_g \mathbf{1}\{g + e \leq \mathcal{T}\} \cdot ATT(g, g + e) \cdot P(G = g \mid G + e \leq \mathcal{T}).\]

This averages \(ATT(g, t)\) across all groups that are observed \(e\) periods from treatment, weighted by the relative size of each group. Event time \(e = 0\) is the period of first treatment. Positive event times show how treatment effects evolve in the periods following implementation, revealing whether effects strengthen, weaken, or remain stable over time.

Negative event times represent pre-treatment periods. Since treatment has not yet occurred, any non-zero effects in these periods would indicate that groups were already diverging before the policy change, which would violate parallel trends. Effects near zero for negative event times support the identifying assumption.

Group Aggregation#

Group aggregation addresses whether early adopters experience different treatment effects than late adopters. For each group \(g\), we average effects over all post-treatment periods

\[\theta_S(g) = \frac{1}{\mathcal{T} - g + 1} \sum_{t=g}^{\mathcal{T}} ATT(g, t).\]

This reveals treatment effect heterogeneity across cohorts. Differences might reflect how the policy was implemented differently over time, compositional differences between early and late adopters, or genuine variation in treatment effectiveness.

Calendar-Time Aggregation#

Calendar-time aggregation shows how the aggregate treatment effect evolves over calendar time, accounting for the staggered rollout of treatment. For each calendar period \(t\),

\[\theta_C(t) = \sum_g \mathbf{1}\{t \geq g\} \cdot ATT(g, t) \cdot P(G = g \mid G \leq t).\]

This weights each group’s contribution by its relative size among all groups treated by time \(t\). It is useful for understanding the total impact of a policy at each point in time, particularly when aggregate effects may vary with macroeconomic conditions or concurrent policy changes.

Overall Average Treatment Effect#

When a single summary measure is needed, the overall effect aggregates across all groups and post-treatment periods

\[\theta_S^O = \sum_g \theta_S(g) \cdot P(G = g \mid G \leq \mathcal{T}).\]

This is the average effect of participating in treatment across the entire treated population, properly accounting for the staggered adoption pattern. It is the natural multi-period analogue of the ATT in the two-period case. If a researcher must report a single treatment effect summary, this is the recommended parameter.

What’s Next#

ModernDiD implements these staggered treatment timing methods along with extensions for continuous treatments, intertemporal effects, triple differences, and sensitivity analysis. To see these ideas in practice, continue to the ModernDiD User Guide.