DiD with Continuous Treatments#
The didcont module implements difference-in-differences estimation for settings
where treatment intensity varies continuously across units, following the methodology
of Callaway, Goodman-Bacon, and Sant’Anna (2024).
This approach addresses the unique challenges that arise when treatment is not simply
binary but operates with varying intensity or “dose” across units.
Continuous treatments arise naturally in many empirical settings. Pollution exposure dissipates across space, affecting locations near sources more severely than distant ones. Localities spend different amounts on public goods and services. Students choose how long to stay in school. Medicare subsidies vary with hospital patient composition. In all these cases, treatment intensity varies substantially, and researchers often care about both the overall effect of the policy and how effects vary with dose.
This module provides tools for identifying, estimating, and conducting inference on well-defined causal parameters in continuous DiD designs. A central insight is that with continuous treatments, there are fundamentally two types of causal parameters, level effects and causal responses, each requiring different identifying assumptions.
Setup and Notation#
Consider a setup with two time periods, \(t = 1\) (pre-treatment) and \(t = 2\) (post-treatment). In the first period, no unit is treated. In the second period, units receive a treatment “dose” denoted \(D_i\), which can be continuous or multi-valued discrete. The support of \(D\) is \(\mathcal{D} = \{0\} \cup \mathcal{D}_{+}\), where \(\mathcal{D}_{+}\) contains all positive doses and zero represents untreated units.
Assumption 1 (Random Sampling)
The observed data consist of \(\{Y_{i,t=2}, Y_{i,t=1}, D_i\}_{i=1}^n\), which is independent and identically distributed.
Assumption 2 (Continuous or Multi-Valued Discrete Treatment)
In period \(t = 1\), no unit is treated, while in period \(t = 2\), the treatment dosage \(D\) has support \(\mathcal{D} = \{0\} \cup \mathcal{D}_{+}\) and is either
(a) Continuous. \(\mathcal{D}_{+} = \mathcal{D}_{+}^c = [d_L, d_U]\) with \(0 < d_L < d_U < \bar{d} < \infty\). The density \(f_{D|D>0}\) satisfies \(a_f^{-1} < f_{D|D>0}(d) < a_f\) for some positive constant \(a_f < \infty\) and all \(d \in \mathcal{D}_{+}^c\), and \(\mathbb{E}[\Delta Y | D = d]\) is continuously differentiable on \(\mathcal{D}_{+}^c\).
(b) Multi-valued discrete. \(\mathcal{D}_{+} = \mathcal{D}_{+}^{mv} = \{d_1, d_2, \ldots, d_J\}\) where \(0 < d_1 < d_2 < \cdots < d_J < \bar{d} < \infty\), and \(\mathbb{P}(D = d) > 0\) for all \(d \in \mathcal{D}\).
In both cases, we require a positive mass of untreated units, \(\mathbb{P}(D = 0) > 0\).
Potential Outcomes Framework#
We adopt the potential outcomes framework where \(Y_{i,t}(d)\) denotes the potential outcome for unit \(i\) at time \(t\) under dose \(d\). The observed outcome in each period satisfies
Assumption 3 (No-Anticipation and Observed Outcomes)
For all units and all \(d \in \mathcal{D}\),
This assumption rules out anticipatory effects, ensuring that in the pre-treatment period, all units exhibit their untreated potential outcomes regardless of their future dose. In the post-treatment period, we observe the potential outcome corresponding to the actual dose received. Let \(\Delta Y = Y_{t=2} - Y_{t=1}\) denote the change in outcomes from period 1 to period 2.
Parameters of Interest#
With continuous treatments, two fundamentally different types of causal effects can be defined. Understanding the distinction between these parameters is crucial for proper interpretation of continuous DiD results.
Level Treatment Effects#
The level treatment effect of dose \(d\) for a given unit is the difference between its potential outcome under dose \(d\) and its untreated potential outcome
This extends the binary treatment effect concept to a “dose-response function.” The average treatment effect on the treated at dose \(d\) among units receiving dose \(d'\) is
When \(d' = d\), this yields \(ATT(d | d)\), the average effect of dose \(d\) compared to no treatment among units that actually received dose \(d\). This is the natural extension of the binary ATT to the continuous case.
The population-level average treatment effect is
Note that \(ATT(d | d)\) and \(ATE(d)\) differ when there is selection into dose group \(d\) on the basis of treatment effects. When units with larger treatment effects systematically choose higher doses, we have \(ATT(d | d) \neq ATE(d)\).
Causal Responses#
The causal response at dose \(d\) measures the effect of a marginal change in the dose. For continuous treatments, the causal response is defined as the derivative of the potential outcome with respect to dose
For discrete treatments, the causal response between adjacent doses \(d_j\) and \(d_{j-1}\) is
When treatment is binary, level treatment effects and causal responses coincide, but they do not under a continuous treatment. This distinction has important practical implications since even if all \(ATT(d | d)\) parameters are large and positive, some causal response parameters could be zero or negative.
The average causal response on the treated (ACRT) for continuous treatments is
When \(d' = d\), this gives the average marginal effect of increasing the dose among units at that dose level. Equivalently, \(ACRT(d | d)\) equals the derivative of the \(t = 2\) average potential outcome for units that received dose \(d\), evaluated at \(d\).
The population-level average causal response is
For discrete treatments, the analogous parameters are
Summary Parameters#
In practice, researchers often want to aggregate these functional parameters into lower-dimensional summary measures. Natural aggregations use the dose distribution among treated units
These provide “best” approximations in the sense of minimizing the mean squared distance between the summary parameter and the underlying functional parameters. The parameters \(ACRT^o\) and \(ACR^o\) are average derivative-type parameters, which have been extensively studied in the econometrics literature on efficient estimation.
Identification Assumptions#
The identification of treatment effect parameters relies on assumptions that restrict how untreated potential outcomes evolve over time across dose groups.
Parallel Trends#
The standard parallel trends assumption extends naturally from the binary case.
Assumption 4 (Parallel Trends)
For all \(d \in \mathcal{D}\),
This assumption states that the average evolution of untreated potential outcomes would be the same across all dose groups in the absence of treatment. Under parallel trends, the untreated group provides a valid counterfactual for the path of outcomes that treated units would have experienced without treatment.
Parallel trends is an assumption about untreated potential outcomes \(Y_t(0)\) only. It says nothing about how treated potential outcomes \(Y_t(d)\) for \(d > 0\) evolve across dose groups.
Strong Parallel Trends#
A different assumption is required to identify causal response parameters and to make valid comparisons across dose groups.
Assumption 5 (Strong Parallel Trends)
For all \(d \in \mathcal{D}\),
Under Assumption 3, the right-hand side of this equation is the observed average evolution of outcomes for dose group \(d\). Strong parallel trends says that the average evolution of outcomes for the entire population if all experienced dose \(d\) (the left-hand side) equals the path of outcomes that dose group \(d\) actually experienced.
An equivalent characterization under Assumption 4 is that strong parallel trends holds if and only if
This means strong parallel trends rules out selection-on-gains into particular dose groups. While this condition does not impose full treatment effect homogeneity, it does ensure that observed outcome changes for each dose group reflect what would have happened to all other groups had they received that dose.
Note
Conventional pre-tests for differential pre-trends cannot distinguish between Assumptions 4 and 5. Because only untreated potential outcomes are observed before treatment, pre-treatment periods cannot test the additional content of strong parallel trends, which necessarily involves treated potential outcomes \(Y_t(d)\) for \(d > 0\).
Relationship Between Assumptions#
In general, Assumptions 4 and 5 are non-nested, though Assumption 5 will typically be stronger in most applications. To see this, consider that Assumption 4 restricts only the evolution of \(Y_t(0)\) across dose groups, while Assumption 5 restricts the evolution of \(Y_t(d)\) for each \(d \in \mathcal{D}\).
When maintained jointly with Assumption 4, Assumption 5 can be understood as a structural assumption that allows extrapolation of treatment effects, ensuring that the treatment effects of dose \(d\) among dose group \(d\) equal the treatment effects of dose \(d\) for the entire population.
Identification Results#
Identification Under Parallel Trends#
Under parallel trends (Assumption 4), the dose-specific average treatment effect on the treated is identified. Specifically, under Assumptions 1 to 4, \(ATT(d | d)\) is identified for all \(d \in \mathcal{D}_{+}\), and it is given by
Furthermore, \(ATT^o = \mathbb{E}[\Delta Y | D > 0] - \mathbb{E}[\Delta Y | D = 0]\).
The identification argument proceeds as follows. By definition,
Adding and subtracting \(\mathbb{E}[Y_{t=1}(0) | D = d]\) and applying Assumption 4,
where the final equality uses the fact that \(Y_{t=2}(d)\) and \(Y_{t=1}(0)\) are observed for units with \(D = d\).
Non-Identification of Causal Responses Under Parallel Trends#
A central result is that causal response parameters are not identified under parallel trends alone. Under Assumptions 1 to 4, the following decompositions reveal the source of the identification failure.
For continuous treatments with \(d \in \mathcal{D}_{+}^c\),
For any \((h, l) \in \mathcal{D} \times \mathcal{D}\) with \(h > l\),
The proof for part (b) is instructive. Starting from the identification result above,
Adding and subtracting \(\mathbb{E}[Y_{t=2}(l) | D = h]\),
The selection bias term \(ATT(l | h) - ATT(l | l)\) captures the fact that different dose groups may experience different treatment effects at the same dose \(l\). Even if untreated potential outcomes evolve identically (parallel trends), comparing outcome paths between dose groups conflates causal responses with this selection-on-gains phenomenon.
For discrete treatments, taking \(h = d_j\) and \(l = d_{j-1}\) yields
Identification Under Strong Parallel Trends#
Under strong parallel trends (Assumption 5), both level effects and causal responses are identified without selection bias. The following results hold under Assumptions 1 to 3 and 5.
For \(d \in \mathcal{D}_{+}\),
When treatment is continuous, for \(d \in \mathcal{D}_{+}^c\),
For any \((h, l) \in \mathcal{D} \times \mathcal{D}\),
For part (a), the argument is similar to the identification under parallel trends but uses Assumption 5 instead
where the third equality applies Assumption 5 to both terms.
Parts (b) and (c) follow because strong parallel trends ensures that lower-dose groups are valid counterfactuals for higher-dose groups. The selection bias term vanishes since \(ATT(l | h) = ATT(l | l) = ATE(l)\) for all \(h, l\).
Under Assumptions 1 to 3 and 5, the summary parameters have the following identification results.
\(ATE^o = \mathbb{E}[\Delta Y | D > 0] - \mathbb{E}[\Delta Y | D = 0]\).
For continuous treatments,
For discrete treatments,
The Case Without Untreated Units#
In some applications, all units receive some positive amount of treatment. Without untreated units, it is infeasible to directly recover \(ATT(d | d)\) or \(ATE(d)\). However, a natural alternative is to compare dose group \(d\) to dose group \(d_L\) (the lowest dose).
Under parallel trends, when there are no untreated units,
This comparison is related to underlying causal parameters, but the right-hand side mixes together the average causal response of moving from \(d_L\) to \(d\) with selection bias.
Under strong parallel trends,
which has a clean causal interpretation without selection bias.
Estimation Methods#
Given the identification results above, this section describes estimation procedures that target well-defined causal parameters.
Discrete Treatments#
When the treatment is multi-valued discrete, estimation is straightforward. Regressing outcome changes on a saturated set of dose indicators with untreated units as the omitted category,
yields OLS coefficients \(\widehat{\beta} = (\widehat{\beta}_1, \ldots, \widehat{\beta}_J)'\) that consistently estimate \(ATT(d_j | d_j)\) under parallel trends. Under strong parallel trends, each \(\widehat{\beta}_j\) estimates \(ATE(d_j)\), and \(\widehat{\beta}_j - \widehat{\beta}_{j-1}\) estimates \(ACR(d_j)\).
Continuous Treatments - Sieve Estimation#
For continuous treatments, the module provides sieve-based estimation using B-spline basis functions. Consider regression specifications of the form
where \(\psi^K(d) = (\psi_{K1}(d), \ldots, \psi_{KK}(d))'\) is a \(K\) -dimensional vector of B-spline basis functions (including an intercept), \(\beta_K = (\beta_{K1}, \ldots, \beta_{KK})'\) is a vector of unknown parameters, and \(\varepsilon_i\) is an idiosyncratic error term.
The OLS estimator is
where for a given matrix \(A\), \(A^{-}\) denotes the Moore-Penrose inverse, and
The estimators for the dose-response function and its derivative are
where \(\partial \psi^K(d) = (d\psi_{K1}(d)/dd, \ldots, d\psi_{KK}(d)/dd)'\) contains the derivatives of the basis functions.
The user controls the spline degree and number of interior knots, allowing flexible
modeling of the dose-response relationship. With degree=3 and num_knots=0
(the default), this fits a global cubic polynomial.
Data-Driven Nonparametric Estimation (CCK)#
For fully nonparametric estimation without arbitrary tuning parameter choices, the module implements the data-driven sieve estimator of Chen, Christensen, and Kankanala (2024). This approach uses dyadic cubic B-splines with adaptive selection of the sieve dimension.
Let \(\mathcal{K} = \{(2^k + 3) : k \in \mathbb{N} \cup \{0\}\}\) be the set of candidate sieve dimensions. The data-driven choice \(\widehat{K}\) uses a Lepskii-type selection procedure. The key idea is to select the most parsimonious specification across all candidates, provided that the estimated \(ATE_K(d)\) curves are not “statistically different” from each other.
Algorithm (Data-Driven Sieve Dimension Selection)
Compute the data-driven index set of sieve dimensions
\[\widehat{\mathcal{K}} = \left\{K \in \mathcal{K} : 0.1(\log \widehat{K}_{ \max})^2 \le K \le \widehat{K}_{\max}\right\},\]where \(\widehat{K}_{\max} = \min\{K \in \mathcal{K} : K\sqrt{\log K} v_n \le 10\sqrt{n} < K^+\sqrt{\log K^+} v_n\}\) with \(v_n = \max\{1, (0.1 \log n)^4\}\) and \(K^+ = \min\{k \in \mathcal{K} : k > K\}\).
For bootstrap draws \(\{\omega_i\}_{i=1}^n\) (iid standard normal, independent of the data), compute the sup-t statistic
\[\sup_{(d, K, K_2) \in \mathcal{D}_{+}^c \times \widehat{\mathcal{K}} \times \widehat{\mathcal{K}} : K_2 > K} \left|\mathbb{Z}_n^*(d, K, K_2)\right|,\]where \(\mathbb{Z}_n^*(d, K, K_2)\) is a normalized bootstrap process comparing estimators at different sieve dimensions. Let \(\gamma_{1- \widehat{\alpha}}^*\) denote the \((1 - \widehat{\alpha})\) quantile.
The data-driven choice is
\[\begin{split}\widehat{K} = \inf\Bigg\{K \in \widehat{\mathcal{K}} : \sup_{\substack{(d, K_2) \in \mathcal{D}_{+}^c \times \widehat{\mathcal{K}} \\ K_2 > K}} \frac{\sqrt{n}|\widehat{ATE}_K(d) - \widehat{ATE}_{K_2}(d)|}{ \widehat{\sigma}_{K,K_2}(d)} \le 1.1 \gamma_{1-\widehat{\alpha}}^*\Bigg\}.\end{split}\]
The intuition is that if increasing \(K\) leads to a statistically different estimate of \(ATE_K(d)\), then it is “worth it” to increase the dimension. This is how the algorithm trades off bias and variance.
Convergence Rates and Confidence Bands#
The data-driven estimators achieve the minimax rate for estimating \(ATE(d)\) and \(ACR(d)\) in sup-norm. Under appropriate regularity conditions, let \(\mathcal{H}^p\) denote the Hölder ball of smoothness \(p\) and let \(p \in [\underline{p}, \bar{p}]\) with \(\bar{p} > \underline{p} > 0.5\). The following convergence results hold.
For level effects, there exists a universal constant \(C_1 > 0\) for which
For derivatives, when \(\underline{p} > 1\), there exists a universal constant \(C_1' > 0\) for which
The convergence rates \((\log n / n)^{p/(2p+1)}\) for level effects and :math: (log n / n)^{(p-1)/(2p+1)} for derivatives are the minimax rates for estimating functions in Hölder balls under sup-norm loss. As expected, the derivative estimator converges more slowly.
Uniform Confidence Bands. The module provides data-driven uniform confidence bands (UCBs) that are both honest (asymptotically correct coverage) and adaptive (contract at the minimax rate). For \(ATE(d)\),
where \(z_{1-\alpha}^*\) is the \((1-\alpha)\) quantile of a bootstrap sup-t statistic and \(\widehat{A} = \log \log \widehat{K}\) inflates critical values to account for potential bias.
Summary Parameter Estimation#
Binarized DiD for \(ATT^o\)#
The summary parameter \(ATT^o\) is estimated by a simple regression
where \(D_i^{>0} = \mathbf{1}\{D_i > 0\}\). The OLS coefficient \(\widehat{\beta}^{bin}\) consistently estimates \(ATT^o\) under parallel trends (or \(ATE^o\) under strong parallel trends).
Average Causal Response#
The summary parameter \(ACR^o\) is estimated using the plug-in principle
where \(n_{D>0} = \sum_{i=1}^n \mathbf{1}\{D_i > 0\}\).
Under appropriate regularity conditions, the estimator is asymptotically normal,
where \(\widehat{\sigma}_{ACR^o}^2 \xrightarrow{p} V_{ACR}\) with \(V_{ACR}\) being the semiparametric efficiency bound
Extensions to Staggered Adoption#
The methodology extends to settings with multiple time periods and variation in treatment timing. Let \(G_i\) denote the time period when unit \(i\) first receives a positive dose, with \(G_i = \infty\) for never-treated units. The potential outcomes are indexed by both timing and dose, \(Y_{i,t}(g, d)\).
The group-time-dose average treatment effect is
which measures the average effect in period \(t\) of becoming treated in period \(g\) with dose \(d\), among units in timing group \(g\).
Under a multi-period version of strong parallel trends, this is identified as
The expression involves “long differences” in outcomes from period \(g - 1\) (the last period before treatment) to \(t\). Not-yet-treated units can also be used as a comparison group.
Aggregation Strategies#
The high-dimensional \(ATE(g, t, d)\) parameters can be aggregated in two main ways.
Dose Aggregation. Averaging across timing groups and time periods yields dose-response functions
which highlight heterogeneity across different dose levels. These are analogous to \(ATE(d)\) and \(ACR(d)\) in the two-period case.
Event-Study Aggregation. Averaging across doses while keeping event-time structure yields
where \(e = t - g\) is the time since treatment. These highlight how treatment effects and causal responses evolve with length of exposure.
Pre-treatment event-study estimates (\(e < 0\)) can be used to assess the plausibility of parallel trends assumptions. However, such tests cannot distinguish between standard parallel trends and strong parallel trends, since pre-treatment periods only involve untreated potential outcomes.
Note
For complete theoretical details including formal assumptions, asymptotic properties, and efficiency results, refer to Callaway, Goodman-Bacon, and Sant’Anna (2024). The nonparametric estimation procedures build on Chen, Christensen, and Kankanala (2024).