DiD with Continuous Treatments#

The didcont module implements difference-in-differences estimation for settings where treatment intensity varies continuously across units, following the methodology of Callaway, Goodman-Bacon, and Sant’Anna (2024). This approach addresses the unique challenges that arise when treatment is not simply binary but operates with varying intensity or “dose” across units.

Continuous treatments arise naturally in many empirical settings. Pollution exposure dissipates across space, affecting locations near sources more severely than distant ones. Localities spend different amounts on public goods and services. Students choose how long to stay in school. Medicare subsidies vary with hospital patient composition. In all these cases, treatment intensity varies substantially, and researchers often care about both the overall effect of the policy and how effects vary with dose.

This module provides tools for identifying, estimating, and conducting inference on well-defined causal parameters in continuous DiD designs. A central insight is that with continuous treatments, there are fundamentally two types of causal parameters, level effects and causal responses, each requiring different identifying assumptions.

Setup and Notation#

Consider a setup with two time periods, \(t = 1\) (pre-treatment) and \(t = 2\) (post-treatment). In the first period, no unit is treated. In the second period, units receive a treatment “dose” denoted \(D_i\), which can be continuous or multi-valued discrete. The support of \(D\) is \(\mathcal{D} = \{0\} \cup \mathcal{D}_{+}\), where \(\mathcal{D}_{+}\) contains all positive doses and zero represents untreated units.

Assumption 1 (Random Sampling)

The observed data consist of \(\{Y_{i,t=2}, Y_{i,t=1}, D_i\}_{i=1}^n\), which is independent and identically distributed.

Assumption 2 (Continuous or Multi-Valued Discrete Treatment)

In period \(t = 1\), no unit is treated, while in period \(t = 2\), the treatment dosage \(D\) has support \(\mathcal{D} = \{0\} \cup \mathcal{D}_{+}\) and is either

(a) Continuous. \(\mathcal{D}_{+} = \mathcal{D}_{+}^c = [d_L, d_U]\) with \(0 < d_L < d_U < \bar{d} < \infty\). The density \(f_{D|D>0}\) satisfies \(a_f^{-1} < f_{D|D>0}(d) < a_f\) for some positive constant \(a_f < \infty\) and all \(d \in \mathcal{D}_{+}^c\), and \(\mathbb{E}[\Delta Y | D = d]\) is continuously differentiable on \(\mathcal{D}_{+}^c\).

(b) Multi-valued discrete. \(\mathcal{D}_{+} = \mathcal{D}_{+}^{mv} = \{d_1, d_2, \ldots, d_J\}\) where \(0 < d_1 < d_2 < \cdots < d_J < \bar{d} < \infty\), and \(\mathbb{P}(D = d) > 0\) for all \(d \in \mathcal{D}\).

In both cases, we require a positive mass of untreated units, \(\mathbb{P}(D = 0) > 0\).

Potential Outcomes Framework#

We adopt the potential outcomes framework where \(Y_{i,t}(d)\) denotes the potential outcome for unit \(i\) at time \(t\) under dose \(d\). The observed outcome in each period satisfies

\[Y_{i,t=1} = Y_{i,t=1}(0), \quad Y_{i,t=2} = Y_{i,t=2}(D_i).\]

Assumption 3 (No-Anticipation and Observed Outcomes)

For all units and all \(d \in \mathcal{D}\),

\[Y_{i,t=1} = Y_{i,t=1}(d) = Y_{i,t=1}(0), \quad Y_{i,t=2} = Y_{i,t=2}(D_i).\]

This assumption rules out anticipatory effects, ensuring that in the pre-treatment period, all units exhibit their untreated potential outcomes regardless of their future dose. In the post-treatment period, we observe the potential outcome corresponding to the actual dose received. Let \(\Delta Y = Y_{t=2} - Y_{t=1}\) denote the change in outcomes from period 1 to period 2.

Parameters of Interest#

With continuous treatments, two fundamentally different types of causal effects can be defined. The distinction between these parameters matters for proper interpretation of continuous DiD results.

Level Treatment Effects#

The level treatment effect of dose \(d\) for a given unit is the difference between its potential outcome under dose \(d\) and its untreated potential outcome

\[Y_{t=2}(d) - Y_{t=2}(0).\]

This extends the binary treatment effect concept to a “dose-response function.” The average treatment effect on the treated at dose \(d\) among units receiving dose \(d'\) is

\[ATT(d | d') = \mathbb{E}[Y_{t=2}(d) - Y_{t=2}(0) | D = d'].\]

When \(d' = d\), this yields \(ATT(d | d)\), the average effect of dose \(d\) compared to no treatment among units that actually received dose \(d\). This is the natural extension of the binary ATT to the continuous case.

The population-level average treatment effect is

\[ATE(d) = \mathbb{E}[Y_{t=2}(d) - Y_{t=2}(0)].\]

Note that \(ATT(d | d)\) and \(ATE(d)\) differ when there is selection into dose group \(d\) on the basis of treatment effects. When units with larger treatment effects systematically choose higher doses, we have \(ATT(d | d) \neq ATE(d)\).

Causal Responses#

The causal response at dose \(d\) measures the effect of a marginal change in the dose. For continuous treatments, the causal response is defined as the derivative of the potential outcome with respect to dose

\[Y'_{t=2}(d) = \lim_{h \to 0^+} \frac{Y_{t=2}(d + h) - Y_{t=2}(d)}{h}.\]

For discrete treatments, the causal response between adjacent doses \(d_j\) and \(d_{j-1}\) is

\[Y_{t=2}(d_j) - Y_{t=2}(d_{j-1}).\]

When treatment is binary, level treatment effects and causal responses coincide, but they do not under a continuous treatment. This distinction has important practical implications since even if all \(ATT(d | d)\) parameters are large and positive, some causal response parameters could be zero or negative.

The average causal response on the treated (ACRT) for continuous treatments is

\[ACRT(d | d') = \left.\frac{\partial ATT(l | d')}{\partial l}\right|_{l=d} = \left.\frac{\partial \mathbb{E}[Y_{t=2}(l) | D = d']}{\partial l}\right|_{l=d}.\]

When \(d' = d\), this gives the average marginal effect of increasing the dose among units at that dose level. Equivalently, \(ACRT(d | d)\) equals the derivative of the \(t = 2\) average potential outcome for units that received dose \(d\), evaluated at \(d\).

The population-level average causal response is

\[ACR(d) = \frac{\partial ATE(d)}{\partial d} = \frac{\partial \mathbb{E}[Y_{t=2}(d)]}{\partial d}.\]

For discrete treatments, the analogous parameters are

\[\begin{split}ACRT(d_j | d_k) &= \mathbb{E}[Y_{t=2}(d_j) - Y_{t=2}(d_{j-1}) | D = d_k], \\ ACR(d_j) &= \mathbb{E}[Y_{t=2}(d_j) - Y_{t=2}(d_{j-1})].\end{split}\]

Summary Parameters#

In practice, researchers often want to aggregate these functional parameters into lower-dimensional summary measures. Natural aggregations use the dose distribution among treated units

\[\begin{split}ATT^o &= \mathbb{E}[ATT(D | D) | D > 0], \quad & ATE^o &= \mathbb{E}[ATE(D) | D > 0], \\ ACRT^o &= \mathbb{E}[ACRT(D | D) | D > 0], \quad & ACR^o &= \mathbb{E}[ACR(D) | D > 0].\end{split}\]

These provide “best” approximations in the sense of minimizing the mean squared distance between the summary parameter and the underlying functional parameters. The parameters \(ACRT^o\) and \(ACR^o\) are average derivative-type parameters, which have been extensively studied in the econometrics literature on efficient estimation.

Identification Assumptions#

The identification of treatment effect parameters relies on assumptions that restrict how untreated potential outcomes evolve over time across dose groups.

Relationship Between Assumptions#

In general, Assumptions 4 and 5 are non-nested, though Assumption 5 will typically be stronger in most applications. To see this, consider that Assumption 4 restricts only the evolution of \(Y_t(0)\) across dose groups, while Assumption 5 restricts the evolution of \(Y_t(d)\) for each \(d \in \mathcal{D}\).

When maintained jointly with Assumption 4, Assumption 5 can be understood as a structural assumption that allows extrapolation of treatment effects, ensuring that the treatment effects of dose \(d\) among dose group \(d\) equal the treatment effects of dose \(d\) for the entire population.

Identification Results#

Which parameters can be recovered from the data depends on the strength of the parallel trends assumption. Standard PT identifies level effects but not causal responses. The stronger SPT assumption identifies both.

The Case Without Untreated Units#

In some applications, all units receive some positive amount of treatment. Without untreated units, it is infeasible to directly recover \(ATT(d | d)\) or \(ATE(d)\). However, a natural alternative is to compare dose group \(d\) to dose group \(d_L\) (the lowest dose).

Under parallel trends, when there are no untreated units,

\[\mathbb{E}[\Delta Y | D = d] - \mathbb{E}[\Delta Y | D = d_L] = ATT(d | d) - ATT(d_L | d_L).\]

This comparison is related to underlying causal parameters, but the right-hand side mixes together the average causal response of moving from \(d_L\) to \(d\) with selection bias.

Under strong parallel trends,

\[\mathbb{E}[\Delta Y | D = d] - \mathbb{E}[\Delta Y | D = d_L] = ATE(d) - ATE(d_L) = \mathbb{E}[Y_{t=2}(d) - Y_{t=2}(d_L)],\]

which has a clean causal interpretation without selection bias.

What Does TWFE Estimate with a Continuous Treatment?#

The negative weighting problems of TWFE in binary staggered settings are well documented (see DiD with Multiple Time Periods). With a continuous treatment, TWFE has additional problems that are specific to the dose variation. The coefficient \(\hat{\beta}^{TWFE}\) from regressing \(\Delta Y\) on \(D\) admits several different decompositions, none of which cleanly recovers a single well-defined causal parameter.

Causal response decomposition. Under parallel trends, \(\hat{\beta}^{TWFE}\) estimates a weighted average of \(ACRT(d \mid d)\) across doses, with positive weights that integrate to one. However, it also includes a selection bias term. Even if the weights are well-behaved, the estimand conflates causal responses with differential selection into dose groups. Under strong parallel trends the selection bias vanishes, but the weights still do not match the dose distribution among treated units. The TWFE-implied weights are concentrated around the mean dose and underweight the tails.

Level effects decomposition. Under parallel trends, \(\hat{\beta}^{TWFE}\) can also be written as a weighted average of \(ATT(d \mid d)\) values, but with weights that integrate to zero rather than one and that can be negative. TWFE implicitly treats above-average doses as “treated” and below-average doses as part of the “comparison group,” which produces a Wald-type estimand that divides the difference in outcome changes by the difference in doses. This means TWFE does not estimate any recognizable average of level treatment effects.

Implications. Even when outcome changes are linear in dose (which eliminates the weighting issues), selection bias persists under parallel trends. And even under strong parallel trends (which eliminates selection bias), TWFE’s implicit weighting scheme does not match the dose distribution. The same TWFE coefficient has multiple interpretations depending on which decomposition one adopts, none of which corresponds to a parameter a researcher would deliberately target. This motivates using the explicitly-targeted estimators described below for \(ATT^o\) and \(ACR^o\).

Estimation Methods#

Given the identification results above, this section describes estimation procedures that target well-defined causal parameters.

Discrete Treatments#

When the treatment is multi-valued discrete, estimation is simple. Regressing outcome changes on a saturated set of dose indicators with untreated units as the omitted category,

\[\Delta Y_i = \beta_0 + \sum_{j=1}^{J} \mathbf{1}\{D_i = d_j\} \beta_j + \varepsilon_i,\]

yields OLS coefficients \(\widehat{\beta} = (\widehat{\beta}_1, \ldots, \widehat{\beta}_J)'\) that consistently estimate \(ATT(d_j | d_j)\) under parallel trends. Under strong parallel trends, each \(\widehat{\beta}_j\) estimates \(ATE(d_j)\), and \(\widehat{\beta}_j - \widehat{\beta}_{j-1}\) estimates \(ACR(d_j)\).

Continuous Treatments - Sieve Estimation#

For continuous treatments, the module provides sieve-based estimation using B-spline basis functions. Consider regression specifications of the form

\[\Delta Y_i = \sum_{k=1}^{K} \psi_{Kk}(D_i) \beta_{Kk} + \varepsilon_i,\]

where \(\psi^K(d) = (\psi_{K1}(d), \ldots, \psi_{KK}(d))'\) is a \(K\) -dimensional vector of B-spline basis functions (including an intercept), \(\beta_K = (\beta_{K1}, \ldots, \beta_{KK})'\) is a vector of unknown parameters, and \(\varepsilon_i\) is an idiosyncratic error term.

The OLS estimator is

\[\widehat{\beta}_K = \mathbb{E}_n\Big[\mathbf{1}\{D > 0\} \psi^K(D) \psi^K(D)' \Big]^{-} \mathbb{E}_n\Big[\mathbf{1}\{D > 0\} \psi^K(D) (\Delta Y - \mathbb{E}_n[\Delta Y | D = 0])\Big],\]

where for a given matrix \(A\), \(A^{-}\) denotes the Moore-Penrose inverse, and

\[\mathbb{E}_n[B | D > 0] = \frac{\sum_{i=1}^n \mathbf{1}\{D_i > 0\} B_i}{ \sum_{i=1}^n \mathbf{1}\{D_i > 0\}}.\]

The estimators for the dose-response function and its derivative are

\[\widehat{ATE}_K(d) = (\psi^K(d))' \widehat{\beta}_K, \quad \widehat{ACR}_K(d) = (\partial \psi^K(d))' \widehat{\beta}_K,\]

where

\[\partial \psi^K(d) = \left(\frac{d\psi_{K1}(d)}{dd}, \ldots, \frac{d\psi_{KK}(d)}{dd}\right)'\]

contains the derivatives of the basis functions.

The user controls the spline degree and number of interior knots, allowing flexible modeling of the dose-response relationship. With degree=3 and num_knots=0 (the default), this fits a global cubic polynomial.

Data-Driven Nonparametric Estimation (CCK)#

For fully nonparametric estimation without arbitrary tuning parameter choices, the module implements the data-driven sieve estimator of Chen, Christensen, and Kankanala (2024). This approach uses dyadic cubic B-splines with adaptive selection of the sieve dimension.

Let \(\mathcal{K} = \{(2^k + 3) : k \in \mathbb{N} \cup \{0\}\}\) be the set of candidate sieve dimensions. The data-driven choice \(\widehat{K}\) uses a Lepskii-type selection procedure. The key idea is to select the most parsimonious specification across all candidates, provided that the estimated \(ATE_K(d)\) curves are not “statistically different” from each other.

Algorithm (Data-Driven Sieve Dimension Selection)

  1. Compute the data-driven index set of sieve dimensions

    \[\widehat{\mathcal{K}} = \left\{K \in \mathcal{K} : 0.1(\log \widehat{K}_{ \max})^2 \le K \le \widehat{K}_{\max}\right\},\]

    where \(\widehat{K}_{\max} = \min\{K \in \mathcal{K} : K\sqrt{\log K} v_n \le 10\sqrt{n} < K^+\sqrt{\log K^+} v_n\}\) with \(v_n = \max\{1, (0.1 \log n)^4\}\) and \(K^+ = \min\{k \in \mathcal{K} : k > K\}\).

  2. For bootstrap draws \(\{\omega_i\}_{i=1}^n\) (iid standard normal, independent of the data), compute the sup-t statistic

    \[\sup_{(d, K, K_2) \in \mathcal{D}_{+}^c \times \widehat{\mathcal{K}} \times \widehat{\mathcal{K}} : K_2 > K} \left|\mathbb{Z}_n^*(d, K, K_2)\right|,\]

    where \(\mathbb{Z}_n^*(d, K, K_2)\) is a normalized bootstrap process comparing estimators at different sieve dimensions. Let \(\gamma_{1- \widehat{\alpha}}^*\) denote the \((1 - \widehat{\alpha})\) quantile.

  3. The data-driven choice is

    \[\begin{split}\widehat{K} = \inf\Bigg\{K \in \widehat{\mathcal{K}} : \sup_{\substack{(d, K_2) \in \mathcal{D}_{+}^c \times \widehat{\mathcal{K}} \\ K_2 > K}} \frac{\sqrt{n}|\widehat{ATE}_K(d) - \widehat{ATE}_{K_2}(d)|}{ \widehat{\sigma}_{K,K_2}(d)} \le 1.1 \gamma_{1-\widehat{\alpha}}^*\Bigg\}.\end{split}\]

The intuition is that if increasing \(K\) leads to a statistically different estimate of \(ATE_K(d)\), then it is “worth it” to increase the dimension. This is how the algorithm trades off bias and variance.

Convergence Rates and Confidence Bands#

The data-driven estimators achieve the minimax rate for estimating \(ATE(d)\) and \(ACR(d)\) in sup-norm. Under appropriate regularity conditions, let \(\mathcal{H}^p\) denote the Hölder ball of smoothness \(p\) and let \(p \in [\underline{p}, \bar{p}]\) with \(\bar{p} > \underline{p} > 0.5\). The following convergence results hold.

For level effects, there exists a universal constant \(C_1 > 0\) for which

\[\sup_{p \in [\underline{p}, \bar{p}]} \sup_{ATE(\cdot) \in \mathcal{H}^p} \mathbb{P}_{ATE}\Bigg(\sup_{d \in \mathcal{D}_{+}^c} |(\widehat{ATE}_{ \widehat{K}} - ATE)(d)| > C_1 \left(\frac{\log n}{n}\right)^{\frac{p}{2p+1}} \Bigg) \to 0.\]

For derivatives, when \(\underline{p} > 1\), there exists a universal constant \(C_1' > 0\) for which

\[\sup_{p \in [\underline{p}, \bar{p}]} \sup_{ATE(\cdot) \in \mathcal{H}^p} \mathbb{P}_{ATE}\Bigg(\sup_{d \in \mathcal{D}_{+}^c} |(\widehat{ACR}_{ \widehat{K}} - ACR)(d)| > C_1' \left(\frac{\log n}{n}\right)^{\frac{p-1}{2p+1}} \Bigg) \to 0.\]

The convergence rates \((\log n / n)^{p/(2p+1)}\) for level effects and :math: (log n / n)^{(p-1)/(2p+1)} for derivatives are the minimax rates for estimating functions in Hölder balls under sup-norm loss. As expected, the derivative estimator converges more slowly.

Uniform Confidence Bands. The module provides data-driven uniform confidence bands (UCBs) that are both honest (asymptotically correct coverage) and adaptive (contract at the minimax rate). For \(ATE(d)\),

\[C_n(d) = \Bigg[\widehat{ATE}_{\widehat{K}}(d) - (z_{1-\alpha}^* + \widehat{A}\gamma_{1-\widehat{\alpha}}^*) \frac{\widehat{\sigma}_{\widehat{K}}(d)} {\sqrt{n}}, \; \widehat{ATE}_{\widehat{K}}(d) + (z_{1-\alpha}^* + \widehat{A}\gamma_{1-\widehat{\alpha}}^*) \frac{\widehat{\sigma}_{\widehat{K}}(d)} {\sqrt{n}}\Bigg],\]

where \(z_{1-\alpha}^*\) is the \((1-\alpha)\) quantile of a bootstrap sup-t statistic and \(\widehat{A} = \log \log \widehat{K}\) inflates critical values to account for potential bias.

Summary Parameter Estimation#

The full dose-response curve is informative but can be hard to summarize. Two scalar summary parameters distill the curve into single numbers that are easy to report and compare.

Binarized DiD for \(ATT^o\)#

The summary parameter \(ATT^o\) is estimated by a simple regression

\[\Delta Y_i = \beta_0^{bin} + D_i^{>0} \beta^{bin} + \epsilon_i,\]

where \(D_i^{>0} = \mathbf{1}\{D_i > 0\}\). The OLS coefficient \(\widehat{\beta}^{bin}\) consistently estimates \(ATT^o\) under parallel trends (or \(ATE^o\) under strong parallel trends).

Average Causal Response#

The summary parameter \(ACR^o\) is estimated using the plug-in principle

\[\widehat{ACR}^o = \mathbb{E}_n[\widehat{ACR}_{\widehat{K}}(D) | D > 0] = \frac{1}{n_{D>0}} \sum_{i : D_i > 0} \widehat{ACR}_{\widehat{K}}(D_i),\]

where \(n_{D>0} = \sum_{i=1}^n \mathbf{1}\{D_i > 0\}\).

Under appropriate regularity conditions, the estimator is asymptotically normal,

\[\sqrt{n_{D>0}} \frac{(\widehat{ACR}^o - ACR^o)}{\widehat{\sigma}_{ACR^o}} \xrightarrow{d} \mathcal{N}(0, 1),\]

where \(\widehat{\sigma}_{ACR^o}^2 \xrightarrow{p} V_{ACR}\) with \(V_{ACR}\) being the semiparametric efficiency bound

\[V_{ACR} = \text{Var}\Bigg[ACR(D) - (\Delta Y - \mathbb{E}[\Delta Y | D, D > 0]) \frac{f'_{D|D>0}(D)}{f_{D|D>0}(D)} \,\Big|\, D > 0\Bigg].\]

Extensions to Staggered Adoption#

The methodology extends to settings with multiple time periods and variation in treatment timing. Let \(G_i\) denote the time period when unit \(i\) first receives a positive dose, with \(G_i = \infty\) for never-treated units. The potential outcomes are indexed by both timing and dose, \(Y_{i,t}(g, d)\).

The group-time-dose average treatment effect is

\[ATE(g, t, d) = \mathbb{E}[Y_t(g, d) - Y_t(0) | G = g],\]

which measures the average effect in period \(t\) of becoming treated in period \(g\) with dose \(d\), among units in timing group \(g\).

Under a multi-period version of strong parallel trends, this is identified as

\[ATE(g, t, d) = \mathbb{E}[Y_t - Y_{g-1} | G = g, D = d] - \mathbb{E}[Y_t - Y_{g-1} | G = \infty, D = 0].\]

The expression involves “long differences” in outcomes from period \(g - 1\) (the last period before treatment) to \(t\). Not-yet-treated units can also be used as a comparison group.

Aggregation Strategies#

The high-dimensional \(ATE(g, t, d)\) parameters can be aggregated in two main ways.

Dose Aggregation. Averaging across timing groups and time periods yields dose-response functions

\[ATE^{dose}(d), \quad ACR^{dose}(d),\]

which highlight heterogeneity across different dose levels. These are analogous to \(ATE(d)\) and \(ACR(d)\) in the two-period case.

Event-Study Aggregation. Averaging across doses while keeping event-time structure yields

\[ATT^{es}(e), \quad ACR^{es}(e),\]

where \(e = t - g\) is the time since treatment. These highlight how treatment effects and causal responses evolve with length of exposure.

Pre-treatment event-study estimates (\(e < 0\)) can be used to assess the plausibility of the identifying assumptions. Plotting \(ATT^{es}(e)\) for \(e < 0\) tests whether untreated outcome paths are parallel across dose groups (standard PT).

Tip

To assess strong parallel trends specifically, examine \(ACR^{es}(e)\) for \(e < 0\), which tests whether the dose-response relationship is stable in the pre-treatment period. Violations of pre-treatment \(ACR^{es}\) provide evidence against SPT that \(ATT^{es}\) pre-trends cannot detect, since the additional content of SPT involves treated potential outcomes \(Y_t(d)\) for \(d > 0\).

Note

For complete theoretical details including formal assumptions, asymptotic properties, and efficiency results, refer to Callaway, Goodman-Bacon, and Sant’Anna (2024). The nonparametric estimation procedures build on Chen, Christensen, and Kankanala (2024).