Panel Data Utilities#
ModernDiD’s panel module provides tools for inspecting and
cleaning panel data before estimation. Every estimator has a robust
preprocessing pipeline that automatically handles most panel irregularities,
so these utilities are optional. They are useful when you want to understand
what the pipeline is doing under the hood, or when you want to make cleaning
decisions yourself rather than relying on the defaults.
Like the estimators, every panel utility function accepts any Arrow-compatible DataFrame, converts to Polars internally for speed, and returns results in your original dataframe format.
Diagnosing the Data#
diagnose_panel gives you a quick summary of
the panel’s structure before you hand it to an estimator. Here we load
the Favara and Imbs (2015)
banking-deregulation dataset, a county-level panel that, like many real
datasets, is not perfect.
import moderndid as did
data = did.load_favara_imbs()
diag = did.diagnose_panel(data,
idname="county",
tname="year",
treatname="inter_bra")
print(diag)
==========================================================================================
Panel Diagnostics
==========================================================================================
┌───────────────────────────┬───────┐
│ Metric │ Value │
├───────────────────────────┼───────┤
│ Units │ 1048 │
│ Periods │ 12 │
│ Observations │ 12538 │
│ Balanced │ No │
│ Duplicate unit-time pairs │ 0 │
│ Unbalanced units │ 5 │
│ Gaps │ 38 │
│ Rows with missing values │ 524 │
│ Single-period units │ 1 │
│ Early-treated units │ 0 │
│ Treatment time-varying │ Yes │
└───────────────────────────┴───────┘
------------------------------------------------------------------------------------------
Suggestions
------------------------------------------------------------------------------------------
Call fill_panel_gaps() to fill 38 missing unit-time pairs
Call make_balanced_panel() to drop 5 units not observed in all periods
524 rows contain missing values and will be dropped during preprocessing
Call complete_data() or make_balanced_panel() to drop 1 units observed in only one period
Treatment varies within units — verify this is expected or call get_group()
==========================================================================================
A balanced 1048 x 12 panel would have 12,576 observations, but we only
have 12,538. The report shows that 5 counties are not observed in every
year, creating 38 missing county-year pairs. It also flags 524 rows with
missing values that the preprocessing pipeline will silently drop, and
one county observed in only a single year. The inter_bra column
changes within counties over time. That is expected here because
interstate branching deregulation rolls out at different dates, but
exactly the kind of thing you want to catch early if your treatment is
supposed to be time-invariant.
You could pass this data directly to
did_multiplegt and it would work. The
preprocessing pipeline would silently drop the 5 incomplete counties.
The value of running diagnostics first is that you see what gets
dropped and can decide whether that is acceptable for your analysis.
Fixing the Gaps#
If you do want to handle the gaps yourself, the diagnostics suggest a couple of strategies.
fill_panel_gaps keeps every county and fills
the 38 missing county-year pairs with null rows. This preserves as
many units as possible, which is useful when you plan to impute the
missing values or pass the data to an estimator with
allow_unbalanced_panel=True.
filled = did.fill_panel_gaps(data, idname="county", tname="year")
filled.shape
(12576, 7)
The panel is now a full 1048 x 12 rectangle.
make_balanced_panel takes the opposite
approach and drops the 5 incomplete counties entirely. You lose a few
units, but every remaining county is observed in all 12 years with no
nulls. This is what the preprocessing pipeline does by default when
allow_unbalanced_panel=False.
balanced = did.make_balanced_panel(data, idname="county", tname="year")
balanced.shape
(12516, 7)
That gives 1043 counties x 12 years.
If your data had duplicate unit-time pairs, you would need to resolve
those before calling any estimator, since duplicates cause a hard
error in the preprocessing pipeline.
deduplicate_panel handles this by keeping the
last occurrence by default, or can average numeric columns with
strategy="mean".
Building the Group-Timing Variable#
Most ModernDiD estimators take gname as an argument, a column indicating the first
period each unit was treated (0 for never-treated). Many datasets
instead store a raw binary treatment indicator that flips from 0 to 1
when treatment begins. get_group converts
between the two. It looks at when each unit’s treatment first turns on
and writes that period into a new "G" column.
groups = did.get_group(data, idname="county", tname="year", treatname="inter_bra")
groups["G"].unique().sort()
[0, 1995, 1996, 1997, 1998, 2000, 2001]
The output shows six distinct deregulation cohorts plus the
never-treated group (0). This "G" column can be passed directly
to gname in any estimator.
Inspection Helpers#
Several lightweight functions answer common questions about a panel without running full diagnostics.
# Quick boolean checks
did.is_balanced_panel(data, idname="county", tname="year")
did.has_gaps(data, idname="county", tname="year")
# Which columns change within units over time?
did.are_varying(data, idname="county", cols=["inter_bra", "state"])
# {"inter_bra": True, "state": False}
# List the exact missing unit-time pairs
gaps = did.scan_gaps(data, idname="county", tname="year")
complete_data keeps only units observed in at
least min_periods time periods, which is useful for dropping units
with too few observations.
# Keep units observed in at least 10 of 12 periods
trimmed = did.complete_data(data, idname="county", tname="year", min_periods=10)
deduplicate_panel removes duplicate unit-time
pairs. The default keeps the last occurrence; strategy="mean" averages
numeric columns instead.
deduped = did.deduplicate_panel(data, idname="county", tname="year", strategy="last")
Reshaping and Transformations#
These functions convert between panel formats and compute common transformations.
# Pivot long panel to wide (one column per period)
wide = did.panel_to_wide(data, idname="county", tname="year")
# Unpivot wide back to long
long = did.wide_to_panel(wide, idname="county", stub_names=["outcome"], tname="year")
# First-difference the outcome variable (adds a "dy" column)
diffed = did.get_first_difference(data, idname="county", yname="outcome", tname="year")
For repeated cross-section data (no unit tracked over time),
assign_rc_ids adds a unique "rowid" column
that some estimators require.
rc_data = did.assign_rc_ids(data)
Next steps#
Once your data is clean, you are ready to estimate treatment effects.
Quickstart walks through
att_gtestimation, aggregation, and all available options.Estimator Overview surveys additional estimators for continuous treatments, triple differences, and more.