moderndid.core.panel.deduplicate_panel#
- moderndid.core.panel.deduplicate_panel(data: Any, idname: str, tname: str, strategy: str = 'last') Any[source]#
Remove duplicate unit-time pairs.
Duplicate unit-time rows cause hard errors during the preprocessing pipeline because the data cannot be unambiguously reshaped or differenced. Run
diagnose_panelfirst to see how many duplicates exist, then call this function to resolve them before estimation.- Parameters:
- data
DataFrame Panel data. Accepts any object implementing the Arrow PyCapsule Interface (
__arrow_c_stream__), including polars, pandas, pyarrow Table, and cudf DataFrames.- idname
str Unit identifier column.
- tname
str Time period column.
- strategy
"first"|"last"|"mean" How to resolve duplicates.
"mean"averages numeric columns and keeps the first value for non-numeric columns.
- data
- Returns:
DataFrameDeduplicated panel in the same format as data.
- Raises:
ValueErrorIf strategy is not one of
"first","last","mean".
See also
diagnose_panelDetect duplicates before removing them.
Examples
In [1]: import polars as pl ...: from moderndid import deduplicate_panel, load_favara_imbs ...: ...: df = load_favara_imbs() ...: df_with_dups = pl.concat([df, df.head(5)]) ...: deduped = deduplicate_panel(df_with_dups, idname="county", tname="year") ...: print(f"Before: {df_with_dups.shape[0]} rows, After: {deduped.shape[0]} rows") ...: Before: 12543 rows, After: 12538 rows