moderndid.core.panel.deduplicate_panel#

moderndid.core.panel.deduplicate_panel(data: Any, idname: str, tname: str, strategy: str = 'last') Any[source]#

Remove duplicate unit-time pairs.

Duplicate unit-time rows cause hard errors during the preprocessing pipeline because the data cannot be unambiguously reshaped or differenced. Run diagnose_panel first to see how many duplicates exist, then call this function to resolve them before estimation.

Parameters:
dataDataFrame

Panel data. Accepts any object implementing the Arrow PyCapsule Interface (__arrow_c_stream__), including polars, pandas, pyarrow Table, and cudf DataFrames.

idnamestr

Unit identifier column.

tnamestr

Time period column.

strategy"first" | "last" | "mean"

How to resolve duplicates. "mean" averages numeric columns and keeps the first value for non-numeric columns.

Returns:
DataFrame

Deduplicated panel in the same format as data.

Raises:
ValueError

If strategy is not one of "first", "last", "mean".

See also

diagnose_panel

Detect duplicates before removing them.

Examples

In [1]: import polars as pl
   ...: from moderndid import deduplicate_panel, load_favara_imbs
   ...: 
   ...: df = load_favara_imbs()
   ...: df_with_dups = pl.concat([df, df.head(5)])
   ...: deduped = deduplicate_panel(df_with_dups, idname="county", tname="year")
   ...: print(f"Before: {df_with_dups.shape[0]} rows, After: {deduped.shape[0]} rows")
   ...: 
Before: 12543 rows, After: 12538 rows