GPU Acceleration with CuPy#
ModernDiD can offload numerical operations to NVIDIA GPUs via CuPy. When the GPU backend is active, matrix operations in the two-period doubly robust estimators (weighted least squares, logistic IRLS, influence function computation) and in the continuous treatment estimator (CCK/NPIV path and multiplier bootstrap) run on the GPU using cuBLAS and cuSOLVER, which can substantially reduce runtime for large datasets on powerful GPUs.
Requirements#
You need an NVIDIA GPU with CUDA support and a CuPy installation that matches your CUDA toolkit version.
Install the GPU extra:
uv pip install 'moderndid[gpu]'
This installs a CuPy wheel that matches the CUDA version specified in the package metadata. If you need a different CuPy wheel for your CUDA runtime, install it directly and then install ModernDiD without the extra:
uv pip install cupy-cuda11x # example for CUDA 11
uv pip install moderndid
Verify the installation:
import moderndid as did
print(did.HAS_CUPY) # True if CuPy is available
Note
HAS_CUPY only checks whether CuPy can be imported. It does not
verify that a CUDA GPU is present. GPU availability is validated when
you first call set_backend("cupy") or pass backend="cupy" to
an estimator.
Enabling the backend#
The GPU backend is opt-in. Pass backend="cupy" to
att_gt, ddd, or
cont_did to run a single call on the GPU. The
backend activates only for that call and reverts automatically when it
returns:
import moderndid as did
result = did.att_gt(
data=data,
yname="y",
tname="time",
idname="id",
gname="group",
est_method="dr",
backend="cupy",
)
For multiple consecutive GPU calls, you can either set the backend
globally or use the use_backend context manager:
# Option 1: global setting
did.set_backend("cupy")
result1 = did.att_gt(...)
result2 = did.ddd(...)
did.set_backend("numpy") # revert when done
# Option 2: context manager (reverts automatically)
from moderndid import use_backend
with use_backend("cupy"):
result1 = did.att_gt(...)
result2 = did.ddd(...)
All three approaches are thread-safe and compose correctly with
n_jobs > 1. When data is a Dask or Spark DataFrame,
backend="cupy" enables GPU-accelerated linear algebra on worker GPUs
(see Combining GPU and Dask and
Combining GPU and Spark).
If CuPy is installed but no GPU is available, backend="cupy"
raises a RuntimeError with an actionable message. If CuPy is not
installed at all, it raises an ImportError.
To check the active backend at any point:
xp = did.get_backend()
print(xp.__name__) # "numpy" or "cupy"
What gets accelerated#
The GPU backend accelerates the low-level numerical operations inside
the two-period estimators that att_gt and
ddd call for each group-time cell, for both panel
and repeated cross-section data with any est_method. It also
accelerates the continuous treatment estimator
cont_did.
Weighted least squares (
reg,dr) — Design matrix multiplication, normal equation solve via cuSOLVER, and fitted value computation via cuBLAS.Logistic IRLS (
ipw,dr) — Iteratively reweighted least squares for the propensity score model. Each iteration runs sigmoid evaluation, Gram matrix accumulation, and a linear solve on the GPU.Influence function computation — All matrix algebra in the influence function (inverse Hessians, score products, weighted sums) runs on GPU arrays. Results transfer back to CPU only at function boundaries.
Multiplier bootstrap — Random Mammen weight generation and the batched matrix multiply for bootstrap replication run on the GPU. Draws are batched to stay within a configurable memory budget (1 GB by default) so that large bootstrap runs do not exhaust GPU memory.
Cluster aggregation — Scatter-add operations to aggregate influence functions at the cluster level use GPU kernels.
Continuous treatment CCK/NPIV estimation (
cont_didwithdose_est_method="cck") — Spline basis construction, regression solves, and derivative computation run on the GPU via cuBLAS.Continuous treatment bootstrap — The multiplier bootstrap for both the parametric and CCK paths of
cont_diduses GPU-accelerated batched matrix multiplication whenbackend="cupy"is active. The parametric path uses CuPy B-spline basis construction on the GPU but converts the results back to NumPy for the per-group least squares solves, since the per-group matrices are too small to benefit from keeping the full solve on the GPU.
These operations are dominated by dense linear algebra (matrix multiplication, triangular solves) that maps well to GPU hardware. The group-time loop, cell scheduling, and aggregation logic remain on the CPU.
Note
cont_did supports GPU acceleration but scaling benchmarks have
not been collected yet.
The intertemporal estimator (did_multiplegt) and
the sensitivity analysis module (honest_did) do
not use the GPU backend. These estimators operate on small matrices
(per-group comparisons and LP constraints respectively) where GPU
kernel launch and data transfer overhead would exceed any computation
benefit.
When it helps#
GPU acceleration provides the largest speedups when the per-cell
sample sizes are large enough to saturate the GPU. This typically
means thousands of units per group-time cell, multiple covariates
producing larger design matrices, and doubly robust estimation
(est_method="dr") which runs both outcome regression and propensity
score estimation per cell.
For small datasets (a few hundred units per cell), the overhead of transferring data to and from the GPU can outweigh the computation savings. In those cases, the CPU backend is faster.
The benefit also depends on estimation method. Doubly robust estimation
performs roughly twice as much linear algebra per cell as pure regression
or pure IPW, so the GPU speedup is more pronounced with est_method="dr".
Bootstrap inference multiplies the work by biters, making the GPU
advantage larger when boot=True with many iterations.
How data moves between CPU and GPU#
ModernDiD handles data transfer automatically. You do not need to create CuPy arrays yourself.
Input data (Polars or pandas DataFrames) is preprocessed on the CPU as usual.
During the tensor construction step, arrays are transferred to the GPU in bulk using
to_device.All cell-level computation runs on GPU arrays.
Results (ATT estimates, influence functions) are transferred back to the CPU using
to_numpybefore being stored in the result object.
Because the bulk transfer happens once and results transfer once, the CPU-GPU communication overhead is small relative to the computation.
Memory management#
ModernDiD ships with RAPIDS Memory Manager (RMM) as part of the [gpu]
extra. When backend="cupy" is activated, ModernDiD automatically
configures CuPy to use RMM’s pool allocator instead of the default
per-allocation cudaMalloc calls. This eliminates the ~1 ms
overhead per GPU allocation that otherwise dominates tight loops such
as the multiplier bootstrap inner loop.
The pool starts empty (initial_pool_size=0) and grows on demand.
Allocations are reused across calls within the same process, so
repeated estimator calls do not pay repeated allocation costs. No user
configuration is required. The pool is initialized the first time
set_backend("cupy") or backend="cupy" is used and remains
active for the rest of the process.
If RMM is not installed (for example, when CuPy is installed manually
without the [gpu] extra), ModernDiD falls back to CuPy’s built-in
memory pool silently.
Advanced pool configuration. If you need to control pool sizing (for example, to share GPU memory with other frameworks), you can initialize RMM yourself before calling any ModernDiD estimator. ModernDiD will detect that RMM is already initialized and skip its own setup:
import rmm
from rmm.allocators.cupy import rmm_cupy_allocator
import cupy as cp
pool = rmm.mr.PoolMemoryResource(
rmm.mr.CudaMemoryResource(),
initial_pool_size="2GiB",
maximum_pool_size="8GiB",
)
rmm.mr.set_current_device_resource(pool)
cp.cuda.set_allocator(rmm_cupy_allocator)
import moderndid as did
result = did.att_gt(..., backend="cupy")
Visible memory usage. RMM’s pool (or CuPy’s built-in pool, when
RMM is not installed) caches allocated GPU memory for reuse rather
than returning it to the OS after each operation. This means
nvidia-smi may show high memory usage even when arrays have been
freed. This is expected behavior and does not indicate a memory leak.
Profiling GPU memory. RMM provides built-in memory statistics and profiling that can help diagnose allocation issues:
import rmm
import rmm.statistics
rmm.statistics.enable_statistics()
result = did.att_gt(..., backend="cupy")
print(rmm.statistics.get_statistics())
If a regression or IRLS solve exhausts GPU memory, ModernDiD raises a
MemoryError with a message suggesting you reduce the problem size
or switch back to backend='numpy'. The bootstrap implementation
batches draws to stay within a 1 GB GPU allocation per batch, but very
large influence function matrices can still exceed available memory.
GPU device selection#
If your machine has multiple GPUs, CuPy uses device 0 by default. All computation runs on a single GPU; ModernDiD does not split work across devices. To select a different GPU, wrap the call in a CuPy device context:
import cupy as cp
import moderndid as did
with cp.cuda.Device(1):
result = did.att_gt(
data=data, yname="y", tname="time",
idname="id", gname="group", backend="cupy",
)
For multi-GPU parallelism, use a distributed backend to pin one worker
or executor per GPU. See Combining GPU and Dask
(using dask-cuda) or Combining GPU and Spark
(using Spark GPU resource scheduling).
Benchmarking correctly#
GPU execution is asynchronous. Standard Python timing
(time.perf_counter, %timeit) measures only the time to
launch GPU kernels, not the time for them to complete. For accurate
benchmarks, synchronize the GPU before taking timestamps:
import cupy as cp
import time
cp.cuda.Stream.null.synchronize()
start = time.perf_counter()
result = did.att_gt(
data=data, yname="y", tname="time",
idname="id", gname="group", backend="cupy",
)
cp.cuda.Stream.null.synchronize()
elapsed = time.perf_counter() - start
The first call in a process incurs one-time overhead from CUDA context
initialization and kernel compilation. CuPy caches compiled kernels in
~/.cupy/kernel_cache, so subsequent calls in the same or later
sessions are faster.
Combining GPU and Dask#
The GPU backend and the Dask distributed backend can be combined.
Pass backend="cupy" to att_gt or
ddd with a Dask DataFrame to run partition-level
linear algebra on worker GPUs. The low-level functions
dask_att_gt and dask_ddd
also accept the backend parameter:
import dask.dataframe as dd
import moderndid as did
ddf = dd.read_parquet("panel_data.parquet")
result = did.att_gt(
data=ddf,
yname="y",
tname="time",
idname="id",
gname="group",
est_method="dr",
backend="cupy",
)
When backend="cupy" is active, each worker converts its partition
arrays to CuPy after building them from pandas. All Gram matrix
accumulation, IRLS iterations, and influence function computation run on
the worker’s GPU. Results are converted back to NumPy before leaving the
worker, so driver-side aggregation (tree-reduce, precomputation) stays
on the CPU.
CuPy must be installed on every worker. For multi-GPU machines, use
dask-cuda with a LocalCUDACluster to pin one worker per GPU:
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
client = Client(cluster)
result = did.att_gt(
data=ddf,
yname="y",
tname="time",
idname="id",
gname="group",
est_method="dr",
backend="cupy",
)
The set_backend / use_backend context manager does not
propagate to Dask worker processes. Always use the backend parameter
on the estimator call instead.
The following example shows a complete workflow where we connect to a multi-GPU cluster, read data, run estimation, and clean up.
import dask.dataframe as dd
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import moderndid as did
# Start one worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)
# Read and persist input data
ddf = dd.read_parquet("panel_data.parquet").persist()
wait(ddf)
result = did.att_gt(
data=ddf,
yname="y",
tname="time",
idname="id",
gname="group",
xformla="~ x1 + x2",
est_method="dr",
backend="cupy",
)
# Post-estimation stays the same
event_study = did.aggte(result, type="dynamic")
did.plot_event_study(event_study)
client.close()
cluster.close()
When you need explicit control over the client (for example on
Databricks or a managed cluster), use the low-level
dask_att_gt entry point which accepts a
client parameter directly.
Combining GPU and Spark#
The GPU backend and the Spark distributed backend can be combined.
Pass backend="cupy" to att_gt or
ddd with a PySpark DataFrame to run partition-level
linear algebra on executor GPUs. The low-level functions
spark_att_gt and spark_ddd
also accept the backend parameter:
from pyspark.sql import SparkSession
import moderndid as did
spark = SparkSession.builder.master("local[*]").getOrCreate()
sdf = spark.read.parquet("panel_data.parquet")
result = did.att_gt(
data=sdf,
yname="y",
tname="time",
idname="id",
gname="group",
est_method="dr",
backend="cupy",
)
When backend="cupy" is active, partition arrays are converted to CuPy
after collection. All Gram matrix accumulation, IRLS iterations, and
influence function computation run on the GPU. Results are converted back
to NumPy before being stored in the result object.
CuPy must be installed on the driver (and on executors if using Spark’s
mapInPandas GPU paths). On GPU-enabled Spark clusters (e.g., Databricks
ML Runtime with GPU instances, or YARN with GPU resource scheduling),
configure executors with GPU resources:
spark = (
SparkSession.builder
.master("yarn")
.config("spark.executor.resource.gpu.amount", "1")
.config("spark.task.resource.gpu.amount", "1")
.getOrCreate()
)
result = did.att_gt(
data=sdf,
yname="y",
tname="time",
idname="id",
gname="group",
est_method="dr",
backend="cupy",
)
The set_backend / use_backend context manager does not
propagate to Spark executor processes. Always use the backend parameter
on the estimator call instead.
Local GPU setup#
Cloud GPU environments (Colab, SageMaker, Databricks) generally ship
with CUDA drivers and runtime libraries pre-installed. On a local
machine you may need a few extra steps after installing the [gpu]
extra.
Verify that CuPy can compile and execute GPU kernels
import cupy as cp
print(f"CuPy version: {cp.__version__}")
print(f"GPU: {cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
print(f"Devices: {cp.cuda.runtime.getDeviceCount()}")
# Triggers kernel compilation; fails if NVRTC or headers are missing
a = cp.array([1, 2, 3])
print(f"Test compute: {cp.sum(a)}") # Should print 6
If this snippet fails, the most common cause is missing CUDA runtime libraries. See the platform-specific notes below.
Windows
The CuPy wheel does not bundle all required CUDA runtime libraries on
Windows. CuPy relies on cuda-pathfinder to locate DLLs at runtime.
Missing libraries surface as errors such as
No such file: nvrtc*.dll,
cannot open source file "cuda_fp16.h", or
No such file: cublasLt*.dll. Install the full set via pip.
uv pip install nvidia-cuda-nvrtc nvidia-cuda-cccl nvidia-cuda-runtime
uv pip install nvidia-cublas nvidia-cusparse nvidia-cusolver nvidia-cufft nvidia-curand nvidia-nvjitlink
Linux
These libraries are typically included with a full CUDA Toolkit
installation (apt install nvidia-cuda-toolkit or the NVIDIA runfile
installer). If you installed CUDA through the system package manager,
no additional pip packages are needed.
macOS
macOS does not have local NVIDIA GPU support. Apple dropped CUDA after
macOS 10.13 (High Sierra), and Apple Silicon uses Metal instead of CUDA.
backend="cupy" still works from macOS when connected to a remote GPU
such as a cloud notebook, an SSH session to a GPU server, or a
Dask/Spark cluster with GPU workers. Install the [gpu] extra on the
remote environment where CuPy has access to an NVIDIA GPU.
Corrupted installs
If did.HAS_CUPY is False even though CuPy appears installed, pip
may have recorded the package while the actual library files are missing.
Force reinstall to fix this.
uv pip install --force-reinstall cupy-cuda12x # match your CUDA version
Verifying GPU usage#
After running an estimator with backend="cupy" you can confirm the
GPU was used.
From Python
import cupy as cp
cp.get_default_memory_pool().free_all_blocks()
mem_before = cp.cuda.runtime.memGetInfo()[0]
result = did.att_gt(
data=data, yname="y", tname="time",
idname="id", gname="group", backend="cupy",
)
mem_after = cp.cuda.runtime.memGetInfo()[0]
print(f"GPU memory consumed: {(mem_before - mem_after) / 1024**2:.1f} MB")
A value greater than 0 MB confirms GPU execution.
From a separate terminal
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used --format=csv -l 1
This prints GPU utilization every second so you can watch it spike during computation.
Troubleshooting#
“CuPy is not installed” when calling set_backend("cupy")
The most common cause is installing the generic cupy package, which
tries to compile from source. Install a prebuilt wheel that matches
your CUDA driver version instead (e.g. uv pip install cupy-cuda12x).
Run nvidia-smi to check which CUDA version your driver supports.
After installing, restart your Python process (or notebook runtime)
before importing ModernDiD. CuPy availability is checked once at
import time.
“cudaErrorInsufficientDriver”
The installed CuPy wheel expects a newer CUDA version than your driver
provides. Check nvidia-smi and switch to the matching wheel.
“No CUDA GPU is available”
Make sure nvidia-smi shows a device. In cloud notebooks, verify
that a GPU runtime is selected.
Next steps#
Quickstart covers estimation options, aggregation types, and visualization for local workflows.
Distributed Estimation describes the Dask and Spark backends for datasets that exceed single-machine memory.
The Examples section walks through each estimator end-to-end with real and simulated data.