test-parallel.R now skips on CRAN. The previous submission's
pretest exceeded CRAN's 2-core cap (_R_CHECK_LIMIT_CORES_) despite
the test honouring the env var, producing the only test failure.
Both test_that blocks in test-parallel.R now call
skip_on_cran(). The parallel path is fully exercised in our GitHub
Actions CI matrix on every push.WINDOWS_BUILD_STATUS.md, BENCHMARKS.md) are intentionally excluded
from the source tarball via .Rbuildignore; the README now points to
their canonical GitHub URLs so they resolve from the rendered README
on CRAN. One stale link (../NOTES_did_gpu_checkpointed.Rmd, an
out-of-tree file that no longer exists) was removed.'Rcpp' and 'CUDA' per CRAN convention.sm_120) GPUs. The GPU build previously
targeted only Turing–Hopper (sm_75–sm_90) with no PTX fallback, so on
Blackwell parts (RTX PRO Blackwell, RTX 50xx) the kernels had no device
image and effect estimates silently came back as all zeros while the
CPU/R backends were correct. Added compute_120,sm_120 plus a
compute_120 PTX target to both src/Makevars and src/Makevars.win.
GPU effects are again bit-identical to the CPU/R backends on Blackwell
(verified on an RTX PRO 5000 Blackwell, CUDA 13.2).src/didgpu_cuda.def. The DLL export list was
git-ignored (/src/*.def), so clean checkouts — and the r-universe /
install_github source build — could not link didgpu_cuda.dll. It is
now tracked.bin/x64. CUDA 13
moved the redistributable DLLs from bin/ to bin/x64/; the bundling
step now searches both so didgpu_cuda.dll loads at runtime.didgpu was originally validated bit-for-bit against an older
DIDmultiplegtDYN. Two of its outputs were deliberately changed upstream;
didgpu now tracks the current (fixed) behavior:
predict_het standard errors now use HC2. The reference switched the
heterogeneity-regression variance from HC1 to HC2
(sandwich::vcovHC(type = "HC2")) in v2.3.1 ("explicit CI formulas"
fix). didgpu now does the same (new sandwich dependency), so the
predict_het SE/t/LB/UB/pF columns match again.N counts each contributing cell once. For bidirectional
panels the reported placebo sample size was double-counting controls
shared between the switcher-in and switcher-out comparisons. It now uses
the in-direction count, matching the reference's per-row
coalesce(in, out) combiner across both in>out and out>in panels.
Point estimates were never affected.same_switchers: the placebo now uses the same restricted switcher set
as the effects. Under same_switchers = TRUE the placebo block was
computed on the full switcher set rather than the consistent-switchers
subset, producing a biased placebo estimate (and inflated placebo N)
relative to did_multiplegt_dyn. The placebo distance now honours the
effects-based still_switcher restriction (the reference derives the
placebo distribution from the same_switchers-gated effect distance), so
placebo estimates and counts match again.trends_lin: no longer crashes on panels with zero estimable effects.
On short panels where no group has the F_g-2 pre-period that
trends_lin requires, result aggregation crashed with
"length of 'dimnames' [1] not equal to array extent" (an unguarded
row-name build on a 0-row table). It now returns an empty, no-estimable-
effects result cleanly.N / Switchers columns now separate unweighted counts from
weighted sums. On weighted panels (weight =) the four reported count
columns were all populated from the same (weighted, truncated) switcher
mass, so N / Switchers reported weighted sums instead of observation
counts, and N.w lost the fractional weight (each cell's weight was
floored before summing). The estimator now reports N and Switchers as
the true unweighted observation / switcher counts and N.w /
Switchers.w as the (unfloored) weighted sums, matching
did_multiplegt_dyn exactly. Point estimates were never affected — the
weighted switcher mass that drives the Neyman pooling and ATE weights is
unchanged; only the reported count columns moved. On unweighted panels all
four columns coincide as before (bit-identical output).w_in = N_in / (N_in + N_out)) and as the
across-horizon ATE weight, biasing the pooled event-study estimates by
~1e-3. The exact (unfloored) mass is now used throughout the estimate path,
matching did_multiplegt_dyn. Single-direction (switchers = "in"/"out")
and unweighted panels were never affected (the weight is 0/1 or the mass is
already integer). Found by randomized weighted×flag differential testing.max L_g),
which can be one step more permissive than did_multiplegt_dyn's
cohort-level T_g clamp. When that extra horizon has no switcher reaching it
(NA estimate, zero switchers), the reference omits the row; didgpu now trims
the trailing block of such unestimable effect/placebo rows so the reported
horizon count matches. Estimable horizons, estimates, and all four count
columns are unchanged. Verified across 359 boundary-stress panels
(switchers in/out/both × effects 4–5 × weighted/unweighted): zero horizon
count mismatches and, importantly, no case where didgpu reported an extra
horizon with positive switchers. Found by overnight differential testing.trends_lin (and other short-panel cases) the placebo block can
be entirely unestimable — no group has the F_g - q - 1 pre-period any
placebo horizon needs. did_multiplegt_dyn returns NULL placebos in that
case; didgpu used to emit a placebo matrix of NAs. Aggregation now drops
the placebo block when every placebo point estimate is NA, matching the
reference. Found by extended weighted×flag differential testing
(4 of 4 affected panels now match; all other scenarios untouched —
verified 30 seeds × 14 scenarios = 420 weighted comparisons, 0 fails).First public release. Five estimator families plus a sensitivity layer, each with CUDA kernels for the hot paths.
est_method = c("OR", "IPW", "DR") — all three inner estimators. DR is
Sant'Anna-Zhao (2020) doubly-robust.control_group = c("never", "notyet") — never-treated OR not-yet-treated.covariates = for OR/IPW/DR adjustment.fit$placebo.event / group / calendar / overall) with
didgpu_cs_aggregate() for switching post-fit.bootstrap_kind = c("cluster", "multiplier") — cluster bootstrap on
units, or multiplier wild bootstrap on per-unit influence functions
(much faster for large B).src/cuda_cs_inner.cu) with per-row influence functions. OR uses
an in-thread Cholesky per cell; IPW/DR add a per-cell IRLS logistic
propensity model replicating stats::glm.fit (ATT agrees with R to
~1e-8), and DR layers on the outcome-regression augmentation. With
per-cell IFs, the cluster + multiplier bootstrap SEs all run on the
GPU — the DR cluster bootstrap is ~192x faster than R.did package on simulated
panels (max abs diff < 0.25 on event-study estimates)."CS" (Cox-Shi 2023), "ARP"
(Andrews-Roth-Pakes 2023), "FSST" (Fang-Santos-Shaikh-Torgovitsky 2023)..testmechs_cs_test(theta_hat, Sigma, A, A_eq, b_eq)
is reusable for any moment-inequality test.src/cuda_testmechs_bootstrap.cu (cuRAND
multinomial; the main acceleration target). Live on Linux/WSL and
wired through .testmechs_bootstrap_cuda; the "nonparametric"
method runs on the GPU, "bayes" uses the R path. cuRAND vs R's
MT19937 differ per-replicate, so bootstrap moments match within
Monte-Carlo error rather than bit-for-bit.didgpu_result,
didgpu_cs_result, didgpu_fect_result.didgpu_loo_result data.frame with leave_out,
estimate, delta, and delta_pct columns, sorted by
abs(delta) descending so the most-influential drop is on top.print() shows the top-N most influential rows with an
interpretation hint; plot() draws a tornado plot of deltas.by = "cohort" (leave-one-cohort-out, the standard DiD
diagnostic). Pass by = "unit", "cluster", or any column name
to drop on a different key.method = c("RM", "M") — relative-magnitudes OR smoothness bounds.didgpu_result (DIDmultiplegtDYN-style) and
didgpu_cs_result (Callaway-Sant'Anna) fits.effects, placebo, switchers = ""/"in"/"out", ATE.weight, controls, trends_nonparam cohort extension.only_never_switchers, same_switchers, dont_drop_larger_lower.normalized = TRUE (per-unit-of-treatment), trends_lin = TRUE
(linear cohort trends with cumulative-recovery placebos).same_switchers_pl (placebo-side same-switchers gate; mirrors the
reference's constraint that it must be paired with same_switchers).predict_het (heterogeneity regression with HC1 robust SEs and joint
F-test).didgpu_by_path() for treatment-trajectory subgroup analysis (the
equivalent of the reference's by_path argument).N, Switchers, N.w, Switchers.w) match
the reference exactly.didgpu_equivalence(fit, delta) — pre-trends equivalence (TOST) test on
the placebo estimates. Instead of "failed to reject a zero pre-trend"
(weak, and worst exactly when underpowered), it tests
H0: |pre-trend| >= delta and REJECTING is positive evidence the
pre-trend is within +/- delta. Reports per-horizon and joint
(intersection-union) verdicts plus the smallest defensible margin
(breakdown_delta). Mirrors didgpu_fect_equivalence().didgpu_joint_placebo(fit, horizons) — the joint chi-square placebo
test (p_jointplacebo) restricted to a chosen pre-treatment window,
reusing the stored bootstrap covariance. Test parallel trends only over
the leads you care about; the full-window call reproduces the headline
p_jointplacebo exactly.didgpu_bacon() — Goodman-Bacon (2021) decomposition of the static TWFE
DiD into its 2x2 timing-group comparisons, with the total weight on
"forbidden" already-treated-control comparisons as the bias diagnostic.
Validated by the exact identity (weighted 2x2 sum == the TWFE
coefficient from didgpu_twfe()). Balanced, binary, absorbing panels.didgpu_did_static() — de Chaisemartin & D'Haultfoeuille (2020) DID_M
instantaneous estimator. Unlike the staggered-adoption methods it allows
treatment to turn on AND off (non-absorbing): it compares each switcher's
period-over-period outcome change to same-baseline stayers and averages
over all switch events, with a cluster bootstrap SE. Native
reimplementation; cross-checked against DIDmultiplegt::did_multiplegt.didgpu_freyaldenhoven() — Freyaldenhoven, Hansen & Shapiro (2019)
pre-event panel event study. estimator = "OLS" is the two-way FE
event study; estimator = "FHS" adds an auxiliary proxy covariate as an
endogenous regressor and 2SLS-instruments it with a far policy lead to
purge a confound that generates pre-trends. Native reimplementation of
the first-difference parameterization; coefficients match
eventstudyr::EventStudy (OLS and FHS) to machine precision.didgpu_cs_continuous() — Callaway, Goodman-Bacon & Sant'Anna (2024)
difference-in-differences with a CONTINUOUS treatment (dose). Estimates
the dose-response curve: the level effect ATT(d) and the causal response
ACRT(d) = ATT'(d), via a B-spline regression of the within-unit outcome
change on the dose, vs a never-treated comparison; multiplier-bootstrap
SEs. Native reimplementation (spline basis via splines2); ATT(d)/ACRT(d)
match contdid::cont_did exactly.didgpu_did_continuous() — de Chaisemartin & D'Haultfoeuille (2024)
continuous treatment with NO STAYERS. When the dose changes for (almost)
every unit there is no pure control group, so identification is in first
differences: with dY, dD the within-unit changes, the common trend
E[dY|dD=0] is recovered from quasi-stayers (dD near 0), giving the level
effect effect(d) = E[dY|dD=d] - E[dY|dD=0] and the average causal response
ACR(d). estimator = "parametric" fits a polynomial in dD (sqrt(n));
estimator = "nonparametric" is a local-linear (kernel) fit (n^2/5; flagged
EXPERIMENTAL — no maintained R reference exists to bit-validate it).
Multiplier-bootstrap SEs. Both estimators validated by simulation against a
known dose-response.BENCHMARKS.md). The audit
also caught and fixed two quadratic bootstraps: didgpu_did_static's cluster
bootstrap was O(n_units^2) per replicate (pre-splitting by cluster makes it
O(n); 12-23x faster, bit-identical SEs), and didgpu_did_continuous no longer
recomputes the O(n^2) overall-ACR on every bootstrap replicate (nonparametric
bootstrap ~675x faster; reported effect(d)/ACR(d) unchanged).saveRDS
tmp + rename) and append-only manifest.csv. Resumable on crash.didgpu_resume(checkpoint_dir, df, ...) — re-invokes with every
stored arg restored from meta.json.didgpu_bootstrap_more(checkpoint_dir, df, extra_reps) — extend a
finished run with more bootstrap reps without rework.didgpu_by(df, by_var, ...) — per-subgroup fits, each with its own
checkpoint subdirectory.n_workers > 1L parallelises the bootstrap loop via
parallel::makeCluster; bit-identical to sequential at the same seed."r" — pure R via data.table. 60× faster than the reference at
200 K rows."reference" — delegate to DIDmultiplegtDYN::did_multiplegt_dyn,
used as the parity oracle."cuda" — live on Windows and Linux/WSL2 (built + verified
end-to-end on an NVIDIA RTX 4000 Ada, CUDA 12.6; bit-identical results
on both). On Windows it needs no admin rights: a user-local CUDA
toolkit plus a two-DLL split (didgpu_cuda.dll built by nvcc/MSVC,
the R-facing didgpu.dll built by Rtools/MinGW, bridged by a pure-C
ABI) sidesteps the MinGW↔MSVC link barrier — see
WINDOWS_BUILD_STATUS.md. Live GPU paths: the CS cluster
bootstrap (179–228× faster than R via the influence-function
shortcut), the CS multiplier bootstrap, the CS OR point estimate
(bit-exact vs R), and the TestMechs nonparametric bootstrap. The
fect SVD path is size-gated — it only engages for very large
balanced panels, since cuSOLVER loses to CPU LAPACK on the small
matrices typical of fect. Every GPU path falls back transparently
to R when CUDA is unavailable or would be slower, so backend = "cuda" is always safe. See BENCHMARKS.md and
tests/testthat/test-cuda-equivalence-grid.R (142 lock-step
assertions). Tests skip GPU paths when nvcc / a device is absent."cpu" — Rcpp+Eigen, scaffolded only.print, summary, coef, confint, vcov, plot,
plus tidy, glance, augment via broom.ggplot2 dependency); event-study with stored CIs.didgpu_summarize_panel, didgpu_estimate_runtime,
didgpu_compare (compare against the reference).didgpu_fect(method = "fe") — two-way fixed effects, iterative
demeaning of the controls-only outcome matrix.didgpu_fect(method = "ife") — Bai (2009) interactive fixed effects.
Alternating fe-step + rank-r SVD of the residual matrix until
convergence.didgpu_fect(method = "mc") — Athey et al. (2021) matrix completion.
Iterative soft-thresholded SVD on the controls-only matrix.didgpu()'s checkpoint / resume / parallel bootstrap
infrastructure. Results are returned as didgpu_fect_result (extends
didgpu_result) so all the standard accessors (coef, confint,
vcov, plot, tidy, glance) work the same way.src/cuda_fect_fe.cu and
src/cuda_fect_svd.cu (the latter uses cuSOLVER's
cusolverDnDgesvdj for the SVD primitive shared by ife and mc)
and are wired through R. However, they are size-gated: on the
small, tall-skinny matrices typical of fect panels the per-iteration
cuSOLVER SVD is 100–300× slower than R's LAPACK (cuSOLVER handle +
H2D/D2H overhead dwarfs the tiny SVD). .fect_cuda_svd_worthwhile()
only routes to the GPU for very large balanced panels
(n_units ≥ 2000 and n_units·n_periods ≥ 2e5); below that
backend = "cuda" transparently uses R's svd(). See BENCHMARKS.md.R CMD check passes with only
pre-existing intentional WARN (CUDA .cu files in src/) and the
declared GNU make SystemRequirements NOTE.tests/testthat/test-fuzz.R) covers 21
scenarios: vanilla / weight / controls / switchers / normalized /
placebos / trends_lin / only_never + same_switchers / kitchen sink /
trends_lin sink / multivalued / bootstrap-stability /
parallel-equals-sequential / checkpoint round-trip / degenerate /
very-small. Default DIDGPU_FUZZ_N = 8 for fast CI; bump via env
var for deep local runs (validated at N = 200, no failures).