diff --git a/ROADMAP.md b/ROADMAP.md index a4406e8..625f4b8 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -2,190 +2,235 @@ This document outlines the feature roadmap for diff-diff, prioritized by practitioner value and academic credibility. -## What Makes a Credible 1.0? +For past changes and release history, see [CHANGELOG.md](CHANGELOG.md). -A production-ready DiD library needs: +--- + +## Current Status (v1.0.2) -1. ✅ **Core estimators** - Basic DiD, TWFE, MultiPeriod, Staggered (Callaway-Sant'Anna), Synthetic DiD -2. ✅ **Valid inference** - Robust SEs, cluster SEs, wild bootstrap for few clusters -3. ✅ **Assumption diagnostics** - Parallel trends tests, placebo tests -4. ✅ **Sensitivity analysis** - What if parallel trends is violated? (Rambachan-Roth) -5. ✅ **Conditional parallel trends** - Covariate adjustment for staggered DiD -6. ✅ **Documentation** - API reference site for discoverability +diff-diff is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` ecosystem for core DiD analysis: -**All 1.0 blockers are complete.** diff-diff has feature parity with R's `did` + `HonestDiD` ecosystem for core DiD analysis. +- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Synthetic DiD +- **Valid inference**: Robust SEs, cluster SEs, wild bootstrap, multiplier bootstrap +- **Assumption diagnostics**: Parallel trends tests, placebo tests, Goodman-Bacon decomposition +- **Sensitivity analysis**: Honest DiD (Rambachan-Roth) +- **Study design**: Power analysis tools --- -## Status Overview - -| Feature | Status | Priority | Why It Matters | -|---------|--------|----------|----------------| -| Honest DiD (Rambachan-Roth) | ✅ Done | — | Reviewers expect sensitivity analysis | -| CallawaySantAnna Covariates | ✅ Done | — | Conditional PT often required in practice | -| API Documentation Site | ✅ Done | — | Credibility and discoverability | -| Goodman-Bacon Decomposition | ✅ Done | — | Explains when TWFE fails | -| Power Analysis | ✅ Done | — | Study design tool | -| CallawaySantAnna Bootstrap | ✅ Done | — | Valid inference with few clusters | -| Sun-Abraham Estimator | Not Started | Post-1.0 | Alternative to CS, some prefer it | -| Gardner's did2s | Not Started | Post-1.0 | Two-stage approach, available in pyfixest | -| Local Projections DiD | Not Started | Post-1.0 | Dynamic effects (Dube et al. 2023) | -| Borusyak-Jaravel-Spiess | Not Started | Post-1.0 | More efficient under homogeneous effects | -| Double/Debiased ML | Not Started | Post-1.0 | High-dimensional covariates | +## Near-Term Enhancements (v1.1–v1.2) ---- +High-value additions building on our existing foundation. + +### Sun-Abraham Estimator + +Interaction-weighted estimator providing an alternative to Callaway-Sant'Anna. Many practitioners run both as a robustness check. + +- Event-study coefficients via saturated regression with cohort-time interactions +- Different weighting scheme than CS; can give different results under heterogeneous effects +- Useful robustness check when CS and SA agree + +**Reference**: Sun & Abraham (2021). *Journal of Econometrics*. -## 1.0 Target Features +### Borusyak-Jaravel-Spiess Imputation Estimator + +More efficient than Callaway-Sant'Anna when treatment effects are homogeneous across groups/time. Uses imputation rather than aggregation. -These would strengthen the 1.0 release but aren't strictly blocking. +- Imputes untreated potential outcomes using pre-treatment data +- More efficient under homogeneous effects assumption +- Can handle unbalanced panels more naturally -### ✅ Goodman-Bacon Decomposition (Done) +**Reference**: Borusyak, Jaravel, and Spiess (2024). *Review of Economic Studies*. -Helps users understand *why* TWFE can be biased with staggered adoption. Shows weights on "forbidden comparisons" (already-treated as controls). Essential diagnostic before deciding whether to use Callaway-Sant'Anna. +### Gardner's Two-Stage DiD (did2s) -- ✅ Decompose TWFE into 2x2 comparisons -- ✅ Show weights by comparison type (clean vs. forbidden) -- ✅ Visualization of decomposition (scatter and bar charts) -- ✅ Integration with `TwoWayFixedEffects.decompose()` method -- ✅ Automatic warning when TWFE detects staggered treatment timing +Two-stage approach gaining traction in applied work. First residualizes outcomes, then estimates effects. -**Reference**: Goodman-Bacon (2021). *Journal of Econometrics*. +- Stage 1: Estimate unit and time FEs using only untreated observations +- Stage 2: Regress residualized outcomes on treatment indicators +- Clean separation of identification and estimation -### ✅ Power Analysis Tools (Done) +**Reference**: Gardner (2022). *Working Paper*. -Practitioners need to know "how many units/periods do I need to detect an effect of size X?" Now available in diff-diff. +### Triple Difference (DDD) Estimators -- ✅ Minimum detectable effect given sample size -- ✅ Required sample size for target power -- ✅ Simulation-based power for any estimator (including staggered designs) -- ✅ Visualization of power curves -- ✅ Panel data considerations (ICC, multiple periods) +Extends DiD to settings requiring a third differencing dimension. Common DDD implementations are invalid when covariates are needed for identification. -**References**: Bloom (1995); Burlig, Preonas, & Woerman (2020). +- Regression adjustment, IPW, and doubly robust DDD estimators +- Staggered adoption support with multiple comparison groups +- Proper covariate integration (naive "two DiD difference" approaches fail) +- Bias reduction and precision gains over standard approaches -### ✅ CallawaySantAnna Bootstrap Inference (Done) +**Reference**: [Ortiz-Villavicencio & Sant'Anna (2025)](https://arxiv.org/abs/2505.09942). *Working Paper*. R package: `triplediff`. -With few clusters or groups, analytical SEs may be unreliable. Multiplier bootstrap provides valid inference following the R `did` package approach. +### Pre-Trends Power Analysis -- ✅ Multiplier bootstrap at unit level with influence function perturbation -- ✅ Aggregate bootstrap samples for overall ATT, event study, and group effects -- ✅ Rademacher, Mammen, and Webb weight distributions -- ✅ Percentile confidence intervals and bootstrap p-values +Assess whether pre-trends tests have adequate power to detect meaningful parallel trends violations. Complements our Honest DiD implementation. -**Reference**: Callaway & Sant'Anna (2021). *Journal of Econometrics*. +- Minimum detectable violation size for pre-trends tests +- Visualization of power against various violation magnitudes +- Integration with existing parallel trends diagnostics + +**Reference**: [Roth (2022)](https://www.aeaweb.org/articles?id=10.1257/aeri.20210236). *AER: Insights*. R package: `pretrends`. ### Enhanced Visualization - Synthetic control weight visualization (bar chart of unit weights) -- ✅ Bacon decomposition visualization (scatter and bar charts) -- Treatment adoption "staircase" plot +- Treatment adoption "staircase" plot for staggered designs +- Interactive plots with plotly backend option --- -## Post-1.0 Features +## Medium-Term Enhancements (v1.3+) -These are valuable but can wait for future versions. +Extending diff-diff to handle more complex settings. -### Sun-Abraham Estimator +### Continuous Treatment DiD -Alternative to Callaway-Sant'Anna using interaction-weighted approach. Some practitioners prefer it; provides a robustness check. +Many treatments have dose/intensity rather than binary on/off. Active research area with recent breakthroughs. -**Reference**: Sun & Abraham (2021). *Journal of Econometrics*. +- Treatment effect on treated (ATT) parameters under generalized parallel trends +- Dose-response curves and marginal effects +- Handle settings where "dose" varies across units and time +- Event studies with continuous treatments -### Gardner's Two-Stage DiD (did2s) +**References**: +- [Callaway, Goodman-Bacon & Sant'Anna (2024)](https://arxiv.org/abs/2107.02637). *NBER Working Paper*. +- [de Chaisemartin, D'Haultfœuille & Vazquez-Bare (2024)](https://arxiv.org/abs/2402.05432). *AEA Papers and Proceedings*. + +### de Chaisemartin-D'Haultfœuille Estimator + +Handles treatment that switches on and off (reversible treatments), unlike most other methods. -Two-stage approach to staggered DiD that first residualizes outcomes using untreated observations, then estimates treatment effects. Available in pyfixest (Python) and did2s (R). +- Allows units to move into and out of treatment +- Time-varying, heterogeneous treatment effects +- Comparison with never-switchers or flexible control groups +- Different assumptions than CS/SA—useful for different settings -**Reference**: Gardner (2022). *Two-stage differences in differences*. +**Reference**: [de Chaisemartin & D'Haultfœuille (2020, 2024)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3980758). *American Economic Review*. ### Local Projections DiD -Implements local projections for dynamic treatment effects. Flexible approach that doesn't require specifying the full dynamic structure. Gaining traction in applied work. +Implements local projections for dynamic treatment effects. Doesn't require specifying full dynamic structure. + +- Flexible impulse response estimation +- Robust to misspecification of dynamics +- Natural handling of anticipation effects +- Growing use in macroeconomics and policy evaluation **Reference**: Dube, Girardi, Jordà, and Taylor (2023). -### Borusyak-Jaravel-Spiess Imputation Estimator +### Nonlinear DiD -More efficient than Callaway-Sant'Anna when parallel trends holds across all periods. Uses imputation approach. +For outcomes where linear models are inappropriate (binary, count, bounded). -**Reference**: Borusyak, Jaravel, and Spiess (2024). +- Logit/probit DiD for binary outcomes +- Poisson DiD for count outcomes +- Flexible strategies for staggered designs with nonlinear models +- Proper handling of incidence rate ratios and odds ratios -### Double/Debiased ML for DiD +**Reference**: [Wooldridge (2023)](https://academic.oup.com/ectj/article/26/3/C31/7250479). *The Econometrics Journal*. -For high-dimensional settings with many covariates. Uses ML for nuisance parameter estimation with cross-fitting. +### Doubly Robust DiD + Synthetic Control -**Reference**: Chernozhukov et al. (2018), Chang (2020). +Unified framework combining DiD and synthetic control with doubly robust identification—valid under *either* parallel trends or synthetic control assumptions. -### Alternative Inference Methods +- ATT identified under parallel trends OR group-level SC condition +- Semiparametric estimation framework +- Multiplier bootstrap for valid inference under either assumption +- Strengthens credibility by avoiding the DiD vs. SC trade-off + +**Reference**: [Sun, Xie & Zhang (2025)](https://arxiv.org/abs/2503.11375). *Working Paper*. -- Randomization inference for small samples -- Bayesian DiD with prior on parallel trends -- Conformal inference for prediction intervals +### Causal Duration Analysis with DiD + +Extends DiD to duration/survival outcomes where standard methods fail (hazard rates, time-to-event). + +- Duration analogue of parallel trends on hazard rates +- Avoids distributional assumptions and hazard function specification +- Visual and formal pre-trends assessment for duration data +- Handles absorbing states approaching probability bounds + +**Reference**: [Deaner & Ku (2025)](https://www.aeaweb.org/conference/2025/program/paper/k77Kh8iS). *AEA Conference Paper*. --- -## Release History +## Long-Term Research Directions (v2.0+) + +Frontier methods requiring more research investment. -### v0.9.0 (Current) +### Matrix Completion Methods -- ✅ Callaway-Sant'Anna multiplier bootstrap inference -- ✅ Rademacher, Mammen, and Webb weight distributions -- ✅ Bootstrap SEs, CIs, and p-values for all aggregations (overall ATT, event study, group effects) -- ✅ `CSBootstrapResults` dataclass for bootstrap results +Unified framework encompassing synthetic control and regression approaches. Moves seamlessly between cross-sectional and time-series patterns. -### v0.8.0 +- Nuclear norm regularization for low-rank structure +- Handles missing data patterns common in panel settings +- Bridges synthetic control (few units, many periods) and regression (many units, few periods) +- Confidence intervals via debiasing -- ✅ Power analysis tools (`PowerAnalysis`, `simulate_power`) -- ✅ MDE, sample size, and power calculations -- ✅ Simulation-based power for any DiD estimator -- ✅ Power curve visualization (`plot_power_curve`) -- ✅ Panel data support with ICC adjustment +**Reference**: [Athey et al. (2021)](https://arxiv.org/abs/1710.10251). *Journal of the American Statistical Association*. -### v0.7.0 +### Causal Forests for DiD -- ✅ Goodman-Bacon decomposition for TWFE diagnostics -- ✅ `plot_bacon()` visualization (scatter and bar charts) -- ✅ `TwoWayFixedEffects.decompose()` integration -- ✅ Automatic staggered treatment warning in TWFE +Machine learning methods for discovering heterogeneous treatment effects in DiD settings. -### v0.6.0 +- Estimate treatment effect heterogeneity across covariates +- Data-driven subgroup discovery +- Combine with DiD identification for observational data +- Honest confidence intervals for discovered heterogeneity -- ✅ **All 1.0 Blockers Complete** -- ✅ Honest DiD sensitivity analysis (Rambachan & Roth 2023) -- ✅ CallawaySantAnna covariate adjustment (DR, IPW, Reg) -- ✅ API documentation site with Sphinx +**References**: +- [Kattenberg, Scheer & Thiel (2023)](https://ideas.repec.org/p/cpb/discus/452.html). *CPB Discussion Paper*. +- Athey & Wager (2019). *Annals of Statistics*. -### v0.5.0 +### Double/Debiased ML for DiD + +For high-dimensional settings with many potential confounders. -- Wild cluster bootstrap (Rademacher, Webb, Mammen weights) -- Placebo tests module -- Tutorial notebooks +- ML for nuisance parameter estimation (propensity, outcome models) +- Cross-fitting for valid inference +- Handles many covariates without overfitting concerns +- Doubly-robust estimation with ML flexibility -### v0.4.0 +**Reference**: Chernozhukov et al. (2018). *The Econometrics Journal*. -- Callaway-Sant'Anna estimator for staggered DiD -- Event study and group effects visualization -- Parallel trends testing utilities +### Alternative Inference Methods -### v0.3.0 +- **Randomization inference**: Exact p-values for small samples +- **Bayesian DiD**: Priors on parallel trends violations +- **Conformal inference**: Prediction intervals with finite-sample guarantees + +--- -- Synthetic Difference-in-Differences -- Multi-period DiD with event study -- Data preparation utilities +## Infrastructure Improvements -### v0.2.0 +Ongoing maintenance and developer experience. -- Two-Way Fixed Effects estimator -- Fixed effects support (absorb parameter) -- Cluster-robust standard errors -- Formula interface +### Performance -### v0.1.0 +- JIT compilation for bootstrap loops (numba) +- Parallel bootstrap iterations +- Sparse matrix handling for large fixed effects +- Memory-efficient estimation for large panels -- Initial release with basic DiD estimator +### Code Quality + +- Extract shared within-transformation logic to utils +- Consolidate linear regression helpers +- Consider splitting `staggered.py` (1800+ lines) + +### Documentation + +- Real-world data examples (beyond synthetic) +- Performance benchmarks vs. R packages +- Video tutorials and worked examples --- ## Contributing -Interested in contributing? See the [GitHub repository](https://github.com/igerber/diff-diff) for open issues. Features marked "Not Started" are good candidates for contributions. +Interested in contributing? Features in the "Near-Term" and "Medium-Term" sections are good candidates. See the [GitHub repository](https://github.com/igerber/diff-diff) for open issues. + +Key references for implementation: +- [Roth et al. (2023)](https://www.sciencedirect.com/science/article/abs/pii/S0304407623001318). "What's Trending in Difference-in-Differences?" *Journal of Econometrics*. +- [Baker et al. (2025)](https://arxiv.org/pdf/2503.13323). "Difference-in-Differences Designs: A Practitioner's Guide." diff --git a/TODO.md b/TODO.md index b4b2ae1..31eab5f 100644 --- a/TODO.md +++ b/TODO.md @@ -6,97 +6,46 @@ For the public feature roadmap, see [ROADMAP.md](ROADMAP.md). --- -## Priority Items for 1.0.1 - -### Linter/Type Errors (Blocking) - COMPLETED - -| Issue | Location | Status | -|-------|----------|--------| -| ~~Unused import `Union`~~ | `power.py:25` | Fixed | -| ~~Unsorted imports~~ | `staggered.py:8` | Fixed | -| ~~10 mypy errors - Optional type handling~~ | `staggered.py:843-1631` | Fixed | - -### Quick Wins - COMPLETED - -- [x] Fix ruff errors (2 auto-fixable) -- [x] Fix mypy errors in staggered.py (Optional dict access needs guards) -- [x] Remove duplicate `_get_significance_stars()` from `diagnostics.py` (now imports from `results.py`) - ---- - ## Known Limitations +Current limitations that may affect users: + | Issue | Location | Priority | Notes | |-------|----------|----------|-------| | MultiPeriodDiD wild bootstrap not supported | `estimators.py:1068-1074` | Low | Edge case | | `predict()` raises NotImplementedError | `estimators.py:532-554` | Low | Rarely needed | -| SyntheticDiD bootstrap can fail silently | `estimators.py:1580-1654` | Medium | Needs error handling | +| SyntheticDiD bootstrap can fail silently | `estimators.py:1580-1654` | Medium | Needs better error handling | | Diagnostics module error handling | `diagnostics.py:782-885` | Medium | Improve robustness | --- -## Code Quality Issues - -### Bare Exception Handling - COMPLETED - -~~Replace broad `except Exception` with specific exceptions:~~ - -| Location | Status | -|----------|--------| -| ~~`diagnostics.py:624`~~ | Fixed - catches `ValueError`, `KeyError`, `LinAlgError` | -| ~~`diagnostics.py:735`~~ | Fixed - catches `ValueError`, `KeyError`, `LinAlgError` | -| ~~`honest_did.py:807`~~ | Fixed - catches `ValueError`, `TypeError` | -| ~~`honest_did.py:822`~~ | Fixed - catches `ValueError`, `TypeError` | +## Code Quality ### Code Duplication -| Duplicate Code | Locations | Status | -|---------------|-----------|--------| -| ~~`_get_significance_stars()`~~ | `results.py:183`, ~~`diagnostics.py`~~ | Fixed in 1.0.1 | -| Wild bootstrap inference block | `estimators.py:278-296`, `estimators.py:725-748` | Future: extract to shared method | -| Within-transformation logic | `estimators.py:217-232`, `estimators.py:787-833`, `bacon.py:567-642` | Future: extract to utils.py | -| Linear regression helper | `staggered.py:205-240`, `estimators.py:366-408` | Future: consider consolidation | - -### API Inconsistencies - PARTIALLY ADDRESSED +Consolidation opportunities for cleaner maintenance: -**Bootstrap parameter naming:** -| Estimator | Parameter | Status | -|-----------|-----------|--------| -| DifferenceInDifferences | `bootstrap_weights` | OK | -| CallawaySantAnna | `bootstrap_weights` | Fixed in 1.0.1 (deprecated `bootstrap_weight_type`) | -| TwoWayFixedEffects | `bootstrap_weights` | OK | +| Duplicate Code | Locations | Notes | +|---------------|-----------|-------| +| Wild bootstrap inference block | `estimators.py:278-296`, `estimators.py:725-748` | Extract to shared method | +| Within-transformation logic | `estimators.py:217-232`, `estimators.py:787-833`, `bacon.py:567-642` | Extract to utils.py | +| Linear regression helper | `staggered.py:205-240`, `estimators.py:366-408` | Consider consolidation | -**Cluster variable defaults:** -- ~~`TwoWayFixedEffects` silently defaults cluster to `unit` at runtime~~ - Documented in 1.0.1 - ---- +### Large Module Files -## Large Module Files +Target: < 1000 lines per module for maintainability. -Current line counts (target: < 1000 lines per module): - -| File | Lines | Status | +| File | Lines | Action | |------|-------|--------| -| `staggered.py` | 1822 | Consider splitting | -| `estimators.py` | ~975 | OK (refactored) | -| `twfe.py` | ~355 | OK (new) | -| `synthetic_did.py` | ~540 | OK (new) | +| `staggered.py` | 1822 | Consider splitting to `staggered_bootstrap.py` | | `honest_did.py` | 1491 | Acceptable | +| `visualization.py` | 1388 | Acceptable | | `utils.py` | 1350 | Acceptable | | `power.py` | 1350 | Acceptable | | `prep.py` | 1338 | Acceptable | -| `visualization.py` | 1388 | Acceptable | | `bacon.py` | 1027 | OK | -**Completed splits:** -- ~~`estimators.py` → `twfe.py`, `synthetic_did.py` (keep base classes in estimators.py)~~ - Done in 1.0.2 - -**Potential splits:** -- `staggered.py` → `staggered_bootstrap.py` (move bootstrap logic) - ---- - -## Standard Error Consistency +### Standard Error Consistency Different estimators compute SEs differently. Consider unified interface. @@ -107,7 +56,7 @@ Different estimators compute SEs differently. Consider unified interface. | CallawaySantAnna | Simple difference-in-means SE | | SyntheticDiD | Bootstrap or placebo-based | -**Action**: Audit and document SE computation across estimators. Consider adding `se_type` parameter for consistency. +**Action**: Consider adding `se_type` parameter for consistency across estimators. --- @@ -122,7 +71,7 @@ Edge cases needing tests: - [ ] CallawaySantAnna with single cohort - [ ] SyntheticDiD with insufficient pre-periods -**Note**: 21 visualization tests are skipped when matplotlib unavailable - this is expected. +**Note**: 21 visualization tests are skipped when matplotlib unavailable—this is expected. --- @@ -135,18 +84,9 @@ Edge cases needing tests: --- -## CallawaySantAnna Bootstrap Improvements - -Deferred improvements from code review (PR #32): - -- [ ] Refactor `_run_multiplier_bootstrap` into smaller helper methods for maintainability -- [ ] Consider aligning p-value computation with R `did` package (symmetric percentile method) - ---- - -## Honest DiD Future Improvements +## Honest DiD Improvements -Post-1.0 enhancements for `honest_did.py`: +Enhancements for `honest_did.py`: - [ ] Improved C-LF implementation with direct optimization instead of grid search - [ ] Support for CallawaySantAnnaResults (currently only MultiPeriodDiDResults) @@ -156,19 +96,29 @@ Post-1.0 enhancements for `honest_did.py`: --- -## Performance +## CallawaySantAnna Bootstrap Improvements + +From code review (PR #32): + +- [ ] Refactor `_run_multiplier_bootstrap` into smaller helper methods +- [ ] Consider aligning p-value computation with R `did` package (symmetric percentile method) + +--- + +## Performance Optimizations No major performance issues identified. Potential future optimizations: -- JIT compilation for bootstrap loops (numba) -- Parallel bootstrap iterations -- Sparse matrix handling for large fixed effects +- [ ] JIT compilation for bootstrap loops (numba) +- [ ] Parallel bootstrap iterations +- [ ] Sparse matrix handling for large fixed effects --- ## Type Hints Missing type hints in internal functions: + - `utils.py:593` - `compute_trend()` nested function - `staggered.py:173, 180` - Nested functions in `_logistic_regression()` - `prep.py:604` - `format_label()` nested function