Add comprehensive Dask performance benchmark tests #1049

jdebacker · 2025-08-22T16:40:25Z

This commit adds a complete benchmark suite for measuring and optimizing Dask performance in OG-Core, with particular focus on Windows performance issues.

New files:

tests/test_dask_benchmarks.py: Mock benchmark tests with synthetic workloads
tests/test_real_txfunc_benchmarks.py: Real-world tax function benchmarks
tests/run_benchmarks.py: Automated benchmark runner with reporting
tests/BENCHMARK_README.md: Comprehensive documentation and usage guide
pytest.ini: Updated with benchmark test markers

Key features:

Platform-specific optimization tests (Windows, macOS, Linux)
Memory usage and compute time benchmarking
Baseline establishment and performance regression detection
Comparison of different Dask schedulers and client configurations
Real tax function estimation performance measurement
Automated identification of optimal Dask settings per platform

Benefits:

Establishes performance baselines before optimization work
Identifies Windows-specific Dask performance bottlenecks
Provides automated regression detection for future changes
Enables data-driven optimization decisions
Supports continuous performance monitoring

Usage:
python tests/run_benchmarks.py # Run all benchmarks
python tests/run_benchmarks.py --quick # Quick benchmarks only
python tests/run_benchmarks.py --save-baseline # Save performance baseline
python tests/run_benchmarks.py --compare-baseline # Compare against baseline

🤖 Generated with Claude Code

This commit adds a complete benchmark suite for measuring and optimizing Dask performance in OG-Core, with particular focus on Windows performance issues. New files: - tests/test_dask_benchmarks.py: Mock benchmark tests with synthetic workloads - tests/test_real_txfunc_benchmarks.py: Real-world tax function benchmarks - tests/run_benchmarks.py: Automated benchmark runner with reporting - tests/BENCHMARK_README.md: Comprehensive documentation and usage guide - pytest.ini: Updated with benchmark test markers Key features: - Platform-specific optimization tests (Windows, macOS, Linux) - Memory usage and compute time benchmarking - Baseline establishment and performance regression detection - Comparison of different Dask schedulers and client configurations - Real tax function estimation performance measurement - Automated identification of optimal Dask settings per platform Benefits: - Establishes performance baselines before optimization work - Identifies Windows-specific Dask performance bottlenecks - Provides automated regression detection for future changes - Enables data-driven optimization decisions - Supports continuous performance monitoring Usage: python tests/run_benchmarks.py # Run all benchmarks python tests/run_benchmarks.py --quick # Quick benchmarks only python tests/run_benchmarks.py --save-baseline # Save performance baseline python tests/run_benchmarks.py --compare-baseline # Compare against baseline 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull Request Overview

This PR adds a comprehensive benchmark suite for measuring and optimizing Dask performance in OG-Core, with particular focus on Windows performance issues. The suite establishes performance baselines, provides automated regression detection, and enables data-driven optimization decisions through platform-specific testing.

Mock benchmark tests with synthetic workloads that mimic tax function patterns
Real tax function benchmarks using actual txfunc.tax_func_estimate calls
Platform-specific optimization detection with automated configuration recommendations

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_real_txfunc_benchmarks.py	Real-world tax function benchmarks with realistic data generation and platform-specific testing
tests/test_dask_benchmarks.py	Mock benchmark framework with synthetic workloads and comprehensive performance measurement utilities
tests/run_benchmarks.py	Automated benchmark runner with baseline comparison and reporting capabilities
tests/BENCHMARK_README.md	Comprehensive documentation covering usage, interpretation, and troubleshooting
pytest.ini	Updated with benchmark-specific test markers for easy test selection

Comments suppressed due to low confidence (1)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/test_real_txfunc_benchmarks.py

tests/test_dask_benchmarks.py

tests/test_real_txfunc_benchmarks.py

tests/test_dask_benchmarks.py

jdebacker · 2025-08-22T17:22:21Z

Some of the initial benchmarking results:

✅ Key Results Summary

Platform: macOS (Darwin) - which behaves similarly to Linux for Dask
performance patterns

📊 Performance Analysis

Threading vs Multiprocessing Performance Gap

Threaded scheduler: ~0.024s average ⚡️ (FAST)
Multiprocessing scheduler: ~2.241s average 🐌 (93x SLOWER!)
Distributed processes: ~1.742s average (72x SLOWER than threads)

Memory Efficiency

Threaded: ~309.7MB average peak memory
Multiprocessing: ~301.7MB average peak memory
Distributed: ~317MB average peak memory

🔍 Key Findings

Massive Performance Gap: Multiprocessing is 93x slower than threading
on macOS
- This confirms the serialization overhead issue we identified
- Windows would likely show even worse multiprocessing performance
Threading is Optimal: The threaded scheduler consistently outperforms
all other options
- Best compute time: ~0.024s
- Reasonable memory usage: ~310MB
- Most reliable across different test scenarios
Distributed Client Overhead: Even distributed clients add some
overhead vs direct threading
- Distributed threaded: ~0.026s (slightly slower than direct threading)
- Still much better than multiprocessing approaches

jdebacker · 2025-08-22T17:22:54Z

🎯 Recommendations Based on Results

For OG-Core Optimization:

Immediate Win: Switch Windows users to threaded scheduler

Instead of:

results = compute(*lazy_values, scheduler=dask.multiprocessing.get)

Use:

results = compute(*lazy_values, scheduler="threads")

Platform Detection: Implement platform-aware Dask configuration

if platform.system() == "Windows":
    scheduler = "threads"  # 93x faster!
else:
    scheduler = dask.multiprocessing.get  # OK on Unix/macOS

Memory Optimization: All schedulers show reasonable memory usage
(~300-320MB)

This commit fixes the real tax function benchmark tests that were failing with "'market_income'" KeyError by: Changes: - Added missing 'market_income' column to generated test data - Added all required columns that txfunc expects: year, total_tax_liab, payroll_tax_liab, weight, mtr_labinc, mtr_capinc - Fixed variable naming inconsistency (error_msg vs error_message) - Increased sample sizes for more realistic benchmarking - Disabled age-specific estimation for faster benchmark execution - Added missing pytest markers ('real', 'platform') to pytest.ini The real tax function benchmarks now successfully run and can measure actual OG-Core performance with different Dask configurations. Test results show ~25s execution time for real tax function estimation, providing valuable baseline data for optimization efforts. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

This commit fixes the run_benchmarks.py script to properly run both mock and real benchmark test files, addressing the errors reported when running --save-baseline with-real-benchmarks. Changes: - Modified run_benchmark_tests() to include both test_dask_benchmarks.py and test_real_txfunc_benchmarks.py - Previously was only running mock benchmarks, missing real txfunc tests - Cleaned up old confusing benchmark results to avoid stale error messages Test results: - All 13 benchmark tests now pass successfully (7 mock + 6 real) - Full benchmark suite runs in ~4:42 minutes - Successfully saved 45 benchmark results as baseline - Both mock and real benchmarks working with all Dask configurations Performance insights from real benchmarks: - Real tax function estimation: 22-44 seconds (baseline performance) - Mock benchmarks: 0.024 seconds (for regression testing) - Threaded scheduler remains fastest for all configurations - Platform-specific optimization tests working correctly 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

jdebacker · 2025-08-22T18:47:55Z

To compare performance improvements after optimizations:
python run_benchmarks.py --compare-baseline with-real-benchmarks

codecov-commenter · 2025-08-24T11:59:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.63%. Comparing base (ef6695e) to head (689bb83).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1049      +/-   ##
==========================================
+ Coverage   72.61%   72.63%   +0.02%     
==========================================
  Files          20       20              
  Lines        5076     5080       +4     
==========================================
+ Hits         3686     3690       +4     
  Misses       1390     1390

Flag	Coverage Δ
unittests	`72.63% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
ogcore/__init__.py	`100.00% <100.00%> (ø)`
ogcore/household.py	`88.08% <100.00%> (+0.20%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

rickecon · 2025-08-26T07:04:48Z

@jdebacker. I have reviewed this PR and am ready to merge it as soon as you review and merge my PR to your branch. Technically, my PR to your branch just updates the OG-Core version number and black formats one of the new test files. But I also made some of the changes that Copilot suggested in this PR thread and directly committed them to your branch through this PR thread. So you will probably want to pull from your remote branch before merging my PR to your branch. Let me know if you have any questions.

For any of the other Copilot-suggested changes that I didn't make, I opened issues (Issue #1051 and Issue #1050) so that we can easily address them later.

…marks Update version, black formatting, and small Copilot revisions

rickecon · 2025-08-28T00:55:19Z

@jdebacker. Looks great. Merging.

jdebacker requested a review from Copilot August 22, 2025 16:42

Copilot AI reviewed Aug 22, 2025

View reviewed changes

jdebacker and others added 2 commits August 22, 2025 14:19

jdebacker added 2 commits August 22, 2025 14:48

format

699f5fc

don't run benchmark tests on gh actions

d0e24d4

rickecon mentioned this pull request Aug 26, 2025

Clean up print statements and maybe use logging package #1050

Closed

rickecon and others added 3 commits August 26, 2025 01:24

Update tests/test_real_txfunc_benchmarks.py

0c22de6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/test_real_txfunc_benchmarks.py

d3fd9d4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add json at top of test_real_txfunc_benmarks.py

d789b7c

rickecon mentioned this pull request Aug 26, 2025

Update to platform specific dask schedulers example run scripts #1051

Open

rickecon added 2 commits August 26, 2025 01:47

Black formatted test_real_txfunc_benchmarks.py

c7b66da

Updated version

777a8ad

rickecon marked this pull request as ready for review August 26, 2025 07:01

jdebacker and others added 5 commits August 27, 2025 13:38

Merge pull request #39 from rickecon/jdebacker-dask-performance-bench…

896406b

…marks Update version, black formatting, and small Copilot revisions

use logging in household.py

4d2453e

implemente copilot suggestions

9d3cb28

format

88121c8

don't import verbose from SS

689bb83

rickecon approved these changes Aug 28, 2025

View reviewed changes

rickecon merged commit 938073b into PSLmodels:master Aug 28, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive Dask performance benchmark tests #1049

Add comprehensive Dask performance benchmark tests #1049

Uh oh!

jdebacker commented Aug 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdebacker commented Aug 22, 2025

Uh oh!

jdebacker commented Aug 22, 2025 •

edited

Loading

Uh oh!

jdebacker commented Aug 22, 2025

Uh oh!

codecov-commenter commented Aug 24, 2025 •

edited

Loading

Uh oh!

rickecon commented Aug 26, 2025 •

edited

Loading

Uh oh!

rickecon commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add comprehensive Dask performance benchmark tests #1049

Add comprehensive Dask performance benchmark tests #1049

Uh oh!

Conversation

jdebacker commented Aug 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdebacker commented Aug 22, 2025

Uh oh!

jdebacker commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Instead of:

Use:

Uh oh!

jdebacker commented Aug 22, 2025

Uh oh!

codecov-commenter commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rickecon commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rickecon commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jdebacker commented Aug 22, 2025 •

edited

Loading

codecov-commenter commented Aug 24, 2025 •

edited

Loading

rickecon commented Aug 26, 2025 •

edited

Loading