-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Summary
The QRF model for tip income trains on a weighted subsample of only 10,000 SIPP records. SIPP 2023 has 100K+ person-month observations. A larger training sample would better capture the thin right tail of high-tip workers, who are rare but important for total tip income.
Current code
sipp.py line ~99-108:
# Weighted subsample of 10,000 recordsSuggested change
Increase to 25-50K records, or use the full SIPP dataset if compute time permits. The QRF with 100 trees on 50K records should still train in under a minute.
Expected impact
Better representation of the distribution tails, particularly high-earners in tipped occupations (e.g., high-end restaurant servers, bartenders in major cities) who may earn $50K+ in tips annually.
Context
Part of a series of improvements to tip income imputation identified while comparing PolicyEngine's "no tax on tips" estimate ($4.7B) against JCT's score ($10.0B).