Skip to content

Increase QRF training sample size for tip income model beyond 10K #529

@MaxGhenis

Description

@MaxGhenis

Summary

The QRF model for tip income trains on a weighted subsample of only 10,000 SIPP records. SIPP 2023 has 100K+ person-month observations. A larger training sample would better capture the thin right tail of high-tip workers, who are rare but important for total tip income.

Current code

sipp.py line ~99-108:

# Weighted subsample of 10,000 records

Suggested change

Increase to 25-50K records, or use the full SIPP dataset if compute time permits. The QRF with 100 trees on 50K records should still train in under a minute.

Expected impact

Better representation of the distribution tails, particularly high-earners in tipped occupations (e.g., high-end restaurant servers, bartenders in major cities) who may earn $50K+ in tips annually.

Context

Part of a series of improvements to tip income imputation identified while comparing PolicyEngine's "no tax on tips" estimate ($4.7B) against JCT's score ($10.0B).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions