-
Notifications
You must be signed in to change notification settings - Fork 104
docs: add Benjamini-Hochberg correction and configurable CUPED lookback #2729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -289,7 +289,8 @@ | |
|
|
||
| | Technique | What It Does | When to Use | Actions to Take | | ||
| | --- | --- | --- | --- | | ||
| | **Bonferroni Correction** | Makes significance thresholds stricter to account for testing multiple metrics/variants at once | You're tracking multiple metrics or testing multiple variants | If a metric loses significance after correction, don't treat it as a confirmed winner. Run a follow-up experiment focused on that metric alone. | | ||
| | **Bonferroni Correction** | Makes significance thresholds stricter to account for testing multiple metrics/variants at once | You're tracking multiple metrics or testing multiple variants and want maximum protection against any false positive | If a metric loses significance after correction, don't treat it as a confirmed winner. Run a follow-up experiment focused on that metric alone. | | ||
| | **Benjamini-Hochberg Correction** | Controls the proportion of significant results that are false positives (false discovery rate) | You're tracking many metrics and want to correct for multiple comparisons without being overly conservative | Treat the surviving significant results as your set of likely winners, knowing a small share may still be false positives. Validate the most important ones in a follow-up. | | ||
|
Check failure on line 293 in pages/docs/experiments.mdx
|
||
| | **Winsorization** | Caps extreme outlier values at a percentile you choose | Revenue or value-based metrics where outliers are common | If results change substantially after Winsorization, this indicates your original results were driven by outliers. Decide whether your business decision is about typical users or extreme ones. | | ||
| | **SRM** | Checks whether your variant split matches the configured allocation | Always on as a health check | Pause the experiment. Identify and fix the root cause of the mismatch, then restart the experiment. | | ||
| | **Pre-Experiment Bias** | Checks whether variant groups were already different *before* the experiment started | Always on as a health check | Your groups had pre-existing bias. Consider enabling CUPED to correct for it. Investigate your assignment logic to prevent it in future experiments. | | ||
|
|
@@ -309,7 +310,37 @@ | |
| - You have multiple treatment variants competing against control | ||
| - You want higher confidence that significant results are real | ||
|
|
||
| Bonferroni Correction is conservative. It reduces false positives but also makes it harder to detect true effects. If you have a single metric that matters most to you, you may prefer to focus on that primary metric without correction. | ||
| Bonferroni Correction is conservative. It reduces false positives but also makes it harder to detect true effects. If you have a single metric that matters most to you, you may prefer to focus on that primary metric without correction. If you're testing many metrics and Bonferroni feels too strict, consider Benjamini-Hochberg below. | ||
|
Check failure on line 313 in pages/docs/experiments.mdx
|
||
|
|
||
| ### Benjamini-Hochberg Correction | ||
|
Check failure on line 315 in pages/docs/experiments.mdx
|
||
|
|
||
| Benjamini-Hochberg (BH) is a more balanced alternative to Bonferroni for handling multiple comparisons. Instead of controlling the probability that *any* significant result is a false positive (the family-wise error rate that Bonferroni targets), BH controls the **false discovery rate (FDR)** — the expected *proportion* of your "winners" that are actually false positives. | ||
|
Check failure on line 317 in pages/docs/experiments.mdx
|
||
|
|
||
| This better matches how experiment decisions are actually made. You only ship winners, so what matters is the share of those winners that are real. If you have 4 significant metrics and on average 1 is a false positive, your false discovery rate is 25% — regardless of how many metrics you tested in total. | ||
|
|
||
| **How it works:** Mixpanel ranks all p-values for the experiment from smallest to largest. Each p-value is compared to a threshold of (rank / total tests) × α, where α is determined by your confidence level. The largest p-value that passes its threshold — and every smaller p-value — is marked significant. | ||
|
|
||
| For example, with 5 metrics at a 95% confidence level (α = 0.05): | ||
|
|
||
| | Metric | P-value | Rank | BH Threshold (0.05 × rank / 5) | Significant? | | ||
| | --- | --- | --- | --- | --- | | ||
| | Revenue | 0.003 | 1 | 0.010 | ✅ | | ||
| | Signups | 0.012 | 2 | 0.020 | ✅ | | ||
| | Click-through | 0.018 | 3 | 0.030 | ✅ | | ||
| | Time on page | 0.041 | 4 | 0.040 | ❌ | | ||
| | Bounce rate | 0.062 | 5 | 0.050 | ❌ | | ||
|
|
||
| Under BH, the first three metrics are significant. Under Bonferroni at the same confidence level, only Revenue would pass. | ||
|
|
||
| **When to use Benjamini-Hochberg Correction:** | ||
|
Check failure on line 335 in pages/docs/experiments.mdx
|
||
|
|
||
| - You're tracking many metrics and want correction without sacrificing too much statistical power | ||
| - You're comfortable accepting that a small proportion of your significant results may be false positives, in exchange for catching more real effects | ||
| - You want a correction method that scales gracefully as you add metrics | ||
|
|
||
| A small proportion of significant results may still be false positives under BH, but you are much more likely to detect real effects than with Bonferroni. BH is the recommended choice when you're tracking many metrics or running multi-variant experiments and want to preserve statistical power. | ||
|
|
||
| You can only apply one correction method at a time. Choose Bonferroni when the cost of acting on any false winner is high, and Benjamini-Hochberg when you're testing many metrics and want a more balanced trade-off between false positives and missed effects. | ||
|
|
||
| ### Winsorization | ||
|
|
||
|
|
@@ -363,6 +394,8 @@ | |
|
|
||
| **How it works:** For each user, Mixpanel looks at their metric value during a pre-exposure period of your choosing and their metric value during the experiment. If these values are strongly correlated (users with high pre-experiment values tend to have high post-experiment values), CUPED uses this relationship to reduce variance in the experiment results. The mean values remain unchanged—CUPED only tightens the confidence intervals. This is applied to all metric categories: primary, secondary, and guardrail metrics. | ||
|
|
||
| **Configuring the pre-exposure period:** When you enable CUPED, you can choose the lookback window under **Configuration → CUPED Pre-Exposure Period**. The available options are **1 Week**, **2 Weeks** (default), **4 Weeks**, **60 Days**, and **90 Days**. Longer windows give CUPED more historical data per user, which can improve variance reduction when behavior is stable over time, but they also exclude users whose history doesn't reach that far back. Shorter windows include more users but may capture less predictive signal. Two weeks is a good default for most experiments; consider a longer window if your metric has slow-moving or seasonal patterns. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't correct. When you set the pre-exposure period, that's the length of time we'll query going back. The more you expand this pre-exposure period, the more users we'll actually pick up. We only filter out people who were active during this period but weren't exposed to the experiment. I think the right point to make here is that you want to choose the pre-exposure period so that it doesn't overlap with other experiments on the same user population as this one, otherwise there is loss of sensitivity and potential bias. And long enough to capture user behavior (say, if some of your users are active only once a month, you want to make sure you choose a range long enough to see them). I also asked Claude for feedback, I'll add a relevant point it gave:
|
||
|
|
||
| **Handling users without pre-experiment data:** Not all users in your experiment will have activity during the pre-exposure period. New users, or users who simply didn't perform the relevant event before the experiment, are assigned a value of zero for the pre-exposure metric. This allows all experiment users to be included while still benefiting from variance reduction for users who do have historical data. | ||
|
|
||
| **When to use CUPED:** | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never say "FDR" again, so probably don't need to include the abbreviation here.