mixpanel · russell-loube-mixpanel · May 27, 2026 · kaan-barmore-genc-mixpanel · May 28, 2026 · kaan-barmore-genc-mixpanel
@@ -289,7 +289,8 @@
 
 | Technique | What It Does | When to Use | Actions to Take |
 | --- | --- | --- | --- |
-| **Bonferroni Correction** | Makes significance thresholds stricter to account for testing multiple metrics/variants at once | You're tracking multiple metrics or testing multiple variants | If a metric loses significance after correction, don't treat it as a confirmed winner. Run a follow-up experiment focused on that metric alone. |
+| **Bonferroni Correction** | Makes significance thresholds stricter to account for testing multiple metrics/variants at once | You're tracking multiple metrics or testing multiple variants and want maximum protection against any false positive | If a metric loses significance after correction, don't treat it as a confirmed winner. Run a follow-up experiment focused on that metric alone. |
+| **Benjamini-Hochberg Correction** | Controls the proportion of significant results that are false positives (false discovery rate) | You're tracking many metrics and want to correct for multiple comparisons without being overly conservative | Treat the surviving significant results as your set of likely winners, knowing a small share may still be false positives. Validate the most important ones in a follow-up. |
 | **Winsorization** | Caps extreme outlier values at a percentile you choose | Revenue or value-based metrics where outliers are common | If results change substantially after Winsorization, this indicates your original results were driven by outliers. Decide whether your business decision is about typical users or extreme ones. |
 | **SRM** | Checks whether your variant split matches the configured allocation | Always on as a health check | Pause the experiment. Identify and fix the root cause of the mismatch, then restart the experiment. |
 | **Pre-Experiment Bias** | Checks whether variant groups were already different *before* the experiment started | Always on as a health check | Your groups had pre-existing bias. Consider enabling CUPED to correct for it. Investigate your assignment logic to prevent it in future experiments. |
@@ -309,7 +310,37 @@
 - You have multiple treatment variants competing against control
 - You want higher confidence that significant results are real
 
-Bonferroni Correction is conservative. It reduces false positives but also makes it harder to detect true effects. If you have a single metric that matters most to you, you may prefer to focus on that primary metric without correction.
+Bonferroni Correction is conservative. It reduces false positives but also makes it harder to detect true effects. If you have a single metric that matters most to you, you may prefer to focus on that primary metric without correction. If you're testing many metrics and Bonferroni feels too strict, consider Benjamini-Hochberg below.
+
+### Benjamini-Hochberg Correction
+
+Benjamini-Hochberg (BH) is a more balanced alternative to Bonferroni for handling multiple comparisons. Instead of controlling the probability that *any* significant result is a false positive (the family-wise error rate that Bonferroni targets), BH controls the **false discovery rate (FDR)** — the expected *proportion* of your "winners" that are actually false positives.
+
+This better matches how experiment decisions are actually made. You only ship winners, so what matters is the share of those winners that are real. If you have 4 significant metrics and on average 1 is a false positive, your false discovery rate is 25% — regardless of how many metrics you tested in total.
+
+**How it works:** Mixpanel ranks all p-values for the experiment from smallest to largest. Each p-value is compared to a threshold of (rank / total tests) × α, where α is determined by your confidence level. The largest p-value that passes its threshold — and every smaller p-value — is marked significant.
+
+For example, with 5 metrics at a 95% confidence level (α = 0.05):
+
+| Metric | P-value | Rank | BH Threshold (0.05 × rank / 5) | Significant? |
+| --- | --- | --- | --- | --- |
+| Revenue | 0.003 | 1 | 0.010 | ✅ |
+| Signups | 0.012 | 2 | 0.020 | ✅ |
+| Click-through | 0.018 | 3 | 0.030 | ✅ |
+| Time on page | 0.041 | 4 | 0.040 | ❌ |
+| Bounce rate | 0.062 | 5 | 0.050 | ❌ |
+
+Under BH, the first three metrics are significant. Under Bonferroni at the same confidence level, only Revenue would pass.
+
+**When to use Benjamini-Hochberg Correction:**
+
+- You're tracking many metrics and want correction without sacrificing too much statistical power
+- You're comfortable accepting that a small proportion of your significant results may be false positives, in exchange for catching more real effects
+- You want a correction method that scales gracefully as you add metrics
+
+A small proportion of significant results may still be false positives under BH, but you are much more likely to detect real effects than with Bonferroni. BH is the recommended choice when you're tracking many metrics or running multi-variant experiments and want to preserve statistical power.
+
+You can only apply one correction method at a time. Choose Bonferroni when the cost of acting on any false winner is high, and Benjamini-Hochberg when you're testing many metrics and want a more balanced trade-off between false positives and missed effects.
 
 ### Winsorization
 
@@ -363,6 +394,8 @@
 
 **How it works:** For each user, Mixpanel looks at their metric value during a pre-exposure period of your choosing and their metric value during the experiment. If these values are strongly correlated (users with high pre-experiment values tend to have high post-experiment values), CUPED uses this relationship to reduce variance in the experiment results. The mean values remain unchanged—CUPED only tightens the confidence intervals. This is applied to all metric categories: primary, secondary, and guardrail metrics.
 
+**Configuring the pre-exposure period:** When you enable CUPED, you can choose the lookback window under **Configuration → CUPED Pre-Exposure Period**. The available options are **1 Week**, **2 Weeks** (default), **4 Weeks**, **60 Days**, and **90 Days**. Longer windows give CUPED more historical data per user, which can improve variance reduction when behavior is stable over time, but they also exclude users whose history doesn't reach that far back. Shorter windows include more users but may capture less predictive signal. Two weeks is a good default for most experiments; consider a longer window if your metric has slow-moving or seasonal patterns.
+
 **Handling users without pre-experiment data:** Not all users in your experiment will have activity during the pre-exposure period. New users, or users who simply didn't perform the relevant event before the experiment, are assigned a value of zero for the pre-exposure metric. This allows all experiment users to be included while still benefiting from variance reduction for users who do have historical data.
 
 **When to use CUPED:**