Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 35 additions & 2 deletions pages/docs/experiments.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,8 @@

| Technique | What It Does | When to Use | Actions to Take |
| --- | --- | --- | --- |
| **Bonferroni Correction** | Makes significance thresholds stricter to account for testing multiple metrics/variants at once | You're tracking multiple metrics or testing multiple variants | If a metric loses significance after correction, don't treat it as a confirmed winner. Run a follow-up experiment focused on that metric alone. |
| **Bonferroni Correction** | Makes significance thresholds stricter to account for testing multiple metrics/variants at once | You're tracking multiple metrics or testing multiple variants and want maximum protection against any false positive | If a metric loses significance after correction, don't treat it as a confirmed winner. Run a follow-up experiment focused on that metric alone. |
| **Benjamini-Hochberg Correction** | Controls the proportion of significant results that are false positives (false discovery rate) | You're tracking many metrics and want to correct for multiple comparisons without being overly conservative | Treat the surviving significant results as your set of likely winners, knowing a small share may still be false positives. Validate the most important ones in a follow-up. |

Check failure on line 293 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Hochberg)

Check failure on line 293 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Benjamini)
| **Winsorization** | Caps extreme outlier values at a percentile you choose | Revenue or value-based metrics where outliers are common | If results change substantially after Winsorization, this indicates your original results were driven by outliers. Decide whether your business decision is about typical users or extreme ones. |
| **SRM** | Checks whether your variant split matches the configured allocation | Always on as a health check | Pause the experiment. Identify and fix the root cause of the mismatch, then restart the experiment. |
| **Pre-Experiment Bias** | Checks whether variant groups were already different *before* the experiment started | Always on as a health check | Your groups had pre-existing bias. Consider enabling CUPED to correct for it. Investigate your assignment logic to prevent it in future experiments. |
Expand All @@ -309,7 +310,37 @@
- You have multiple treatment variants competing against control
- You want higher confidence that significant results are real

Bonferroni Correction is conservative. It reduces false positives but also makes it harder to detect true effects. If you have a single metric that matters most to you, you may prefer to focus on that primary metric without correction.
Bonferroni Correction is conservative. It reduces false positives but also makes it harder to detect true effects. If you have a single metric that matters most to you, you may prefer to focus on that primary metric without correction. If you're testing many metrics and Bonferroni feels too strict, consider Benjamini-Hochberg below.

Check failure on line 313 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Hochberg)

Check failure on line 313 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Benjamini)

### Benjamini-Hochberg Correction

Check failure on line 315 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Hochberg)

Check failure on line 315 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Benjamini)

Benjamini-Hochberg (BH) is a more balanced alternative to Bonferroni for handling multiple comparisons. Instead of controlling the probability that *any* significant result is a false positive (the family-wise error rate that Bonferroni targets), BH controls the **false discovery rate (FDR)** — the expected *proportion* of your "winners" that are actually false positives.

Check failure on line 317 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Hochberg)

Check failure on line 317 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Benjamini)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never say "FDR" again, so probably don't need to include the abbreviation here.


This better matches how experiment decisions are actually made. You only ship winners, so what matters is the share of those winners that are real. If you have 4 significant metrics and on average 1 is a false positive, your false discovery rate is 25% — regardless of how many metrics you tested in total.

**How it works:** Mixpanel ranks all p-values for the experiment from smallest to largest. Each p-value is compared to a threshold of (rank / total tests) × α, where α is determined by your confidence level. The largest p-value that passes its threshold — and every smaller p-value — is marked significant.

For example, with 5 metrics at a 95% confidence level (α = 0.05):

| Metric | P-value | Rank | BH Threshold (0.05 × rank / 5) | Significant? |
| --- | --- | --- | --- | --- |
| Revenue | 0.003 | 1 | 0.010 | ✅ |
| Signups | 0.012 | 2 | 0.020 | ✅ |
| Click-through | 0.018 | 3 | 0.030 | ✅ |
| Time on page | 0.041 | 4 | 0.040 | ❌ |
| Bounce rate | 0.062 | 5 | 0.050 | ❌ |

Under BH, the first three metrics are significant. Under Bonferroni at the same confidence level, only Revenue would pass.

**When to use Benjamini-Hochberg Correction:**

Check failure on line 335 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Hochberg)

Check failure on line 335 in pages/docs/experiments.mdx

View workflow job for this annotation

GitHub Actions / spellcheck

Unknown word (Benjamini)

- You're tracking many metrics and want correction without sacrificing too much statistical power
- You're comfortable accepting that a small proportion of your significant results may be false positives, in exchange for catching more real effects
- You want a correction method that scales gracefully as you add metrics

A small proportion of significant results may still be false positives under BH, but you are much more likely to detect real effects than with Bonferroni. BH is the recommended choice when you're tracking many metrics or running multi-variant experiments and want to preserve statistical power.

You can only apply one correction method at a time. Choose Bonferroni when the cost of acting on any false winner is high, and Benjamini-Hochberg when you're testing many metrics and want a more balanced trade-off between false positives and missed effects.

### Winsorization

Expand Down Expand Up @@ -363,6 +394,8 @@

**How it works:** For each user, Mixpanel looks at their metric value during a pre-exposure period of your choosing and their metric value during the experiment. If these values are strongly correlated (users with high pre-experiment values tend to have high post-experiment values), CUPED uses this relationship to reduce variance in the experiment results. The mean values remain unchanged—CUPED only tightens the confidence intervals. This is applied to all metric categories: primary, secondary, and guardrail metrics.

**Configuring the pre-exposure period:** When you enable CUPED, you can choose the lookback window under **Configuration → CUPED Pre-Exposure Period**. The available options are **1 Week**, **2 Weeks** (default), **4 Weeks**, **60 Days**, and **90 Days**. Longer windows give CUPED more historical data per user, which can improve variance reduction when behavior is stable over time, but they also exclude users whose history doesn't reach that far back. Shorter windows include more users but may capture less predictive signal. Two weeks is a good default for most experiments; consider a longer window if your metric has slow-moving or seasonal patterns.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. When you set the pre-exposure period, that's the length of time we'll query going back. The more you expand this pre-exposure period, the more users we'll actually pick up. We only filter out people who were active during this period but weren't exposed to the experiment.

I think the right point to make here is that you want to choose the pre-exposure period so that it doesn't overlap with other experiments on the same user population as this one, otherwise there is loss of sensitivity and potential bias. And long enough to capture user behavior (say, if some of your users are active only once a month, you want to make sure you choose a range long enough to see them).

I also asked Claude for feedback, I'll add a relevant point it gave:

A longer window is not always better — stale behavior from far in the past can be less predictive of current behavior, and flooding the population with X = 0 (new users, infrequent users) drags ρ toward zero.


**Handling users without pre-experiment data:** Not all users in your experiment will have activity during the pre-exposure period. New users, or users who simply didn't perform the relevant event before the experiment, are assigned a value of zero for the pre-exposure metric. This allows all experiment users to be included while still benefiting from variance reduction for users who do have historical data.

**When to use CUPED:**
Expand Down
Loading