statistics: collect singleton sketches in row sampler by 0xPoe · Pull Request #68499 · pingcap/tidb

0xPoe · 2026-05-19T15:42:45Z

What problem does this PR solve?

Issue Number: ref #67449

Problem Summary:

Distributed row-sampling ANALYZE has no way to derive a population NDV when sketches are gathered from a sub-sample of rows. The existing FM sketch counts every distinct value, which biases the estimate downward.

What changed and how does it work?

Split out of #68157 — the pkg/statistics half only. Executor wiring stays in #68157.

Add a singletonSketchBuilder that tracks values seen exactly once per node; only singletons feed a separate FM sketch, which is the basis for population NDV estimation under sub-sampling.
Add NDVSampleSkipRate on RowSampleBuilder (default 1 = sketches collected from every row; no behavior change in this PR).
Rescale NullCount / TotalSizes to per-population estimates when a sub-sample is used.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Summary by CodeRabbit

New Features
- NDV sub-sampling added to statistics collection with a default NDV sample rate.
Tests
- New and expanded tests covering default NDV behavior, singleton-sketch generation and serialization, rescaling under sub-sampling, and sketch size limits.
Chores
- Full-sampling analyzer path and test setup configured to use the default NDV sample rate.

pantheon-ai · 2026-05-19T15:42:53Z

@0xPoe I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

coderabbitai · 2026-05-19T15:43:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds configurable NDV sub-sampling to row statistics via per-node singleton FM-sketches, extends collection state/flow (gating, sample counts, rescaling), updates proto (de)serialization and merge logic, adds tests, and bumps a test shard count.

Changes

NDV Sub-Sampling Feature

Layer / File(s)	Summary
NDV constants, singleton sketch types, and helpers `pkg/statistics/constants.go`, `pkg/statistics/row_sampler.go`	`DefaultNDVSampleRate` added. `baseCollector` gets `SingletonSketches` and `SketchSampleCount`. `singletonSketchBuilder` introduced to deduplicate hashes; `ShouldBuildSingletonSketches` helper and imports added; `RowSampleBuilder` gains `NDVSampleRate`.
Collector initialization and analyzer wiring `pkg/statistics/row_sampler.go`, `pkg/store/mockstore/unistore/cophandler/analyze.go`	Reservoir and Bernoulli collectors allocate `SingletonSketches`. `RowSampleBuilder.Collect` validates `NDVSampleRate`, initializes singleton builders and FM sketch slots. Full-sampling analyzer sets `NDVSampleRate: statistics.DefaultNDVSampleRate`.
Row collection with NDV sub-sampling gate and rescaling `pkg/statistics/row_sampler.go`	Per-row NDV sampling gate controls whether rows feed FM sketches and singleton builders; sampled rows increment `SketchSampleCount`. Single-column groups clone singleton builders; multi-column groups use row-level `InsertRowValue`. `NullCount` and `TotalSizes` are rescaled to full-population estimates after subsampling.
Serialization, deserialization, and merge for singleton sketches `pkg/statistics/row_sampler.go`	`ToProto` emits `SingletonSketches` and `SketchSampleCount`. `FromProto` reconstructs singleton sketches and reads `SketchSampleCount`. Merge logic accumulates `SketchSampleCount` and merges singleton sketches across collectors.
Tests, serialization tests, and BUILD tweak `pkg/statistics/sample_test.go`, `pkg/statistics/BUILD.bazel`	Adds `TestRescaleSampledValue` and several `SubTestRowSample*` subtests validating default NDV behavior, singleton-sketch determinism/serialization, null-count rescaling, and max-size enforcement. Existing sampling tests set `NDVSampleRate: DefaultNDVSampleRate`. `statistics_test` shard_count incremented from 44 to 45.

Sequence Diagram

sequenceDiagram
  participant Builder as RowSampleBuilder
  participant RNG as NDVSampler
  participant SBuilder as singletonSketchBuilder
  participant Collector as baseCollector
  Builder->>RNG: decide collectSketch (NDVSampleRate)
  alt collectSketch == true
    Builder->>SBuilder: insertRowValue / insertHash
    Builder->>Collector: feed FM sketches
    Builder->>Collector: increment SketchSampleCount
  else collectSketch == false
    Builder->>Collector: skip sketch inputs
  end
  Builder->>Collector: rescale NullCount & TotalSizes if sub-sampled

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

pingcap/tidb#68158: Adds singleton-sketch protobuf fields (SingletonSketch, SketchSampleCount) referenced by this PR.
pingcap/tidb#67593: Consumes per-node singletonSketches to estimate global distinct values.
pingcap/tidb#68414: Updates tipb protobuf and FM-sketch hashing used by singleton-sketch serialization.

Suggested reviewers

time-and-fate
mjonss
henrybw

Poem

I nibble hashes, one by one,
I mark the singletons under the sun.
Sampled rows I count and keep,
Rescaled totals from the heap.
Hop, sketch, estimate — a rabbit’s leap! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding singleton sketches collection in the row sampler for better NDV estimation under sub-sampling.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description includes the required issue number (ref `#67449`), a clear problem summary explaining the NDV bias issue, detailed description of changes made, and appropriate checkboxes. All required sections are present and substantively completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/statistics/row_sampler.go`:
- Around line 246-248: The code currently reads s.NDVSampleRate directly causing
a panic when NDVSampleRate is omitted (zero); change the initialization to honor
the default by setting ndvSampleRate := s.NDVSampleRate; if ndvSampleRate == 0 {
ndvSampleRate = DefaultNDVSampleRate } and then keep the existing validation
(intest.Assert(ndvSampleRate > 0 && ndvSampleRate <= DefaultNDVSampleRate, ...))
before calling ShouldBuildSingletonSketches(ndvSampleRate); this preserves the
old zero-value behavior while still validating explicit values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: bccaac0f-5a91-4885-9ceb-efb53fad6117

📥 Commits

Reviewing files that changed from the base of the PR and between 548ad7d and 54da5cb.

📒 Files selected for processing (4)

pkg/statistics/BUILD.bazel
pkg/statistics/constants.go
pkg/statistics/row_sampler.go
pkg/statistics/sample_test.go

codecov · 2026-05-19T16:03:50Z

Codecov Report

❌ Patch coverage is 29.41176% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.8614%. Comparing base (2b9527b) to head (60e5514).
⚠️ Report is 31 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #68499        +/-   ##
================================================
- Coverage   77.2806%   74.8614%   -2.4193%     
================================================
  Files          2010       2030        +20     
  Lines        555351     573653     +18302     
================================================
+ Hits         429179     429445       +266     
- Misses       125252     143570     +18318     
+ Partials        920        638       -282

Flag	Coverage Δ
integration	`41.4005% <29.4117%> (+1.6075%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`60.4679% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.7418% <ø> (-14.2673%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/statistics/row_sampler.go`:
- Around line 407-415: The function rescaleSampledValue currently asserts
sampleCount>0 but doesn't clamp negative or zero inputs as the comment promises;
update rescaleSampledValue to return 0 early when sampled <= 0 or totalRowCount
<= 0 (keeping the existing intest.Assert(sampleCount > 0)), and otherwise
perform the same round-half-up scaling using (sampled*totalRowCount +
sampleCount/2) / sampleCount so negative sampled values no longer produce
negative estimates.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e3d22ff4-8bc8-4630-a8f1-6fbe7a01f647

📥 Commits

Reviewing files that changed from the base of the PR and between 9f3a14b and b0259e3.

📒 Files selected for processing (5)

pkg/statistics/BUILD.bazel
pkg/statistics/constants.go
pkg/statistics/row_sampler.go
pkg/statistics/sample_test.go
pkg/store/mockstore/unistore/cophandler/analyze.go

✅ Files skipped from review due to trivial changes (2)

pkg/statistics/constants.go
pkg/store/mockstore/unistore/cophandler/analyze.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/statistics/BUILD.bazel

0xPoe

🔢 Self-check (PR reviewed by myself and ready for feedback)

Code compiles successfully
Unit tests added
No AI-generated elegant nonsense in PR.
Comments added where necessary
PR title and description updated
Documentation PR created (or confirmed not needed)
PR size is reasonable

/cc @time-and-fate @winoros

0xPoe · 2026-05-21T19:27:19Z

/retest

ti-chi-bot · 2026-05-23T11:56:11Z

[LGTM Timeline notifier]

Timeline:

2026-05-23 11:56:11.093685811 +0000 UTC m=+94041.063850873: ☑️ agreed by winoros.

Signed-off-by: 0xPoe <techregister@pm.me>

time-and-fate · 2026-05-25T07:27:40Z

 	ColGroups       [][]int64
 	MaxSampleSize   int
 	SampleRate      float64
+	NDVSampleRate   float64


I think current comments on NDVSampleSkipRate are not clear, and we should clearly explain the relationship and the difference between the two sample rates.

Explain that these are two independent sampling mechanisms.

Use unified terminology. Currently, we are using NDVSampleRate but SketchSampleCount.

Explain that this doesn't only affect the NDV sampling but also affects the collection of NULL count and row size, and it's also related to the collection of the new singleton sketch.

I have addressed 1 and 3 in 49f2459 (this PR).

For 2, I think it is hard to unify them at that level. I called it SketchSampleCount because it is used to collect the FMSketches and SingletonSketches, so SketchSampleCount seems natural. This is mainly meant for implementation. But for NDV rate, users do not need to know all these sketches, so NDV rate would be easier to understand. I think changing it to NDVSampleCount is also OK, but SketchSampleCount seems more tightly tied to these two sketches we use to build NDV.

Added more comments for it in ef312c7 (this PR)

time-and-fate · 2026-05-25T07:28:57Z

+	// NDVSampleSkipRate is the sample rate at which sampling is skipped, so NDV sketches are built from every row.
+	NDVSampleSkipRate = 1


Why don't we just call it DefaultNDVSampleRate?

Suggested change

// NDVSampleSkipRate is the sample rate at which sampling is skipped, so NDV sketches are built from every row.

NDVSampleSkipRate = 1

// NDVSampleSkipRate is the sample rate at which sampling is skipped. The default value is 1, so NDV sketches are built from every row.

NDVSampleSkipRate = 1

See #68499 (comment)

DefaultNDVSampleRate would be used when we integrate it with the priority queue. Then its default value would be 0.05, I guess. This is the value we use to skip sampling by default.

Or otherwise, we would end up with two defaults:

DefaultNDVSampleRate for normal tables as 1.

DefaultNDVSampleRateForBigTable for big tables as 0.05??

time-and-fate · 2026-05-25T07:48:53Z

+	for i, singletonSketch := range singletonSketches {
+		s.SingletonSketches[i].MergeFMSketch(singletonSketch)
+	}


Is this the correct behavior? I think we need to keep the sketch from every region.

You are right. We should not do the worker-level merge first.

When I did my tests, I think I did it with this worker-level merge. It seems that doing it per region would consume much more memory and burn more CPU.

time-and-fate · 2026-05-25T07:51:21Z

+func ShouldBuildSingletonSketches(rate float64) bool {
+	return rate < NDVSampleSkipRate
+}


I think we don't need an extra function since it's used only once.

The reason I moved it out is that it would be easier to add some comments for it. I am OK to inline it and add the comment for the variable.

And it would be used twice in the following PR. See https://patch-diff.githubusercontent.com/raw/pingcap/tidb/pull/68157.diff and search for it.

Signed-off-by: 0xPoe <techregister@pm.me>

ti-chi-bot · 2026-05-25T09:10:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: winoros
Once this PR has been reviewed and has the lgtm label, please assign terry1purcell for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [winoros]
pkg/statistics/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: 0xPoe <techregister@pm.me>

wuhuizuo · 2026-05-25T11:16:51Z

/test pull-integration-realcluster-test-next-gen

MergeCollector unioned the per-region singleton FM sketches. Union conflates them: a value that is a singleton in two regions occurs twice overall, yet stays in the union and was miscounted as a global singleton, inflating the NDV estimate. Keep each region's NDV and singleton sketches separate in a new RegionSketchSummary so EstimateGlobalSingletonBySketches can discount such values across regions. The merge collects one summary per region instead of unioning, preserving per-region granularity through every merge level. ref pingcap#67449 Signed-off-by: 0xPoe <techregister@pm.me>

tiprow · 2026-05-25T11:17:07Z

@wuhuizuo: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow

/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test pull-integration-realcluster-test-next-gen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-05-25T11:30:33Z

@0xPoe: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-mysql-client-test-next-gen	`60e5514`	link	true	`/test pull-mysql-client-test-next-gen`
pull-mysql-client-test	`60e5514`	link	true	`/test pull-mysql-client-test`
pull-lightning-integration-test	`60e5514`	link	true	`/test pull-lightning-integration-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

0xPoe · 2026-05-25T15:16:04Z

I need to rework this solution, so I’m closing this one. I will resend it when it is ready.

ti-chi-bot Bot added the release-note-none Denotes a PR that doesn't merit a release note. label May 19, 2026

ti-chi-bot Bot added component/statistics sig/planner SIG: Planner size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 19, 2026

0xPoe mentioned this pull request May 19, 2026

WIP: statistics, executor: collect singleton sketches for row sampling #68157

Closed

13 tasks

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread pkg/statistics/row_sampler.go

0xPoe force-pushed the stats-collect-singleton-sketches branch 4 times, most recently from 42d9035 to b0259e3 Compare May 19, 2026 16:57

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread pkg/statistics/row_sampler.go

0xPoe force-pushed the stats-collect-singleton-sketches branch 3 times, most recently from 4944e09 to b46e664 Compare May 19, 2026 17:24

0xPoe commented May 19, 2026

View reviewed changes

ti-chi-bot Bot requested review from time-and-fate and winoros May 19, 2026 18:13

winoros reviewed May 20, 2026

View reviewed changes

Comment thread pkg/statistics/row_sampler.go Outdated

0xPoe requested a review from winoros May 21, 2026 08:17

0xPoe force-pushed the stats-collect-singleton-sketches branch from b46e664 to 1625571 Compare May 21, 2026 17:28

0xPoe force-pushed the stats-collect-singleton-sketches branch 2 times, most recently from 6e4f581 to 9e72f06 Compare May 21, 2026 19:36

winoros approved these changes May 23, 2026

View reviewed changes

Comment thread pkg/statistics/row_sampler.go Outdated

Comment thread pkg/statistics/constants.go Outdated

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 23, 2026

statistics: collect singleton sketches in row sampler

7eebdb8

Signed-off-by: 0xPoe <techregister@pm.me>

0xPoe force-pushed the stats-collect-singleton-sketches branch from 9e72f06 to 7eebdb8 Compare May 24, 2026 14:23

time-and-fate reviewed May 25, 2026

View reviewed changes

docs: add comments for two sample rates

49f2459

Signed-off-by: 0xPoe <techregister@pm.me>

0xPoe force-pushed the stats-collect-singleton-sketches branch from 38c13a3 to 49f2459 Compare May 25, 2026 09:22

0xPoe requested a review from time-and-fate May 25, 2026 09:25

docs: add comments for Sketeches

ef312c7

Signed-off-by: 0xPoe <techregister@pm.me>

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 25, 2026

0xPoe closed this May 25, 2026

		// NDVSampleSkipRate is the sample rate at which sampling is skipped, so NDV sketches are built from every row.
		NDVSampleSkipRate = 1

Conversation

0xPoe commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0xPoe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0xPoe commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot Bot commented May 23, 2026

[LGTM Timeline notifier]

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented May 25, 2026

Uh oh!

wuhuizuo commented May 25, 2026

Uh oh!

tiprow Bot commented May 25, 2026

0xPoe commented May 19, 2026 •

edited

Loading

pantheon-ai Bot commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading