Skip to content

statistics: collect singleton sketches in row sampler#68499

Closed
0xPoe wants to merge 4 commits into
pingcap:masterfrom
0xPoe:stats-collect-singleton-sketches
Closed

statistics: collect singleton sketches in row sampler#68499
0xPoe wants to merge 4 commits into
pingcap:masterfrom
0xPoe:stats-collect-singleton-sketches

Conversation

@0xPoe
Copy link
Copy Markdown
Member

@0xPoe 0xPoe commented May 19, 2026

What problem does this PR solve?

Issue Number: ref #67449

Problem Summary:

Distributed row-sampling ANALYZE has no way to derive a population NDV when sketches are gathered from a sub-sample of rows. The existing FM sketch counts every distinct value, which biases the estimate downward.

What changed and how does it work?

Split out of #68157 — the pkg/statistics half only. Executor wiring stays in #68157.

  • Add a singletonSketchBuilder that tracks values seen exactly once per node; only singletons feed a separate FM sketch, which is the basis for population NDV estimation under sub-sampling.
  • Add NDVSampleSkipRate on RowSampleBuilder (default 1 = sketches collected from every row; no behavior change in this PR).
  • Rescale NullCount / TotalSizes to per-population estimates when a sub-sample is used.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • New Features

    • NDV sub-sampling added to statistics collection with a default NDV sample rate.
  • Tests

    • New and expanded tests covering default NDV behavior, singleton-sketch generation and serialization, rescaling under sub-sampling, and sketch size limits.
  • Chores

    • Full-sampling analyzer path and test setup configured to use the default NDV sample rate.

Review Change Stack

@ti-chi-bot ti-chi-bot Bot added the release-note-none Denotes a PR that doesn't merit a release note. label May 19, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented May 19, 2026

@0xPoe I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added component/statistics sig/planner SIG: Planner size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 19, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds configurable NDV sub-sampling to row statistics via per-node singleton FM-sketches, extends collection state/flow (gating, sample counts, rescaling), updates proto (de)serialization and merge logic, adds tests, and bumps a test shard count.

Changes

NDV Sub-Sampling Feature

Layer / File(s) Summary
NDV constants, singleton sketch types, and helpers
pkg/statistics/constants.go, pkg/statistics/row_sampler.go
DefaultNDVSampleRate added. baseCollector gets SingletonSketches and SketchSampleCount. singletonSketchBuilder introduced to deduplicate hashes; ShouldBuildSingletonSketches helper and imports added; RowSampleBuilder gains NDVSampleRate.
Collector initialization and analyzer wiring
pkg/statistics/row_sampler.go, pkg/store/mockstore/unistore/cophandler/analyze.go
Reservoir and Bernoulli collectors allocate SingletonSketches. RowSampleBuilder.Collect validates NDVSampleRate, initializes singleton builders and FM sketch slots. Full-sampling analyzer sets NDVSampleRate: statistics.DefaultNDVSampleRate.
Row collection with NDV sub-sampling gate and rescaling
pkg/statistics/row_sampler.go
Per-row NDV sampling gate controls whether rows feed FM sketches and singleton builders; sampled rows increment SketchSampleCount. Single-column groups clone singleton builders; multi-column groups use row-level InsertRowValue. NullCount and TotalSizes are rescaled to full-population estimates after subsampling.
Serialization, deserialization, and merge for singleton sketches
pkg/statistics/row_sampler.go
ToProto emits SingletonSketches and SketchSampleCount. FromProto reconstructs singleton sketches and reads SketchSampleCount. Merge logic accumulates SketchSampleCount and merges singleton sketches across collectors.
Tests, serialization tests, and BUILD tweak
pkg/statistics/sample_test.go, pkg/statistics/BUILD.bazel
Adds TestRescaleSampledValue and several SubTestRowSample* subtests validating default NDV behavior, singleton-sketch determinism/serialization, null-count rescaling, and max-size enforcement. Existing sampling tests set NDVSampleRate: DefaultNDVSampleRate. statistics_test shard_count incremented from 44 to 45.

Sequence Diagram

sequenceDiagram
  participant Builder as RowSampleBuilder
  participant RNG as NDVSampler
  participant SBuilder as singletonSketchBuilder
  participant Collector as baseCollector
  Builder->>RNG: decide collectSketch (NDVSampleRate)
  alt collectSketch == true
    Builder->>SBuilder: insertRowValue / insertHash
    Builder->>Collector: feed FM sketches
    Builder->>Collector: increment SketchSampleCount
  else collectSketch == false
    Builder->>Collector: skip sketch inputs
  end
  Builder->>Collector: rescale NullCount & TotalSizes if sub-sampled
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • pingcap/tidb#68158: Adds singleton-sketch protobuf fields (SingletonSketch, SketchSampleCount) referenced by this PR.
  • pingcap/tidb#67593: Consumes per-node singletonSketches to estimate global distinct values.
  • pingcap/tidb#68414: Updates tipb protobuf and FM-sketch hashing used by singleton-sketch serialization.

Suggested reviewers

  • time-and-fate
  • mjonss
  • henrybw

Poem

I nibble hashes, one by one,
I mark the singletons under the sun.
Sampled rows I count and keep,
Rescaled totals from the heap.
Hop, sketch, estimate — a rabbit’s leap! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding singleton sketches collection in the row sampler for better NDV estimation under sub-sampling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description includes the required issue number (ref #67449), a clear problem summary explaining the NDV bias issue, detailed description of changes made, and appropriate checkboxes. All required sections are present and substantively completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/statistics/row_sampler.go`:
- Around line 246-248: The code currently reads s.NDVSampleRate directly causing
a panic when NDVSampleRate is omitted (zero); change the initialization to honor
the default by setting ndvSampleRate := s.NDVSampleRate; if ndvSampleRate == 0 {
ndvSampleRate = DefaultNDVSampleRate } and then keep the existing validation
(intest.Assert(ndvSampleRate > 0 && ndvSampleRate <= DefaultNDVSampleRate, ...))
before calling ShouldBuildSingletonSketches(ndvSampleRate); this preserves the
old zero-value behavior while still validating explicit values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: bccaac0f-5a91-4885-9ceb-efb53fad6117

📥 Commits

Reviewing files that changed from the base of the PR and between 548ad7d and 54da5cb.

📒 Files selected for processing (4)
  • pkg/statistics/BUILD.bazel
  • pkg/statistics/constants.go
  • pkg/statistics/row_sampler.go
  • pkg/statistics/sample_test.go

Comment thread pkg/statistics/row_sampler.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 29.41176% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.8614%. Comparing base (2b9527b) to head (60e5514).
⚠️ Report is 31 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #68499        +/-   ##
================================================
- Coverage   77.2806%   74.8614%   -2.4193%     
================================================
  Files          2010       2030        +20     
  Lines        555351     573653     +18302     
================================================
+ Hits         429179     429445       +266     
- Misses       125252     143570     +18318     
+ Partials        920        638       -282     
Flag Coverage Δ
integration 41.4005% <29.4117%> (+1.6075%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4679% <ø> (ø)
parser ∅ <ø> (∅)
br 48.7418% <ø> (-14.2673%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@0xPoe 0xPoe force-pushed the stats-collect-singleton-sketches branch 4 times, most recently from 42d9035 to b0259e3 Compare May 19, 2026 16:57
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/statistics/row_sampler.go`:
- Around line 407-415: The function rescaleSampledValue currently asserts
sampleCount>0 but doesn't clamp negative or zero inputs as the comment promises;
update rescaleSampledValue to return 0 early when sampled <= 0 or totalRowCount
<= 0 (keeping the existing intest.Assert(sampleCount > 0)), and otherwise
perform the same round-half-up scaling using (sampled*totalRowCount +
sampleCount/2) / sampleCount so negative sampled values no longer produce
negative estimates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e3d22ff4-8bc8-4630-a8f1-6fbe7a01f647

📥 Commits

Reviewing files that changed from the base of the PR and between 9f3a14b and b0259e3.

📒 Files selected for processing (5)
  • pkg/statistics/BUILD.bazel
  • pkg/statistics/constants.go
  • pkg/statistics/row_sampler.go
  • pkg/statistics/sample_test.go
  • pkg/store/mockstore/unistore/cophandler/analyze.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/statistics/constants.go
  • pkg/store/mockstore/unistore/cophandler/analyze.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/statistics/BUILD.bazel

Comment thread pkg/statistics/row_sampler.go
@0xPoe 0xPoe force-pushed the stats-collect-singleton-sketches branch 3 times, most recently from 4944e09 to b46e664 Compare May 19, 2026 17:24
Copy link
Copy Markdown
Member Author

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔢 Self-check (PR reviewed by myself and ready for feedback)

  • Code compiles successfully

  • Unit tests added

  • No AI-generated elegant nonsense in PR.

  • Comments added where necessary

  • PR title and description updated

  • Documentation PR created (or confirmed not needed)

  • PR size is reasonable

/cc @time-and-fate @winoros

@ti-chi-bot ti-chi-bot Bot requested review from time-and-fate and winoros May 19, 2026 18:13
Comment thread pkg/statistics/row_sampler.go Outdated
@0xPoe 0xPoe requested a review from winoros May 21, 2026 08:17
@0xPoe 0xPoe force-pushed the stats-collect-singleton-sketches branch from b46e664 to 1625571 Compare May 21, 2026 17:28
@0xPoe
Copy link
Copy Markdown
Member Author

0xPoe commented May 21, 2026

/retest

@0xPoe 0xPoe force-pushed the stats-collect-singleton-sketches branch 2 times, most recently from 6e4f581 to 9e72f06 Compare May 21, 2026 19:36
Comment thread pkg/statistics/row_sampler.go Outdated
Comment thread pkg/statistics/constants.go Outdated
@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 23, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 23, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-23 11:56:11.093685811 +0000 UTC m=+94041.063850873: ☑️ agreed by winoros.

Signed-off-by: 0xPoe <techregister@pm.me>
@0xPoe 0xPoe force-pushed the stats-collect-singleton-sketches branch from 9e72f06 to 7eebdb8 Compare May 24, 2026 14:23
ColGroups [][]int64
MaxSampleSize int
SampleRate float64
NDVSampleRate float64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think current comments on NDVSampleSkipRate are not clear, and we should clearly explain the relationship and the difference between the two sample rates.

  1. Explain that these are two independent sampling mechanisms.
  2. Use unified terminology. Currently, we are using NDVSampleRate but SketchSampleCount.
  3. Explain that this doesn't only affect the NDV sampling but also affects the collection of NULL count and row size, and it's also related to the collection of the new singleton sketch.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed 1 and 3 in 49f2459 (this PR).

For 2, I think it is hard to unify them at that level. I called it SketchSampleCount because it is used to collect the FMSketches and SingletonSketches, so SketchSampleCount seems natural. This is mainly meant for implementation. But for NDV rate, users do not need to know all these sketches, so NDV rate would be easier to understand. I think changing it to NDVSampleCount is also OK, but SketchSampleCount seems more tightly tied to these two sketches we use to build NDV.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments for it in ef312c7 (this PR)

Comment on lines +23 to +24
// NDVSampleSkipRate is the sample rate at which sampling is skipped, so NDV sketches are built from every row.
NDVSampleSkipRate = 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just call it DefaultNDVSampleRate?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NDVSampleSkipRate is the sample rate at which sampling is skipped, so NDV sketches are built from every row.
NDVSampleSkipRate = 1
// NDVSampleSkipRate is the sample rate at which sampling is skipped. The default value is 1, so NDV sketches are built from every row.
NDVSampleSkipRate = 1

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultNDVSampleRate would be used when we integrate it with the priority queue. Then its default value would be 0.05, I guess. This is the value we use to skip sampling by default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or otherwise, we would end up with two defaults:

DefaultNDVSampleRate for normal tables as 1.

DefaultNDVSampleRateForBigTable for big tables as 0.05??

Comment thread pkg/statistics/row_sampler.go Outdated
Comment on lines +464 to +466
for i, singletonSketch := range singletonSketches {
s.SingletonSketches[i].MergeFMSketch(singletonSketch)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct behavior? I think we need to keep the sketch from every region.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We should not do the worker-level merge first.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I did my tests, I think I did it with this worker-level merge. It seems that doing it per region would consume much more memory and burn more CPU.

Comment on lines +152 to +154
func ShouldBuildSingletonSketches(rate float64) bool {
return rate < NDVSampleSkipRate
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need an extra function since it's used only once.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I moved it out is that it would be easier to add some comments for it. I am OK to inline it and add the comment for the variable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it would be used twice in the following PR. See https://patch-diff.githubusercontent.com/raw/pingcap/tidb/pull/68157.diff and search for it.

Signed-off-by: 0xPoe <techregister@pm.me>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: winoros
Once this PR has been reviewed and has the lgtm label, please assign terry1purcell for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@0xPoe 0xPoe force-pushed the stats-collect-singleton-sketches branch from 38c13a3 to 49f2459 Compare May 25, 2026 09:22
@0xPoe 0xPoe requested a review from time-and-fate May 25, 2026 09:25
Signed-off-by: 0xPoe <techregister@pm.me>
@wuhuizuo
Copy link
Copy Markdown
Contributor

/test pull-integration-realcluster-test-next-gen

MergeCollector unioned the per-region singleton FM sketches. Union conflates them: a value that is a singleton in two regions occurs twice overall, yet stays in the union and was miscounted as a global singleton, inflating the NDV estimate.

Keep each region's NDV and singleton sketches separate in a new RegionSketchSummary so EstimateGlobalSingletonBySketches can discount such values across regions. The merge collects one summary per region instead of unioning, preserving per-region granularity through every merge level.

ref pingcap#67449

Signed-off-by: 0xPoe <techregister@pm.me>
@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 25, 2026

@wuhuizuo: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test fast_test_tiprow
/test tidb_parser_test

Use /test all to run all jobs.

Details

In response to this:

/test pull-integration-realcluster-test-next-gen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 25, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 25, 2026

@0xPoe: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-mysql-client-test-next-gen 60e5514 link true /test pull-mysql-client-test-next-gen
pull-mysql-client-test 60e5514 link true /test pull-mysql-client-test
pull-lightning-integration-test 60e5514 link true /test pull-lightning-integration-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@0xPoe
Copy link
Copy Markdown
Member Author

0xPoe commented May 25, 2026

I need to rework this solution, so I’m closing this one. I will resend it when it is ready.

@0xPoe 0xPoe closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/statistics needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants