statistics: collect singleton sketches in row sampler#68499
Conversation
|
@0xPoe I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds configurable NDV sub-sampling to row statistics via per-node singleton FM-sketches, extends collection state/flow (gating, sample counts, rescaling), updates proto (de)serialization and merge logic, adds tests, and bumps a test shard count. ChangesNDV Sub-Sampling Feature
Sequence DiagramsequenceDiagram
participant Builder as RowSampleBuilder
participant RNG as NDVSampler
participant SBuilder as singletonSketchBuilder
participant Collector as baseCollector
Builder->>RNG: decide collectSketch (NDVSampleRate)
alt collectSketch == true
Builder->>SBuilder: insertRowValue / insertHash
Builder->>Collector: feed FM sketches
Builder->>Collector: increment SketchSampleCount
else collectSketch == false
Builder->>Collector: skip sketch inputs
end
Builder->>Collector: rescale NullCount & TotalSizes if sub-sampled
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/statistics/row_sampler.go`:
- Around line 246-248: The code currently reads s.NDVSampleRate directly causing
a panic when NDVSampleRate is omitted (zero); change the initialization to honor
the default by setting ndvSampleRate := s.NDVSampleRate; if ndvSampleRate == 0 {
ndvSampleRate = DefaultNDVSampleRate } and then keep the existing validation
(intest.Assert(ndvSampleRate > 0 && ndvSampleRate <= DefaultNDVSampleRate, ...))
before calling ShouldBuildSingletonSketches(ndvSampleRate); this preserves the
old zero-value behavior while still validating explicit values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: bccaac0f-5a91-4885-9ceb-efb53fad6117
📒 Files selected for processing (4)
pkg/statistics/BUILD.bazelpkg/statistics/constants.gopkg/statistics/row_sampler.gopkg/statistics/sample_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #68499 +/- ##
================================================
- Coverage 77.2806% 74.8614% -2.4193%
================================================
Files 2010 2030 +20
Lines 555351 573653 +18302
================================================
+ Hits 429179 429445 +266
- Misses 125252 143570 +18318
+ Partials 920 638 -282
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
42d9035 to
b0259e3
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/statistics/row_sampler.go`:
- Around line 407-415: The function rescaleSampledValue currently asserts
sampleCount>0 but doesn't clamp negative or zero inputs as the comment promises;
update rescaleSampledValue to return 0 early when sampled <= 0 or totalRowCount
<= 0 (keeping the existing intest.Assert(sampleCount > 0)), and otherwise
perform the same round-half-up scaling using (sampled*totalRowCount +
sampleCount/2) / sampleCount so negative sampled values no longer produce
negative estimates.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e3d22ff4-8bc8-4630-a8f1-6fbe7a01f647
📒 Files selected for processing (5)
pkg/statistics/BUILD.bazelpkg/statistics/constants.gopkg/statistics/row_sampler.gopkg/statistics/sample_test.gopkg/store/mockstore/unistore/cophandler/analyze.go
✅ Files skipped from review due to trivial changes (2)
- pkg/statistics/constants.go
- pkg/store/mockstore/unistore/cophandler/analyze.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/statistics/BUILD.bazel
4944e09 to
b46e664
Compare
0xPoe
left a comment
There was a problem hiding this comment.
🔢 Self-check (PR reviewed by myself and ready for feedback)
-
Code compiles successfully
-
Unit tests added
-
No AI-generated elegant nonsense in PR.
-
Comments added where necessary
-
PR title and description updated
-
Documentation PR created (or confirmed not needed)
-
PR size is reasonable
b46e664 to
1625571
Compare
|
/retest |
6e4f581 to
9e72f06
Compare
[LGTM Timeline notifier]Timeline:
|
Signed-off-by: 0xPoe <techregister@pm.me>
9e72f06 to
7eebdb8
Compare
| ColGroups [][]int64 | ||
| MaxSampleSize int | ||
| SampleRate float64 | ||
| NDVSampleRate float64 |
There was a problem hiding this comment.
I think current comments on NDVSampleSkipRate are not clear, and we should clearly explain the relationship and the difference between the two sample rates.
- Explain that these are two independent sampling mechanisms.
- Use unified terminology. Currently, we are using
NDVSampleRatebutSketchSampleCount. - Explain that this doesn't only affect the NDV sampling but also affects the collection of
NULLcount and row size, and it's also related to the collection of the new singleton sketch.
There was a problem hiding this comment.
I have addressed 1 and 3 in 49f2459 (this PR).
For 2, I think it is hard to unify them at that level. I called it SketchSampleCount because it is used to collect the FMSketches and SingletonSketches, so SketchSampleCount seems natural. This is mainly meant for implementation. But for NDV rate, users do not need to know all these sketches, so NDV rate would be easier to understand. I think changing it to NDVSampleCount is also OK, but SketchSampleCount seems more tightly tied to these two sketches we use to build NDV.
There was a problem hiding this comment.
Added more comments for it in ef312c7 (this PR)
| // NDVSampleSkipRate is the sample rate at which sampling is skipped, so NDV sketches are built from every row. | ||
| NDVSampleSkipRate = 1 |
There was a problem hiding this comment.
Why don't we just call it DefaultNDVSampleRate?
There was a problem hiding this comment.
| // NDVSampleSkipRate is the sample rate at which sampling is skipped, so NDV sketches are built from every row. | |
| NDVSampleSkipRate = 1 | |
| // NDVSampleSkipRate is the sample rate at which sampling is skipped. The default value is 1, so NDV sketches are built from every row. | |
| NDVSampleSkipRate = 1 |
There was a problem hiding this comment.
DefaultNDVSampleRate would be used when we integrate it with the priority queue. Then its default value would be 0.05, I guess. This is the value we use to skip sampling by default.
There was a problem hiding this comment.
Or otherwise, we would end up with two defaults:
DefaultNDVSampleRate for normal tables as 1.
DefaultNDVSampleRateForBigTable for big tables as 0.05??
| for i, singletonSketch := range singletonSketches { | ||
| s.SingletonSketches[i].MergeFMSketch(singletonSketch) | ||
| } |
There was a problem hiding this comment.
Is this the correct behavior? I think we need to keep the sketch from every region.
There was a problem hiding this comment.
You are right. We should not do the worker-level merge first.
There was a problem hiding this comment.
When I did my tests, I think I did it with this worker-level merge. It seems that doing it per region would consume much more memory and burn more CPU.
| func ShouldBuildSingletonSketches(rate float64) bool { | ||
| return rate < NDVSampleSkipRate | ||
| } |
There was a problem hiding this comment.
I think we don't need an extra function since it's used only once.
There was a problem hiding this comment.
The reason I moved it out is that it would be easier to add some comments for it. I am OK to inline it and add the comment for the variable.
There was a problem hiding this comment.
And it would be used twice in the following PR. See https://patch-diff.githubusercontent.com/raw/pingcap/tidb/pull/68157.diff and search for it.
Signed-off-by: 0xPoe <techregister@pm.me>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: winoros The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
38c13a3 to
49f2459
Compare
Signed-off-by: 0xPoe <techregister@pm.me>
|
/test pull-integration-realcluster-test-next-gen |
MergeCollector unioned the per-region singleton FM sketches. Union conflates them: a value that is a singleton in two regions occurs twice overall, yet stays in the union and was miscounted as a global singleton, inflating the NDV estimate. Keep each region's NDV and singleton sketches separate in a new RegionSketchSummary so EstimateGlobalSingletonBySketches can discount such values across regions. The merge collects one summary per region instead of unioning, preserving per-region granularity through every merge level. ref pingcap#67449 Signed-off-by: 0xPoe <techregister@pm.me>
|
@wuhuizuo: The specified target(s) for Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@0xPoe: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I need to rework this solution, so I’m closing this one. I will resend it when it is ready. |
What problem does this PR solve?
Issue Number: ref #67449
Problem Summary:
Distributed row-sampling ANALYZE has no way to derive a population NDV when sketches are gathered from a sub-sample of rows. The existing FM sketch counts every distinct value, which biases the estimate downward.
What changed and how does it work?
Split out of #68157 — the
pkg/statisticshalf only. Executor wiring stays in #68157.singletonSketchBuilderthat tracks values seen exactly once per node; only singletons feed a separate FM sketch, which is the basis for population NDV estimation under sub-sampling.NDVSampleSkipRateonRowSampleBuilder(default1= sketches collected from every row; no behavior change in this PR).NullCount/TotalSizesto per-population estimates when a sub-sample is used.Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Tests
Chores