Skip to content

[SPARK-55978][SQL][FOLLOWUP] Don't block V2 join pushdown when pushed Sample has fraction=1#56031

Open
cloud-fan wants to merge 1 commit into
apache:masterfrom
cloud-fan:SPARK-55978-followup
Open

[SPARK-55978][SQL][FOLLOWUP] Don't block V2 join pushdown when pushed Sample has fraction=1#56031
cloud-fan wants to merge 1 commit into
apache:masterfrom
cloud-fan:SPARK-55978-followup

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Followup to #54972.

Narrow the V2ScanRelationPushDown join-pushdown guard so it only blocks pushdown when at least one side has a pushed Sample with fraction < 1. At fraction = 1 the sample is a no-op on the result set, so dropping it inside the merged scan builder is safe.

Why are the changes needed?

The guard added in SPARK-55978 exists because the merged scan builder for SupportsPushDownJoin cannot carry a pushed sample and would silently discard it. The hazard is silent result change. At fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land at TABLESAMPLE SYSTEM (100 PERCENT) (parameterized queries, query generators, environment-tuned fractions).

Does this PR introduce any user-facing change?

No behavior change for queries with fraction < 1. For queries where the pushed sample has fraction = 1, join pushdown now proceeds — same result set, faster plan.

How was this patch tested?

  • Existing "join pushdown is skipped when a side has a pushed sample" test moved from 100 PERCENT to 50 PERCENT so it keeps exercising the (still-active) fraction < 1 branch of the guard.
  • New "100% SYSTEM sample does not block join pushdown" test asserts the new fraction = 1 short-circuit, locking in the contract.

Was this patch authored or co-authored using generative AI tooling?

No.

… Sample has fraction=1

### What changes were proposed in this pull request?

Followup to apache#54972.

Narrow the `V2ScanRelationPushDown` join-pushdown guard so it only blocks pushdown when at least one side has a pushed `Sample` with fraction < 1. At fraction = 1 the sample is a no-op on the result set, so dropping it inside the merged scan builder is safe.

### Why are the changes needed?

The guard added in SPARK-55978 exists because the merged scan builder for `SupportsPushDownJoin` cannot carry a pushed sample and would silently discard it. The hazard is *silent result change*. At fraction = 1, no rows are excluded, so dropping the sample changes nothing observable. The current guard is therefore stricter than its rationale requires, and unnecessarily skips join pushdown for queries that land at `TABLESAMPLE SYSTEM (100 PERCENT)` (parameterized queries, query generators, environment-tuned fractions).

### Does this PR introduce _any_ user-facing change?

No behavior change for queries with fraction < 1. For queries where the pushed sample has fraction = 1, join pushdown now proceeds — same result set, faster plan.

### How was this patch tested?

- Existing `"join pushdown is skipped when a side has a pushed sample"` test moved from `100 PERCENT` to `50 PERCENT` so it keeps exercising the (still-active) fraction < 1 branch of the guard.
- New `"100% SYSTEM sample does not block join pushdown"` test asserts the new fraction = 1 short-circuit, locking in the contract.

### Was this patch authored or co-authored using generative AI tooling?

No.
@cloud-fan
Copy link
Copy Markdown
Contributor Author

@stanyao @szehon-ho

// fraction < 1, because the merged scan builder would silently
// discard it and change the result. At fraction = 1 the sample is
// a no-op on the result set, so dropping it is safe.
// TODO(SPARK-56504): Extend SupportsPushDownJoin to accept pushed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is @stanyao 's todo JIRA, can we make this pr target that and remove the TODO?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, this is separate ?

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up — narrowing the guard to fraction < 1 matches the original SPARK-55978 rationale well (avoid silently discarding a sample that actually filters rows).

Assumption (full table scan): This PR relies on the assumption that a pushed sample with upperBound - lowerBound >= 1.0 is a no-op on the result row set — i.e. equivalent to reading the full table without sampling — so it is safe when join pushdown drops/discards the sample (merged scan builder cannot carry it; JDBC join subqueries omit TABLESAMPLE, etc.). That is reasonable for SQL TABLESAMPLE at 100% (withReplacement = false, lowerBound = 0) and for the motivating SYSTEM (100 PERCENT) case, but it is still an assumption: execution becomes "scan without sample," not "scan with 100% sample," and correctness depends on connectors/dialects treating 100% as including all rows/blocks (especially SYSTEM, which is implementation-defined).

Tests: Moving the blocking case to 50 PERCENT and adding the 100 PERCENT join-pushed test look good. Optional follow-ups: TABLESAMPLE (100 PERCENT) (Bernoulli), and asserting row results match the no-sample join.

Left an inline note on withReplacement.

// TODO(SPARK-56504): Extend SupportsPushDownJoin to accept pushed
// samples so sources supporting both can handle the composition.
leftHolder.pushedSample.isEmpty && rightHolder.pushedSample.isEmpty &&
leftHolder.pushedSample.forall(s => s.upperBound - s.lowerBound >= 1.0) &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upperBound - lowerBound >= 1.0 does not check withReplacement. With withReplacement = true, SampleExec uses PoissonSampler: even at rate 1.0, each input row can be emitted 0, 1, 2, … times, so discarding the pushed sample during join pushdown is not the same as a full scan. SQL TABLESAMPLE always sets withReplacement = false, but DataFrame.sample(withReplacement = true, fraction = 1.0) can be pushed to DSv2.

Suggest tightening the guard, e.g.:

!s.withReplacement && s.upperBound - s.lowerBound >= 1.0

(optionally also s.lowerBound <= RandomSampler.roundingEpsilon for clarity).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants