[SPARK-56944] [SQL] Trim aliases from ListAgg expression subtree#55984
[SPARK-56944] [SQL] Trim aliases from ListAgg expression subtree#55984mihailoale-db wants to merge 1 commit into
Conversation
|
@cloud-fan PTAL when you find time. Thanks! |
cloud-fan
left a comment
There was a problem hiding this comment.
Thanks Mihailo, the fix is reasonable and follows the pattern from SPARK-54871.
One question about completeness, and one about test coverage — see inline. Otherwise the change is straightforward.
| } | ||
| if (someOrder.size == 1 && someOrder.head.child.semanticEquals(child)) { | ||
| if (someOrder.size == 1 && | ||
| trimAliases(someOrder.head.child).semanticEquals(trimAliases(child))) { |
There was a problem hiding this comment.
The PR description says "trim aliases from ListAgg expression subtree", but checkOrderValueDeterminism (line 684) still does orderExpressions.head.child.semanticEquals(castChild) without trimming. With spark.sql.listagg.allowDistinctCastWithOrder.enabled=true (default), a query like
SELECT listagg(DISTINCT v:a::string, ',') WITHIN GROUP (ORDER BY v:a) FROM ...reaches that path, and the two analyzers still produce different errors:
- single-pass (aliases still present): the
Alias-vs-AliassemanticEqualsis false because the ExprIds differ, so we returnNonDeterministicMismatchand throwfunctionAndOrderExpressionMismatchError. - fixed-point (post-
CleanupAliases):semanticEqualsis true, thenisCastEqualityPreserving(VariantType)is false, so we throwfunctionAndOrderExpressionUnsafeCastError.
Should this comparison also wrap both sides in trimAliases? Otherwise the single-pass/fixed-point divergence is only partly closed.
|
|
||
| -- LISTAGG with semi-structured extract (parser wraps v:a in Alias with fresh ExprId) | ||
| -- Tests that isOrderCompatible strips Alias wrappers before comparing via semanticEquals | ||
| SELECT listagg(DISTINCT v:a::string, ',') WITHIN GROUP (ORDER BY v:a::string) FROM (SELECT parse_json('{"a": "x"}') v UNION ALL SELECT parse_json('{"a": "y"}') UNION ALL SELECT parse_json('{"a": "x"}')); |
There was a problem hiding this comment.
SQLQueryTestSuite only runs the legacy analyzer here, where CleanupAliases has already stripped the parser-introduced Alias before CheckAnalysis calls isOrderCompatible. So these golden files pass on master without the fix too — they don't actually exercise the single-pass path the PR is fixing.
Consider adding a targeted single-pass unit test (e.g. in AggregateExpressionResolverSuite or alongside ResolverRunner) that constructs LISTAGG(DISTINCT v:a::string) WITHIN GROUP (ORDER BY v:a::string) and goes through Resolver directly, so the regression is pinned at the layer where it actually manifested.
What changes were proposed in this pull request?
In this PR I propose to trim aliases from
ListAggexpression subtree in order to fix a discrepancy between single-pass and fixed-point analyzers.It is a safe change since it would otherwise be removed in
CleanupAliases.Why are the changes needed?
To fix a dual-run issue.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added + existing tests.
Was this patch authored or co-authored using generative AI tooling?
Yes.