[SPARK-55897][SQL][4.0] Handle UserDefinedType in ColumnarRow, ColumnarBatchRow, and ColumnarArray get()#55990
Open
james-willis wants to merge 1 commit into
Open
Conversation
…chRow, and ColumnarArray get() ### What changes were proposed in this pull request? `ColumnarRow.get()`, `ColumnarBatchRow.get()`, and `ColumnarArray.get()` throw `SparkUnsupportedOperationException` when called with a `UserDefinedType` because they have no branch to handle UDTs. This PR adds UDT handling to all three methods: - **ColumnarRow** and **ColumnarBatchRow**: Add an `instanceof UserDefinedType` branch that recurses with `udt.sqlType()`, matching the pattern already used in `SpecializedGettersReader.read()`. - **ColumnarArray**: Change the `handleUserDefinedType` flag from `false` to `true` in the existing call to `SpecializedGettersReader.read()`. ### Why are the changes needed? The codegen path (`CodeGenerator.getValue()`) unwraps `udt.sqlType()` before generating accessor calls, so UDT columns work when whole-stage codegen is active. However, on the interpreted eval path — when codegen is disabled, falls back, or the number of fields exceeds `spark.sql.codegen.maxFields` — `GetStructField.nullSafeEval` calls `ColumnarRow.get(ordinal, udtType)` directly, which hits the unhandled branch and throws. ### Does this PR introduce _any_ user-facing change? Yes. UDT columns in columnar data sources (e.g., Parquet) now work correctly on the interpreted evaluation path. Previously they would throw `SparkUnsupportedOperationException`. ### How was this patch tested? Added 6 new tests in `ColumnarBatchSuite` covering all 3 methods × 2 UDT backing types (primitive `IntegerType` and complex `StructType`). Each test creates columnar vectors with UDT data and verifies that `get()` returns the correct value. Two helper UDT classes (`TestIntUDT`, `TestStructWrapperUDT`) are defined for the tests. ### Was this patch authored or co-authored using generative AI tooling? Yes. Opus 4.6 Closes apache#54701 from james-willis/columnar-row-udt-test. Authored-by: jameswillis <james@wherobots.com> Signed-off-by: Huaxin Gao <huaxin.gao11@gmail.com> (cherry picked from commit 472735c)
Contributor
Author
|
@huaxingao here is the 4.0 port. |
Contributor
|
@james-willis Could you check why the CI failed? |
Contributor
Author
|
@huaxingao It was that flakey Protobuf breaking change action. retry fixed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of #54701 to branch-4.0.
What changes were proposed in this pull request?
ColumnarRow.get(),ColumnarBatchRow.get(), andColumnarArray.get()throwSparkUnsupportedOperationExceptionwhen called with aUserDefinedTypebecause they have no branch to handle UDTs.This PR adds UDT handling to all three methods:
instanceof UserDefinedTypebranch that recurses withudt.sqlType(), matching the pattern already used inSpecializedGettersReader.read().handleUserDefinedTypeflag fromfalsetotruein the existing call toSpecializedGettersReader.read().Why are the changes needed?
The codegen path (
CodeGenerator.getValue()) unwrapsudt.sqlType()before generating accessor calls, so UDT columns work when whole-stage codegen is active. However, on the interpreted eval path — when codegen is disabled, falls back, or the number of fields exceedsspark.sql.codegen.maxFields—GetStructField.nullSafeEvalcallsColumnarRow.get(ordinal, udtType)directly, which hits the unhandled branch and throws.Does this PR introduce any user-facing change?
Yes. UDT columns in columnar data sources (e.g., Parquet) now work correctly on the interpreted evaluation path. Previously they would throw
SparkUnsupportedOperationException.How was this patch tested?
Added 6 new tests in
ColumnarBatchSuitecovering all 3 methods x 2 UDT backing types (primitiveIntegerTypeand complexStructType). Each test creates columnar vectors with UDT data and verifies thatget()returns the correct value. Two helper UDT classes (TestIntUDT,TestStructWrapperUDT) are defined for the tests.Cherry-picked from 472735c on master. The cherry-pick had a trivial conflict in
ColumnarBatchSuite.scala: the neighboring[SPARK-55552] Varianttest exists on branch-4.1+ but not on branch-4.0, so its insertion point was contested. Resolved by keeping only the SPARK-55897 tests (the Variant test is unrelated).Was this patch authored or co-authored using generative AI tooling?
Yes. Opus 4.6