Skip to content

feat: implement array_exists with lambda support via JVM UDF bridge#4223

Draft
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:array-exists-lambda
Draft

feat: implement array_exists with lambda support via JVM UDF bridge#4223
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:array-exists-lambda

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 5, 2026

Which issue does this PR close?

Part of #4193

Rationale for this change

This PR adds a new Comet JVM UDF feature, where Comet can have JVM implementations of expressions that operate on Arrow data.

array_exists is implemented as the first example.

The advantage of this approach is that we can quickly implement these features with 100% Spark compatibility without re-implementing the expressions in native code -we just call existing Java/Spark code, but operator on Arrow data, and avoid an expensive transition falling back to Spark.

Performance is 1.8x of Spark.

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 26.2
Apple M3 Max
array_exists - int array (x -> x < 0):    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               584            591           7          7.2         139.3       1.0X
Comet (Scan)                                        588            623          40          7.1         140.1       1.0X
Comet (Scan + Exec)                                 322            329           3         13.0          76.8       1.8X

What changes are included in this PR?

Experimental support for Spark's exists(array, x -> predicate(x)) — the first lambda-based expression accelerated by Comet.

  • CometLambdaRegistry: static concurrent map bridging plan-time lambda expressions to execution-time UDF lookup
  • ArrayExistsUDF: iterates ListVector elements, evaluates the lambda predicate via Spark's NamedLambdaVariable, implements three-valued null logic
  • CometArrayExists serde: registers the lambda, emits JvmScalarUdf proto
  • Scope: single-argument lambdas referencing only the array element; primitive + string element types

How are these changes tested?

5 end-to-end tests in CometArrayExpressionSuite covering integer predicates, string predicates, null elements with three-valued logic, all-match case, and empty arrays. All use checkSparkAnswerAndOperator to verify both correctness and native execution.

@andygrove andygrove marked this pull request as ready for review May 5, 2026 15:06
@andygrove andygrove requested a review from comphead May 5, 2026 15:06
@andygrove
Copy link
Copy Markdown
Member Author

@hsiang-c fyi

Adds a new JVM UDF bridge framework that allows Spark expressions to be
evaluated on the JVM side via Arrow C Data Interface, while keeping the
native execution pipeline intact. Includes array_exists as the first
lambda-based expression using this framework.
@comphead
Copy link
Copy Markdown
Contributor

comphead commented May 5, 2026

Are we planning to merge it asap or wait DF 54.0?

@comphead
Copy link
Copy Markdown
Contributor

comphead commented May 5, 2026

we can try use apache/datafusion#21903 directly or create datafusion-spark counterparty

@andygrove
Copy link
Copy Markdown
Member Author

Are we planning to merge it asap or wait DF 54.0?

I would love to get the JVM UDF framework in (once reviewed).

There are many applications where it can help us get acceleration by default rather than opt-in

  • lambdas
  • JSON
  • regex

What would be the advantage of waiting for DF 54? Does that give us 100% compatibility for array_exists with lambdas?

@andygrove
Copy link
Copy Markdown
Member Author

Are we planning to merge it asap or wait DF 54.0?

I would love to get the JVM UDF framework in (once reviewed).

There are many applications where it can help us get acceleration by default rather than opt-in

  • lambdas
  • JSON
  • regex

What would be the advantage of waiting for DF 54? Does that give us 100% compatibility for array_exists with lambdas?

I could split the JVM UDF work out into a separate PR but there would be no tests if we don't have an example of an expression using it

@comphead
Copy link
Copy Markdown
Contributor

comphead commented May 5, 2026

Are we planning to merge it asap or wait DF 54.0?

I would love to get the JVM UDF framework in (once reviewed).
There are many applications where it can help us get acceleration by default rather than opt-in

  • lambdas
  • JSON
  • regex

What would be the advantage of waiting for DF 54? Does that give us 100% compatibility for array_exists with lambdas?

I could split the JVM UDF work out into a separate PR but there would be no tests if we don't have an example of an expression using it

Having no tests for lambda is fine IMO as we do not expose the feature to users right away.

Does that give us 100% compatibility for array_exists with lambdas?

Thats the entire intention of datafusion-spark module, but the compatibility is not always true unfortunately.

My main concern we could end up with multiple lambda implementation in DF and in Comet and might cause confusion and conflicts. The small poc PR shown the array_exists works with basic lambda(no column capture, no nested lambdas) https://github.com/apache/datafusion-comet/pull/4127/changes#diff-7411f5845a2488bb1509b95d8ad1e014422e21d70e0b802bd7624eabc4621c66

For customers we can build another branch on top of DF54 migration branch and including lambda functions there, so they can test it, WDYT?

@andygrove andygrove marked this pull request as draft May 5, 2026 21:28
@andygrove
Copy link
Copy Markdown
Member Author

Having no tests for lambda is fine IMO as we do not expose the feature to users right away.

Ok, here is new PR with just the framework - #4232

Moving this PR to draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants