Skip to content

[SPARK-56718][PYTHON] Refactor SQL_COGROUPED_MAP_PANDAS_UDF#55674

Open
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:refactor/cogrouped-map-pandas-udf
Open

[SPARK-56718][PYTHON] Refactor SQL_COGROUPED_MAP_PANDAS_UDF#55674
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:refactor/cogrouped-map-pandas-udf

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Refactor SQL_COGROUPED_MAP_PANDAS_UDF to be self-contained in read_udfs(), moving logic from CogroupPandasUDFSerializer into worker.py, using ArrowStreamCoGroupSerializer as pure I/O.

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Making each eval type self-contained in read_udfs() improves readability and makes it easier to reason about the data flow for each eval type independently.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. No behavior change.

ASV benchmark (CogroupedMapPandasUDFTimeBench), 3 runs each, averaged:

master: ebfdf8b6b4  vs  PR: 87d96ebe40

Time (ms, lower = better)
scenario          udf                  master      PR        diff
---------------------------------------------------------------------
few_groups_sm     identity_udf          160.3     159.0   -0.81%
few_groups_sm     concat_udf            181.0     181.0    0.00%
few_groups_sm     left_semi_udf         182.7     181.7   -0.55%
few_groups_sm     key_identity_udf      160.3     162.0   +1.04%
few_groups_lg     identity_udf          413.3     411.3   -0.48%
few_groups_lg     concat_udf            545.7     537.3   -1.54%
few_groups_lg     left_semi_udf         508.3     500.7   -1.51%
few_groups_lg     key_identity_udf      412.3     412.3    0.00%
many_groups_sm    identity_udf         1356.7    1356.7    0.00%
many_groups_sm    concat_udf           1480.0    1473.3   -0.45%
many_groups_sm    left_semi_udf        1496.7    1503.3   +0.44%
many_groups_sm    key_identity_udf     1370.0    1363.3   -0.49%
many_groups_lg    identity_udf          765.3     760.7   -0.61%
many_groups_lg    concat_udf            922.7     908.7   -1.52%
many_groups_lg    left_semi_udf         887.0     886.0   -0.11%
many_groups_lg    key_identity_udf      764.7     767.0   +0.30%
wide_values       identity_udf         1073.3    1073.3    0.00%
wide_values       concat_udf           1276.7    1266.7   -0.78%
wide_values       left_semi_udf        1130.0    1130.0    0.00%
wide_values       key_identity_udf     1080.0    1073.3   -0.62%
multi_key         identity_udf          425.3     423.7   -0.39%
multi_key         concat_udf            473.7     473.0   -0.14%
multi_key         left_semi_udf         463.0     451.3   -2.52%
multi_key         key_identity_udf      430.3     428.3   -0.46%
---------------------------------------------------------------------
SUM                                  17995.0   17834.0   -0.90%

Aggregate essentially flat (-0.90%); per-scenario variation within run-to-run noise.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Copy Markdown
Contributor

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Yicong-Huang Yicong-Huang force-pushed the refactor/cogrouped-map-pandas-udf branch from 87d96eb to 9960e00 Compare May 20, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants