[SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time by zhengruifeng · Pull Request #56026 · apache/spark

zhengruifeng · 2026-05-21T00:44:48Z

What changes were proposed in this pull request?

This PR rewrites the PR title processing in dev/merge_spark_pr.py:

New Component registry — a typed Component class and a COMPONENTS tuple listing all recognized Spark JIRA components with their canonical tags, aliases, and a primary flag. primary=True means the tag alone satisfies the merge-time requirement; non-primary tags (e.g. [TEST], [SHUFFLE]) must be paired with a primary one. The primary set covers the main subsystems: BUILD, CONNECT, CORE, DOC, DOCKER, GRAPHX, INFRA, K8S, ML, MLLIB, PS, PYTHON, R, SDP, SECURITY, SQL, SS, STREAMING, UI, WINDOWS, YARN.

New Title parser — a Title class with leading (SPARK-NNNNN / MINOR / TRIVIAL), components, and text fields. Title.parse() is strict:

the title must open with a leading tag;
tags may appear in any order and with arbitrary surrounding whitespace;
bracket-tag characters include letters, digits, _, -, and . (so version tags like [4.X] and [3.5] are recognized);
SPARK-NNNNN IDs must all precede any component tags;
[MINOR] and [TRIVIAL] cannot coexist with each other or with a SPARK-ID.

Malformed titles raise ValueError.

Replaced standardize_jira_ref — the old lenient regex rewriter that tolerated bare SPARK 1234 refs, [Project Infra] multi-word tags, etc. is removed. Titles must now be well-formed before reaching the merge step.

New title pipeline in main():

Hard-fail on [WIP] or [DO-NOT-MERGE] (previously just a soft prompt).
Revert "..." and Reapply "..." titles are kept verbatim.
Title.parse() — fail with a clear message if malformed.
Normalize each component tag via the registry (e.g. PYSPARK→PYTHON, DOCS→DOC, FOLLOW-UP→FOLLOWUP, TESTS→TEST) and track whether any primary tag is present.
If no primary component is present, prompt the committer to enter at least one; insert the entered tag(s) right after the leading refs.
Deduplicate component tags in insertion order.
Move backport version tags ([4.X], [4.2], [3.5], ...) to the head of the component list.
Move [FOLLOWUP] to the last position.
Print a warning for any tag that is neither a known component nor a version tag.

Why are the changes needed?

Some PRs are merged without any [COMPONENT] tag (e.g. #55866 merged as [SPARK-56853] Improve PATH Tests), losing module attribution in the changelog. Others carry non-canonical tags or version tags in inconsistent positions, which makes changelog tooling unreliable.

The old standardize_jira_ref tolerated very loose title formats but did not enforce component presence, leaving the gap. This PR closes it with a strict parser, registry-based normalization, and a prompt-based fallback when a primary component is missing.

Does this PR introduce any user-facing change?

No. The change only affects the committer-facing interactive merge tool (dev/merge_spark_pr.py).

How was this patch tested?

Doctests on Title, Title.parse, and the existing version-resolution helpers. All pass via:

python3 -m doctest dev/merge_spark_pr.py

The pipeline was also dry-run against the latest 1000 commits on master and 200 open PRs — see the comments on this PR for the full report.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

cloud-fan

Nice clean addition that closes a real changelog-attribution gap. Code follows the existing standardize_jira_ref pattern (pure helpers + doctests + call site in main()), all 55 doctests pass, and the comments are accurate.

A few questions and minor suggestions inline:

One design question about the selection rule for COMPONENT_MARKERS — several active JIRA components have no allowlist entry (Declarative Pipelines, Deploy, DStreams, Examples, Optimizer, Scheduler, Shuffle, Spark Docker, Spark Shell, Spark Submit, Windows, Block Manager). Worth documenting the criterion, since the SDP case (77 recent uses of [SDP]) suggests committers are already using subsystem tags the script will now reject.
The validator accepts unknown entries when at least one canonical is provided, which lets typos slip through silently. One-line warning would close the gap.
One small coverage gap: no doctest for the [SPARK-1234]-only (empty body) case.
Minor styling: the recoverable prompt uses print_error (red), which is usually for fatal errors.

LuciferYang · 2026-05-21T15:58:13Z

It seems tags like [PROJECT INFRA] were originally supported. Will any issues arise after this fix?

the script will automatically map it to [INFRA]

Generated-by: Claude Code

zhengruifeng · 2026-05-22T12:03:35Z

`Title.parse` coverage on real-world titles

Tested against the latest 1000 commits on master and 200 open PRs:

Source	Total	Bypassed*	Parse OK	Parse Fail
Commits	1000	15	983	2
Open PRs	200	11	172	17

* Bypassed = Revert "..." / Reapply "..." / [WIP] / [DO-NOT-MERGE] titles short-circuited in main() before Title.parse is called.

Failing commits (already on `master`, slipped past the old lenient script)

revert [SPARK-43752] and add test
[DOCS] Clarify DataFrames in quickstart

Failing open PRs (all malformed — exactly what this PR intends to block)

Add recursive read for partition
fix issue MapType as a type hint for schema 55900
WIP Spark 56763 sarutak 3.5 restore additional functionality r2
[SPARK-XXXXX][CORE] Add IO_URING transport mode
Fix UNBOUND_SQL_PARAMETER regression in PySpark 4.1.1
fix: the hive thrift server's ldap authentication pr... in...
Predicate Pushdown in Spark Structured Streaming (DataSource V2).
[SPARK-XXXXX][SQL] Public Column.toJson / Column.fromJson on the V2 catalog interface
fix: remove bare excepts and clean up lint issues (quantum-local-fixer)
[SQL][DML] Resolve Path Based Tables Read in Transactional Writes
Project DSv2 leaf node to Scan's schema
[SPARK-XXXXX][SQL][TESTS] Add DSv2 table refresh infrastructure and t…
[SPARK-XXXXX][CONNECT][TESTS] Add Connect DataFrame pinning tests
[SPARK-XXXXX][SQL][TESTS] Add DSv2 CACHE TABLE and write validation t…
[SQL] Update grammar/TableChange to allow CLUSTER BY expressions
[SPARK 55879] [Web UI] copy sql to clipboard
Implement `SET` command in single-pass analyzer

Breakdown of the 17 PR failures: 5 [SPARK-XXXXX] placeholders, 2 component-first (no SPARK ref), 1 [SPARK 55879] (space instead of hyphen), 9 with no leading bracket at all.

zhengruifeng · 2026-05-22T12:06:46Z

`prompt_for_components` coverage on real-world titles

Among the parseable titles from the previous sample (983 commits, 172 open PRs), how many would lack a primary tag and trigger the prompt at merge time:

Source	Parseable	Need prompt	Rate
Commits	983	40	~4%
Open PRs	172	11	~6%

Breakdown of why the prompt would fire:

Category	Commits	PRs
No tags at all (just `[SPARK-NNNNN] ...`)	26	8
Only `[TEST]` (alone or with `[FOLLOWUP]`)	9	0
Only `[FOLLOWUP]`	2	0
Only `[UDF]`	0	2
Other (`[WEBIUI]`, `[SHS]`, `[EXAMPLE]`, `[REPL]`)	3	1
Total	40	11

Notes:

[WEBIUI] looks like a typo of [WEBUI], which is already an alias of the primary UI tag.
The "no tags at all" group (26 commits, 8 PRs) is the prompt working as intended — titles like [SPARK-NNNNN] ... with no component tag get caught.

Generated-by: Claude Code

zhengruifeng · 2026-05-22T12:34:31Z

End-to-end pipeline impact on real-world titles

Ran the full new title pipeline (parse → normalize via registry → prompt-if-no-primary → dedup → move version tags to head → move FOLLOWUP last → warn on unknown tags) against the same sample.

	Commits (983)	PRs (172)
Need prompt for primary	40	11
Title would be modified	190	37
Has unknown tag(s) (warning only)	9	7

Breakdown of "modified" titles

Kind	Commits	PRs
Tag rename (`TESTS→TEST`, `FOLLOW-UP→FOLLOWUP`, `DOCS→DOC`, `WEBUI→UI`, `EXAMPLES→EXAMPLE`, `PYSPARK→PYTHON`, `SHELL→REPL`)	187	25
Reorder only (`FOLLOWUP` → last, version tag → head)	3	4
Whitespace / case fixes (e.g. missing space, lowercase inner tags)	0	8

Examples:

Reorder: [SPARK-55897][SQL][4.0] ... → [SPARK-55897][4.0][SQL] ...
Whitespace: [SPARK-56998]Add SECURITY.md ... → [SPARK-56998] Add SECURITY.md ...
Case: [SPARK-56962][SS][RTM][StreamingShuffle][Part2] ... → [SPARK-56962][SS][RTM][STREAMINGSHUFFLE][PART2] ...

Unknown tags

Recurring (worth considering for the registry):

Tag	Commits	PRs
`DML`	3	2
`GEO`	3	0
`UDF`	0	2
`RTM`	0	2
`SHS`	1	0

One-offs / typos:

WEBIUI — typo of WEBUI (already aliased to primary UI)
REVERT, SPARK, STREAMINGSHUFFLE, PART1, PART2 — non-standard or one-off

For unknown tags the script just prints a warning and continues — the committer can decide whether to fix or proceed.

cloud-fan

LGTM. Re-review: 3 addressed, 1 remaining, 4 new — history was rewritten since the prior round, so no per-finding sub-classification. All findings below are minor (doc accuracy, dead code, a coverage gap, and one theoretical-only behavior gap that the author's dry-run shows is unhit in practice). Approving; feel free to fold these in or merge as-is.

cloud-fan · 2026-05-22T14:33:17Z

+    the same component (e.g. "DOCS", "DOCUMENTATION" -> "DOC").
+
+    ``primary`` marks components whose presence alone satisfies the merge-time
+    requirement. Non-primary JIRA components (e.g. [TEST], [PS], [SHUFFLE])


Stale example: [PS] was promoted to primary=True in the latest commit (Make PS primary; ...). Replace with a still-non-primary tag so the docstring stays accurate.

Suggested change

requirement. Non-primary JIRA components (e.g. [TEST], [PS], [SHUFFLE])

requirement. Non-primary JIRA components (e.g. [TEST], [SHUFFLE], [DEPLOY])

cloud-fan · 2026-05-22T14:33:18Z

+    Component("SDP", ("PIPELINES",), primary=True, jira_name="Declarative Pipelines"),
+    Component("SECURITY", primary=True, jira_name="Security"),
+    Component("SHUFFLE", jira_name="Shuffle"),
+    # Component("SPARKR", jira_name="SparkR"),  # SPARKR is treated as an alias of R above


Dead commented-out registry entry — the alias is now declared on the R row above (Component("R", ("SPARKR",), ...)), so this comment line can just be deleted.

Suggested change

# Component("SPARKR", jira_name="SparkR"), # SPARKR is treated as an alias of R above

cloud-fan · 2026-05-22T14:33:18Z

+)
+
+
+_BRACKET_TAG_RE = re.compile(r"\[\s*([A-Za-z0-9._-]+)\s*\]")


_BRACKET_TAG_RE excludes internal spaces ([A-Za-z0-9._-]+), so every alias/canonical with an embedded space is unreachable from Title.parse — only resolvable through the prompt_for_components interactive path. That affects:

canonical: "BLOCK MANAGER" (only the underscore alias is parser-reachable)

aliases: "SPARK CORE", "PROJECT INFRA", "WEB UI", "SPARK SHELL", "JAVA API", "FOLLOW UP"

Practical impact: in your reply to @LuciferYang you said [PROJECT INFRA] in a title would automatically map to [INFRA], but the parser actually leaves it in the body text. I verified by importing the module:

>>> Title.parse("[SPARK-1234][PROJECT INFRA] Fix") leading=['SPARK-1234'], components=[], text='[PROJECT INFRA] Fix' >>> str(_) '[SPARK-1234] [PROJECT INFRA] Fix'

The prompt fires (no primary), and the final title becomes [SPARK-1234][INFRA] [PROJECT INFRA] Fix.

Your dry-run on 1000 commits + 200 PRs shows this form doesn't occur in practice, so it's a theoretical regression only. Two options if you want to close it cleanly: drop the with-space variants from the registry (since they're parser-unreachable they only serve the interactive path), or extend the regex to accept internal whitespace so the canonical form is consistent everywhere. Up to you — happy to leave as-is if you prefer.

cloud-fan · 2026-05-22T14:33:18Z

+_LEADING_TAGS = frozenset({"MINOR", "TRIVIAL"})
+
+
+class Title:


Same coverage gap I flagged last round: there is no doctest for the empty-body case (Title.parse("[SPARK-1234]")). It parses fine (leading=['SPARK-1234'], components=[], text='') and round-trips through __str__ correctly, but a doctest would pin that behavior — and exercise the if not self.text: return parts branch in __str__, which is otherwise uncovered.

cloud-fan · 2026-05-22T14:33:18Z

+    component. Each entered token is normalized via Component.find
+    (e.g. "DOCS" -> "DOC", "PYSPARK" -> "PYTHON"). Unrecognized tokens are
+    passed through as-is. Re-prompts until at least one entered token resolves
+    to a tag in PRIMARY_COMPONENTS. Returns an uppercase list of tags in


Stale name — there is no PRIMARY_COMPONENTS constant in the module; primariness is encoded on each Component via the primary attribute. Suggested rewording:

Suggested change

to a tag in PRIMARY_COMPONENTS. Returns an uppercase list of tags in

to a primary Component (one with primary=True). Returns an uppercase list of tags in

zhengruifeng changed the title ~~[INFRA] Require [COMPONENT] tag in PR title at merge time~~ [INFRA] Require COMPONENT tag in PR title at merge time May 21, 2026

zhengruifeng changed the title ~~[INFRA] Require COMPONENT tag in PR title at merge time~~ [SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time May 21, 2026

zhengruifeng marked this pull request as ready for review May 21, 2026 04:32

zhengruifeng requested review from HyukjinKwon, LuciferYang, cloud-fan and dongjoon-hyun May 21, 2026 04:32

cloud-fan reviewed May 21, 2026

View reviewed changes

Comment thread dev/merge_spark_pr.py Outdated

Comment thread dev/merge_spark_pr.py Outdated

Comment thread dev/merge_spark_pr.py Outdated

Comment thread dev/merge_spark_pr.py Outdated

LuciferYang reviewed May 21, 2026

View reviewed changes

zhengruifeng marked this pull request as draft May 22, 2026 04:21

zhengruifeng force-pushed the SPARK-merge-script-require-component branch 2 times, most recently from 3b5ce78 to b2237ca Compare May 22, 2026 11:47

[SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time

f5ed9a1

Generated-by: Claude Code

zhengruifeng force-pushed the SPARK-merge-script-require-component branch from 9694168 to f5ed9a1 Compare May 22, 2026 11:54

zhengruifeng marked this pull request as ready for review May 22, 2026 12:12

Make PS primary; parse and reorder version tags; warn on unknown tags

42986c0

Generated-by: Claude Code

cloud-fan approved these changes May 22, 2026

View reviewed changes

	requirement. Non-primary JIRA components (e.g. [TEST], [PS], [SHUFFLE])
	requirement. Non-primary JIRA components (e.g. [TEST], [SHUFFLE], [DEPLOY])

		)


		_BRACKET_TAG_RE = re.compile(r"\[\s([A-Za-z0-9._-]+)\s\]")

	to a tag in PRIMARY_COMPONENTS. Returns an uppercase list of tags in
	to a primary Component (one with primary=True). Returns an uppercase list of tags in

Conversation

zhengruifeng commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LuciferYang May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented May 22, 2026

Title.parse coverage on real-world titles

Failing commits (already on master, slipped past the old lenient script)

Failing open PRs (all malformed — exactly what this PR intends to block)

Uh oh!

zhengruifeng commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

prompt_for_components coverage on real-world titles

Uh oh!

zhengruifeng commented May 22, 2026

End-to-end pipeline impact on real-world titles

Breakdown of "modified" titles

Unknown tags

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 22, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 22, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 22, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 22, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng commented May 21, 2026 •

edited

Loading

LuciferYang May 21, 2026 •

edited

Loading

zhengruifeng May 22, 2026 •

edited

Loading

`Title.parse` coverage on real-world titles

Failing commits (already on `master`, slipped past the old lenient script)

zhengruifeng commented May 22, 2026 •

edited

Loading

`prompt_for_components` coverage on real-world titles