Skip to content

[SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time#56026

Open
zhengruifeng wants to merge 2 commits into
apache:masterfrom
zhengruifeng:SPARK-merge-script-require-component
Open

[SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time#56026
zhengruifeng wants to merge 2 commits into
apache:masterfrom
zhengruifeng:SPARK-merge-script-require-component

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 21, 2026

What changes were proposed in this pull request?

This PR rewrites the PR title processing in dev/merge_spark_pr.py:

New Component registry — a typed Component class and a COMPONENTS tuple listing all recognized Spark JIRA components with their canonical tags, aliases, and a primary flag. primary=True means the tag alone satisfies the merge-time requirement; non-primary tags (e.g. [TEST], [SHUFFLE]) must be paired with a primary one. The primary set covers the main subsystems: BUILD, CONNECT, CORE, DOC, DOCKER, GRAPHX, INFRA, K8S, ML, MLLIB, PS, PYTHON, R, SDP, SECURITY, SQL, SS, STREAMING, UI, WINDOWS, YARN.

New Title parser — a Title class with leading (SPARK-NNNNN / MINOR / TRIVIAL), components, and text fields. Title.parse() is strict:

  • the title must open with a leading tag;
  • tags may appear in any order and with arbitrary surrounding whitespace;
  • bracket-tag characters include letters, digits, _, -, and . (so version tags like [4.X] and [3.5] are recognized);
  • SPARK-NNNNN IDs must all precede any component tags;
  • [MINOR] and [TRIVIAL] cannot coexist with each other or with a SPARK-ID.

Malformed titles raise ValueError.

Replaced standardize_jira_ref — the old lenient regex rewriter that tolerated bare SPARK 1234 refs, [Project Infra] multi-word tags, etc. is removed. Titles must now be well-formed before reaching the merge step.

New title pipeline in main():

  1. Hard-fail on [WIP] or [DO-NOT-MERGE] (previously just a soft prompt).
  2. Revert "..." and Reapply "..." titles are kept verbatim.
  3. Title.parse() — fail with a clear message if malformed.
  4. Normalize each component tag via the registry (e.g. PYSPARKPYTHON, DOCSDOC, FOLLOW-UPFOLLOWUP, TESTSTEST) and track whether any primary tag is present.
  5. If no primary component is present, prompt the committer to enter at least one; insert the entered tag(s) right after the leading refs.
  6. Deduplicate component tags in insertion order.
  7. Move backport version tags ([4.X], [4.2], [3.5], ...) to the head of the component list.
  8. Move [FOLLOWUP] to the last position.
  9. Print a warning for any tag that is neither a known component nor a version tag.

Why are the changes needed?

Some PRs are merged without any [COMPONENT] tag (e.g. #55866 merged as [SPARK-56853] Improve PATH Tests), losing module attribution in the changelog. Others carry non-canonical tags or version tags in inconsistent positions, which makes changelog tooling unreliable.

The old standardize_jira_ref tolerated very loose title formats but did not enforce component presence, leaving the gap. This PR closes it with a strict parser, registry-based normalization, and a prompt-based fallback when a primary component is missing.

Does this PR introduce any user-facing change?

No. The change only affects the committer-facing interactive merge tool (dev/merge_spark_pr.py).

How was this patch tested?

Doctests on Title, Title.parse, and the existing version-resolution helpers. All pass via:

python3 -m doctest dev/merge_spark_pr.py

The pipeline was also dry-run against the latest 1000 commits on master and 200 open PRs — see the comments on this PR for the full report.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

@zhengruifeng zhengruifeng changed the title [INFRA] Require [COMPONENT] tag in PR title at merge time [INFRA] Require COMPONENT tag in PR title at merge time May 21, 2026
@zhengruifeng zhengruifeng changed the title [INFRA] Require COMPONENT tag in PR title at merge time [SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time May 21, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 21, 2026 04:32
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean addition that closes a real changelog-attribution gap. Code follows the existing standardize_jira_ref pattern (pure helpers + doctests + call site in main()), all 55 doctests pass, and the comments are accurate.

A few questions and minor suggestions inline:

  • One design question about the selection rule for COMPONENT_MARKERS — several active JIRA components have no allowlist entry (Declarative Pipelines, Deploy, DStreams, Examples, Optimizer, Scheduler, Shuffle, Spark Docker, Spark Shell, Spark Submit, Windows, Block Manager). Worth documenting the criterion, since the SDP case (77 recent uses of [SDP]) suggests committers are already using subsystem tags the script will now reject.
  • The validator accepts unknown entries when at least one canonical is provided, which lets typos slip through silently. One-line warning would close the gap.
  • One small coverage gap: no doctest for the [SPARK-1234]-only (empty body) case.
  • Minor styling: the recoverable prompt uses print_error (red), which is usually for fatal errors.

Comment thread dev/merge_spark_pr.py Outdated
Comment thread dev/merge_spark_pr.py Outdated
Comment thread dev/merge_spark_pr.py Outdated
Comment thread dev/merge_spark_pr.py Outdated
Comment thread dev/merge_spark_pr.py Outdated
Comment thread dev/merge_spark_pr.py Outdated
Comment thread dev/merge_spark_pr.py Outdated
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems tags like [PROJECT INFRA] were originally supported. Will any issues arise after this fix?

Copy link
Copy Markdown
Contributor Author

@zhengruifeng zhengruifeng May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the script will automatically map it to [INFRA]

@zhengruifeng zhengruifeng marked this pull request as draft May 22, 2026 04:21
@zhengruifeng zhengruifeng force-pushed the SPARK-merge-script-require-component branch 2 times, most recently from 3b5ce78 to b2237ca Compare May 22, 2026 11:47
@zhengruifeng zhengruifeng force-pushed the SPARK-merge-script-require-component branch from 9694168 to f5ed9a1 Compare May 22, 2026 11:54
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

Title.parse coverage on real-world titles

Tested against the latest 1000 commits on master and 200 open PRs:

Source Total Bypassed* Parse OK Parse Fail
Commits 1000 15 983 2
Open PRs 200 11 172 17

* Bypassed = Revert "..." / Reapply "..." / [WIP] / [DO-NOT-MERGE] titles short-circuited in main() before Title.parse is called.

Failing commits (already on master, slipped past the old lenient script)

revert [SPARK-43752] and add test
[DOCS] Clarify DataFrames in quickstart

Failing open PRs (all malformed — exactly what this PR intends to block)

Add recursive read for partition
fix issue MapType as a type hint for schema 55900
WIP Spark 56763 sarutak 3.5 restore additional functionality r2
[SPARK-XXXXX][CORE] Add IO_URING transport mode
Fix UNBOUND_SQL_PARAMETER regression in PySpark 4.1.1
fix: the hive thrift server's ldap authentication pr... in...
Predicate Pushdown in Spark Structured Streaming (DataSource V2).
[SPARK-XXXXX][SQL] Public Column.toJson / Column.fromJson on the V2 catalog interface
fix: remove bare excepts and clean up lint issues (quantum-local-fixer)
[SQL][DML] Resolve Path Based Tables Read in Transactional Writes
Project DSv2 leaf node to Scan's schema
[SPARK-XXXXX][SQL][TESTS] Add DSv2 table refresh infrastructure and t…
[SPARK-XXXXX][CONNECT][TESTS] Add Connect DataFrame pinning tests
[SPARK-XXXXX][SQL][TESTS] Add DSv2 CACHE TABLE and write validation t…
[SQL] Update grammar/TableChange to allow CLUSTER BY expressions
[SPARK 55879] [Web UI] copy sql to clipboard
Implement `SET` command in single-pass analyzer

Breakdown of the 17 PR failures: 5 [SPARK-XXXXX] placeholders, 2 component-first (no SPARK ref), 1 [SPARK 55879] (space instead of hyphen), 9 with no leading bracket at all.

@zhengruifeng
Copy link
Copy Markdown
Contributor Author

zhengruifeng commented May 22, 2026

prompt_for_components coverage on real-world titles

Among the parseable titles from the previous sample (983 commits, 172 open PRs), how many would lack a primary tag and trigger the prompt at merge time:

Source Parseable Need prompt Rate
Commits 983 40 ~4%
Open PRs 172 11 ~6%

Breakdown of why the prompt would fire:

Category Commits PRs
No tags at all (just [SPARK-NNNNN] ...) 26 8
Only [TEST] (alone or with [FOLLOWUP]) 9 0
Only [FOLLOWUP] 2 0
Only [UDF] 0 2
Other ([WEBIUI], [SHS], [EXAMPLE], [REPL]) 3 1
Total 40 11

Notes:

  • [WEBIUI] looks like a typo of [WEBUI], which is already an alias of the primary UI tag.
  • The "no tags at all" group (26 commits, 8 PRs) is the prompt working as intended — titles like [SPARK-NNNNN] ... with no component tag get caught.

@zhengruifeng zhengruifeng marked this pull request as ready for review May 22, 2026 12:12
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

End-to-end pipeline impact on real-world titles

Ran the full new title pipeline (parse → normalize via registry → prompt-if-no-primary → dedup → move version tags to head → move FOLLOWUP last → warn on unknown tags) against the same sample.

Commits (983) PRs (172)
Need prompt for primary 40 11
Title would be modified 190 37
Has unknown tag(s) (warning only) 9 7

Breakdown of "modified" titles

Kind Commits PRs
Tag rename (TESTS→TEST, FOLLOW-UP→FOLLOWUP, DOCS→DOC, WEBUI→UI, EXAMPLES→EXAMPLE, PYSPARK→PYTHON, SHELL→REPL) 187 25
Reorder only (FOLLOWUP → last, version tag → head) 3 4
Whitespace / case fixes (e.g. missing space, lowercase inner tags) 0 8

Examples:

  • Reorder: [SPARK-55897][SQL][4.0] ...[SPARK-55897][4.0][SQL] ...
  • Whitespace: [SPARK-56998]Add SECURITY.md ...[SPARK-56998] Add SECURITY.md ...
  • Case: [SPARK-56962][SS][RTM][StreamingShuffle][Part2] ...[SPARK-56962][SS][RTM][STREAMINGSHUFFLE][PART2] ...

Unknown tags

Recurring (worth considering for the registry):

Tag Commits PRs
DML 3 2
GEO 3 0
UDF 0 2
RTM 0 2
SHS 1 0

One-offs / typos:

  • WEBIUI — typo of WEBUI (already aliased to primary UI)
  • REVERT, SPARK, STREAMINGSHUFFLE, PART1, PART2 — non-standard or one-off

For unknown tags the script just prints a warning and continues — the committer can decide whether to fix or proceed.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Re-review: 3 addressed, 1 remaining, 4 new — history was rewritten since the prior round, so no per-finding sub-classification. All findings below are minor (doc accuracy, dead code, a coverage gap, and one theoretical-only behavior gap that the author's dry-run shows is unhit in practice). Approving; feel free to fold these in or merge as-is.

Comment thread dev/merge_spark_pr.py
the same component (e.g. "DOCS", "DOCUMENTATION" -> "DOC").

``primary`` marks components whose presence alone satisfies the merge-time
requirement. Non-primary JIRA components (e.g. [TEST], [PS], [SHUFFLE])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale example: [PS] was promoted to primary=True in the latest commit (Make PS primary; ...). Replace with a still-non-primary tag so the docstring stays accurate.

Suggested change
requirement. Non-primary JIRA components (e.g. [TEST], [PS], [SHUFFLE])
requirement. Non-primary JIRA components (e.g. [TEST], [SHUFFLE], [DEPLOY])

Comment thread dev/merge_spark_pr.py
Component("SDP", ("PIPELINES",), primary=True, jira_name="Declarative Pipelines"),
Component("SECURITY", primary=True, jira_name="Security"),
Component("SHUFFLE", jira_name="Shuffle"),
# Component("SPARKR", jira_name="SparkR"), # SPARKR is treated as an alias of R above
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead commented-out registry entry — the alias is now declared on the R row above (Component("R", ("SPARKR",), ...)), so this comment line can just be deleted.

Suggested change
# Component("SPARKR", jira_name="SparkR"), # SPARKR is treated as an alias of R above

Comment thread dev/merge_spark_pr.py
)


_BRACKET_TAG_RE = re.compile(r"\[\s*([A-Za-z0-9._-]+)\s*\]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_BRACKET_TAG_RE excludes internal spaces ([A-Za-z0-9._-]+), so every alias/canonical with an embedded space is unreachable from Title.parse — only resolvable through the prompt_for_components interactive path. That affects:

  • canonical: "BLOCK MANAGER" (only the underscore alias is parser-reachable)
  • aliases: "SPARK CORE", "PROJECT INFRA", "WEB UI", "SPARK SHELL", "JAVA API", "FOLLOW UP"

Practical impact: in your reply to @LuciferYang you said [PROJECT INFRA] in a title would automatically map to [INFRA], but the parser actually leaves it in the body text. I verified by importing the module:

>>> Title.parse("[SPARK-1234][PROJECT INFRA] Fix")
leading=['SPARK-1234'], components=[], text='[PROJECT INFRA] Fix'
>>> str(_)
'[SPARK-1234] [PROJECT INFRA] Fix'

The prompt fires (no primary), and the final title becomes [SPARK-1234][INFRA] [PROJECT INFRA] Fix.

Your dry-run on 1000 commits + 200 PRs shows this form doesn't occur in practice, so it's a theoretical regression only. Two options if you want to close it cleanly: drop the with-space variants from the registry (since they're parser-unreachable they only serve the interactive path), or extend the regex to accept internal whitespace so the canonical form is consistent everywhere. Up to you — happy to leave as-is if you prefer.

Comment thread dev/merge_spark_pr.py
_LEADING_TAGS = frozenset({"MINOR", "TRIVIAL"})


class Title:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same coverage gap I flagged last round: there is no doctest for the empty-body case (Title.parse("[SPARK-1234]")). It parses fine (leading=['SPARK-1234'], components=[], text='') and round-trips through __str__ correctly, but a doctest would pin that behavior — and exercise the if not self.text: return parts branch in __str__, which is otherwise uncovered.

Comment thread dev/merge_spark_pr.py
component. Each entered token is normalized via Component.find
(e.g. "DOCS" -> "DOC", "PYSPARK" -> "PYTHON"). Unrecognized tokens are
passed through as-is. Re-prompts until at least one entered token resolves
to a tag in PRIMARY_COMPONENTS. Returns an uppercase list of tags in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale name — there is no PRIMARY_COMPONENTS constant in the module; primariness is encoded on each Component via the primary attribute. Suggested rewording:

Suggested change
to a tag in PRIMARY_COMPONENTS. Returns an uppercase list of tags in
to a primary Component (one with primary=True). Returns an uppercase list of tags in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants