[SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time#56026
[SPARK-56979][INFRA] Require COMPONENT tag in PR title at merge time#56026zhengruifeng wants to merge 2 commits into
Conversation
cloud-fan
left a comment
There was a problem hiding this comment.
Nice clean addition that closes a real changelog-attribution gap. Code follows the existing standardize_jira_ref pattern (pure helpers + doctests + call site in main()), all 55 doctests pass, and the comments are accurate.
A few questions and minor suggestions inline:
- One design question about the selection rule for
COMPONENT_MARKERS— several active JIRA components have no allowlist entry (Declarative Pipelines,Deploy,DStreams,Examples,Optimizer,Scheduler,Shuffle,Spark Docker,Spark Shell,Spark Submit,Windows,Block Manager). Worth documenting the criterion, since the SDP case (77 recent uses of[SDP]) suggests committers are already using subsystem tags the script will now reject. - The validator accepts unknown entries when at least one canonical is provided, which lets typos slip through silently. One-line warning would close the gap.
- One small coverage gap: no doctest for the
[SPARK-1234]-only (empty body) case. - Minor styling: the recoverable prompt uses
print_error(red), which is usually for fatal errors.
There was a problem hiding this comment.
It seems tags like [PROJECT INFRA] were originally supported. Will any issues arise after this fix?
There was a problem hiding this comment.
the script will automatically map it to [INFRA]
3b5ce78 to
b2237ca
Compare
Generated-by: Claude Code
9694168 to
f5ed9a1
Compare
|
| Source | Total | Bypassed* | Parse OK | Parse Fail |
|---|---|---|---|---|
| Commits | 1000 | 15 | 983 | 2 |
| Open PRs | 200 | 11 | 172 | 17 |
* Bypassed = Revert "..." / Reapply "..." / [WIP] / [DO-NOT-MERGE] titles short-circuited in main() before Title.parse is called.
Failing commits (already on master, slipped past the old lenient script)
revert [SPARK-43752] and add test
[DOCS] Clarify DataFrames in quickstart
Failing open PRs (all malformed — exactly what this PR intends to block)
Add recursive read for partition
fix issue MapType as a type hint for schema 55900
WIP Spark 56763 sarutak 3.5 restore additional functionality r2
[SPARK-XXXXX][CORE] Add IO_URING transport mode
Fix UNBOUND_SQL_PARAMETER regression in PySpark 4.1.1
fix: the hive thrift server's ldap authentication pr... in...
Predicate Pushdown in Spark Structured Streaming (DataSource V2).
[SPARK-XXXXX][SQL] Public Column.toJson / Column.fromJson on the V2 catalog interface
fix: remove bare excepts and clean up lint issues (quantum-local-fixer)
[SQL][DML] Resolve Path Based Tables Read in Transactional Writes
Project DSv2 leaf node to Scan's schema
[SPARK-XXXXX][SQL][TESTS] Add DSv2 table refresh infrastructure and t…
[SPARK-XXXXX][CONNECT][TESTS] Add Connect DataFrame pinning tests
[SPARK-XXXXX][SQL][TESTS] Add DSv2 CACHE TABLE and write validation t…
[SQL] Update grammar/TableChange to allow CLUSTER BY expressions
[SPARK 55879] [Web UI] copy sql to clipboard
Implement `SET` command in single-pass analyzer
Breakdown of the 17 PR failures: 5 [SPARK-XXXXX] placeholders, 2 component-first (no SPARK ref), 1 [SPARK 55879] (space instead of hyphen), 9 with no leading bracket at all.
|
| Source | Parseable | Need prompt | Rate |
|---|---|---|---|
| Commits | 983 | 40 | ~4% |
| Open PRs | 172 | 11 | ~6% |
Breakdown of why the prompt would fire:
| Category | Commits | PRs |
|---|---|---|
No tags at all (just [SPARK-NNNNN] ...) |
26 | 8 |
Only [TEST] (alone or with [FOLLOWUP]) |
9 | 0 |
Only [FOLLOWUP] |
2 | 0 |
Only [UDF] |
0 | 2 |
Other ([WEBIUI], [SHS], [EXAMPLE], [REPL]) |
3 | 1 |
| Total | 40 | 11 |
Notes:
[WEBIUI]looks like a typo of[WEBUI], which is already an alias of the primaryUItag.- The "no tags at all" group (26 commits, 8 PRs) is the prompt working as intended — titles like
[SPARK-NNNNN] ...with no component tag get caught.
Generated-by: Claude Code
End-to-end pipeline impact on real-world titlesRan the full new title pipeline (parse → normalize via registry → prompt-if-no-primary → dedup → move version tags to head → move FOLLOWUP last → warn on unknown tags) against the same sample.
Breakdown of "modified" titles
Examples:
Unknown tagsRecurring (worth considering for the registry):
One-offs / typos:
For unknown tags the script just prints a warning and continues — the committer can decide whether to fix or proceed. |
cloud-fan
left a comment
There was a problem hiding this comment.
LGTM. Re-review: 3 addressed, 1 remaining, 4 new — history was rewritten since the prior round, so no per-finding sub-classification. All findings below are minor (doc accuracy, dead code, a coverage gap, and one theoretical-only behavior gap that the author's dry-run shows is unhit in practice). Approving; feel free to fold these in or merge as-is.
| the same component (e.g. "DOCS", "DOCUMENTATION" -> "DOC"). | ||
|
|
||
| ``primary`` marks components whose presence alone satisfies the merge-time | ||
| requirement. Non-primary JIRA components (e.g. [TEST], [PS], [SHUFFLE]) |
There was a problem hiding this comment.
Stale example: [PS] was promoted to primary=True in the latest commit (Make PS primary; ...). Replace with a still-non-primary tag so the docstring stays accurate.
| requirement. Non-primary JIRA components (e.g. [TEST], [PS], [SHUFFLE]) | |
| requirement. Non-primary JIRA components (e.g. [TEST], [SHUFFLE], [DEPLOY]) |
| Component("SDP", ("PIPELINES",), primary=True, jira_name="Declarative Pipelines"), | ||
| Component("SECURITY", primary=True, jira_name="Security"), | ||
| Component("SHUFFLE", jira_name="Shuffle"), | ||
| # Component("SPARKR", jira_name="SparkR"), # SPARKR is treated as an alias of R above |
There was a problem hiding this comment.
Dead commented-out registry entry — the alias is now declared on the R row above (Component("R", ("SPARKR",), ...)), so this comment line can just be deleted.
| # Component("SPARKR", jira_name="SparkR"), # SPARKR is treated as an alias of R above |
| ) | ||
|
|
||
|
|
||
| _BRACKET_TAG_RE = re.compile(r"\[\s*([A-Za-z0-9._-]+)\s*\]") |
There was a problem hiding this comment.
_BRACKET_TAG_RE excludes internal spaces ([A-Za-z0-9._-]+), so every alias/canonical with an embedded space is unreachable from Title.parse — only resolvable through the prompt_for_components interactive path. That affects:
- canonical:
"BLOCK MANAGER"(only the underscore alias is parser-reachable) - aliases:
"SPARK CORE","PROJECT INFRA","WEB UI","SPARK SHELL","JAVA API","FOLLOW UP"
Practical impact: in your reply to @LuciferYang you said [PROJECT INFRA] in a title would automatically map to [INFRA], but the parser actually leaves it in the body text. I verified by importing the module:
>>> Title.parse("[SPARK-1234][PROJECT INFRA] Fix")
leading=['SPARK-1234'], components=[], text='[PROJECT INFRA] Fix'
>>> str(_)
'[SPARK-1234] [PROJECT INFRA] Fix'
The prompt fires (no primary), and the final title becomes [SPARK-1234][INFRA] [PROJECT INFRA] Fix.
Your dry-run on 1000 commits + 200 PRs shows this form doesn't occur in practice, so it's a theoretical regression only. Two options if you want to close it cleanly: drop the with-space variants from the registry (since they're parser-unreachable they only serve the interactive path), or extend the regex to accept internal whitespace so the canonical form is consistent everywhere. Up to you — happy to leave as-is if you prefer.
| _LEADING_TAGS = frozenset({"MINOR", "TRIVIAL"}) | ||
|
|
||
|
|
||
| class Title: |
There was a problem hiding this comment.
Same coverage gap I flagged last round: there is no doctest for the empty-body case (Title.parse("[SPARK-1234]")). It parses fine (leading=['SPARK-1234'], components=[], text='') and round-trips through __str__ correctly, but a doctest would pin that behavior — and exercise the if not self.text: return parts branch in __str__, which is otherwise uncovered.
| component. Each entered token is normalized via Component.find | ||
| (e.g. "DOCS" -> "DOC", "PYSPARK" -> "PYTHON"). Unrecognized tokens are | ||
| passed through as-is. Re-prompts until at least one entered token resolves | ||
| to a tag in PRIMARY_COMPONENTS. Returns an uppercase list of tags in |
There was a problem hiding this comment.
Stale name — there is no PRIMARY_COMPONENTS constant in the module; primariness is encoded on each Component via the primary attribute. Suggested rewording:
| to a tag in PRIMARY_COMPONENTS. Returns an uppercase list of tags in | |
| to a primary Component (one with primary=True). Returns an uppercase list of tags in |
What changes were proposed in this pull request?
This PR rewrites the PR title processing in
dev/merge_spark_pr.py:New
Componentregistry — a typedComponentclass and aCOMPONENTStuple listing all recognized Spark JIRA components with their canonical tags, aliases, and aprimaryflag.primary=Truemeans the tag alone satisfies the merge-time requirement; non-primary tags (e.g.[TEST],[SHUFFLE]) must be paired with a primary one. The primary set covers the main subsystems:BUILD,CONNECT,CORE,DOC,DOCKER,GRAPHX,INFRA,K8S,ML,MLLIB,PS,PYTHON,R,SDP,SECURITY,SQL,SS,STREAMING,UI,WINDOWS,YARN.New
Titleparser — aTitleclass withleading(SPARK-NNNNN / MINOR / TRIVIAL),components, andtextfields.Title.parse()is strict:_,-, and.(so version tags like[4.X]and[3.5]are recognized);[MINOR]and[TRIVIAL]cannot coexist with each other or with a SPARK-ID.Malformed titles raise
ValueError.Replaced
standardize_jira_ref— the old lenient regex rewriter that tolerated bareSPARK 1234refs,[Project Infra]multi-word tags, etc. is removed. Titles must now be well-formed before reaching the merge step.New title pipeline in
main():[WIP]or[DO-NOT-MERGE](previously just a soft prompt).Revert "..."andReapply "..."titles are kept verbatim.Title.parse()— fail with a clear message if malformed.PYSPARK→PYTHON,DOCS→DOC,FOLLOW-UP→FOLLOWUP,TESTS→TEST) and track whether any primary tag is present.[4.X],[4.2],[3.5], ...) to the head of the component list.[FOLLOWUP]to the last position.Why are the changes needed?
Some PRs are merged without any
[COMPONENT]tag (e.g. #55866 merged as[SPARK-56853] Improve PATH Tests), losing module attribution in the changelog. Others carry non-canonical tags or version tags in inconsistent positions, which makes changelog tooling unreliable.The old
standardize_jira_reftolerated very loose title formats but did not enforce component presence, leaving the gap. This PR closes it with a strict parser, registry-based normalization, and a prompt-based fallback when a primary component is missing.Does this PR introduce any user-facing change?
No. The change only affects the committer-facing interactive merge tool (
dev/merge_spark_pr.py).How was this patch tested?
Doctests on
Title,Title.parse, and the existing version-resolution helpers. All pass via:The pipeline was also dry-run against the latest 1000 commits on
masterand 200 open PRs — see the comments on this PR for the full report.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code