Skip to content

Generate complete static case fold orbits#206

Draft
ittaigolde wants to merge 2 commits into
google:masterfrom
ittaigolde:complete-static-case-fold-orbits
Draft

Generate complete static case fold orbits#206
ittaigolde wants to merge 2 commits into
google:masterfrom
ittaigolde:complete-static-case-fold-orbits

Conversation

@ittaigolde

Copy link
Copy Markdown

Fixes #168.

This builds on #170's Unicode 10 table update and makes the fix independent of the runtime JDK's Unicode version.

I reproduced the hang on JDK 21 and scanned all Unicode codepoints with a bounded simpleFold() orbit walker. On master, U+1C80..U+1C88 produce non-closing orbits. For example:

U+1C80 -> U+0412 -> U+0432 -> U+0412 ...

#170 fixes the currently observed failures, but simpleFold() still mixes generated ICU tables with the active JDK's Character.toLowerCase() / toUpperCase() data. A later JDK could therefore introduce another incomplete orbit.

Following @herbyderby's suggestion in #168, this change:

  • generates complete static fold orbits, including ordinary upper/lower pairs;
  • stores them in sparse 256-entry int[] pages, supporting supplementary codepoints;
  • removes the runtime JDK casing fallback from simpleFold();
  • adds bounded closure coverage and public API regressions for the Cyrillic Extended-C case, folded character classes, and a supplementary-plane mapping.

Preliminary JDK 21 microbenchmark, 20 million mixed lookups:

Current hybrid fallback:    7.31 ns/op
Dense complete int[]:       4.80 ns/op
Sorted-array binary search: 14.30 ns/op
Paged sparse int[][]:       4.66 ns/op

Estimated orbit-table memory:

Current #170 char[]:     85 KB
Dense complete int[]:   501 KB
Sorted parallel arrays:  23 KB
Paged sparse int[][]:    29 KB

These are informal measurements rather than JMH results.

Verification:

  • compiled main sources directly with JDK 21;
  • passed UnicodeTest, PatternTest, and CharClassTest;
  • passed all JUnit classes except ExecTest and GWTTest through a standalone Gradle 9 runner.

The excluded classes could not be validated faithfully in that runner on Windows: ExecTest fixture parsing is affected by checkout line endings, and GWTTest expects the legacy root build's resource layout. The repository's Gradle 5.2 wrapper does not start on JDK 21.

@google-cla

google-cla Bot commented May 30, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Infinite loop in Pattern#compile on certain case-insensitive patterns

2 participants