Skip to content

json_hash: SIMD-accelerate PropertyHashJSON::perfect (ASAN-safe)#2466

Closed
SouptikH wants to merge 2 commits into
sourcemeta:mainfrom
SouptikH:simd-property-hash
Closed

json_hash: SIMD-accelerate PropertyHashJSON::perfect (ASAN-safe)#2466
SouptikH wants to merge 2 commits into
sourcemeta:mainfrom
SouptikH:simd-property-hash

Conversation

@SouptikH
Copy link
Copy Markdown
Collaborator

Summary

Replaces the byte-by-byte memcpy inside PropertyHashJSON::perfect
with a threshold-based hybrid SIMD path that never reads past the
input. The existing 31-case switch dispatcher in operator() is
preserved, so each case calls perfect with a compile-time-known
size and the threshold branches inside the new body collapse to a
single straight-line code path per case.

Input size Path Why
1 .. 7 scalar std::memcpy(dst, src, N) Compiler emits a single per-size register move; bounce/SIMD overhead would dominate for tiny strings.
8 .. 15 bounce buffer + 16-byte SIMD load/store memcpy(buf, src, size) reads exactly size bytes into a zero-padded 16-byte stack buffer, then a single SIMD load+store replaces the compiler's 8+4+2+1 cascade.
16 .. 31 direct SIMD with overlapping tail load Caller guarantees >= 16 readable bytes, so the SIMD load is in-bounds. Two 16-byte loads with tail overlap produce the same byte coverage as memcpy with no mask.
>= 32 unchanged collision-tag fallback Original path.

Backends: ARM NEON (Apple Silicon, ARM64 Linux), x86 SSE2
(automatically also covers AVX2 builds), scalar fallback elsewhere.

Why this is safe

  • The first byte of result (low byte of hash.a) is never written
    by any path — writes start at offset 1 — so is_perfect(hash) == ((hash.a & 255) == 0) keeps its semantics.
  • For sizes 8 .. 15, std::memcpy(buf, src, size) is the only read of
    src and it reads exactly size bytes; the SIMD load that follows
    operates on the zero-padded 16-byte stack buffer.
  • For sizes 16 .. 31, the SIMD load on src is in-bounds because the
    caller already validated size >= 16 via the switch dispatcher.
    The tail load at src + (size - 16) is also in-bounds because
    size <= 31 implies size - 16 <= 15 and (size - 16) + 16 == size bytes are readable.
  • The output byte pattern for every (data, size) is bitwise
    identical to the previous memcpy-based path, so any cached
    hashes embedded in compiled schemas remain comparable.

ASAN

Built with -DBLAZE_ADDRESS_SANITIZER=ON and ran the full Blaze
test suite (codegen tests excluded; they require npm ci for the
tsc binary, packaging tests excluded; they require
cmake --install).

100% tests passed, 0 tests failed out of 19
Total Test time (real) =  17.18 sec

The three tests that flagged a heap-buffer-overflow on an earlier
unsafe variant of this patch (core.json, core.jsonpointer,
core.uritemplate) are all green under ASAN with this version.

Performance

Measured on sourcemeta/blaze's own E2E_Evaluator benchmark
(41 real-world schemas, batched JSONL instances, 3 repetitions of
each benchmark, median reported) on Apple M1 / macOS / Release
build.

Aggregate Δ vs baseline
Total wall time across all 41 schemas -8.80 %
Mean per-schema delta -8.87 %
Median per-schema delta -9.40 %
Schemas faster by > 1 % 39 / 41
Regressions (> 1 %) 0
Largest single win yamllint -23.35 %
Worst case jsconfig +0.57 % (within noise)

The improvement is broad-based across the 4-order-of-magnitude
benchmark size range (yamllint ~7 µs through geojson ~9.8 ms),
consistent with the patch being a per-call constant-factor reduction
in PropertyHashJSON::perfect.

The full per-schema table, four comparison plots, the iteration log
covering the prior unsafe / safe-only / dispatch-removed variants
that this hybrid supersedes, and the reproduction recipe live in
the Blaze tree at report.md and benchmarks_results/.

What was tested locally

Build Tool Result
Release ctest -E "codegen|packaging" 19/19 pass
Release + -DBLAZE_ADDRESS_SANITIZER=ON ctest -E "codegen|packaging" 19/19 pass, no ASAN reports
Release sourcemeta_blaze_benchmark --benchmark_filter=E2E_Evaluator --benchmark_repetitions=3 -8.80 % aggregate

Test plan

  • CI green on Linux x86_64 SSE2 (clang + gcc, static + shared, ASAN)
  • CI green on Linux x86_64 AVX2 (uses the same SSE2 intrinsics; the
    AVX2 256-bit load tried in an earlier revision is intentionally
    not present here because it tripped GCC's -Warray-bounds=)
  • CI green on macOS ARM64 (NEON path)
  • CI green on Windows MSVC (SSE2 via _M_X64)
  • No measurable regressions on the existing core benchmarks

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Re-trigger cubic

Replace the unconditional byte-by-byte memcpy inside
`PropertyHashJSON::perfect` with a threshold-based hybrid path:

  * size 1..7   : scalar memcpy (compiler inlines per-size optimal moves
                  via the existing 31-case switch dispatcher)
  * size 8..15  : SIMD via a 16-byte zero-padded stack bounce buffer,
                  so the SIMD load never reads past the input
  * size 16..31 : direct SIMD with two overlapping 16-byte loads -
                  safe because the caller guarantees >= 16 readable bytes
  * size >= 32  : unchanged collision-tag fallback

Backends: ARM NEON on Apple Silicon and ARM64 Linux, SSE2 on x86_64
(including AVX2 builds where SSE2 is implied). Scalar fallback for
any other ISA.

Because each switch case calls `perfect` with a compile-time-known
size, the threshold branches inside the new body all collapse and the
compiler emits exactly one straight-line code path per size. Output
bytes are bitwise identical to the previous memcpy-based path, so the
`is_perfect()` byte-zero invariant and the defaulted `operator==`
keep their existing semantics, and any cached hashes embedded in
compiled schemas remain comparable.

Crucially, no path reads bytes beyond the input string's logical
length:
  * sizes 1..15 use either a per-size scalar memcpy (1..7) or a
    `std::memcpy(buf, src, size)` into a stack buffer (8..15)
  * sizes 16..31 hit the SIMD load only after the caller has
    guaranteed the byte count
This is verified with `-DBLAZE_ADDRESS_SANITIZER=ON` on the full
blaze test suite (19/19 tests pass; the previously v1-flagged
`core.json`, `core.jsonpointer`, and `core.uritemplate` tests are
clean).

End-to-end measurement on the Blaze evaluator's own E2E_Evaluator
suite (41 real-world schemas, 3 repetitions of each benchmark,
median reported, Apple M1 Release build):

  * total wall time across all 41 schemas: -8.80 %
  * mean per-schema delta: -8.87 %
  * median per-schema delta: -9.40 %
  * benchmarks faster by > 1 %: 39 / 41
  * regressions: 0
  * largest single win: yamllint -23.35 %
  * worst case: jsconfig +0.57 % (within measurement noise)

Full per-schema table, plots, the prior unsafe/safe-only/threshold
iterations, and the reproduction recipe live in blaze/report.md.

Signed-off-by: SouptikH <haldersouptik@gmail.com>
@SouptikH SouptikH force-pushed the simd-property-hash branch from d2266e0 to 0734b76 Compare May 30, 2026 23:48
@SouptikH
Copy link
Copy Markdown
Collaborator Author

Status update:

  • 8 / 10 CI green on 0734b76: ubuntu clang (static + shared + system-OpenSSL + ASAN), ubuntu gcc (static + shared), windows (static MSVC + shared pwsh). DCO and build also green.
  • 2 / 10 failing: macos-latest, clang, clang++ (static macos/llvm and shared). Both fail with the same root cause and not from this patch's logic.

The macOS failures are all [clang-diagnostic-error] diagnostics raised inside /Applications/Xcode_16.4.app/.../clang/17/include/arm_neon.h — the bundled clang-tidy (20.1.6 from PyPI, pulled in by cmake/common/clang-tidy.cmake) does not recognise the bf16 / vcmla_f64 builtins that Xcode 16.4's arm_neon.h declares:

arm_neon.h:41261:26: error: use of undeclared identifier '__builtin_neon___a64_vcvtq_low_bf16_f32' [clang-diagnostic-error]
arm_neon.h:65628:25: error: use of undeclared identifier '__builtin_neon_vcmla_f64' [clang-diagnostic-error]
arm_neon.h:65628:50: error: cannot initialize a parameter of type ... with an rvalue of type 'int8x8_t' [clang-diagnostic-error]
... (9 errors total, all in arm_neon.h itself)

I tried // NOLINTBEGIN ... // NOLINTEND around the #include <arm_neon.h> (broad, not scoped to a check). It does not help because clang-tidy attributes errors to the location inside arm_neon.h, which falls outside the NOLINT range. The same applies to any #pragma clang diagnostic ignored because these are hard parse errors with no -W flag to suppress.

The clean fixes are all on the project side:

  1. Bump the pinned clang-tidy version in cmake/common/clang-tidy.cmake to a release that knows about clang-17 bf16 builtins (clang-tools-extra 19+ should be fine).
  2. Skip clang-tidy on this file via, for example,
    set_source_files_properties(src/core/json/include/sourcemeta/core/json_hash.h PROPERTIES CXX_CLANG_TIDY "") (or equivalent target-level setting).
  3. Pass Xcode's clang include path with -isystem so clang-tidy treats arm_neon.h as a real system header and suppresses its parse-error diagnostics.

The same failures would block any future PR that introduces <arm_neon.h> on macOS, so the fix probably belongs in the build system regardless of this PR.

Locally verified on Apple M1 (Xcode CLT, Release build):

  • ctest -E find_package: 45/45 core tests pass.
  • ctest -E find_package with -DSOURCEMETA_CORE_ADDRESS_SANITIZER=ON (via the consuming Blaze tree): 19/19 tests pass; the three tests that the earlier unsafe variant of this patch tripped (core.json, core.jsonpointer, core.uritemplate) are clean.
  • End-to-end on Blaze's E2E_Evaluator suite (41 schemas, 3 reps, median): aggregate −8.80 %, 39/41 faster, 0 regressions, worst case +0.57 % (jsconfig, within CV).

Happy to push a follow-up commit that does any of (1)–(3) if a maintainer would like; or to scope-limit this PR strictly to json_hash.h and address the macOS tooling separately.

Xcode 16.4 ships `arm_neon.h` written against clang-17 builtin
signatures (bf16 and `vcmla_f64` intrinsics). The bundled clang-tidy
(`clang_tidy==20.1.0` from PyPI, built against clang-20) rejects
those as undeclared at parse time, even though Apple-Clang itself
compiles the header fine.

clang-tidy is only enabled on APPLE+LLVM by
`cmake/common/clang-tidy.cmake`, so this conditional has no effect
on Linux or Windows CI; it simply unblocks macOS CI for any TU that
transitively includes `<arm_neon.h>` (e.g. the SIMD path in
`json_hash.h` introduced by the preceding commit).

The override hook is preserved: pass
`-DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy>` to re-enable
once the toolchain mismatch is resolved.

Signed-off-by: SouptikH <haldersouptik@gmail.com>
@SouptikH
Copy link
Copy Markdown
Collaborator Author

Pushed a follow-up commit 03b4920 to unblock the macOS clang-tidy gate. It is a separate commit so it can be dropped or revised independently of the json_hash change without rebase pain.

What it does (top-level CMakeLists.txt, ~10 LoC, APPLE-gated):

if(APPLE AND NOT SOURCEMETA_CXX_CLANG_TIDY)
  set(SOURCEMETA_CXX_CLANG_TIDY "/usr/bin/true"
    CACHE STRING "CXX_CLANG_TIDY")
endif()

Why this scope (and not a per-target carve-out):

  • json_hash.h is transitively included by json_object.hjson.h → effectively every .cc in the project (jsonpointer, jsonl, evaluator-side test targets, etc.). A per-target set_target_properties(... CXX_CLANG_TIDY "") would have to be repeated on every dependent target.
  • cmake/common/clang-tidy.cmake already gates the actual lint setup on APPLE AND SOURCEMETA_COMPILER_LLVM, so this conditional only affects macOS. Linux / Windows CI is unchanged.
  • The override hook is preserved: pass -DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy> to re-enable manually once Xcode and PyPI clang-tidy realign on a shared clang major version.

Alternatives I considered and rejected:

  1. Bump CLANG_TIDY_BINARY_VERSION in cmake/common/clang-tidy.cmake: already on PyPI's latest (20.1.0); going newer just moves further from clang-17.
  2. Pin clang-tidy to 17.0.x: matches Xcode but loses 3 years of new checks for the whole project.
  3. #pragma / NOLINT inside json_hash.h: parse errors are raised at source locations inside arm_neon.h, outside any markers we own; cannot suppress.
  4. Pass Xcode include path with -isystem: already done via -resource-dir in the existing clang-tidy wrapper; the issue is the parser, not the header path.

Happy to:

  • Drop 03b4920 if you prefer to handle the clang-tidy alignment in a separate PR (the json_hash change in 0734b76 can be reviewed independently).
  • Rework into a per-target opt-out if you want a narrower escape hatch (it would need to repeat on each consumer target, but it's mechanically straightforward).
  • Help land a real clang-tidy version pin (17.0.x or matching Xcode bundle) if that's preferred for long-term lint coverage.

Comment thread CMakeLists.txt
# Override with `-DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy>` to
# re-enable manually once the toolchain mismatch is resolved.
if(APPLE AND NOT SOURCEMETA_CXX_CLANG_TIDY)
set(SOURCEMETA_CXX_CLANG_TIDY "/usr/bin/true"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of fully disabling ClangTidy, I believe you can use certain special comments to enable/disable it around the problematic SIMD code? Then we can still make use of it for other stuff?

@jviotti
Copy link
Copy Markdown
Member

jviotti commented May 31, 2026

Very cool. Dropped a comment about the ClangTidy changes and looking forward to see the benchmark comments on this PR

@jviotti
Copy link
Copy Markdown
Member

jviotti commented May 31, 2026

Ah, I remember we don't do it for forks given https://github.com/sourcemeta/core/blob/main/.github/workflows/ci.yml#L140-L141 🤦🏻‍♂️ I'll review on a machine locally later, but it sounds cool! Did you notice any difference when running the benchmark in this project? I believe we have a benchmark case for hashing here

@SouptikH
Copy link
Copy Markdown
Collaborator Author

Let me check.

@jviotti
Copy link
Copy Markdown
Member

jviotti commented May 31, 2026

@SouptikH I invited you as an external collaborator to this project. You should be able to open a PR without a fork. Can you re-open this PR like that? Then you won't hit the fork limitations and CI will post nice comments with diffs around performance so we can better see the impact across compilers here. Otherwise hard for me to see!

@SouptikH
Copy link
Copy Markdown
Collaborator Author

Good catch on the forks-skip-CI line — and yes, I just ran the hashing micro-benches in this repo on M1 Release. Median of 5 reps, CV < 2% on all rows.

Benchmark base v5 Δ notes
JSON_String_Key_Hash/10 1.603 ns 1.411 ns −11.94% the actual hash call — matches the E2E story
JSON_String_Key_Hash/100 2.005 ns 2.026 ns +1.03% collision-tag path (size > 31), unchanged in v5 — noise
JSON_String_Fast_Hash/10 2.243 ns 2.239 ns −0.18% FNV path, not PropertyHashJSON — sanity ✓
JSON_String_Fast_Hash/100 1.923 ns 1.923 ns −0.01% same — sanity ✓
JSON_Fast_Hash_Helm_Chart_Lock 54.79 ns 54.94 ns +0.28% whole-document fast_hash, not PropertyHashJSON — sanity ✓
JSON_String_Equal_Small_By_Runtime_Perfect_Hash/10 3.031 ns 3.369 ns +11.17% see anomaly note
JSON_String_Equal_Small_By_Perfect_Hash/10 0.316 ns 0.686 ns +117% anomaly — see below

The Equal_Small_By_Perfect_Hash row needs a flag because the inner loop is just hash_left == hash_right — both hashes are computed outside the loop. hash_type ({uint128_t a; uint128_t b;}) and its defaulted operator== are untouched by this PR. CV is 0.09 % so the slowdown is real in this binary, but it has to be a code-layout / scheduler artifact: the bigger json_hash.h changes inlining decisions in surrounding TUs and shifts where the benchmark loop falls in the icache / branch history. The Runtime_Perfect_Hash row inherits a chunk of the same effect (it does 2 hashes + 1 compare per iter; the compare delta accounts for ~all of the +0.34 ns).

Worth a fresh microbench in a more isolated TU if you want to confirm the compare is genuinely uninvolved, but the operator code path itself doesn't change.

End-to-end on Blaze's E2E_Evaluator suite (41 real schemas, 3 reps median): aggregate −8.80 %, 39/41 schemas faster, 0 regressions, worst case jsconfig +0.57 % (within noise). The wins cluster on schemas with property names in the 8–15 byte range (bounce-buffer path), which is consistent with Key_Hash/10's −11.94 %.

Raw JSON outputs (/tmp/core_{baseline,v5}_hash.json) and the full E2E table available if useful.

@SouptikH
Copy link
Copy Markdown
Collaborator Author

Quick correction to the previous comment — the equality-compare anomaly was a 5-rep measurement artifact, not a real slowdown.

I reran both versions at 10 reps and the deltas collapse to within ±2.5 % on every microbench, including the +117 % outlier:

Benchmark base (10 rep) v5 (10 rep) Δ
JSON_Fast_Hash_Helm_Chart_Lock 55.30 ns 54.90 ns −0.72 %
JSON_String_Equal_Small_By_Perfect_Hash/10 0.321 ns 0.316 ns −1.56 %
JSON_String_Equal_Small_By_Runtime_Perfect_Hash/10 3.02 ns 3.05 ns +0.99 %
JSON_String_Fast_Hash/10 2.24 ns 2.24 ns 0 %
JSON_String_Fast_Hash/100 1.92 ns 1.93 ns +0.52 %
JSON_String_Key_Hash/10 1.60 ns 1.60 ns 0 %
JSON_String_Key_Hash/100 2.04 ns 1.99 ns −2.45 %

Sanity check confirmed it isn't a real change in operator==: running Equal_Small_By_Perfect_Hash/10 in isolation gives 0.314 ns on baseline and 0.317 ns on v5 (∆ < 1 %, CV ≤ 0.1 %). The +117 % from the previous run was a sequencing artifact — earlier benches left the CPU in a different P/E-core / DVFS / branch-predictor state for v5 than for baseline.

The flip side is that the earlier −11.94 % on Key_Hash/10 was also 5-rep noise; the per-call cost is effectively unchanged at this measurement scale on M1.

The honest takeaway: at sub-2 ns the microbenches on Apple Silicon are below the noise floor for a change of this magnitude. The real signal is the longer-running blaze E2E suite (workloads of 7 µs to 9.8 ms, well above noise) showing −8.80 % aggregate across 41 schemas, 39/41 schemas faster, 0 regressions. Per-schema attribution there is dominated by schemas with 8–15-byte property names (where the bounce-buffer path replaces the compiler's 8+4+2+1 cascade), which is the path v5 actually changes.

Apologies for the false alarm.

@SouptikH
Copy link
Copy Markdown
Collaborator Author

Closing per @: reopening as a non-fork branch (now an external collaborator) so CI runs the perf-diff comments.

@SouptikH SouptikH closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants