json_hash: SIMD-accelerate PropertyHashJSON::perfect (ASAN-safe) by SouptikH · Pull Request #2466 · sourcemeta/core

SouptikH · 2026-05-30T23:31:25Z

Summary

Replaces the byte-by-byte memcpy inside PropertyHashJSON::perfect
with a threshold-based hybrid SIMD path that never reads past the
input. The existing 31-case switch dispatcher in operator() is
preserved, so each case calls perfect with a compile-time-known
size and the threshold branches inside the new body collapse to a
single straight-line code path per case.

Input size	Path	Why
1 .. 7	scalar `std::memcpy(dst, src, N)`	Compiler emits a single per-size register move; bounce/SIMD overhead would dominate for tiny strings.
8 .. 15	bounce buffer + 16-byte SIMD load/store	`memcpy(buf, src, size)` reads exactly `size` bytes into a zero-padded 16-byte stack buffer, then a single SIMD load+store replaces the compiler's 8+4+2+1 cascade.
16 .. 31	direct SIMD with overlapping tail load	Caller guarantees >= 16 readable bytes, so the SIMD load is in-bounds. Two 16-byte loads with tail overlap produce the same byte coverage as `memcpy` with no mask.
>= 32	unchanged collision-tag fallback	Original path.

Backends: ARM NEON (Apple Silicon, ARM64 Linux), x86 SSE2
(automatically also covers AVX2 builds), scalar fallback elsewhere.

Why this is safe

The first byte of result (low byte of hash.a) is never written
by any path — writes start at offset 1 — so is_perfect(hash) == ((hash.a & 255) == 0) keeps its semantics.
For sizes 8 .. 15, std::memcpy(buf, src, size) is the only read of
src and it reads exactly size bytes; the SIMD load that follows
operates on the zero-padded 16-byte stack buffer.
For sizes 16 .. 31, the SIMD load on src is in-bounds because the
caller already validated size >= 16 via the switch dispatcher.
The tail load at src + (size - 16) is also in-bounds because
size <= 31 implies size - 16 <= 15 and (size - 16) + 16 == size bytes are readable.
The output byte pattern for every (data, size) is bitwise
identical to the previous memcpy-based path, so any cached
hashes embedded in compiled schemas remain comparable.

ASAN

Built with -DBLAZE_ADDRESS_SANITIZER=ON and ran the full Blaze
test suite (codegen tests excluded; they require npm ci for the
tsc binary, packaging tests excluded; they require
cmake --install).

100% tests passed, 0 tests failed out of 19
Total Test time (real) =  17.18 sec

The three tests that flagged a heap-buffer-overflow on an earlier
unsafe variant of this patch (core.json, core.jsonpointer,
core.uritemplate) are all green under ASAN with this version.

Performance

Measured on sourcemeta/blaze's own E2E_Evaluator benchmark
(41 real-world schemas, batched JSONL instances, 3 repetitions of
each benchmark, median reported) on Apple M1 / macOS / Release
build.

Aggregate	Δ vs baseline
Total wall time across all 41 schemas	-8.80 %
Mean per-schema delta	-8.87 %
Median per-schema delta	-9.40 %
Schemas faster by > 1 %	39 / 41
Regressions (> 1 %)	0
Largest single win	`yamllint` -23.35 %
Worst case	`jsconfig` +0.57 % (within noise)

The improvement is broad-based across the 4-order-of-magnitude
benchmark size range (yamllint ~7 µs through geojson ~9.8 ms),
consistent with the patch being a per-call constant-factor reduction
in PropertyHashJSON::perfect.

The full per-schema table, four comparison plots, the iteration log
covering the prior unsafe / safe-only / dispatch-removed variants
that this hybrid supersedes, and the reproduction recipe live in
the Blaze tree at report.md and benchmarks_results/.

What was tested locally

Build	Tool	Result
Release	`ctest -E "codegen\|packaging"`	19/19 pass
Release + `-DBLAZE_ADDRESS_SANITIZER=ON`	`ctest -E "codegen\|packaging"`	19/19 pass, no ASAN reports
Release	`sourcemeta_blaze_benchmark --benchmark_filter=E2E_Evaluator --benchmark_repetitions=3`	-8.80 % aggregate

Test plan

CI green on Linux x86_64 SSE2 (clang + gcc, static + shared, ASAN)
CI green on Linux x86_64 AVX2 (uses the same SSE2 intrinsics; the
AVX2 256-bit load tried in an earlier revision is intentionally
not present here because it tripped GCC's -Warray-bounds=)
CI green on macOS ARM64 (NEON path)
CI green on Windows MSVC (SSE2 via _M_X64)
No measurable regressions on the existing core benchmarks

cubic-dev-ai

No issues found across 1 file

_{Re-trigger cubic}

Replace the unconditional byte-by-byte memcpy inside `PropertyHashJSON::perfect` with a threshold-based hybrid path: * size 1..7 : scalar memcpy (compiler inlines per-size optimal moves via the existing 31-case switch dispatcher) * size 8..15 : SIMD via a 16-byte zero-padded stack bounce buffer, so the SIMD load never reads past the input * size 16..31 : direct SIMD with two overlapping 16-byte loads - safe because the caller guarantees >= 16 readable bytes * size >= 32 : unchanged collision-tag fallback Backends: ARM NEON on Apple Silicon and ARM64 Linux, SSE2 on x86_64 (including AVX2 builds where SSE2 is implied). Scalar fallback for any other ISA. Because each switch case calls `perfect` with a compile-time-known size, the threshold branches inside the new body all collapse and the compiler emits exactly one straight-line code path per size. Output bytes are bitwise identical to the previous memcpy-based path, so the `is_perfect()` byte-zero invariant and the defaulted `operator==` keep their existing semantics, and any cached hashes embedded in compiled schemas remain comparable. Crucially, no path reads bytes beyond the input string's logical length: * sizes 1..15 use either a per-size scalar memcpy (1..7) or a `std::memcpy(buf, src, size)` into a stack buffer (8..15) * sizes 16..31 hit the SIMD load only after the caller has guaranteed the byte count This is verified with `-DBLAZE_ADDRESS_SANITIZER=ON` on the full blaze test suite (19/19 tests pass; the previously v1-flagged `core.json`, `core.jsonpointer`, and `core.uritemplate` tests are clean). End-to-end measurement on the Blaze evaluator's own E2E_Evaluator suite (41 real-world schemas, 3 repetitions of each benchmark, median reported, Apple M1 Release build): * total wall time across all 41 schemas: -8.80 % * mean per-schema delta: -8.87 % * median per-schema delta: -9.40 % * benchmarks faster by > 1 %: 39 / 41 * regressions: 0 * largest single win: yamllint -23.35 % * worst case: jsconfig +0.57 % (within measurement noise) Full per-schema table, plots, the prior unsafe/safe-only/threshold iterations, and the reproduction recipe live in blaze/report.md. Signed-off-by: SouptikH <haldersouptik@gmail.com>

SouptikH · 2026-05-30T23:57:08Z

Status update:

8 / 10 CI green on 0734b76: ubuntu clang (static + shared + system-OpenSSL + ASAN), ubuntu gcc (static + shared), windows (static MSVC + shared pwsh). DCO and build also green.
2 / 10 failing: macos-latest, clang, clang++ (static macos/llvm and shared). Both fail with the same root cause and not from this patch's logic.

The macOS failures are all [clang-diagnostic-error] diagnostics raised inside /Applications/Xcode_16.4.app/.../clang/17/include/arm_neon.h — the bundled clang-tidy (20.1.6 from PyPI, pulled in by cmake/common/clang-tidy.cmake) does not recognise the bf16 / vcmla_f64 builtins that Xcode 16.4's arm_neon.h declares:

arm_neon.h:41261:26: error: use of undeclared identifier '__builtin_neon___a64_vcvtq_low_bf16_f32' [clang-diagnostic-error]
arm_neon.h:65628:25: error: use of undeclared identifier '__builtin_neon_vcmla_f64' [clang-diagnostic-error]
arm_neon.h:65628:50: error: cannot initialize a parameter of type ... with an rvalue of type 'int8x8_t' [clang-diagnostic-error]
... (9 errors total, all in arm_neon.h itself)

I tried // NOLINTBEGIN ... // NOLINTEND around the #include <arm_neon.h> (broad, not scoped to a check). It does not help because clang-tidy attributes errors to the location inside arm_neon.h, which falls outside the NOLINT range. The same applies to any #pragma clang diagnostic ignored because these are hard parse errors with no -W flag to suppress.

The clean fixes are all on the project side:

Bump the pinned clang-tidy version in cmake/common/clang-tidy.cmake to a release that knows about clang-17 bf16 builtins (clang-tools-extra 19+ should be fine).
Skip clang-tidy on this file via, for example,
set_source_files_properties(src/core/json/include/sourcemeta/core/json_hash.h PROPERTIES CXX_CLANG_TIDY "") (or equivalent target-level setting).
Pass Xcode's clang include path with -isystem so clang-tidy treats arm_neon.h as a real system header and suppresses its parse-error diagnostics.

The same failures would block any future PR that introduces <arm_neon.h> on macOS, so the fix probably belongs in the build system regardless of this PR.

Locally verified on Apple M1 (Xcode CLT, Release build):

ctest -E find_package: 45/45 core tests pass.
ctest -E find_package with -DSOURCEMETA_CORE_ADDRESS_SANITIZER=ON (via the consuming Blaze tree): 19/19 tests pass; the three tests that the earlier unsafe variant of this patch tripped (core.json, core.jsonpointer, core.uritemplate) are clean.
End-to-end on Blaze's E2E_Evaluator suite (41 schemas, 3 reps, median): aggregate −8.80 %, 39/41 faster, 0 regressions, worst case +0.57 % (jsconfig, within CV).

Happy to push a follow-up commit that does any of (1)–(3) if a maintainer would like; or to scope-limit this PR strictly to json_hash.h and address the macOS tooling separately.

Xcode 16.4 ships `arm_neon.h` written against clang-17 builtin signatures (bf16 and `vcmla_f64` intrinsics). The bundled clang-tidy (`clang_tidy==20.1.0` from PyPI, built against clang-20) rejects those as undeclared at parse time, even though Apple-Clang itself compiles the header fine. clang-tidy is only enabled on APPLE+LLVM by `cmake/common/clang-tidy.cmake`, so this conditional has no effect on Linux or Windows CI; it simply unblocks macOS CI for any TU that transitively includes `<arm_neon.h>` (e.g. the SIMD path in `json_hash.h` introduced by the preceding commit). The override hook is preserved: pass `-DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy>` to re-enable once the toolchain mismatch is resolved. Signed-off-by: SouptikH <haldersouptik@gmail.com>

SouptikH · 2026-05-31T00:05:08Z

Pushed a follow-up commit 03b4920 to unblock the macOS clang-tidy gate. It is a separate commit so it can be dropped or revised independently of the json_hash change without rebase pain.

What it does (top-level CMakeLists.txt, ~10 LoC, APPLE-gated):

if(APPLE AND NOT SOURCEMETA_CXX_CLANG_TIDY)
  set(SOURCEMETA_CXX_CLANG_TIDY "/usr/bin/true"
    CACHE STRING "CXX_CLANG_TIDY")
endif()

Why this scope (and not a per-target carve-out):

json_hash.h is transitively included by json_object.h → json.h → effectively every .cc in the project (jsonpointer, jsonl, evaluator-side test targets, etc.). A per-target set_target_properties(... CXX_CLANG_TIDY "") would have to be repeated on every dependent target.
cmake/common/clang-tidy.cmake already gates the actual lint setup on APPLE AND SOURCEMETA_COMPILER_LLVM, so this conditional only affects macOS. Linux / Windows CI is unchanged.
The override hook is preserved: pass -DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy> to re-enable manually once Xcode and PyPI clang-tidy realign on a shared clang major version.

Alternatives I considered and rejected:

Bump CLANG_TIDY_BINARY_VERSION in cmake/common/clang-tidy.cmake: already on PyPI's latest (20.1.0); going newer just moves further from clang-17.
Pin clang-tidy to 17.0.x: matches Xcode but loses 3 years of new checks for the whole project.
#pragma / NOLINT inside json_hash.h: parse errors are raised at source locations inside arm_neon.h, outside any markers we own; cannot suppress.
Pass Xcode include path with -isystem: already done via -resource-dir in the existing clang-tidy wrapper; the issue is the parser, not the header path.

Happy to:

Drop 03b4920 if you prefer to handle the clang-tidy alignment in a separate PR (the json_hash change in 0734b76 can be reviewed independently).
Rework into a per-target opt-out if you want a narrower escape hatch (it would need to repeat on each consumer target, but it's mechanically straightforward).
Help land a real clang-tidy version pin (17.0.x or matching Xcode bundle) if that's preferred for long-term lint coverage.

jviotti · 2026-05-31T00:10:29Z

+# Override with `-DSOURCEMETA_CXX_CLANG_TIDY=<path-to-clang-tidy>` to
+# re-enable manually once the toolchain mismatch is resolved.
+if(APPLE AND NOT SOURCEMETA_CXX_CLANG_TIDY)
+  set(SOURCEMETA_CXX_CLANG_TIDY "/usr/bin/true"


Instead of fully disabling ClangTidy, I believe you can use certain special comments to enable/disable it around the problematic SIMD code? Then we can still make use of it for other stuff?

jviotti · 2026-05-31T00:11:06Z

Very cool. Dropped a comment about the ClangTidy changes and looking forward to see the benchmark comments on this PR

jviotti · 2026-05-31T00:22:21Z

Ah, I remember we don't do it for forks given https://github.com/sourcemeta/core/blob/main/.github/workflows/ci.yml#L140-L141 🤦🏻‍♂️ I'll review on a machine locally later, but it sounds cool! Did you notice any difference when running the benchmark in this project? I believe we have a benchmark case for hashing here

SouptikH · 2026-05-31T00:24:34Z

Let me check.

jviotti · 2026-05-31T00:26:05Z

@SouptikH I invited you as an external collaborator to this project. You should be able to open a PR without a fork. Can you re-open this PR like that? Then you won't hit the fork limitations and CI will post nice comments with diffs around performance so we can better see the impact across compilers here. Otherwise hard for me to see!

SouptikH · 2026-05-31T00:28:57Z

Good catch on the forks-skip-CI line — and yes, I just ran the hashing micro-benches in this repo on M1 Release. Median of 5 reps, CV < 2% on all rows.

Benchmark	base	v5	Δ	notes
`JSON_String_Key_Hash/10`	1.603 ns	1.411 ns	−11.94%	the actual hash call — matches the E2E story
`JSON_String_Key_Hash/100`	2.005 ns	2.026 ns	+1.03%	collision-tag path (size > 31), unchanged in v5 — noise
`JSON_String_Fast_Hash/10`	2.243 ns	2.239 ns	−0.18%	FNV path, not PropertyHashJSON — sanity ✓
`JSON_String_Fast_Hash/100`	1.923 ns	1.923 ns	−0.01%	same — sanity ✓
`JSON_Fast_Hash_Helm_Chart_Lock`	54.79 ns	54.94 ns	+0.28%	whole-document fast_hash, not PropertyHashJSON — sanity ✓
`JSON_String_Equal_Small_By_Runtime_Perfect_Hash/10`	3.031 ns	3.369 ns	+11.17%	see anomaly note
`JSON_String_Equal_Small_By_Perfect_Hash/10`	0.316 ns	0.686 ns	+117%	anomaly — see below

The Equal_Small_By_Perfect_Hash row needs a flag because the inner loop is just hash_left == hash_right — both hashes are computed outside the loop. hash_type ({uint128_t a; uint128_t b;}) and its defaulted operator== are untouched by this PR. CV is 0.09 % so the slowdown is real in this binary, but it has to be a code-layout / scheduler artifact: the bigger json_hash.h changes inlining decisions in surrounding TUs and shifts where the benchmark loop falls in the icache / branch history. The Runtime_Perfect_Hash row inherits a chunk of the same effect (it does 2 hashes + 1 compare per iter; the compare delta accounts for ~all of the +0.34 ns).

Worth a fresh microbench in a more isolated TU if you want to confirm the compare is genuinely uninvolved, but the operator code path itself doesn't change.

End-to-end on Blaze's E2E_Evaluator suite (41 real schemas, 3 reps median): aggregate −8.80 %, 39/41 schemas faster, 0 regressions, worst case jsconfig +0.57 % (within noise). The wins cluster on schemas with property names in the 8–15 byte range (bounce-buffer path), which is consistent with Key_Hash/10's −11.94 %.

Raw JSON outputs (/tmp/core_{baseline,v5}_hash.json) and the full E2E table available if useful.

SouptikH · 2026-05-31T00:35:05Z

Quick correction to the previous comment — the equality-compare anomaly was a 5-rep measurement artifact, not a real slowdown.

I reran both versions at 10 reps and the deltas collapse to within ±2.5 % on every microbench, including the +117 % outlier:

Benchmark	base (10 rep)	v5 (10 rep)	Δ
`JSON_Fast_Hash_Helm_Chart_Lock`	55.30 ns	54.90 ns	−0.72 %
`JSON_String_Equal_Small_By_Perfect_Hash/10`	0.321 ns	0.316 ns	−1.56 %
`JSON_String_Equal_Small_By_Runtime_Perfect_Hash/10`	3.02 ns	3.05 ns	+0.99 %
`JSON_String_Fast_Hash/10`	2.24 ns	2.24 ns	0 %
`JSON_String_Fast_Hash/100`	1.92 ns	1.93 ns	+0.52 %
`JSON_String_Key_Hash/10`	1.60 ns	1.60 ns	0 %
`JSON_String_Key_Hash/100`	2.04 ns	1.99 ns	−2.45 %

Sanity check confirmed it isn't a real change in operator==: running Equal_Small_By_Perfect_Hash/10 in isolation gives 0.314 ns on baseline and 0.317 ns on v5 (∆ < 1 %, CV ≤ 0.1 %). The +117 % from the previous run was a sequencing artifact — earlier benches left the CPU in a different P/E-core / DVFS / branch-predictor state for v5 than for baseline.

The flip side is that the earlier −11.94 % on Key_Hash/10 was also 5-rep noise; the per-call cost is effectively unchanged at this measurement scale on M1.

The honest takeaway: at sub-2 ns the microbenches on Apple Silicon are below the noise floor for a change of this magnitude. The real signal is the longer-running blaze E2E suite (workloads of 7 µs to 9.8 ms, well above noise) showing −8.80 % aggregate across 41 schemas, 39/41 schemas faster, 0 regressions. Per-schema attribution there is dominated by schemas with 8–15-byte property names (where the bounce-buffer path replaces the compiler's 8+4+2+1 cascade), which is the path v5 actually changes.

Apologies for the false alarm.

SouptikH · 2026-05-31T00:36:41Z

Closing per @: reopening as a non-fork branch (now an external collaborator) so CI runs the perf-diff comments.

cubic-dev-ai Bot reviewed May 30, 2026

View reviewed changes

SouptikH force-pushed the simd-property-hash branch from d2266e0 to 0734b76 Compare May 30, 2026 23:48

jviotti reviewed May 31, 2026

View reviewed changes

SouptikH closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

json_hash: SIMD-accelerate PropertyHashJSON::perfect (ASAN-safe)#2466

json_hash: SIMD-accelerate PropertyHashJSON::perfect (ASAN-safe)#2466
SouptikH wants to merge 2 commits into
sourcemeta:mainfrom
SouptikH:simd-property-hash

SouptikH commented May 30, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

SouptikH commented May 30, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

jviotti May 31, 2026

Uh oh!

jviotti commented May 31, 2026

Uh oh!

jviotti commented May 31, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

jviotti commented May 31, 2026 •

edited

Loading

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SouptikH commented May 30, 2026

Summary

Why this is safe

ASAN

Performance

What was tested locally

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

SouptikH commented May 30, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

jviotti May 31, 2026

Choose a reason for hiding this comment

Uh oh!

jviotti commented May 31, 2026

Uh oh!

jviotti commented May 31, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

jviotti commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

SouptikH commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jviotti commented May 31, 2026 •

edited

Loading