Skip to content

perf: v0.12.19 — zero-alloc captures, 95% less memory#152

Merged
kolkov merged 10 commits intomainfrom
release/v0.12.19
Mar 24, 2026
Merged

perf: v0.12.19 — zero-alloc captures, 95% less memory#152
kolkov merged 10 commits intomainfrom
release/v0.12.19

Conversation

@kolkov
Copy link
Contributor

@kolkov kolkov commented Mar 24, 2026

Summary

Major memory optimization release — Rust-aligned architecture for DFA cache and PikeVM capture tracking.

Performance Changes

  • Zero-alloc FindSubmatch via dual SlotTable — Rust-style flat SlotTable replaces per-thread COW captures. Stack-based epsilon closure with RestoreCapture frames. FindAllSubmatch: 554MB → 26MB (-95%), 3.3x faster
  • BoundedBacktracker visited limit — 256KB for UseNFA (Rust default). LangArena: 89MB → 25MB (-72%), RSS 353MB → 41MB (-88%)
  • Byte-based DFA cacheCacheCapacityBytes (2MB default) replaces MaxStates. Matches Rust hybrid_cache_capacity
  • Remove dual transition storage — State.transitions eliminated, flatTrans only

Memory Impact (Kostya LangArena, 13 patterns, 7MB log)

Metric v0.12.18 v0.12.19 Improvement
Total alloc (FindAll) 89 MB 25 MB -72%
RSS 353 MB 41 MB -88%
FindAllSubmatch (50K matches) 554 MB 26 MB -95%
Speed 113-126ms 115-125ms No regression

Documentation

Test plan

  • All tests pass locally (11 packages, 0 failures)
  • golangci-lint run — 0 issues
  • gofmt -l . — clean
  • Stdlib compatibility: 38/38 patterns match
  • LangArena correctness: 13/13 patterns match stdlib
  • CI: Tests + Benchmarks + Lint + 386
  • regex-bench: Go + Rust comparison on EPYC

kolkov added 10 commits March 24, 2026 15:55
Remove transitions []StateID and transitionCount from State struct.
Transitions now stored exclusively in DFACache.flatTrans flat table.

- Remove State.AddTransition(), Transition(), Stride(), TransitionCount()
- Remove Builder.move() (unused after DetectAcceleration simplification)
- Simplify DetectAcceleration/DetectAccelerationFromCached to return nil
- Add DetectAccelerationFromFlat() reading from flat table
- Simplify tryDetectAccelerationWithCache (flatTrans-only path)
- Remove 3 redundant AddTransition calls from determinize
- Update tests: add TestDetectAccelerationFromFlat, remove State transition tests

Memory: ~222MB -> ~150MB (eliminates redundant per-state transition slices)
Add NewBoundedBacktrackerSmall() with 128K entries (256KB) visited
capacity, matching Rust regex's default visited_capacity.

UseNFA path now creates BT with small limit. When haystack exceeds
BT capacity, falls back to PikeVM (correct for leftmost-first).
UseBoundedBacktracker strategy retains 32M limit for POSIX longest-match.

LangArena LogParser (7MB log, 13 patterns, 10 iterations):
- Total alloc: 89MB -> 25MB (-72%)
- RSS (Sys): 353MB -> 41MB (-88%)
- errors pattern: 66MB -> 2.4MB (-96%)
- Speed: no regression (113-126ms per iter)
- Remove dual transition storage (State.transitions eliminated)
- Rust-aligned BT visited limit for UseNFA (128K entries = 256KB)
- Memory: 89MB -> 25MB alloc (-72%), RSS 353MB -> 41MB (-88%)
Replace MaxStates (count) with CacheCapacityBytes (bytes).
Default: 2MB matching Rust regex's hybrid_cache_capacity.

- Add DFACache.MemoryUsage() (mirrors Rust Cache::memory_usage)
- Insert checks MemoryUsage() >= capacityBytes instead of state count
- Config: CacheCapacityBytes (new), MaxStates (deprecated, backward compat)
- Self-adjusting: fewer states for large stride, more for small
- effectiveCapacityBytes() bridges legacy MaxStates to bytes (~100B/state)
SearchWithSlotTableCapturesAt now uses SlotTable instead of legacy COW.
Works for simple patterns like (foo)(bar), but greedy repetitions
(a+)(b+) lose group start positions during loop iterations.

Root cause: addSearchThread CopySlots overwrites capture slots on each
loop iteration. Need stack-based epsilon closure with RestoreCapture
frames (Rust approach) to preserve capture context through loops.

TODO: Convert recursive addSearchThread to stack-based with save/restore
Status: 2 NFA unit test failures, all meta tests pass (meta still on COW)
Converted addSearchThread and addSearchThreadToNext from recursive to
stack-based with captureFrame (Explore + RestoreCapture frames).
Mirrors Rust pikevm.rs FollowEpsilon::RestoreCapture pattern.

Still failing: greedy loop captures (a+)(b+) — per-state SlotTable
overwrites group start on each loop iteration (State visited again
in next generation). Per-thread COW preserves all variants.

Root issue: per-state storage loses capture history across byte
transitions in greedy loops. Need either per-thread indexing or
generation-aware slot preservation.

Status: 2 NFA unit tests fail, all meta tests pass
Implement Rust-style dual SlotTable (curr/next) for capture propagation
across byte transitions. Stack-based epsilon closure with RestoreCapture
frames preserves capture context through greedy loops.

Key changes:
- Add NextSlotTable + captureStack + currSlots to PikeVMState
- addSearchThread: stack-based with captureFrame (Explore + RestoreCapture)
- addSearchThreadToNext: loads from curr SlotTable, writes to next
- Swap SlotTable/NextSlotTable after each byte (Rust mem::swap pattern)
- Don't clear Visited before seed — prevents SlotTable row overwrite
- Wire meta FindSubmatch to use SlotTable path
- Fix empty match capture groups (buildCapturesFromSlots)

FindAllSubmatch (5 patterns, 50K matches, 800KB input):
- Alloc: 554MB -> 26MB (-95%)
- Mallocs: 12.5M -> 440K (-96%)
- Time: 1.48s -> 0.45s (3.3x faster)
- CHANGELOG: add SlotTable capture tracking entry
- OPTIMIZATIONS: add #10 Dual SlotTable (95% less memory), update version
- ARCHITECTURE.md: new file documenting engine architecture, memory model,
  thread safety, and Rust alignment
… SlotTable

- Dual SlotTable (curr/next) capture tracking (Rust approach)
- Stack-based epsilon closure with RestoreCapture frames
- FindAllSubmatch: 554MB -> 26MB (-95%), 3.3x faster
- Updated ARCHITECTURE.md, OPTIMIZATIONS.md, README.md
@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 73.42105% with 101 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
nfa/pikevm.go 70.86% 79 Missing and 9 partials ⚠️
nfa/backtrack.go 16.66% 5 Missing ⚠️
dfa/lazy/config.go 69.23% 4 Missing ⚠️
dfa/lazy/builder.go 88.88% 1 Missing and 1 partial ⚠️
dfa/lazy/lazy.go 93.75% 0 Missing and 1 partial ⚠️
meta/findall.go 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link

Benchmark Comparison

Comparing main → PR #152

Summary: geomean 118.8n 115.5n -2.78%

⚠️ Potential regressions detected:

LazyDFASimpleLiteral-4     55.68n ± ∞ ¹   56.28n ± ∞ ¹  +1.08% (p=0.008 n=5)
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean              33.59n         33.62n        +0.09%
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AnchoredLiteralVsStdlib/coregex_short-4                 8.112n ± ∞ ¹    8.751n ± ∞ ¹    +7.88% (p=0.008 n=5)
AnchoredLiteralVsStdlib/coregex_medium-4                9.986n ± ∞ ¹   10.630n ± ∞ ¹    +6.45% (p=0.008 n=5)
AnchoredLiteralVsStdlib/coregex_long-4                  10.02n ± ∞ ¹    10.66n ± ∞ ¹    +6.39% (p=0.008 n=5)
AnchoredLiteralVsStdlib/coregex_no_match-4              5.611n ± ∞ ¹    5.917n ± ∞ ¹    +5.45% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

@kolkov kolkov merged commit ab4039b into main Mar 24, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant