perf: v0.12.19 — zero-alloc captures, 95% less memory by kolkov · Pull Request #152 · coregx/coregex

kolkov · 2026-03-24T16:45:23Z

Summary

Major memory optimization release — Rust-aligned architecture for DFA cache and PikeVM capture tracking.

Performance Changes

Zero-alloc FindSubmatch via dual SlotTable — Rust-style flat SlotTable replaces per-thread COW captures. Stack-based epsilon closure with RestoreCapture frames. FindAllSubmatch: 554MB → 26MB (-95%), 3.3x faster
BoundedBacktracker visited limit — 256KB for UseNFA (Rust default). LangArena: 89MB → 25MB (-72%), RSS 353MB → 41MB (-88%)
Byte-based DFA cache — CacheCapacityBytes (2MB default) replaces MaxStates. Matches Rust hybrid_cache_capacity
Remove dual transition storage — State.transitions eliminated, flatTrans only

Memory Impact (Kostya LangArena, 13 patterns, 7MB log)

Metric	v0.12.18	v0.12.19	Improvement
Total alloc (FindAll)	89 MB	25 MB	-72%
RSS	353 MB	41 MB	-88%
FindAllSubmatch (50K matches)	554 MB	26 MB	-95%
Speed	113-126ms	115-125ms	No regression

Documentation

New docs/ARCHITECTURE.md — engine architecture, memory model, thread safety
Updated docs/OPTIMIZATIONS.md — added Bug: ^ anchor not working correctly in MatchString #10 Dual SlotTable
Updated README.md — 17 strategies diagram, architecture links

Test plan

All tests pass locally (11 packages, 0 failures)
golangci-lint run — 0 issues
gofmt -l . — clean
Stdlib compatibility: 38/38 patterns match
LangArena correctness: 13/13 patterns match stdlib
CI: Tests + Benchmarks + Lint + 386
regex-bench: Go + Rust comparison on EPYC

Remove transitions []StateID and transitionCount from State struct. Transitions now stored exclusively in DFACache.flatTrans flat table. - Remove State.AddTransition(), Transition(), Stride(), TransitionCount() - Remove Builder.move() (unused after DetectAcceleration simplification) - Simplify DetectAcceleration/DetectAccelerationFromCached to return nil - Add DetectAccelerationFromFlat() reading from flat table - Simplify tryDetectAccelerationWithCache (flatTrans-only path) - Remove 3 redundant AddTransition calls from determinize - Update tests: add TestDetectAccelerationFromFlat, remove State transition tests Memory: ~222MB -> ~150MB (eliminates redundant per-state transition slices)

Add NewBoundedBacktrackerSmall() with 128K entries (256KB) visited capacity, matching Rust regex's default visited_capacity. UseNFA path now creates BT with small limit. When haystack exceeds BT capacity, falls back to PikeVM (correct for leftmost-first). UseBoundedBacktracker strategy retains 32M limit for POSIX longest-match. LangArena LogParser (7MB log, 13 patterns, 10 iterations): - Total alloc: 89MB -> 25MB (-72%) - RSS (Sys): 353MB -> 41MB (-88%) - errors pattern: 66MB -> 2.4MB (-96%) - Speed: no regression (113-126ms per iter)

- Remove dual transition storage (State.transitions eliminated) - Rust-aligned BT visited limit for UseNFA (128K entries = 256KB) - Memory: 89MB -> 25MB alloc (-72%), RSS 353MB -> 41MB (-88%)

Replace MaxStates (count) with CacheCapacityBytes (bytes). Default: 2MB matching Rust regex's hybrid_cache_capacity. - Add DFACache.MemoryUsage() (mirrors Rust Cache::memory_usage) - Insert checks MemoryUsage() >= capacityBytes instead of state count - Config: CacheCapacityBytes (new), MaxStates (deprecated, backward compat) - Self-adjusting: fewer states for large stride, more for small - effectiveCapacityBytes() bridges legacy MaxStates to bytes (~100B/state)

SearchWithSlotTableCapturesAt now uses SlotTable instead of legacy COW. Works for simple patterns like (foo)(bar), but greedy repetitions (a+)(b+) lose group start positions during loop iterations. Root cause: addSearchThread CopySlots overwrites capture slots on each loop iteration. Need stack-based epsilon closure with RestoreCapture frames (Rust approach) to preserve capture context through loops. TODO: Convert recursive addSearchThread to stack-based with save/restore Status: 2 NFA unit test failures, all meta tests pass (meta still on COW)

Converted addSearchThread and addSearchThreadToNext from recursive to stack-based with captureFrame (Explore + RestoreCapture frames). Mirrors Rust pikevm.rs FollowEpsilon::RestoreCapture pattern. Still failing: greedy loop captures (a+)(b+) — per-state SlotTable overwrites group start on each loop iteration (State visited again in next generation). Per-thread COW preserves all variants. Root issue: per-state storage loses capture history across byte transitions in greedy loops. Need either per-thread indexing or generation-aware slot preservation. Status: 2 NFA unit tests fail, all meta tests pass

Implement Rust-style dual SlotTable (curr/next) for capture propagation across byte transitions. Stack-based epsilon closure with RestoreCapture frames preserves capture context through greedy loops. Key changes: - Add NextSlotTable + captureStack + currSlots to PikeVMState - addSearchThread: stack-based with captureFrame (Explore + RestoreCapture) - addSearchThreadToNext: loads from curr SlotTable, writes to next - Swap SlotTable/NextSlotTable after each byte (Rust mem::swap pattern) - Don't clear Visited before seed — prevents SlotTable row overwrite - Wire meta FindSubmatch to use SlotTable path - Fix empty match capture groups (buildCapturesFromSlots) FindAllSubmatch (5 patterns, 50K matches, 800KB input): - Alloc: 554MB -> 26MB (-95%) - Mallocs: 12.5M -> 440K (-96%) - Time: 1.48s -> 0.45s (3.3x faster)

- CHANGELOG: add SlotTable capture tracking entry - OPTIMIZATIONS: add #10 Dual SlotTable (95% less memory), update version - ARCHITECTURE.md: new file documenting engine architecture, memory model, thread safety, and Rust alignment

… SlotTable - Dual SlotTable (curr/next) capture tracking (Rust approach) - Stack-based epsilon closure with RestoreCapture frames - FindAllSubmatch: 554MB -> 26MB (-95%), 3.3x faster - Updated ARCHITECTURE.md, OPTIMIZATIONS.md, README.md

codecov · 2026-03-24T16:47:19Z

Codecov Report

❌ Patch coverage is 73.42105% with 101 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
nfa/pikevm.go	70.86%	79 Missing and 9 partials ⚠️
nfa/backtrack.go	16.66%	5 Missing ⚠️
dfa/lazy/config.go	69.23%	4 Missing ⚠️
dfa/lazy/builder.go	88.88%	1 Missing and 1 partial ⚠️
dfa/lazy/lazy.go	93.75%	0 Missing and 1 partial ⚠️
meta/findall.go	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-03-24T16:52:23Z

Benchmark Comparison

Comparing main → PR #152

Summary: geomean 118.8n 115.5n -2.78%

⚠️ Potential regressions detected:

LazyDFASimpleLiteral-4     55.68n ± ∞ ¹   56.28n ± ∞ ¹  +1.08% (p=0.008 n=5)
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean              33.59n         33.62n        +0.09%
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AnchoredLiteralVsStdlib/coregex_short-4                 8.112n ± ∞ ¹    8.751n ± ∞ ¹    +7.88% (p=0.008 n=5)
AnchoredLiteralVsStdlib/coregex_medium-4                9.986n ± ∞ ¹   10.630n ± ∞ ¹    +6.45% (p=0.008 n=5)
AnchoredLiteralVsStdlib/coregex_long-4                  10.02n ± ∞ ¹    10.66n ± ∞ ¹    +6.39% (p=0.008 n=5)
AnchoredLiteralVsStdlib/coregex_no_match-4              5.611n ± ∞ ¹    5.917n ± ∞ ¹    +5.45% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

kolkov added 10 commits March 24, 2026 15:55

Merge hotfix/remove-dual-transitions: memory optimization -72%

eff99f1

- Remove dual transition storage (State.transitions eliminated) - Rust-aligned BT visited limit for UseNFA (128K entries = 256KB) - Memory: 89MB -> 25MB alloc (-72%), RSS 353MB -> 41MB (-88%)

Merge feature/dfa-cache-byte-limit: byte-based DFA cache (2MB default)

33b99be

kolkov merged commit ab4039b into main Mar 24, 2026
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: v0.12.19 — zero-alloc captures, 95% less memory#152

perf: v0.12.19 — zero-alloc captures, 95% less memory#152
kolkov merged 10 commits intomainfrom
release/v0.12.19

kolkov commented Mar 24, 2026

Uh oh!

codecov bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kolkov commented Mar 24, 2026

Summary

Performance Changes

Memory Impact (Kostya LangArena, 13 patterns, 7MB log)

Documentation

Test plan

Uh oh!

codecov bot commented Mar 24, 2026

Codecov Report

Uh oh!

github-actions bot commented Mar 24, 2026

Benchmark Comparison

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant