Skip to content

Commit ab4039b

Browse files
authored
perf: v0.12.19 — zero-alloc captures, 95% less memory (#152)
* perf: remove dual transition storage — State.transitions eliminated Remove transitions []StateID and transitionCount from State struct. Transitions now stored exclusively in DFACache.flatTrans flat table. - Remove State.AddTransition(), Transition(), Stride(), TransitionCount() - Remove Builder.move() (unused after DetectAcceleration simplification) - Simplify DetectAcceleration/DetectAccelerationFromCached to return nil - Add DetectAccelerationFromFlat() reading from flat table - Simplify tryDetectAccelerationWithCache (flatTrans-only path) - Remove 3 redundant AddTransition calls from determinize - Update tests: add TestDetectAccelerationFromFlat, remove State transition tests Memory: ~222MB -> ~150MB (eliminates redundant per-state transition slices) * perf: Rust-aligned BT visited limit for UseNFA — 72% less memory Add NewBoundedBacktrackerSmall() with 128K entries (256KB) visited capacity, matching Rust regex's default visited_capacity. UseNFA path now creates BT with small limit. When haystack exceeds BT capacity, falls back to PikeVM (correct for leftmost-first). UseBoundedBacktracker strategy retains 32M limit for POSIX longest-match. LangArena LogParser (7MB log, 13 patterns, 10 iterations): - Total alloc: 89MB -> 25MB (-72%) - RSS (Sys): 353MB -> 41MB (-88%) - errors pattern: 66MB -> 2.4MB (-96%) - Speed: no regression (113-126ms per iter) * perf: byte-based DFA cache limit — 2MB default like Rust Replace MaxStates (count) with CacheCapacityBytes (bytes). Default: 2MB matching Rust regex's hybrid_cache_capacity. - Add DFACache.MemoryUsage() (mirrors Rust Cache::memory_usage) - Insert checks MemoryUsage() >= capacityBytes instead of state count - Config: CacheCapacityBytes (new), MaxStates (deprecated, backward compat) - Self-adjusting: fewer states for large stride, more for small - effectiveCapacityBytes() bridges legacy MaxStates to bytes (~100B/state) * wip: SlotTable-based capture search — greedy loop capture bug SearchWithSlotTableCapturesAt now uses SlotTable instead of legacy COW. Works for simple patterns like (foo)(bar), but greedy repetitions (a+)(b+) lose group start positions during loop iterations. Root cause: addSearchThread CopySlots overwrites capture slots on each loop iteration. Need stack-based epsilon closure with RestoreCapture frames (Rust approach) to preserve capture context through loops. TODO: Convert recursive addSearchThread to stack-based with save/restore Status: 2 NFA unit test failures, all meta tests pass (meta still on COW) * wip: stack-based epsilon closure with RestoreCapture Converted addSearchThread and addSearchThreadToNext from recursive to stack-based with captureFrame (Explore + RestoreCapture frames). Mirrors Rust pikevm.rs FollowEpsilon::RestoreCapture pattern. Still failing: greedy loop captures (a+)(b+) — per-state SlotTable overwrites group start on each loop iteration (State visited again in next generation). Per-thread COW preserves all variants. Root issue: per-state storage loses capture history across byte transitions in greedy loops. Need either per-thread indexing or generation-aware slot preservation. Status: 2 NFA unit tests fail, all meta tests pass * feat: dual SlotTable capture tracking — zero-alloc FindSubmatch Implement Rust-style dual SlotTable (curr/next) for capture propagation across byte transitions. Stack-based epsilon closure with RestoreCapture frames preserves capture context through greedy loops. Key changes: - Add NextSlotTable + captureStack + currSlots to PikeVMState - addSearchThread: stack-based with captureFrame (Explore + RestoreCapture) - addSearchThreadToNext: loads from curr SlotTable, writes to next - Swap SlotTable/NextSlotTable after each byte (Rust mem::swap pattern) - Don't clear Visited before seed — prevents SlotTable row overwrite - Wire meta FindSubmatch to use SlotTable path - Fix empty match capture groups (buildCapturesFromSlots) FindAllSubmatch (5 patterns, 50K matches, 800KB input): - Alloc: 554MB -> 26MB (-95%) - Mallocs: 12.5M -> 440K (-96%) - Time: 1.48s -> 0.45s (3.3x faster) * docs: update CHANGELOG, OPTIMIZATIONS, add ARCHITECTURE.md for v0.12.19 - CHANGELOG: add SlotTable capture tracking entry - OPTIMIZATIONS: add #10 Dual SlotTable (95% less memory), update version - ARCHITECTURE.md: new file documenting engine architecture, memory model, thread safety, and Rust alignment
1 parent 921d193 commit ab4039b

18 files changed

Lines changed: 935 additions & 550 deletions

CHANGELOG.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,35 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1212
- ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120))
1313
- SIMD prefilter for CompositeSequenceDFA (#83)
1414

15+
## [0.12.19] - 2026-03-24
16+
17+
### Performance
18+
- **Zero-alloc FindSubmatch via dual SlotTable** (Rust approach) — replaced per-thread
19+
COW capture allocation with Rust-style flat SlotTable. Two SlotTables (curr/next)
20+
swap between byte generations. Stack-based epsilon closure with RestoreCapture
21+
frames preserves capture context through greedy loops. FindAllSubmatch (5 patterns,
22+
50K matches, 800KB input): alloc **554MB → 26MB** (-95%), mallocs **12.5M → 440K**
23+
(-96%), time **1.48s → 0.45s** (3.3x faster). Reference: Rust `pikevm.rs`
24+
`ActiveStates` + `SlotTable` + `FollowEpsilon::RestoreCapture`.
25+
26+
- **Rust-aligned BoundedBacktracker visited limit for UseNFA** — reduced visited
27+
table capacity from 32M entries (64MB) to 128K entries (256KB) for UseNFA paths,
28+
matching Rust regex's `visited_capacity` default. On Kostya's LangArena LogParser
29+
(7MB log, 13 patterns): total alloc **89MB → 25MB** (-72%), RSS **353MB → 41MB**
30+
(-88%). `errors` pattern: **66MB → 2.4MB** (-96%). No speed regression.
31+
`UseBoundedBacktracker` strategy retains full 32M limit for POSIX longest-match
32+
correctness (Go stdlib compatibility).
33+
34+
- **Byte-based DFA cache limit** (Rust approach) — replaced `MaxStates` count limit
35+
with `CacheCapacityBytes` (default 2MB, matching Rust's `hybrid_cache_capacity`).
36+
Cache limit is now self-adjusting: fewer states for large alphabets, more for small.
37+
Added `MemoryUsage()` method for runtime cache introspection.
38+
39+
- **Remove dual transition storage** — eliminated `transitions []StateID` and
40+
`transitionCount` from `State` struct. Transitions now stored exclusively in
41+
`DFACache.flatTrans`. Acceleration detection migrated to `DetectAccelerationFromFlat()`
42+
reading directly from flat table.
43+
1544
## [0.12.18] - 2026-03-24
1645

1746
### Performance

README.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ High-performance regex engine for Go. Drop-in replacement for `regexp` with **3-
2020
Go's stdlib `regexp` is intentionally simple — single NFA engine, no optimizations. This guarantees O(n) time but leaves performance on the table.
2121

2222
coregex brings Rust regex-crate architecture to Go:
23-
- **Multi-engine**: Lazy DFA, PikeVM, OnePass, BoundedBacktracker
23+
- **Multi-engine**: 17 strategies — Lazy DFA, PikeVM, OnePass, BoundedBacktracker, and more
2424
- **SIMD prefilters**: AVX2/SSSE3 for fast candidate rejection
2525
- **Reverse search**: Suffix/inner literal patterns run 1000x+ faster
2626
- **O(n) guarantee**: No backtracking, no ReDoS vulnerabilities
@@ -187,20 +187,23 @@ Uses Go's `regexp/syntax` parser:
187187
```
188188
Pattern → Parse → NFA → Literal Extract → Strategy Select
189189
190-
─────────────────────────────────┐
191-
│ Engines (17 strategies): │
192-
│ LazyDFA, PikeVM, OnePass, │
193-
│ BoundedBacktracker,
194-
│ ReverseInner, ReverseSuffix,
195-
│ ReverseSuffixSet, AnchoredLiteral, │
196-
CharClassSearcher, Teddy,
197-
│ DigitPrefilter, AhoCorasick, │
198-
│ CompositeSearcher, BranchDispatch │
199-
─────────────────────────────────┘
190+
┌────────────────────────────────────────────┐
191+
│ Engines (17 strategies):
192+
│ LazyDFA, PikeVM, OnePass,
193+
│ BoundedBacktracker, ReverseAnchored,
194+
│ ReverseInner, ReverseSuffix,
195+
│ ReverseSuffixSet, MultilineReverseSuffix, │
196+
AnchoredLiteral, CharClassSearcher,
197+
Teddy, DigitPrefilter, AhoCorasick,
198+
│ CompositeSearcher, BranchDispatch, Both
199+
└────────────────────────────────────────────┘
200200
201201
Input → Prefilter (SIMD) → Engine → Match Result
202202
```
203203

204+
> For detailed architecture documentation, see [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
205+
> For optimization details, see [docs/OPTIMIZATIONS.md](docs/OPTIMIZATIONS.md).
206+
204207
**SIMD Primitives** (AMD64):
205208
- `memchr` — single byte search (AVX2)
206209
- `memmem` — substring search (SSSE3)

dfa/lazy/accel_test.go

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -87,37 +87,37 @@ func TestDetectAcceleration(t *testing.T) {
8787
}
8888

8989
func TestDetectAccelerationFromCached(t *testing.T) {
90-
// Test the lazy detection that only uses cached transitions
90+
// State no longer stores transitions — DetectAccelerationFromCached returns nil.
91+
// Acceleration is now detected via DetectAccelerationFromFlat using flatTrans.
9192
state := NewState(StateID(1), []nfa.StateID{0}, false)
92-
93-
// Initially no cached transitions - should return nil
9493
exitBytes := DetectAccelerationFromCached(state)
9594
if exitBytes != nil {
96-
t.Errorf("Expected nil with no cached transitions, got %v", exitBytes)
95+
t.Errorf("Expected nil (State has no transitions), got %v", exitBytes)
9796
}
97+
}
98+
99+
func TestDetectAccelerationFromFlat(t *testing.T) {
100+
// Test acceleration detection via flat transition table
101+
stride := 256
102+
sid := StateID(1)
103+
flatTrans := make([]StateID, 2*stride) // 2 states
98104

99-
// Add 250 self-loop transitions
105+
// State 1: 250 self-loops, 3 exits to state 2, 3 dead
106+
base := int(sid) * stride
100107
for i := 0; i < 250; i++ {
101-
state.AddTransition(byte(i), StateID(1)) // Self-loop
108+
flatTrans[base+i] = sid // Self-loop
102109
}
110+
flatTrans[base+250] = StateID(2)
111+
flatTrans[base+251] = StateID(2)
112+
flatTrans[base+252] = StateID(2)
113+
flatTrans[base+253] = DeadState
114+
flatTrans[base+254] = DeadState
115+
flatTrans[base+255] = DeadState
103116

104-
// Add 3 exit bytes
105-
state.AddTransition(byte(250), StateID(2)) // Exit to state 2
106-
state.AddTransition(byte(251), StateID(2)) // Exit to state 2
107-
state.AddTransition(byte(252), StateID(2)) // Exit to state 2
108-
109-
// Add 3 dead transitions
110-
state.AddTransition(byte(253), DeadState)
111-
state.AddTransition(byte(254), DeadState)
112-
state.AddTransition(byte(255), DeadState)
113-
114-
// Now should detect as accelerable
115-
exitBytes = DetectAccelerationFromCached(state)
117+
exitBytes := DetectAccelerationFromFlat(sid, flatTrans, stride, nil)
116118
if len(exitBytes) != 3 {
117119
t.Errorf("Expected 3 exit bytes, got %v", exitBytes)
118120
}
119-
120-
// Verify the exit bytes are correct
121121
expected := map[byte]bool{250: true, 251: true, 252: true}
122122
for _, b := range exitBytes {
123123
if !expected[b] {

dfa/lazy/anchored_search_prefilter_test.go

Lines changed: 10 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -73,82 +73,39 @@ func TestDetectAccelFromCachedWithClassesByteMapping(t *testing.T) {
7373

7474
// TestDetectAccelFromCachedWithClassesNilClasses verifies the nil byteClasses fallback.
7575
func TestDetectAccelFromCachedWithClassesNilClasses(t *testing.T) {
76-
// Create a state with known transitions (stride=256, no compression)
76+
// State no longer stores transitions — DetectAccelerationFromCachedWithClasses returns nil.
77+
// Use DetectAccelerationFromFlat for flat table detection.
7778
state := NewState(StateID(1), []nfa.StateID{0}, false)
78-
79-
// Fill 253 self-loop transitions
80-
for i := 0; i < 253; i++ {
81-
state.AddTransition(byte(i), StateID(1))
82-
}
83-
// Add 3 exit transitions to a different state
84-
state.AddTransition(253, StateID(2))
85-
state.AddTransition(254, StateID(2))
86-
state.AddTransition(255, StateID(2))
87-
88-
// nil byteClasses -> exit class indices ARE the bytes (identity)
8979
result := DetectAccelerationFromCachedWithClasses(state, nil)
90-
if len(result) != 3 {
91-
t.Fatalf("expected 3 exit bytes with nil classes, got %v", result)
92-
}
93-
expected := map[byte]bool{253: true, 254: true, 255: true}
94-
for _, b := range result {
95-
if !expected[b] {
96-
t.Errorf("unexpected exit byte %d", b)
97-
}
80+
if result != nil {
81+
t.Errorf("expected nil (State has no transitions), got %v", result)
9882
}
9983
}
10084

101-
// TestDetectAccelFromCachedInsufficientTransitions tests that when too few
102-
// transitions are cached, acceleration detection returns nil.
85+
// TestDetectAccelFromCachedInsufficientTransitions tests that State-based detection returns nil.
10386
func TestDetectAccelFromCachedInsufficientTransitions(t *testing.T) {
10487
state := NewState(StateID(1), []nfa.StateID{0}, false)
105-
// Only add a few transitions (way below 94% threshold)
106-
state.AddTransition(0, StateID(1))
107-
state.AddTransition(1, StateID(2))
108-
10988
result := DetectAccelerationFromCachedWithClasses(state, nil)
11089
if result != nil {
111-
t.Errorf("expected nil for insufficient cached transitions, got %v", result)
90+
t.Errorf("expected nil (State has no transitions), got %v", result)
11291
}
11392
}
11493

115-
// TestDetectAccelFromCachedTooManyExitClasses tests that >3 exit classes returns nil.
94+
// TestDetectAccelFromCachedTooManyExitClasses tests that State-based detection returns nil.
11695
func TestDetectAccelFromCachedTooManyExitClasses(t *testing.T) {
11796
state := NewState(StateID(1), []nfa.StateID{0}, false)
118-
// Fill 250 self-loops
119-
for i := 0; i < 250; i++ {
120-
state.AddTransition(byte(i), StateID(1))
121-
}
122-
// Add 4 distinct exit transitions (> 3 limit)
123-
state.AddTransition(250, StateID(2))
124-
state.AddTransition(251, StateID(3))
125-
state.AddTransition(252, StateID(4))
126-
state.AddTransition(253, StateID(5))
127-
// Fill remaining with dead
128-
state.AddTransition(254, DeadState)
129-
state.AddTransition(255, DeadState)
130-
13197
result := DetectAccelerationFromCachedWithClasses(state, nil)
13298
if result != nil {
133-
t.Errorf("expected nil for >3 exit classes, got %v", result)
99+
t.Errorf("expected nil (State has no transitions), got %v", result)
134100
}
135101
}
136102

137-
// TestDetectAccelFromCachedZeroExitClasses tests that 0 exit classes returns nil.
103+
// TestDetectAccelFromCachedZeroExitClasses tests that State-based detection returns nil.
138104
func TestDetectAccelFromCachedZeroExitClasses(t *testing.T) {
139105
state := NewState(StateID(1), []nfa.StateID{0}, false)
140-
// All transitions are self-loops or dead
141-
for i := 0; i < 256; i++ {
142-
if i < 200 {
143-
state.AddTransition(byte(i), StateID(1)) // self-loop
144-
} else {
145-
state.AddTransition(byte(i), DeadState) // dead
146-
}
147-
}
148-
149106
result := DetectAccelerationFromCachedWithClasses(state, nil)
150107
if result != nil {
151-
t.Errorf("expected nil for 0 exit classes, got %v", result)
108+
t.Errorf("expected nil (State has no transitions), got %v", result)
152109
}
153110
}
154111

0 commit comments

Comments
 (0)