Skip to content

Add reset(to:), processedTokenCount, and implicit prefix caching#51

Merged
stikves merged 2 commits into
apple:mainfrom
stikves:sukru/engine-partial-reset
Jun 18, 2026
Merged

Add reset(to:), processedTokenCount, and implicit prefix caching#51
stikves merged 2 commits into
apple:mainfrom
stikves:sukru/engine-partial-reset

Conversation

@stikves

@stikves stikves commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Extends the InferenceEngine protocol with partial KV cache reset and automatic prefix detection:

  • reset(to: tokenIndex): truncate KV cache to keep first N positions
  • processedTokenCount: read-only, tracks engine's current cache position
  • TokenHistory: shared utility for memcmp-based prefix detection
  • Pipelined engine: implicit prefix caching — engine auto-detects common prefix between consecutive generate() calls, only processes new tokens
  • Sequential/Static: implicit rewind when input is shorter than cache (full divergence triggers zero-fill, prefix extension rewinds counter)

All engines pass explicit reset parity test. Pipelined passes implicit prefix caching test (extend + diverge). Sequential/Static pass by wiring explicit reset.

@stikves stikves force-pushed the sukru/engine-partial-reset branch 2 times, most recently from c3f7d0a to a5c51ce Compare June 17, 2026 02:45
@stikves stikves marked this pull request as ready for review June 17, 2026 03:42
@stikves stikves force-pushed the sukru/engine-partial-reset branch 2 times, most recently from 433d83b to 5482d42 Compare June 17, 2026 18:17
@stikves stikves force-pushed the sukru/engine-partial-reset branch from 8623ada to 8ddd9ba Compare June 17, 2026 19:50
Extends the InferenceEngine protocol with partial KV cache reset and
automatic prefix detection:

- reset(to: tokenIndex): truncate KV cache to keep first N positions
- processedTokenCount: read-only, tracks engine's current cache position
- TokenHistory: shared utility for memcmp-based prefix detection
- Pipelined engine: implicit prefix caching — engine auto-detects common
  prefix between consecutive generate() calls, only processes new tokens
- Sequential/Static: implicit rewind when input is shorter than cache
  (full divergence triggers zero-fill, prefix extension rewinds counter)

All engines pass 20/20 explicit reset parity test. Pipelined passes 11/11
implicit prefix caching test (extend + diverge). Sequential/Static pass
10/11 (divergence auto-reset works, extend works via counter rewind).
@stikves stikves force-pushed the sukru/engine-partial-reset branch from 8ddd9ba to 9bc7aa3 Compare June 18, 2026 00:02
@carinapeng carinapeng self-requested a review June 18, 2026 18:20

@carinapeng carinapeng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@stikves stikves merged commit 18cd896 into apple:main Jun 18, 2026
3 checks passed
@stikves stikves deleted the sukru/engine-partial-reset branch June 18, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants