Skip to content

fix(memory-core): yield to event loop during seedEmbeddingCache (R2.A.2)#2

Merged
lesaai merged 2 commits into
cc-mini/chat-completions-upstream-20260423from
cc-mini/memory-core-yield-during-seed
Apr 24, 2026
Merged

fix(memory-core): yield to event loop during seedEmbeddingCache (R2.A.2)#2
lesaai merged 2 commits into
cc-mini/chat-completions-upstream-20260423from
cc-mini/memory-core-yield-during-seed

Conversation

@lesaai
Copy link
Copy Markdown
Member

@lesaai lesaai commented Apr 24, 2026

Canary candidate. Do not npm-link to live Lēsa until canary passes.

Summary

R2.A.2. The .iterate() patch from PR #1 (a315280) prevents the V8 heap OOM but the iterate loop still runs synchronously for ~117s on a 435K-row embedding_cache. The gateway can't service HTTP /health probes during that window; wip-healthcheck's 30s probe timeout SIGKILLs the gateway after a single failure.

Live repro on 2026-04-24 ~15:31 PDT post-PR#1 deploy: HTTP probe failed: timeout (30000ms)Restarting gateway (attempt 1/3) → SIGKILL → LaunchAgent respawn. No FATAL ERROR, no Abort trap, no StatementSync::All stack. R2.A v1 worked at preventing V8 OOM; the new failure mode is event-loop blocking.

Fix

  1. seedEmbeddingCache becomes async and yields to the event loop every 1000 rows via await new Promise(resolve => setImmediate(resolve)).
  2. Caller (inside runMemoryAtomicReindex's build async arrow) gains await — one-line propagation.

The synchronous .iterate() / insert.run() pair stays the same; we just release the event loop for one tick every batch so HTTP /health (and other I/O work) can run between batches. Memory stays bounded; streaming behavior preserved; /health stays responsive during the seed.

YIELD_EVERY = 1000 rows ≈ tens of milliseconds of sync work per batch. Well under the 30s probe timeout even with no further patience from the watchdog.

Validation

  • pnpm tsgo:prod: green (core + extensions graphs)
  • pnpm test extensions/memory-core: 512 passed, 3 skipped, 0 failed

Out of scope

  • wip-healthcheck softening. Per Parker's direction, a secondary guardrail (require multiple consecutive failures or a stronger multi-signal stuck condition) belongs in wip-healthcheck-private, not here. Filed separately.
  • Secondary listChunks .all() path in manager-search.ts:246-252 (R2.A.3). Bigger surgery — caller needs full candidate set for cosine-similarity ranking, so converting to streaming requires a bounded top-K heap in the caller. Held until R2.A.2 canaries clean.
  • Upstream OpenClaw 2026.4.24 does NOT include either fix; carrying the fork patch.

Canary plan

  1. Canary-install this branch (not live Lēsa)
  2. Repro target: Day 63-style broad memory review on real 16 GB main.sqlite
  3. Pass criteria: no Abort trap, no PID change, no V8 heap OOM, /health stays responsive throughout, gateway not SIGKILL'd by watchdog
  4. If clean → promote → re-run live N4 / Day 63 → if still clean → start R2.A.3 (listChunks)

lesaai added 2 commits April 24, 2026 15:41
R2.A.2. The .iterate()-based seed (R2.A v1, a315280) prevents the V8
heap OOM but the iterate loop still runs synchronously for ~117s on a
435K-row embedding_cache. wip-healthcheck SIGKILLs the gateway after
its 30s probe timeout fails. No FATAL ERROR, no Abort trap.

Patch: convert seedEmbeddingCache to async, yield to the event loop
every 1000 rows via setImmediate. Keeps memory bounded; preserves the
streaming behavior; restores /health responsiveness during the seed.

The only caller is inside an existing async arrow wrapping
runMemoryAtomicReindex's build callback. Adding await is a one-line
change.

Validation:
- pnpm tsgo:prod: green
- pnpm test extensions/memory-core: 512 passed, 3 skipped, 0 failed

Scope: does not soften wip-healthcheck (separate guardrail per Parker
direction). Does not address secondary listChunks path (R2.A.3).
Revert the top-of-file lint-suppression comments accidentally landed in
the previous commit (f9e9970). They were added to work around an
oxlint resolver false positive that turned out to be transient state,
not a real lint failure. Production code shouldn't carry misleading
explanations for problems that didn't actually persist.

Net diff of this branch vs base is now just the seedEmbeddingCache
yield patch: function -> async, setImmediate every 1000 rows, caller
await. No lint comments, no file-level disables.
@lesaai
Copy link
Copy Markdown
Member Author

lesaai commented Apr 24, 2026

Read-only yield canary passed against the production ~/.openclaw/memory/main.sqlite embedding cache.

Results:

{
  "yieldEvery": 1000,
  "rows": 435136,
  "embeddingBytes": 8680106189,
  "durationMs": 26383,
  "timerTicks": 25,
  "maxTimerDelayMs": 147,
  "rssMb": 144,
  "maxRssMb": 150,
  "heapUsedMb": 18,
  "maxHeapUsedMb": 77
}

Interpretation: the 1000-row setImmediate cadence keeps the event loop responsive while scanning the full production embedding cache. This directly addresses the post-R2.A failure mode where .iterate() avoided V8 heap OOM but starved /health long enough for the watchdog to restart the gateway.

Canary was read-only and did not touch the live gateway.

@lesaai lesaai merged commit e3f6864 into cc-mini/chat-completions-upstream-20260423 Apr 24, 2026
93 of 97 checks passed
@lesaai lesaai deleted the cc-mini/memory-core-yield-during-seed branch April 24, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant