Fix #1929: Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on#1935
Fix #1929: Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on#1935Memtensor-AI wants to merge 5 commits into
Conversation
❌ Automated Test Results: FAILED
Failed tests:
Error detailsBranch: |
29c802f to
0ea9bc3
Compare
❌ Automated Test Results: FAILED
Failed tests:
Error detailsBranch: |
…gs/maintenance The maintenance stats endpoint used to paginate every row of `traces`, `policies`, `world_model`, and `skills` and hydrate the BLOB vector columns into JS just to inspect each vector's byte length. On a deployment with ~93K traces and ~270 MB of vectors that single `better-sqlite3` call blocked the Node event loop at 100% CPU for 4+ minutes (issue #1929), starving every concurrent `onTurnStart`. Replace the implementation with five `SELECT COUNT(*) + SUM(CASE WHEN ...)` queries — `LENGTH(blob)` reads only the BLOB header, never the payload — so the per-bucket counts now finish in single-millisecond territory regardless of database size. The public `EmbeddingMaintenanceStats` JSON shape, the HTTP route, and the JSON-RPC bridge are unchanged. The two pre-existing semantic filters from the slot-based path (`shouldTraceHaveEmbeddings` for short-text traces, lightweight-memory carveout for `vec_action`) are preserved verbatim inside the SQL WHERE clauses so per-bucket counts do not shift for already-installed users. Tests: - New `tests/unit/storage/embedding-maintenance.test.ts` pins the bucket semantics, lightweight carveout, short-text filter, dimension-mismatch detection, empty-DB safety, and the `expectedByteLen=0` fallback. - Existing memory-core suite passes unchanged (28/28), including the "repairs missing and wrong-dimension imported trace embeddings" and "does not require action vectors for lightweight memory traces" cases — proves the public contract is byte-identical.
…geMs Second half of the issue #1929 mitigation: the previous commit cut the `/api/v1/embeddings/maintenance` cost to SQL counts, but the tier-2 retrieval path (`scanAndTopK` over `traces.vec_summary` / `vec_action`) still runs a brute-force full-table scan on every `onTurnStart`. On a ~93K-row deployment that single scan blocks the Node event loop for 5-30 seconds. Introduce `algorithm.retrieval.vectorScanMaxAgeMs` (ms, default `0` = unbounded for back-compat). When > 0, the vector channels add `ts >= now() - vectorScanMaxAgeMs` to the SQL WHERE clause so only recent traces participate in the cosine scan. The keyword (FTS / pattern / structural) channels are left unbounded so ancient traces remain reachable via exact-text recall. Schema validation: - `NumberInRange(0, 0, 31_536_000_000)` — Typebox rejects negative, out-of-range, and non-number patches at `resolveConfig` time, so a "dirty" `PATCH /api/v1/config` never reaches `writer.ts`'s atomic rename. A subsequent `GET /api/v1/config` always returns a value in [0, 1 year]. - Hard cap of one year matches the contract pinned by the autodev rerun harness: anything larger is indistinguishable from "unbounded" at the corpus sizes where the bound starts to matter, and accepting absurdly large values would let misconfigured deployments silently revert to the legacy starvation path. Tests: `tests/unit/config/load.test.ts` adds 22 parametrised cases covering the accepted range (default, 1d, 30d, max, 0), the out-of-range rejection set the harness exercises (-1, -60s, -86_400_000, max+1, max+86_400_000, 100×max), and the invalid-type set (string number, string text, null, dict, list, NaN, Inf).
`MemosError("config_invalid", ...)` raised by `core/config/index.ts::resolveConfig`
on a bad PATCH body used to bubble up to `server/http.ts`'s catch-all and surface
as a 500 `internal` error. The rerun harness for issue #1929 explicitly asserts
`status_code < 500` on every malformed `PATCH /api/v1/config` body — concurrent
search calls must not be poisoned by a misbehaving viewer or admin script — so
the route now catches `MemosError` whose code is `config_invalid` or
`config_write_failed` and translates it to a 400 `invalid_argument` with the
schema validator's message. Any other error keeps propagating to the global
handler so unexpected bugs still page operators.
Also adds `bool_true` / `bool_false` to the parametrized
`retrieval.vectorScanMaxAgeMs` invalid-type tests so the schema-level guard is
pinned for both booleans (Typebox `Type.Number` rejects them but coercion
behaviour is worth pinning explicitly).
Test plan: 102/102 in `tests/unit/config/load.test.ts` +
`tests/unit/server/http.test.ts` pass, including three new integration tests
that exercise the 400-mapping path (`schema validation errors → 400`,
`writer failures → 400`, `unexpected errors → 500`). `npx tsc -p tsconfig.json
--noEmit` and `npx tsc -p tsconfig.build.json` both clean.
Refs: #1929
0ea9bc3 to
7111b4f
Compare
The previous fix mapped both `config_invalid` and `config_write_failed` from `PATCH /api/v1/config` to HTTP 400. But `config_write_failed` is raised only when the atomic config rename fails (disk full / permission denied) — a server-side I/O fault, not bad client input. Returning 400 `invalid_argument` for it misleads clients into thinking their (valid) payload was rejected and hides a real operational problem from the 500 pager path. Narrow the client-error set to `config_invalid` only (the Typebox schema-validation failure raised by `resolveConfig` on a malformed PATCH body). `config_write_failed` and every other error keep propagating to the global handler as 500. The #1929 rerun harness contract tests exercise only malformed input (`config_invalid`), so all 32 schema-contract cases stay green. Refs: #1929 Co-authored-by: Cursor <cursoragent@cursor.com>
Align the http.test.ts expectation with the corrected route behaviour: a `config_write_failed` (atomic rename failed) is a server-side fault, so `PATCH /api/v1/config` must return 500 `internal`, not 400. The schema-validation (`config_invalid`) → 400 case is covered by the adjacent test. Refs: #1929 Co-authored-by: Cursor <cursoragent@cursor.com>
✅ Manual re-validation: config PATCH 4xx fix verifiedThe last automated FAILED result above was produced against commit Fix now on this branch:
Re-validated manually on the test node (rebuilt the plugin from branch HEAD
The 15 previously-failing harness cases ( |
Description
Fixes the event-loop starvation on GET /api/v1/embeddings/maintenance reported in issue #1929. The endpoint used to paginate every row of
traces/policies/world_model/skillsinapps/memos-local-plugin/core/pipeline/memory-core.ts::collectEmbeddingSlots()and hydrate the BLOB vector columns into JS purely to inspect each vector's length, blocking better-sqlite3 on the main thread for 4+ minutes on a ~93K-row deployment with ~270 MB of vectors.This change adds a new SQL-only counter
embeddingMaintenanceCounts(db, { expectedByteLen })inapps/memos-local-plugin/core/storage/repos/index.tsthat issues fiveSELECT COUNT(*) + SUM(CASE WHEN ...)queries —LENGTH(blob)reads only the BLOB header, never the payload — and rewirescomputeEmbeddingMaintenanceStats()to use it. The publicEmbeddingMaintenanceStatsJSON shape, the HTTP route, the JSON-RPC bridge, and the agent contract are unchanged. The two pre-existing semantic filters (short-text trace skip viashouldTraceHaveEmbeddings, lightweight-memory carveout forvec_action) are preserved verbatim inside the SQL WHERE clauses so per-bucket counts stay identical for installed users.rebuildEmbeddings()keeps using slot enumeration for the actual row updates but now also pulls its before/after stats from the fast path.Tests: new
tests/unit/storage/embedding-maintenance.test.ts(4 cases) pins the bucket semantics, lightweight carveout, short-text filter, dimension-mismatch detection, empty-DB safety, and theexpectedByteLen=0fallback. The pre-existing memory-core suite (28/28 tests) passes unchanged — including the "repairs missing and wrong-dimension imported trace embeddings" and "does not require action vectors for lightweight memory traces" cases — confirming the public contract is byte-identical.tsc -p tsconfig.json --noEmitandtsc -p tsconfig.build.jsonboth exit 0. The 3 wider-suite failures (tests/e2e/v7-full-chain.e2e.test.ts,tests/unit/storage/migrator.test.ts::namespace-visibility…regression #1787,tests/unit/storage/traces-count.test.ts::count() should return accurate count > 500) all reproduce on bare main aftergit stashand are pre-existing failures unrelated to this change.Branch
bugfix/autodev-1929-rerun-20260616pushed to origin at commit29c802fc. Spec artifacts (proposal.md / design.md / spec.md / task.md / verification-report.md) archived to the sibling specs repo under2026-06-16-1929-bug-apiv1embeddingsmaintenance-causes-100-cpu-and-event-loop/and pushed tomemos-autodev-specsmain.Note: the issue's "Additional Fix" mentioning bounding
scanAndTopKincore/storage/vector.tsis intentionally left for a follow-up — the production hang the title and event-loop logs describe is fully explained by the maintenance endpoint path, and keeping this PR scoped tightens the review surface (see proposal.md).Related Issue (Required): Fixes #1929
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Automated tests are pending.
Checklist
@MatthewZhuang, @CarltonXiang, @syzsunshine219, @World-controller please review this PR.
Reviewer Checklist