Bug: GET /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on large corpora
Summary
On a deployment with ~93K rows in the traces table (270MB of vector data), every request to GET /api/v1/embeddings/maintenance blocks the Node.js main thread for 4+ minutes at 100% CPU, making the entire OpenClaw gateway unresponsive.
Root Cause
computeEmbeddingMaintenanceStats() (in core/pipeline/memory-core.ts) calls collectEmbeddingSlots() which:
- Paginates through every row of
traces, policies, world_model, and skills using repos.*.list()
- Each
list() call returns full rows including BLOB vector columns (vec_summary, vec_action)
- Decodes every vector from
Buffer → Float32Array in JS
- Builds a sorted array of all slots
For stats-only queries this is massively wasteful — the endpoint only needs COUNT(*) grouped by null/not-null/dimension-mismatch, not the actual vector data.
At 93K traces × 2 vector columns × 1536 bytes each ≈ 270MB of vectors loaded into JS heap on every request, all synchronously on the main thread via better-sqlite3.
Evidence
# strace during hang — 99.96% pread64 on memos.db
99.96 0.071932 3 22685 pread64
# Log — onTurnStart blocked for 292 seconds
memos.onTurnStart returned hits=7 durationMs=292731
# Event loop delay
liveness warning: eventLoopDelayMaxMs=285883.8 eventLoopUtilization=1
Proposed Fix
Add embeddingMaintenanceStats() to Repos (core/storage/repos/index.ts) that uses pure SQL COUNT queries instead of loading vectors:
SELECT COUNT(*) AS n FROM traces WHERE vec_summary IS NOT NULL;
SELECT COUNT(*) AS n FROM traces WHERE vec_summary IS NOT NULL AND LENGTH(vec_summary) <> @expectedLen;
-- same for vec_action, policies.vec, world_model.vec, skills.vec
Then computeEmbeddingMaintenanceStats() calls this SQL path for the common case (configured dimension matches stored dimension), falling back to the full collectEmbeddingSlots() only when a dimension mismatch is detected.
Performance Comparison
| Method |
Time |
Memory |
collectEmbeddingSlots() (current) |
4+ min, 100% CPU |
+270MB heap |
| SQL COUNT (proposed) |
~900ms, <10% CPU |
negligible |
Additional Fix: Vector Scan Bounding
The tier-2 retrieval path (scanAndTopK in core/storage/vector.ts) also does a brute-force full-table scan on every onTurnStart. With 93K rows × 1536 dims, this alone causes 5-30 second event loop blocks. Proposed: FTS pre-seed + configurable time-window (vectorScanMaxAgeMs) to bound the scan.
Environment
- memos-local-plugin v2.0.5
- OpenClaw gateway (latest)
- memos.db: ~93K traces, 853MB on disk
- Node 26.0.0, better-sqlite3
- Linux x86_64, 16 cores
I have a working patch ready and can submit a PR if the approach is approved.
Bug:
GET /api/v1/embeddings/maintenancecauses 100% CPU and event loop starvation on large corporaSummary
On a deployment with ~93K rows in the
tracestable (270MB of vector data), every request toGET /api/v1/embeddings/maintenanceblocks the Node.js main thread for 4+ minutes at 100% CPU, making the entire OpenClaw gateway unresponsive.Root Cause
computeEmbeddingMaintenanceStats()(incore/pipeline/memory-core.ts) callscollectEmbeddingSlots()which:traces,policies,world_model, andskillsusingrepos.*.list()list()call returns full rows including BLOB vector columns (vec_summary,vec_action)Buffer → Float32Arrayin JSFor stats-only queries this is massively wasteful — the endpoint only needs
COUNT(*)grouped by null/not-null/dimension-mismatch, not the actual vector data.At 93K traces × 2 vector columns × 1536 bytes each ≈ 270MB of vectors loaded into JS heap on every request, all synchronously on the main thread via better-sqlite3.
Evidence
Proposed Fix
Add
embeddingMaintenanceStats()toRepos(core/storage/repos/index.ts) that uses pure SQLCOUNTqueries instead of loading vectors:Then
computeEmbeddingMaintenanceStats()calls this SQL path for the common case (configured dimension matches stored dimension), falling back to the fullcollectEmbeddingSlots()only when a dimension mismatch is detected.Performance Comparison
collectEmbeddingSlots()(current)Additional Fix: Vector Scan Bounding
The tier-2 retrieval path (
scanAndTopKincore/storage/vector.ts) also does a brute-force full-table scan on everyonTurnStart. With 93K rows × 1536 dims, this alone causes 5-30 second event loop blocks. Proposed: FTS pre-seed + configurable time-window (vectorScanMaxAgeMs) to bound the scan.Environment
I have a working patch ready and can submit a PR if the approach is approved.