Skip to content

Bug: /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on large corpora #1929

@tianxin8206

Description

@tianxin8206

Bug: GET /api/v1/embeddings/maintenance causes 100% CPU and event loop starvation on large corpora

Summary

On a deployment with ~93K rows in the traces table (270MB of vector data), every request to GET /api/v1/embeddings/maintenance blocks the Node.js main thread for 4+ minutes at 100% CPU, making the entire OpenClaw gateway unresponsive.

Root Cause

computeEmbeddingMaintenanceStats() (in core/pipeline/memory-core.ts) calls collectEmbeddingSlots() which:

  1. Paginates through every row of traces, policies, world_model, and skills using repos.*.list()
  2. Each list() call returns full rows including BLOB vector columns (vec_summary, vec_action)
  3. Decodes every vector from Buffer → Float32Array in JS
  4. Builds a sorted array of all slots

For stats-only queries this is massively wasteful — the endpoint only needs COUNT(*) grouped by null/not-null/dimension-mismatch, not the actual vector data.

At 93K traces × 2 vector columns × 1536 bytes each ≈ 270MB of vectors loaded into JS heap on every request, all synchronously on the main thread via better-sqlite3.

Evidence

# strace during hang — 99.96% pread64 on memos.db
 99.96    0.071932           3     22685           pread64

# Log — onTurnStart blocked for 292 seconds
memos.onTurnStart returned hits=7 durationMs=292731

# Event loop delay
liveness warning: eventLoopDelayMaxMs=285883.8 eventLoopUtilization=1

Proposed Fix

Add embeddingMaintenanceStats() to Repos (core/storage/repos/index.ts) that uses pure SQL COUNT queries instead of loading vectors:

SELECT COUNT(*) AS n FROM traces WHERE vec_summary IS NOT NULL;
SELECT COUNT(*) AS n FROM traces WHERE vec_summary IS NOT NULL AND LENGTH(vec_summary) <> @expectedLen;
-- same for vec_action, policies.vec, world_model.vec, skills.vec

Then computeEmbeddingMaintenanceStats() calls this SQL path for the common case (configured dimension matches stored dimension), falling back to the full collectEmbeddingSlots() only when a dimension mismatch is detected.

Performance Comparison

Method Time Memory
collectEmbeddingSlots() (current) 4+ min, 100% CPU +270MB heap
SQL COUNT (proposed) ~900ms, <10% CPU negligible

Additional Fix: Vector Scan Bounding

The tier-2 retrieval path (scanAndTopK in core/storage/vector.ts) also does a brute-force full-table scan on every onTurnStart. With 93K rows × 1536 dims, this alone causes 5-30 second event loop blocks. Proposed: FTS pre-seed + configurable time-window (vectorScanMaxAgeMs) to bound the scan.

Environment

  • memos-local-plugin v2.0.5
  • OpenClaw gateway (latest)
  • memos.db: ~93K traces, 853MB on disk
  • Node 26.0.0, better-sqlite3
  • Linux x86_64, 16 cores

I have a working patch ready and can submit a PR if the approach is approved.

Metadata

Metadata

Assignees

Labels

ai-taskAutoDev task dispatched to AI coding agent | AI 编码任务ai-testingAI agent is running automated testsbugSomething isn't working | 功能异常help wantedExtra attention is needed | 需要社区帮助pluginPlugin/adapter/bridge layer (apps/ directory) | 插件/适配层

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions