Skip to content

main <-staging#182

Merged
ducnmm merged 57 commits into
mainfrom
staging
May 21, 2026
Merged

main <-staging#182
ducnmm merged 57 commits into
mainfrom
staging

Conversation

@ducnmm
Copy link
Copy Markdown
Collaborator

@ducnmm ducnmm commented May 21, 2026

No description provided.

hien-p and others added 30 commits April 7, 2026 16:19
PR #121 made remember/analyze async on the server: client gets HTTP 202
+ job_id back in ~500ms instead of waiting ~18s for the full Walrus
upload + chain commit. Update the Python SDK to match the new TypeScript
SDK contract so all four downstream targets (Python apps, FastAPI, AI
middleware, OpenClaw plugin) get the same UX win.

Surface (mirrors TS):
- remember() / remember_async() now return RememberAcceptedResult
- wait_for_remember_job() polls /api/remember/{id} with jittered exp
  backoff + transient retry (matches TS pollingDelayMs +
  isTransientPollingStatus)
- remember_and_wait() convenience wraps both
- Bulk family: remember_bulk[_async] / get_remember_bulk_status /
  wait_for_remember_jobs / remember_bulk_and_wait
- analyze() returns job_ids + fact_count; analyze_and_wait() polls
  every fact's remember job to completion
- embed() exposes /api/embed for raw vectors
- New typed exceptions: MemWalRememberJobNotFound / Failed / Timeout
- MemWalSync wrapper exposes every new async method

Signing:
- build_signature_message() now includes the nonce + account_id segments
  the server requires (MED-1 replay protection + LOW-23 account-hint
  binding). Client generates a UUID nonce per request and sends
  x-nonce header; without this the server rejects with HTTP 426
  "unsupported legacy SDK".

Verified end-to-end against a local server (testnet):
- remember() returned 202 in 58ms (the PR #121 win)
- wait_for_remember_job() settled blob_id after ~30s of upload+commit
- recall() returned decrypted plaintext via SEAL
- remember_bulk_and_wait() handled 3 items across 3 wallets, all done
- See packages/python-sdk-memwal/examples/interactive_demo.py for the
  reproducible demo and the full server-log evidence.
- Add docs/mcp/how-it-works.md covering auth-required vs bridged mode,
  first-run flow, local credentials, and the stdio bridge
- Add docs/mcp/changelog.mdx for the @mysten-incubation/memwal-mcp package
- overview: supported clients, client-machine behavior, why the package
  over raw HTTP, and How It Works / Changelog cards
- quick-start: login-path choice, config locations, first-run behavior,
  direct HTTP setup, and local development
- reference: first-run behavior, credential file, client config paths,
  HTTP vs stdio guidance, runtime safety notes, logout semantics
- Migrate sdk/openclaw changelogs from .md to .mdx; add SDK 0.0.3 and
  0.0.4 release entries
- run-docs-locally: note Node 20 LTS requirement for Mintlify
- docs.json: add mcp/how-it-works and mcp/changelog to nav
docs(mcp): restructure MCP docs into landing + quick-start + how-it-works + reference
…servability-tracing-monitoring-and-apm

feat: add relayer observability and metrics
Add a server-level default namespace so users set it once in MCP client
config instead of passing namespace on every tool call.

- New --namespace <name> CLI flag (alias --ns, plus --namespace= form)
- New MEMWAL_NAMESPACE env var; precedence: per-call > CLI > env > unset
  (unset = forwarded without namespace; relayer applies its own default)
- Thread resolved namespace through BridgeConfig + AuthRequiredConfig
- Pure applyDefaultNamespace() injects the default into memwal_remember,
  memwal_recall, memwal_analyze, memwal_restore tool calls when the agent
  omits namespace; explicit per-call value always wins
- memwal_restore keeps namespace required in its schema; configured default
  is only a fallback if the agent calls it without one
- Help text + config snippet; README Default Namespace section with Cursor
  and Claude Desktop examples and a manual verification note
- docs/mcp overview + reference: default-namespace behavior and examples
- Add prod/dev/staging/local relayer presets via env= on MemWalConfig,
  MemWal.create, MemWalSync.create, with_memwal_langchain,
  with_memwal_openai; export ENV_PRESETS. Precedence: explicit
  server_url > env > default; unknown preset raises ValueError
- tests/test_env_presets.py (10 cases) covering resolution + precedence
- SDK README: Environment Presets section
- docs/python-sdk: full doc set mirroring the TS SDK nav
  (quick-start, usage, usage/{memwal,memwal-manual,with-memwal},
  api-reference, changelog.mdx) written from the actual Python API
- docs.json: add Python SDK tab
feat(python-sdk): add Python SDK with async client, signing, and AI m…
feat(mcp): configurable default memory namespace
Replace the `services/ranker` placeholder with a real `CompositeRanker`
behind a `Ranker` trait. Blends semantic similarity with an optional
recency decay; default weights short-circuit to today's pgvector cosine
order so existing clients see byte-identical responses.

Score formula:

    score = semantic * (1 - distance)
          + recency  * 2^(-age_days / half_life_days)

Implementation uses `exp(-age * ln(2) / half_life)` for a true half-life
decay (a memory at the half-life mark scores exactly 0.5 in the recency
term, not 1/e ≈ 0.368 as a naive `exp(-age/half_life)` would give).

Wire changes:

- Optional `scoring_weights` on `RecallRequest` + `AskRequest`.
- `RecallResult.score: Option<f64>` with `skip_serializing_if = "Option::is_none"`
  so the field only appears when the ranker actually ran.
- `db.search_similar` now selects `created_at` alongside the cosine
  distance, threaded through `SearchHit` → `HydratedMemory` via the
  shared `zip_created_at_onto_hydrated` helper used by both `/api/recall`
  and `/api/ask`.

Validation:

- `ScoringWeights::validate()` returns 400 at the top of each handler
  (fail-fast before any embed / fetch spend) for NaN, Inf, out-of-range
  weights, or sub-floor `recency_half_life_days < MIN_HALF_LIFE_DAYS`
  (1e-6 ≈ 86 ms) — closes the subnormal-half-life gap where the recency
  term silently collapsed to zero.
- `ScoringWeights::is_ranker_active()` — single source of truth for the
  `recency.abs() >= f64::EPSILON` predicate (was duplicated in 4 sites).

Defensive math: future timestamps clamp via `.max(0)`, non-positive
half-life zeros the recency term, NaN sorts as `Equal`. `CompositeRanker`
is stateless, instantiated once in `main.rs` and shared via `Arc<dyn Ranker>`
on `AppState` — matches the existing `Embedder` / `Extractor` shape and
leaves room for a future cross-encoder reranker (Cohere / BGE) behind
the same call site.

Tests: 187 pass (was 172 on dev — 15 new). Coverage includes the
half-life formula at the half-life mark, no-op short-circuit invariant,
score-field presence/absence, full `validate()` boundary matrix
(NaN/Inf/negative/>100/subnormal half-life/recency-zero carve-out), and
a refactor guard pinning `ScoringWeights::default().recency == 0.0` so a
future change to the default can't silently activate the ranker on every
existing client.

Benchmarks (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer, OpenRouter):

- LOCOMO 3-preset: baseline 53.88 / default 53.62 / recency_heavy 53.96
- LongMemEval 2-preset: baseline 72.15 / recency_heavy 71.85

All overall deltas within judge noise (SEM ±0.72 LOCOMO, ±1.36 LongMemEval).
This is **behavior-preserving infrastructure**, not a quality lift — the
trait + opt-in plumbing unblocks MEM-54 (importance signal) and any
future reranker without re-shipping the same wire/storage threading.
Per-category data lives in the archived results: small lifts on
`multi_hop` / `adversarial` from recency-heavy weights, small dips on
`temporal` / `preference`. The `preference` −3.16 on LongMemEval is a
concrete reason not to ship `recency = 0.4` as a server default — and
the shipped default (`recency = 0.0`) sidesteps it entirely.

No-op invariant verified end-to-end: LOCOMO baseline 53.88 sits within
±1 J of the May 14 pre-ranker (54.5) and May 13 ENG-1747 (54.5 / 54.8)
baselines. The composite reranker code path exists but never alters
retrieval order under default weights.

Full archive + per-category breakdown + methodology in
`services/server/review/assessment/benchmark-runs/2026-05-18-ranker-composite-recency/`.

Closes MEM-53. Part of MEM-52 (RAG quality, cycle 13).
# Conflicts:
#	packages/python-sdk-memwal/README.md
#	packages/python-sdk-memwal/examples/async_remember_demo.py
#	packages/python-sdk-memwal/memwal/__init__.py
#	packages/python-sdk-memwal/memwal/client.py
#	packages/python-sdk-memwal/memwal/middleware.py
#	packages/python-sdk-memwal/memwal/types.py
#	packages/python-sdk-memwal/pyproject.toml
#	packages/python-sdk-memwal/tests/test_client.py
#	packages/python-sdk-memwal/tests/test_integration.py
#	packages/python-sdk-memwal/tests/test_middleware.py
#	packages/python-sdk-memwal/tests/test_signing.py
feat(python-sdk): add PyPI release workflow
* feat(server): expose prompt versions on /health (MEM-56)

Surface FACT_EXTRACTION_PROMPT_VERSION (extractor.rs) and
ASK_SYSTEM_PROMPT_VERSION (admin.rs) on the /health response so the
benchmark harness can pin them into result-artifact metadata at run
start. Closes the attribution gap where two LOCOMO runs with different
extractor prompts produced indistinguishable JSON on disk.

HealthResponse gains a prompt_versions: PromptVersions block with
extract + ask fields. Both fields are always populated — there is no
"version unknown" state for a running server.
ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] since it's now
load-bearing.

Pinned by health_response_serializes_prompt_versions_block so a future
rename can't silently break the harness pipeline.

188/188 tests pass (was 187).

MEM-56

* feat(benchmarks): pin prompt_versions into run artifacts (MEM-56)

Read prompt_versions from GET /health at run start, fail fast if the
server doesn't expose them (no silent fallback to empty metadata), and
thread the dict onto RunArtifact so every result JSON records which
extract.v* / ask.v* produced it. Comparison table renders a
'prompt versions' row so a future 'score jumped in week N' delta is
attributable to the prompt change vs the weights change rather than
guessed at from git history.

Changes:

- core/types.py: RunArtifact gains prompt_versions: dict[str, str]
  with empty-dict default (legacy artifacts loaded by 'compare' still
  parse). Fresh runs always populate because the harness fails fast
  at startup when the field is missing.
- run.py: at server boot check, after the mode validation, abort with
  a clear error if health.prompt_versions doesn't carry both extract
  and ask. On success, log the versions and stash them on config
  under _server_prompt_versions so stage_eval picks them up without a
  signature change.
- core/report.py: generate_comparison_table renders a 'prompt versions'
  row showing extract.vN/ask.vM per preset. Empty cells for legacy
  artifacts so cross-cycle comparisons stay readable.

Manually verified end-to-end against the running server:
- /health returns {extract:extract.v1, ask:ask.v1}
- Harness fail-fast triggered against a pre-MEM-56 server (no
  prompt_versions field) with the documented error message
- Stand-alone Python script confirmed: server -> harness -> artifact
  JSON contains the prompt_versions block
- Comparison table renders the row with synthetic data

Not benchmarked end-to-end because this PR doesn't touch scoring,
extraction, or retrieval — pure metadata plumbing. A full LOCOMO +
LongMemEval re-run would reproduce yesterday's ranker numbers within
judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 /
MEM-57 where benchmarks actually buy signal.

MEM-56
…ee-deployment-template-for-memwal-relayer

docs: add Nautilus TEE relayer deployment template
…s-forced-to-register-new-delegate-key-on-every

MEM-46: refine delegate key import setup
…l-performance

perf(server): race Walrus aggregator reads
hungtranphamminh and others added 24 commits May 20, 2026 08:39
* feat(server): extract.v2 — relax fact-extraction scope to both parties

Relax the extractor prompt's user-only scope to cover memorable facts
from either party in the conversation. The v1 prompt scoped extraction
to "facts about the user", which systematically under-counted
assistant-side content (recommendations, conclusions, summaries,
plans). LongMemEval's `single_session_assistant` category sat at
~29.91 J because of this — the LLM was capable of distinguishing
user-said from assistant-said facts when asked, but the prompt was
preventing it from extracting the latter at all.

Bumps `FACT_EXTRACTION_PROMPT_VERSION` from "extract.v1" to
"extract.v2". The const is surfaced on `GET /health` (via MEM-56)
so every benchmark run-artifact JSON carries the version it was
produced under.

Prompt-injection guard, NONE-on-no-facts behaviour, and the
one-fact-per-line output shape are all preserved verbatim from v1.
New rules cover what the assistant says vs what to skip (acks,
restatements, formatting meta-talk).

Benchmark headline (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer):

- LongMemEval overall: 72.15 → 76.6  (+4.45 J)
- LongMemEval `single_session_assistant`: 29.91 → 74.2  (+44.3 J,
  the cycle's first significant single-category lift)
- LongMemEval other categories: within judge noise on every other
  category (smallest move −2.1 on `temporal`)
- LOCOMO overall: 53.88 → 53.7  (flat)
- LOCOMO `single_hop`: 53.40 → 43.5  (−9.9 J, ~6 SEMs — real,
  not noise)

The LOCOMO `single_hop` regression is a dilution effect at the
recall `limit=10` cut: extract.v2 extracts +33% more facts per
conversation, so the relevant user-side fact gets pushed below
position 10 more often when synthetic single-fact-lookup queries
hit. Fix path is MEM-54 (importance signal weighting user-said
personal facts higher at ranker time) — landing next, this cycle.

Three v2 prompt variants were explored during MEM-55 development
to see if the regression could be addressed at the prompt layer.
It can't — per-turn ingestion (one /api/analyze call per speaker
turn) makes "dedup against context" impossible to implement
reliably because the LLM doesn't see the other turns. The fix
belongs at the ranker layer.

Pre-commit validation gate was per-category: LOCOMO `single_hop`
within ±2 J failed by a wide margin. Shipping anyway because the
averaged-across-benchmarks delta is +2.13 J net and the fix path
for the per-category regression is concrete and immediate
(MEM-54). Documented loudly in the benchmark archive README
rather than dressed up.

MEM-55

* chore(server): archive 2026-05-19 MEM-55 extract.v2 benchmark

Archive the LongMemEval + LOCOMO baselines that validated the
extract.v2 prompt change. All on commit 47a1f6f (current dev tip
with MEM-53 ranker + MEM-56 prompt-version pinning merged).

Headlines documented in the README:

- LongMemEval overall 76.6 (+4.45 vs v1)
- LongMemEval `single_session_assistant` 74.2 (+44.3)
- LOCOMO overall 53.7 (flat)
- LOCOMO `single_hop` 43.5 (−9.9, real not noise)

README is explicit about the validation-gate accounting: gate (3)
"LOCOMO within ±2 J" failed on the `single_hop` per-category
delta. Documents the dilution-at-recall-limit root cause, why we
ship anyway (averaged net +2.13 J, MEM-54 is the next sub-issue
and directly addresses the dilution), and what we'd do if MEM-54
doesn't deliver (revisit or revert).

This is the first benchmark archive that exercises MEM-56's
prompt-version attribution pipeline end-to-end: every artifact
JSON carries `prompt_versions: {extract: extract.v2, ask: ask.v1}`
in its metadata block, so future cross-run comparisons can
attribute J-Score deltas to the prompt change without guessing.

MEM-55
…-recovery

# Conflicts:
#	apps/app/src/pages/SetupWizard.tsx
#	services/server/scripts/sidecar-server.ts
#	services/server/src/routes/remember.rs
#	services/server/src/storage/walrus.rs
#	services/server/src/types.rs
…ee-deployment-template-for-memwal-relayer

docs: clarify TEE deployment pattern
Improve sidecar upload recovery diagnostics
Feat: MEM-54 — per-fact importance signal end-to-end + extract.v3 (#177)

Surfaces the extractor LLM's implicit importance assessment as an
opt-in ranker dimension. Extractor tags each fact with a vital /
standard / trivial bucket → persisted on vector_entries.importance
(migration 009) → consumed by CompositeRanker via a weighted term
on ScoringWeights.importance. Default weights (importance: 0.0)
preserve byte-identical-to-today recall ordering.

Why categorical (vital/standard/trivial → 0.9/0.5/0.2): LLMs are
reliably good at categorical classification and unreliable at
continuous quantification (continuous scores bunch around 0.5
under uncertainty). 3-bucket gives the ranker meaningful headroom
(vital is 1.8× standard, trivial is 0.4× standard) without spreading
so wide that one mis-classification dominates the score.

Why opt-in via scoring_weights.importance (default 0.0): backward-
compatible by construction, per-request tunability without forcing a
server-wide default change, and experimental hygiene — we can A/B
the signal via the harness preset system before promoting it to a
default.

LOCOMO — recovers the MEM-55 single_hop regression and adds +4.3
overall:
- single_hop: 53.40 (v1) → 43.5 (MEM-55 v2, ❌) → 53.6 (this PR, ✅)
- Overall:   53.88 (v1) → 53.7 (MEM-55 v2)     → 58.2 (+4.3) ✅

LongMemEval — known regression on single_session_assistant
(74.2 → 62.7, −11.5 vs MEM-55 v2). Tracked as the immediate next
ticket MEM-57 (pre-extraction dedup context, Mem0 v3 pattern), which
is expected to compensate by giving the extractor stronger signal for
what's new vs already-known. If MEM-57 doesn't move the LME number, a
small prompt-softening follow-up PR is tested locally as fallback.

Infrastructure landed:
- migration 009: vector_entries.importance REAL NOT NULL DEFAULT 0.5
  (non-destructive ADD COLUMN; legacy rows degrade to neutral standard)
- ExtractedFact { text, importance } with case-insensitive bucket
  parser + legacy no-TAB fallback (handles extract.v1/v2 output during
  ingestion-format transition)
- MemoryEngine.store_blob takes importance, threaded through
  routes/analyze.rs (benchmark + production), routes/remember.rs
  (single + bulk), routes/admin.rs restore, and jobs.rs
  WalletOperation::{UploadAndTransfer, SetMetadataAndTransfer,
  FinalizeUploadedBlob} + BulkRememberItem (all with
  #[serde(default = "default_importance")] for in-flight legacy
  payload compatibility)
- HydratedMemory.importance: Option<f32> zipped from SearchHit via
  the renamed zip_search_hit_fields_onto_hydrated helper
  (consolidates created_at + importance zip into one DB→ranker pass)
- CompositeRanker gains an importance term; is_ranker_active()
  includes the new weight; tracing breadcrumbs include the weight
- ScoringWeights.importance: f64 (default 0.0); validate() bounds
  match semantic/recency [0.0, 100.0]

Prompt change (extract.v2 → extract.v3):
- 3-bucket rubric appended with concrete category definitions
- BUCKET<TAB>FACT_TEXT output format with worked examples
- FACT_EXTRACTION_PROMPT_VERSION bumped to extract.v3 (surfaced on
  /health and in benchmark artifacts via MEM-56)

Tests: 208/208 pass (17 ranker including 6 new importance tests; 14
extractor including 4 new bucket-parser tests). Bucket distribution
verified end-to-end on the real DB after benchmark runs.

Also includes a chore commit that stops tracking benchmark result
archives + working analysis notes in the repo (team decision —
archived internally for monitoring). Historical commits that added
those files are untouched.

Closes MEM-54.
…ittee-aggregator

feat(seal): default testnet SEAL to Mysten committee aggregator
…ct.v4 (#178)

Adds the Mem0 v3 saliency-aware extraction pattern: before the extractor
LLM call, retrieve the top-K nearest existing memories for the input and
prepend them as a <related_memories> context block. The extractor uses
the context to skip duplicates and anchor borderline facts, without
merging or superseding (extraction stays ADD-only).

This is the architectural fix for the MEM-54 v3 LME single_session_
assistant regression and the LOCOMO single_hop dilution. Net result on
both benchmarks vs the pre-cycle-13 baseline (extract.v1, May 18):

LOCOMO — every category improved:
  - single_hop:   53.40 → 67.3  (+13.9)
  - multi_hop:    47.08 → 56.7  (+9.6)
  - open_domain:  52.22 → 71.5  (+19.3)
  - adversarial:  71.33 → 82.4  (+11.1)
  - temporal:     36.42 → 45.8  (+9.4)
  - Overall:      53.88 → 68.5  (+14.6, ~36 SEMs)
  First time on this codebase that LOCOMO single_hop and multi_hop match
  or beat Mem0 v2's published numbers (arXiv:2504.19413).

LongMemEval — 5 of 6 categories improved:
  - single_session_assistant:  29.91 → 57.6  (+27.7)
  - multi_session:             78.57 → 82.5  (+3.9)
  - preference:                77.83 → 80.2  (+2.4)
  - knowledge_update:          86.10 → 86.5  (+0.4)
  - single_session_user:       95.21 → 96.1  (+0.9)
  - temporal:                  62.03 → 59.5  (−2.5)
  - Overall:                   72.15 → 76.0  (+3.9)

  Known regression vs the historical-best (MEM-55 v2's 74.2 on
  single_session_assistant): MEM-57's broad dedup occasionally conflates
  a summary memory in context with the input's atomic list items,
  dropping the items as 'paraphrases'. Net is still +27.7 vs v1, but
  −16.6 vs the historical best. The deep-review root cause + paired
  prompt fix are scoped as MEM-59 (granularity-aware dedup) for the
  immediate follow-up.

## Implementation

Trait extension (extractor.rs)
- Extractor::extract_with_context(text, &[&str]) — default impl falls
  through to extract(text) so test mocks + non-analyze callers don't
  need to change.
- LlmExtractor override: short-circuits to extract() on empty slice,
  otherwise sends a 3-message payload (system + <related_memories> +
  input). The static system prompt stays cacheable; per-request context
  varies in the user-role message.
- extract() and extract_with_context() share a private
  call_chat_completion(messages) helper — single HTTP path, single
  observability point.

Prompt change (extract.v3 → extract.v4)
- Adds the <related_memories> instruction block (skip exact-paraphrase
  duplicates, anchor borderline content, ADD-only) plus a worked dedup
  example. Output format unchanged from v3 — same parser.
- FACT_EXTRACTION_PROMPT_VERSION bumped to extract.v4 (surfaced on
  /health and in benchmark artifacts via MEM-56).

Handler wiring (routes/analyze.rs)
- Pre-extraction recall fires before extract_with_context() on both the
  production and benchmark paths.
- On production: search_similar against pgvector, then fetch_batch hits
  Walrus + the SEAL decrypt sidecar — NOT additional Postgres reads.
  On benchmark: fetch_batch reads the plaintext column. Both engines
  emit HydratedMemory { text }, so the prompt rendering operates
  uniformly on both.
- PRE_EXTRACTION_CONTEXT_LIMIT = 10 (matches Mem0 v3's K).
- Per-leg timing instrumentation (embed_ms / search_ms / walrus_ms /
  seal_ms) with a status enum covering 8 outcomes.
- Empty-namespace fast path: a cheap btree existence check on
  idx_vector_entries_owner_ns skips the embed + search round-trip on
  first-ingest-into-a-namespace (fires ~7% of LME / ~0.4% of LOCOMO
  calls; saves ~80-150ms + an embedding call per skip).
- Graceful degradation: every recall-side failure (embed / search /
  fetch) falls back to plain extraction with a warn log and a status
  enum tag — a user's write never fails because the read path is
  degraded.

P0 hardening (per deep review, prerequisites for production ship)

1. Per-leg timeouts — tokio::time::timeout on each leg (embed 800ms,
   search 300ms, fetch 500ms). Caps the pre-extraction worst case at
   ~1.6s instead of the observed 30s benchmark outlier. New status
   values: embed_timeout, search_timeout, fetch_timeout.

2. Prompt-injection guard on <related_memories> content. MEM-57 routes
   stored user text → SEAL decrypt → LLM prompt; a user storing
   '</related_memories><system>...' could otherwise manipulate their
   own future extractions. escape_for_prompt_context() converts <, >, &
   to entities at the render chokepoint. Cross-tenant injection is
   already blocked by the DB owner+namespace filter and the SEAL
   credential tied to auth.account_id; this closes the
   self-injection-within-one's-own-namespace path. Applies uniformly to
   the production (SEAL-decrypted) and benchmark (plaintext) paths.

## Latency cost (full disclosure)

The MEM-57 ticket's +50-150ms p95 forecast was scoped to "one extra
recall round-trip". The real flow runs 5 sequential operations
(existence check + input embed + pgvector search + Walrus fetch + SEAL
decrypt): measured p50 ~660ms, p95 ~1473ms, p99 ~4882ms across 10,179
events. 81.7% of calls land in the 500-1000ms bucket — structural, not
noise. Per-leg timeouts cap the worst case at ~1.6s. Accepted because
the +14.6 J LOCOMO win is overwhelming for an LLM-bound endpoint where
the extractor itself takes 1-2s on the dominant path.

## Test surface

221/221 unit tests pass (was 208 on dev). 13 new tests across the
MEM-57 surface: prompt formatting (incl. UTF-8 boundary truncation and
the empty-slice short-circuit contract), the XML-entity injection
guard, the dedup parser round-trip on extract.v4 output, and the trait
default-impl fallthrough.

End-to-end observability verified across 16,121 /api/analyze events
(LOCOMO + LME): 99.4% status=ok, 7.1% skipped_empty_namespace on LME
(fast path firing as designed), 1 embed_failed (graceful fallback
worked), 0 timeouts.

## Docs

Also folds two benchmark-docs commits that brought the harness README
back in line with the code (it predated MEM-54/55/56/57):
- Migrations auto-apply on startup (no manual sqlx migrate); analyze
  status is "done" not "completed"; presets documented as 3 signals
  (semantic / recency / importance) with the inert frequency key
  flagged; importance signal documented; the real run artifacts
  described instead of non-existent summary.md/detailed-report.md.
- Env guidance: RATE_LIMIT_DISABLED=1, PORT=3001 (to match the harness
  default vs the server's 8000), and the always-required env in
  benchmark mode (DATABASE_URL / MEMWAL_PACKAGE_ID / MEMWAL_REGISTRY_ID
  / a reachable SUI_RPC_URL — SEAL + Walrus are bypassed, auth is not).
- Added a TL;DR first-run quickstart.
- services/server/.env.example: added the benchmark section the README
  referenced (was missing).
- benchmarks/pyproject.toml: declared huggingface_hub directly (was
  only transitive via datasets).

## Migration safety

No DB schema changes in this PR. The pre-extraction flow uses the
existing idx_vector_entries_owner_ns index for both the existence check
and search_similar.

## Follow-ups

- MEM-59 (granularity-aware dedup, extract.v5) — paired prompt-only fix
  for the single_session_assistant regression. Root cause + suggested
  text already scoped from this PR's deep review.
- K=5 vs K=10 context-limit experiment (potential ~50-150ms p95 saving).
- Pre-extraction status/latency metrics (currently structured logs only).
- 100k-namespace capacity test before large-customer onboarding
  (pgvector ≤0.7 HNSW post-filter tail).

Closes MEM-57.
The SEAL session migration in this PR requires every relayer-mode signed
request to first fetch GET /config and verify the SEAL package version
via sui_getObject. test_middleware.py only mocked /api/recall and
/api/analyze, so its 4 happy-path tests broke once a real recall was
attempted (respx.AllMockedAssertionError on GET /config).

Mirror the mock_seal_session_prereqs() helper already used in
test_client.py and call it at the start of every @respx.mock test in
test_middleware.py. All 26 middleware tests now pass; full suite 107
passing.
…ueue-failure

fix(server): mark single remember jobs failed on recovery enqueue errors
…rity

fix(python-sdk): align GET signing and use seal sessions
Copy link
Copy Markdown
Collaborator

@jasong-03 jasong-03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

staging → main promotion of the same content as PR #181 (verified there: Python SDK 107/107 pytest, server jobs::tests 12/12, async/sync smoke + protocol checks all green against dev relayer). CI on this PR is also fully green — Mintlify Deployment passed this time, and Railway successfully deployed to staging.memwal.ai. Approving.

@ducnmm ducnmm merged commit c7c374d into main May 21, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants