main <-staging by ducnmm · Pull Request #182 · MystenLabs/MemWal

ducnmm · 2026-05-21T08:05:11Z

No description provided.

…iddleware

PR #121 made remember/analyze async on the server: client gets HTTP 202 + job_id back in ~500ms instead of waiting ~18s for the full Walrus upload + chain commit. Update the Python SDK to match the new TypeScript SDK contract so all four downstream targets (Python apps, FastAPI, AI middleware, OpenClaw plugin) get the same UX win. Surface (mirrors TS): - remember() / remember_async() now return RememberAcceptedResult - wait_for_remember_job() polls /api/remember/{id} with jittered exp backoff + transient retry (matches TS pollingDelayMs + isTransientPollingStatus) - remember_and_wait() convenience wraps both - Bulk family: remember_bulk[_async] / get_remember_bulk_status / wait_for_remember_jobs / remember_bulk_and_wait - analyze() returns job_ids + fact_count; analyze_and_wait() polls every fact's remember job to completion - embed() exposes /api/embed for raw vectors - New typed exceptions: MemWalRememberJobNotFound / Failed / Timeout - MemWalSync wrapper exposes every new async method Signing: - build_signature_message() now includes the nonce + account_id segments the server requires (MED-1 replay protection + LOW-23 account-hint binding). Client generates a UUID nonce per request and sends x-nonce header; without this the server rejects with HTTP 426 "unsupported legacy SDK". Verified end-to-end against a local server (testnet): - remember() returned 202 in 58ms (the PR #121 win) - wait_for_remember_job() settled blob_id after ~30s of upload+commit - recall() returned decrypted plaintext via SEAL - remember_bulk_and_wait() handled 3 items across 3 wallets, all done - See packages/python-sdk-memwal/examples/interactive_demo.py for the reproducible demo and the full server-log evidence.

- Add docs/mcp/how-it-works.md covering auth-required vs bridged mode, first-run flow, local credentials, and the stdio bridge - Add docs/mcp/changelog.mdx for the @mysten-incubation/memwal-mcp package - overview: supported clients, client-machine behavior, why the package over raw HTTP, and How It Works / Changelog cards - quick-start: login-path choice, config locations, first-run behavior, direct HTTP setup, and local development - reference: first-run behavior, credential file, client config paths, HTTP vs stdio guidance, runtime safety notes, logout semantics - Migrate sdk/openclaw changelogs from .md to .mdx; add SDK 0.0.3 and 0.0.4 release entries - run-docs-locally: note Node 20 LTS requirement for Mintlify - docs.json: add mcp/how-it-works and mcp/changelog to nav

dev <- main

docs(mcp): restructure MCP docs into landing + quick-start + how-it-works + reference

…servability-tracing-monitoring-and-apm feat: add relayer observability and metrics

Add a server-level default namespace so users set it once in MCP client config instead of passing namespace on every tool call. - New --namespace <name> CLI flag (alias --ns, plus --namespace= form) - New MEMWAL_NAMESPACE env var; precedence: per-call > CLI > env > unset (unset = forwarded without namespace; relayer applies its own default) - Thread resolved namespace through BridgeConfig + AuthRequiredConfig - Pure applyDefaultNamespace() injects the default into memwal_remember, memwal_recall, memwal_analyze, memwal_restore tool calls when the agent omits namespace; explicit per-call value always wins - memwal_restore keeps namespace required in its schema; configured default is only a fallback if the agent calls it without one - Help text + config snippet; README Default Namespace section with Cursor and Claude Desktop examples and a manual verification note - docs/mcp overview + reference: default-namespace behavior and examples

- Add prod/dev/staging/local relayer presets via env= on MemWalConfig, MemWal.create, MemWalSync.create, with_memwal_langchain, with_memwal_openai; export ENV_PRESETS. Precedence: explicit server_url > env > default; unknown preset raises ValueError - tests/test_env_presets.py (10 cases) covering resolution + precedence - SDK README: Environment Presets section - docs/python-sdk: full doc set mirroring the TS SDK nav (quick-start, usage, usage/{memwal,memwal-manual,with-memwal}, api-reference, changelog.mdx) written from the actual Python API - docs.json: add Python SDK tab

feat(python-sdk): add Python SDK with async client, signing, and AI m…

feat(mcp): configurable default memory namespace

Replace the `services/ranker` placeholder with a real `CompositeRanker` behind a `Ranker` trait. Blends semantic similarity with an optional recency decay; default weights short-circuit to today's pgvector cosine order so existing clients see byte-identical responses. Score formula: score = semantic * (1 - distance) + recency * 2^(-age_days / half_life_days) Implementation uses `exp(-age * ln(2) / half_life)` for a true half-life decay (a memory at the half-life mark scores exactly 0.5 in the recency term, not 1/e ≈ 0.368 as a naive `exp(-age/half_life)` would give). Wire changes: - Optional `scoring_weights` on `RecallRequest` + `AskRequest`. - `RecallResult.score: Option<f64>` with `skip_serializing_if = "Option::is_none"` so the field only appears when the ranker actually ran. - `db.search_similar` now selects `created_at` alongside the cosine distance, threaded through `SearchHit` → `HydratedMemory` via the shared `zip_created_at_onto_hydrated` helper used by both `/api/recall` and `/api/ask`. Validation: - `ScoringWeights::validate()` returns 400 at the top of each handler (fail-fast before any embed / fetch spend) for NaN, Inf, out-of-range weights, or sub-floor `recency_half_life_days < MIN_HALF_LIFE_DAYS` (1e-6 ≈ 86 ms) — closes the subnormal-half-life gap where the recency term silently collapsed to zero. - `ScoringWeights::is_ranker_active()` — single source of truth for the `recency.abs() >= f64::EPSILON` predicate (was duplicated in 4 sites). Defensive math: future timestamps clamp via `.max(0)`, non-positive half-life zeros the recency term, NaN sorts as `Equal`. `CompositeRanker` is stateless, instantiated once in `main.rs` and shared via `Arc<dyn Ranker>` on `AppState` — matches the existing `Embedder` / `Extractor` shape and leaves room for a future cross-encoder reranker (Cohere / BGE) behind the same call site. Tests: 187 pass (was 172 on dev — 15 new). Coverage includes the half-life formula at the half-life mark, no-op short-circuit invariant, score-field presence/absence, full `validate()` boundary matrix (NaN/Inf/negative/>100/subnormal half-life/recency-zero carve-out), and a refactor guard pinning `ScoringWeights::default().recency == 0.0` so a future change to the default can't silently activate the ranker on every existing client. Benchmarks (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer, OpenRouter): - LOCOMO 3-preset: baseline 53.88 / default 53.62 / recency_heavy 53.96 - LongMemEval 2-preset: baseline 72.15 / recency_heavy 71.85 All overall deltas within judge noise (SEM ±0.72 LOCOMO, ±1.36 LongMemEval). This is **behavior-preserving infrastructure**, not a quality lift — the trait + opt-in plumbing unblocks MEM-54 (importance signal) and any future reranker without re-shipping the same wire/storage threading. Per-category data lives in the archived results: small lifts on `multi_hop` / `adversarial` from recency-heavy weights, small dips on `temporal` / `preference`. The `preference` −3.16 on LongMemEval is a concrete reason not to ship `recency = 0.4` as a server default — and the shipped default (`recency = 0.0`) sidesteps it entirely. No-op invariant verified end-to-end: LOCOMO baseline 53.88 sits within ±1 J of the May 14 pre-ranker (54.5) and May 13 ENG-1747 (54.5 / 54.8) baselines. The composite reranker code path exists but never alters retrieval order under default weights. Full archive + per-category breakdown + methodology in `services/server/review/assessment/benchmark-runs/2026-05-18-ranker-composite-recency/`. Closes MEM-53. Part of MEM-52 (RAG quality, cycle 13).

# Conflicts: # packages/python-sdk-memwal/README.md # packages/python-sdk-memwal/examples/async_remember_demo.py # packages/python-sdk-memwal/memwal/__init__.py # packages/python-sdk-memwal/memwal/client.py # packages/python-sdk-memwal/memwal/middleware.py # packages/python-sdk-memwal/memwal/types.py # packages/python-sdk-memwal/pyproject.toml # packages/python-sdk-memwal/tests/test_client.py # packages/python-sdk-memwal/tests/test_integration.py # packages/python-sdk-memwal/tests/test_middleware.py # packages/python-sdk-memwal/tests/test_signing.py

feat(python-sdk): add PyPI release workflow

* feat(server): expose prompt versions on /health (MEM-56) Surface FACT_EXTRACTION_PROMPT_VERSION (extractor.rs) and ASK_SYSTEM_PROMPT_VERSION (admin.rs) on the /health response so the benchmark harness can pin them into result-artifact metadata at run start. Closes the attribution gap where two LOCOMO runs with different extractor prompts produced indistinguishable JSON on disk. HealthResponse gains a prompt_versions: PromptVersions block with extract + ask fields. Both fields are always populated — there is no "version unknown" state for a running server. ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] since it's now load-bearing. Pinned by health_response_serializes_prompt_versions_block so a future rename can't silently break the harness pipeline. 188/188 tests pass (was 187). MEM-56 * feat(benchmarks): pin prompt_versions into run artifacts (MEM-56) Read prompt_versions from GET /health at run start, fail fast if the server doesn't expose them (no silent fallback to empty metadata), and thread the dict onto RunArtifact so every result JSON records which extract.v* / ask.v* produced it. Comparison table renders a 'prompt versions' row so a future 'score jumped in week N' delta is attributable to the prompt change vs the weights change rather than guessed at from git history. Changes: - core/types.py: RunArtifact gains prompt_versions: dict[str, str] with empty-dict default (legacy artifacts loaded by 'compare' still parse). Fresh runs always populate because the harness fails fast at startup when the field is missing. - run.py: at server boot check, after the mode validation, abort with a clear error if health.prompt_versions doesn't carry both extract and ask. On success, log the versions and stash them on config under _server_prompt_versions so stage_eval picks them up without a signature change. - core/report.py: generate_comparison_table renders a 'prompt versions' row showing extract.vN/ask.vM per preset. Empty cells for legacy artifacts so cross-cycle comparisons stay readable. Manually verified end-to-end against the running server: - /health returns {extract:extract.v1, ask:ask.v1} - Harness fail-fast triggered against a pre-MEM-56 server (no prompt_versions field) with the documented error message - Stand-alone Python script confirmed: server -> harness -> artifact JSON contains the prompt_versions block - Comparison table renders the row with synthetic data Not benchmarked end-to-end because this PR doesn't touch scoring, extraction, or retrieval — pure metadata plumbing. A full LOCOMO + LongMemEval re-run would reproduce yesterday's ranker numbers within judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 / MEM-57 where benchmarks actually buy signal. MEM-56

…ee-deployment-template-for-memwal-relayer docs: add Nautilus TEE relayer deployment template

…s-forced-to-register-new-delegate-key-on-every MEM-46: refine delegate key import setup

…l-performance perf(server): race Walrus aggregator reads

* feat(server): extract.v2 — relax fact-extraction scope to both parties Relax the extractor prompt's user-only scope to cover memorable facts from either party in the conversation. The v1 prompt scoped extraction to "facts about the user", which systematically under-counted assistant-side content (recommendations, conclusions, summaries, plans). LongMemEval's `single_session_assistant` category sat at ~29.91 J because of this — the LLM was capable of distinguishing user-said from assistant-said facts when asked, but the prompt was preventing it from extracting the latter at all. Bumps `FACT_EXTRACTION_PROMPT_VERSION` from "extract.v1" to "extract.v2". The const is surfaced on `GET /health` (via MEM-56) so every benchmark run-artifact JSON carries the version it was produced under. Prompt-injection guard, NONE-on-no-facts behaviour, and the one-fact-per-line output shape are all preserved verbatim from v1. New rules cover what the assistant says vs what to skip (acks, restatements, formatting meta-talk). Benchmark headline (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer): - LongMemEval overall: 72.15 → 76.6 (+4.45 J) - LongMemEval `single_session_assistant`: 29.91 → 74.2 (+44.3 J, the cycle's first significant single-category lift) - LongMemEval other categories: within judge noise on every other category (smallest move −2.1 on `temporal`) - LOCOMO overall: 53.88 → 53.7 (flat) - LOCOMO `single_hop`: 53.40 → 43.5 (−9.9 J, ~6 SEMs — real, not noise) The LOCOMO `single_hop` regression is a dilution effect at the recall `limit=10` cut: extract.v2 extracts +33% more facts per conversation, so the relevant user-side fact gets pushed below position 10 more often when synthetic single-fact-lookup queries hit. Fix path is MEM-54 (importance signal weighting user-said personal facts higher at ranker time) — landing next, this cycle. Three v2 prompt variants were explored during MEM-55 development to see if the regression could be addressed at the prompt layer. It can't — per-turn ingestion (one /api/analyze call per speaker turn) makes "dedup against context" impossible to implement reliably because the LLM doesn't see the other turns. The fix belongs at the ranker layer. Pre-commit validation gate was per-category: LOCOMO `single_hop` within ±2 J failed by a wide margin. Shipping anyway because the averaged-across-benchmarks delta is +2.13 J net and the fix path for the per-category regression is concrete and immediate (MEM-54). Documented loudly in the benchmark archive README rather than dressed up. MEM-55 * chore(server): archive 2026-05-19 MEM-55 extract.v2 benchmark Archive the LongMemEval + LOCOMO baselines that validated the extract.v2 prompt change. All on commit 47a1f6f (current dev tip with MEM-53 ranker + MEM-56 prompt-version pinning merged). Headlines documented in the README: - LongMemEval overall 76.6 (+4.45 vs v1) - LongMemEval `single_session_assistant` 74.2 (+44.3) - LOCOMO overall 53.7 (flat) - LOCOMO `single_hop` 43.5 (−9.9, real not noise) README is explicit about the validation-gate accounting: gate (3) "LOCOMO within ±2 J" failed on the `single_hop` per-category delta. Documents the dilution-at-recall-limit root cause, why we ship anyway (averaged net +2.13 J, MEM-54 is the next sub-issue and directly addresses the dilution), and what we'd do if MEM-54 doesn't deliver (revisit or revert). This is the first benchmark archive that exercises MEM-56's prompt-version attribution pipeline end-to-end: every artifact JSON carries `prompt_versions: {extract: extract.v2, ask: ask.v1}` in its metadata block, so future cross-run comparisons can attribute J-Score deltas to the prompt change without guessing. MEM-55

…-recovery # Conflicts: # apps/app/src/pages/SetupWizard.tsx # services/server/scripts/sidecar-server.ts # services/server/src/routes/remember.rs # services/server/src/storage/walrus.rs # services/server/src/types.rs

…ee-deployment-template-for-memwal-relayer docs: clarify TEE deployment pattern

Improve sidecar upload recovery diagnostics

Feat: MEM-54 — per-fact importance signal end-to-end + extract.v3 (#177) Surfaces the extractor LLM's implicit importance assessment as an opt-in ranker dimension. Extractor tags each fact with a vital / standard / trivial bucket → persisted on vector_entries.importance (migration 009) → consumed by CompositeRanker via a weighted term on ScoringWeights.importance. Default weights (importance: 0.0) preserve byte-identical-to-today recall ordering. Why categorical (vital/standard/trivial → 0.9/0.5/0.2): LLMs are reliably good at categorical classification and unreliable at continuous quantification (continuous scores bunch around 0.5 under uncertainty). 3-bucket gives the ranker meaningful headroom (vital is 1.8× standard, trivial is 0.4× standard) without spreading so wide that one mis-classification dominates the score. Why opt-in via scoring_weights.importance (default 0.0): backward- compatible by construction, per-request tunability without forcing a server-wide default change, and experimental hygiene — we can A/B the signal via the harness preset system before promoting it to a default. LOCOMO — recovers the MEM-55 single_hop regression and adds +4.3 overall: - single_hop: 53.40 (v1) → 43.5 (MEM-55 v2, ❌) → 53.6 (this PR, ✅) - Overall: 53.88 (v1) → 53.7 (MEM-55 v2) → 58.2 (+4.3) ✅ LongMemEval — known regression on single_session_assistant (74.2 → 62.7, −11.5 vs MEM-55 v2). Tracked as the immediate next ticket MEM-57 (pre-extraction dedup context, Mem0 v3 pattern), which is expected to compensate by giving the extractor stronger signal for what's new vs already-known. If MEM-57 doesn't move the LME number, a small prompt-softening follow-up PR is tested locally as fallback. Infrastructure landed: - migration 009: vector_entries.importance REAL NOT NULL DEFAULT 0.5 (non-destructive ADD COLUMN; legacy rows degrade to neutral standard) - ExtractedFact { text, importance } with case-insensitive bucket parser + legacy no-TAB fallback (handles extract.v1/v2 output during ingestion-format transition) - MemoryEngine.store_blob takes importance, threaded through routes/analyze.rs (benchmark + production), routes/remember.rs (single + bulk), routes/admin.rs restore, and jobs.rs WalletOperation::{UploadAndTransfer, SetMetadataAndTransfer, FinalizeUploadedBlob} + BulkRememberItem (all with #[serde(default = "default_importance")] for in-flight legacy payload compatibility) - HydratedMemory.importance: Option<f32> zipped from SearchHit via the renamed zip_search_hit_fields_onto_hydrated helper (consolidates created_at + importance zip into one DB→ranker pass) - CompositeRanker gains an importance term; is_ranker_active() includes the new weight; tracing breadcrumbs include the weight - ScoringWeights.importance: f64 (default 0.0); validate() bounds match semantic/recency [0.0, 100.0] Prompt change (extract.v2 → extract.v3): - 3-bucket rubric appended with concrete category definitions - BUCKET<TAB>FACT_TEXT output format with worked examples - FACT_EXTRACTION_PROMPT_VERSION bumped to extract.v3 (surfaced on /health and in benchmark artifacts via MEM-56) Tests: 208/208 pass (17 ranker including 6 new importance tests; 14 extractor including 4 new bucket-parser tests). Bucket distribution verified end-to-end on the real DB after benchmark runs. Also includes a chore commit that stops tracking benchmark result archives + working analysis notes in the repo (team decision — archived internally for monitoring). Historical commits that added those files are untouched. Closes MEM-54.

…ittee-aggregator feat(seal): default testnet SEAL to Mysten committee aggregator

…ct.v4 (#178) Adds the Mem0 v3 saliency-aware extraction pattern: before the extractor LLM call, retrieve the top-K nearest existing memories for the input and prepend them as a <related_memories> context block. The extractor uses the context to skip duplicates and anchor borderline facts, without merging or superseding (extraction stays ADD-only). This is the architectural fix for the MEM-54 v3 LME single_session_ assistant regression and the LOCOMO single_hop dilution. Net result on both benchmarks vs the pre-cycle-13 baseline (extract.v1, May 18): LOCOMO — every category improved: - single_hop: 53.40 → 67.3 (+13.9) - multi_hop: 47.08 → 56.7 (+9.6) - open_domain: 52.22 → 71.5 (+19.3) - adversarial: 71.33 → 82.4 (+11.1) - temporal: 36.42 → 45.8 (+9.4) - Overall: 53.88 → 68.5 (+14.6, ~36 SEMs) First time on this codebase that LOCOMO single_hop and multi_hop match or beat Mem0 v2's published numbers (arXiv:2504.19413). LongMemEval — 5 of 6 categories improved: - single_session_assistant: 29.91 → 57.6 (+27.7) - multi_session: 78.57 → 82.5 (+3.9) - preference: 77.83 → 80.2 (+2.4) - knowledge_update: 86.10 → 86.5 (+0.4) - single_session_user: 95.21 → 96.1 (+0.9) - temporal: 62.03 → 59.5 (−2.5) - Overall: 72.15 → 76.0 (+3.9) Known regression vs the historical-best (MEM-55 v2's 74.2 on single_session_assistant): MEM-57's broad dedup occasionally conflates a summary memory in context with the input's atomic list items, dropping the items as 'paraphrases'. Net is still +27.7 vs v1, but −16.6 vs the historical best. The deep-review root cause + paired prompt fix are scoped as MEM-59 (granularity-aware dedup) for the immediate follow-up. ## Implementation Trait extension (extractor.rs) - Extractor::extract_with_context(text, &[&str]) — default impl falls through to extract(text) so test mocks + non-analyze callers don't need to change. - LlmExtractor override: short-circuits to extract() on empty slice, otherwise sends a 3-message payload (system + <related_memories> + input). The static system prompt stays cacheable; per-request context varies in the user-role message. - extract() and extract_with_context() share a private call_chat_completion(messages) helper — single HTTP path, single observability point. Prompt change (extract.v3 → extract.v4) - Adds the <related_memories> instruction block (skip exact-paraphrase duplicates, anchor borderline content, ADD-only) plus a worked dedup example. Output format unchanged from v3 — same parser. - FACT_EXTRACTION_PROMPT_VERSION bumped to extract.v4 (surfaced on /health and in benchmark artifacts via MEM-56). Handler wiring (routes/analyze.rs) - Pre-extraction recall fires before extract_with_context() on both the production and benchmark paths. - On production: search_similar against pgvector, then fetch_batch hits Walrus + the SEAL decrypt sidecar — NOT additional Postgres reads. On benchmark: fetch_batch reads the plaintext column. Both engines emit HydratedMemory { text }, so the prompt rendering operates uniformly on both. - PRE_EXTRACTION_CONTEXT_LIMIT = 10 (matches Mem0 v3's K). - Per-leg timing instrumentation (embed_ms / search_ms / walrus_ms / seal_ms) with a status enum covering 8 outcomes. - Empty-namespace fast path: a cheap btree existence check on idx_vector_entries_owner_ns skips the embed + search round-trip on first-ingest-into-a-namespace (fires ~7% of LME / ~0.4% of LOCOMO calls; saves ~80-150ms + an embedding call per skip). - Graceful degradation: every recall-side failure (embed / search / fetch) falls back to plain extraction with a warn log and a status enum tag — a user's write never fails because the read path is degraded. P0 hardening (per deep review, prerequisites for production ship) 1. Per-leg timeouts — tokio::time::timeout on each leg (embed 800ms, search 300ms, fetch 500ms). Caps the pre-extraction worst case at ~1.6s instead of the observed 30s benchmark outlier. New status values: embed_timeout, search_timeout, fetch_timeout. 2. Prompt-injection guard on <related_memories> content. MEM-57 routes stored user text → SEAL decrypt → LLM prompt; a user storing '</related_memories><system>...' could otherwise manipulate their own future extractions. escape_for_prompt_context() converts <, >, & to entities at the render chokepoint. Cross-tenant injection is already blocked by the DB owner+namespace filter and the SEAL credential tied to auth.account_id; this closes the self-injection-within-one's-own-namespace path. Applies uniformly to the production (SEAL-decrypted) and benchmark (plaintext) paths. ## Latency cost (full disclosure) The MEM-57 ticket's +50-150ms p95 forecast was scoped to "one extra recall round-trip". The real flow runs 5 sequential operations (existence check + input embed + pgvector search + Walrus fetch + SEAL decrypt): measured p50 ~660ms, p95 ~1473ms, p99 ~4882ms across 10,179 events. 81.7% of calls land in the 500-1000ms bucket — structural, not noise. Per-leg timeouts cap the worst case at ~1.6s. Accepted because the +14.6 J LOCOMO win is overwhelming for an LLM-bound endpoint where the extractor itself takes 1-2s on the dominant path. ## Test surface 221/221 unit tests pass (was 208 on dev). 13 new tests across the MEM-57 surface: prompt formatting (incl. UTF-8 boundary truncation and the empty-slice short-circuit contract), the XML-entity injection guard, the dedup parser round-trip on extract.v4 output, and the trait default-impl fallthrough. End-to-end observability verified across 16,121 /api/analyze events (LOCOMO + LME): 99.4% status=ok, 7.1% skipped_empty_namespace on LME (fast path firing as designed), 1 embed_failed (graceful fallback worked), 0 timeouts. ## Docs Also folds two benchmark-docs commits that brought the harness README back in line with the code (it predated MEM-54/55/56/57): - Migrations auto-apply on startup (no manual sqlx migrate); analyze status is "done" not "completed"; presets documented as 3 signals (semantic / recency / importance) with the inert frequency key flagged; importance signal documented; the real run artifacts described instead of non-existent summary.md/detailed-report.md. - Env guidance: RATE_LIMIT_DISABLED=1, PORT=3001 (to match the harness default vs the server's 8000), and the always-required env in benchmark mode (DATABASE_URL / MEMWAL_PACKAGE_ID / MEMWAL_REGISTRY_ID / a reachable SUI_RPC_URL — SEAL + Walrus are bypassed, auth is not). - Added a TL;DR first-run quickstart. - services/server/.env.example: added the benchmark section the README referenced (was missing). - benchmarks/pyproject.toml: declared huggingface_hub directly (was only transitive via datasets). ## Migration safety No DB schema changes in this PR. The pre-extraction flow uses the existing idx_vector_entries_owner_ns index for both the existence check and search_similar. ## Follow-ups - MEM-59 (granularity-aware dedup, extract.v5) — paired prompt-only fix for the single_session_assistant regression. Root cause + suggested text already scoped from this PR's deep review. - K=5 vs K=10 context-limit experiment (potential ~50-150ms p95 saving). - Pre-extraction status/latency metrics (currently structured logs only). - 100k-namespace capacity test before large-customer onboarding (pgvector ≤0.7 HNSW post-filter tail). Closes MEM-57.

The SEAL session migration in this PR requires every relayer-mode signed request to first fetch GET /config and verify the SEAL package version via sui_getObject. test_middleware.py only mocked /api/recall and /api/analyze, so its 4 happy-path tests broke once a real recall was attempted (respx.AllMockedAssertionError on GET /config). Mirror the mock_seal_session_prereqs() helper already used in test_client.py and call it at the start of every @respx.mock test in test_middleware.py. All 26 middleware tests now pass; full suite 107 passing.

…ueue-failure fix(server): mark single remember jobs failed on recovery enqueue errors

…rity fix(python-sdk): align GET signing and use seal sessions

staging <- dev

jasong-03

staging → main promotion of the same content as PR #181 (verified there: Python SDK 107/107 pytest, server jobs::tests 12/12, async/sync smoke + protocol checks all green against dev relayer). CI on this PR is also fully green — Mintlify Deployment passed this time, and Railway successfully deployed to staging.memwal.ai. Approving.

hien-p and others added 30 commits April 7, 2026 16:19

feat(python-sdk): add Python SDK with async client, signing, and AI m…

c88dda1

…iddleware

Merge remote-tracking branch 'origin/dev' into feat/python-sdk-memwal

7a91c6c

Merge remote-tracking branch 'origin/dev' into feat/python-sdk-memwal

a35778e

feat: add sidecar upload diagnostics

2a8a5cc

chore: log sidecar runtime state

2c825ac

Fix Walrus sidecar stale state handling

2a90cdc

Add existing delegate key import to setup

cfbbca3

feat: add relayer observability and metrics

9471689

docs(openclaw): add Changelog card to overview

6667168

Merge pull request #163 from MystenLabs/main

79a5dc9

dev <- main

Merge pull request #162 from MystenLabs/docs/restructure-mcp-page

1a1d5a2

docs(mcp): restructure MCP docs into landing + quick-start + how-it-works + reference

Merge pull request #161 from MystenLabs/feature/mem-31-add-relayer-ob…

efcc1b9

…servability-tracing-monitoring-and-apm feat: add relayer observability and metrics

feat(python-sdk): add PyPI release workflow

944f00a

Merge pull request #80 from MystenLabs/feat/python-sdk-memwal

2ec55b3

feat(python-sdk): add Python SDK with async client, signing, and AI m…

Merge pull request #165 from MystenLabs/feat/mcp-default-namespace

db42aaf

feat(mcp): configurable default memory namespace

perf(server): race walrus aggregator reads

bbf787b

chore(ci): wait after benchmark remember phase

5fcc977

fix(app): refine delegate key import setup

17ea770

docs: add Nautilus TEE relayer deployment template

f863069

Merge pull request #164 from MystenLabs/chore/python-sdk-release

90e59b5

feat(python-sdk): add PyPI release workflow

Merge pull request #171 from MystenLabs/feature/mem-51-add-nautilus-t…

6946981

…ee-deployment-template-for-memwal-relayer docs: add Nautilus TEE relayer deployment template

Merge pull request #170 from MystenLabs/feature/mem-46-dashboard-user…

6170bed

…s-forced-to-register-new-delegate-key-on-every MEM-46: refine delegate key import setup

Merge pull request #166 from MystenLabs/feature/eng-1768-walrus-recal…

8db8aee

…l-performance perf(server): race Walrus aggregator reads

hungtranphamminh and others added 24 commits May 20, 2026 08:39

Improve sidecar upload recovery diagnostics

80e6cd9

Remove unrelated setup wizard change

d97142a

Fix upload recovery retry handling

2b45f9d

Merge pull request #175 from MystenLabs/feature/mem-51-add-nautilus-t…

1176c40

…ee-deployment-template-for-memwal-relayer docs: clarify TEE deployment pattern

Merge pull request #174 from MystenLabs/improve-sidecar-upload-recovery

e5b98ca

Improve sidecar upload recovery diagnostics

feat(seal): default to Mysten committee aggregator

05f2568

test(seal): cover legacy independent threshold

5db484d

chore(seal): keep SDK manual defaults unchanged

ae28740

docs(seal): clarify default config scope

5bf97ff

Merge pull request #176 from MystenLabs/feat/default-seal-mysten-comm…

933a0b2

…ittee-aggregator feat(seal): default testnet SEAL to Mysten committee aggregator

fix(python-sdk): align GET signing and use seal sessions

1fd4d91

test(python-sdk): align integration tests with async remember

904f512

fix(server): mark single remember jobs failed on recovery enqueue errors

ab9fa73

test(server): add live remember terminal-state check

acec445

fix(server): persist remember handoff failures

10b7fc4

test(server): isolate remember job db setup

cda4cea

Merge pull request #180 from MystenLabs/fix/remember-job-terminal-enq…

5b1398d

…ueue-failure fix(server): mark single remember jobs failed on recovery enqueue errors

Merge pull request #179 from MystenLabs/fix/python-sdk-session-get-pa…

b2d4b12

…rity fix(python-sdk): align GET signing and use seal sessions

Merge pull request #181 from MystenLabs/dev

e7dee7f

staging <- dev

ducnmm requested review from daniellam258, harrymove-ctrl, hungtranphamminh and jasong-03 May 21, 2026 08:05

jasong-03 approved these changes May 21, 2026

View reviewed changes

ducnmm merged commit c7c374d into main May 21, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main <-staging#182

main <-staging#182
ducnmm merged 57 commits into
mainfrom
staging

ducnmm commented May 21, 2026

Uh oh!

jasong-03 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ducnmm commented May 21, 2026

Uh oh!

jasong-03 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants