Conversation
PR #121 made remember/analyze async on the server: client gets HTTP 202 + job_id back in ~500ms instead of waiting ~18s for the full Walrus upload + chain commit. Update the Python SDK to match the new TypeScript SDK contract so all four downstream targets (Python apps, FastAPI, AI middleware, OpenClaw plugin) get the same UX win. Surface (mirrors TS): - remember() / remember_async() now return RememberAcceptedResult - wait_for_remember_job() polls /api/remember/{id} with jittered exp backoff + transient retry (matches TS pollingDelayMs + isTransientPollingStatus) - remember_and_wait() convenience wraps both - Bulk family: remember_bulk[_async] / get_remember_bulk_status / wait_for_remember_jobs / remember_bulk_and_wait - analyze() returns job_ids + fact_count; analyze_and_wait() polls every fact's remember job to completion - embed() exposes /api/embed for raw vectors - New typed exceptions: MemWalRememberJobNotFound / Failed / Timeout - MemWalSync wrapper exposes every new async method Signing: - build_signature_message() now includes the nonce + account_id segments the server requires (MED-1 replay protection + LOW-23 account-hint binding). Client generates a UUID nonce per request and sends x-nonce header; without this the server rejects with HTTP 426 "unsupported legacy SDK". Verified end-to-end against a local server (testnet): - remember() returned 202 in 58ms (the PR #121 win) - wait_for_remember_job() settled blob_id after ~30s of upload+commit - recall() returned decrypted plaintext via SEAL - remember_bulk_and_wait() handled 3 items across 3 wallets, all done - See packages/python-sdk-memwal/examples/interactive_demo.py for the reproducible demo and the full server-log evidence.
- Add docs/mcp/how-it-works.md covering auth-required vs bridged mode, first-run flow, local credentials, and the stdio bridge - Add docs/mcp/changelog.mdx for the @mysten-incubation/memwal-mcp package - overview: supported clients, client-machine behavior, why the package over raw HTTP, and How It Works / Changelog cards - quick-start: login-path choice, config locations, first-run behavior, direct HTTP setup, and local development - reference: first-run behavior, credential file, client config paths, HTTP vs stdio guidance, runtime safety notes, logout semantics - Migrate sdk/openclaw changelogs from .md to .mdx; add SDK 0.0.3 and 0.0.4 release entries - run-docs-locally: note Node 20 LTS requirement for Mintlify - docs.json: add mcp/how-it-works and mcp/changelog to nav
dev <- main
docs(mcp): restructure MCP docs into landing + quick-start + how-it-works + reference
…servability-tracing-monitoring-and-apm feat: add relayer observability and metrics
Add a server-level default namespace so users set it once in MCP client config instead of passing namespace on every tool call. - New --namespace <name> CLI flag (alias --ns, plus --namespace= form) - New MEMWAL_NAMESPACE env var; precedence: per-call > CLI > env > unset (unset = forwarded without namespace; relayer applies its own default) - Thread resolved namespace through BridgeConfig + AuthRequiredConfig - Pure applyDefaultNamespace() injects the default into memwal_remember, memwal_recall, memwal_analyze, memwal_restore tool calls when the agent omits namespace; explicit per-call value always wins - memwal_restore keeps namespace required in its schema; configured default is only a fallback if the agent calls it without one - Help text + config snippet; README Default Namespace section with Cursor and Claude Desktop examples and a manual verification note - docs/mcp overview + reference: default-namespace behavior and examples
- Add prod/dev/staging/local relayer presets via env= on MemWalConfig,
MemWal.create, MemWalSync.create, with_memwal_langchain,
with_memwal_openai; export ENV_PRESETS. Precedence: explicit
server_url > env > default; unknown preset raises ValueError
- tests/test_env_presets.py (10 cases) covering resolution + precedence
- SDK README: Environment Presets section
- docs/python-sdk: full doc set mirroring the TS SDK nav
(quick-start, usage, usage/{memwal,memwal-manual,with-memwal},
api-reference, changelog.mdx) written from the actual Python API
- docs.json: add Python SDK tab
feat(python-sdk): add Python SDK with async client, signing, and AI m…
feat(mcp): configurable default memory namespace
Replace the `services/ranker` placeholder with a real `CompositeRanker`
behind a `Ranker` trait. Blends semantic similarity with an optional
recency decay; default weights short-circuit to today's pgvector cosine
order so existing clients see byte-identical responses.
Score formula:
score = semantic * (1 - distance)
+ recency * 2^(-age_days / half_life_days)
Implementation uses `exp(-age * ln(2) / half_life)` for a true half-life
decay (a memory at the half-life mark scores exactly 0.5 in the recency
term, not 1/e ≈ 0.368 as a naive `exp(-age/half_life)` would give).
Wire changes:
- Optional `scoring_weights` on `RecallRequest` + `AskRequest`.
- `RecallResult.score: Option<f64>` with `skip_serializing_if = "Option::is_none"`
so the field only appears when the ranker actually ran.
- `db.search_similar` now selects `created_at` alongside the cosine
distance, threaded through `SearchHit` → `HydratedMemory` via the
shared `zip_created_at_onto_hydrated` helper used by both `/api/recall`
and `/api/ask`.
Validation:
- `ScoringWeights::validate()` returns 400 at the top of each handler
(fail-fast before any embed / fetch spend) for NaN, Inf, out-of-range
weights, or sub-floor `recency_half_life_days < MIN_HALF_LIFE_DAYS`
(1e-6 ≈ 86 ms) — closes the subnormal-half-life gap where the recency
term silently collapsed to zero.
- `ScoringWeights::is_ranker_active()` — single source of truth for the
`recency.abs() >= f64::EPSILON` predicate (was duplicated in 4 sites).
Defensive math: future timestamps clamp via `.max(0)`, non-positive
half-life zeros the recency term, NaN sorts as `Equal`. `CompositeRanker`
is stateless, instantiated once in `main.rs` and shared via `Arc<dyn Ranker>`
on `AppState` — matches the existing `Embedder` / `Extractor` shape and
leaves room for a future cross-encoder reranker (Cohere / BGE) behind
the same call site.
Tests: 187 pass (was 172 on dev — 15 new). Coverage includes the
half-life formula at the half-life mark, no-op short-circuit invariant,
score-field presence/absence, full `validate()` boundary matrix
(NaN/Inf/negative/>100/subnormal half-life/recency-zero carve-out), and
a refactor guard pinning `ScoringWeights::default().recency == 0.0` so a
future change to the default can't silently activate the ranker on every
existing client.
Benchmarks (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer, OpenRouter):
- LOCOMO 3-preset: baseline 53.88 / default 53.62 / recency_heavy 53.96
- LongMemEval 2-preset: baseline 72.15 / recency_heavy 71.85
All overall deltas within judge noise (SEM ±0.72 LOCOMO, ±1.36 LongMemEval).
This is **behavior-preserving infrastructure**, not a quality lift — the
trait + opt-in plumbing unblocks MEM-54 (importance signal) and any
future reranker without re-shipping the same wire/storage threading.
Per-category data lives in the archived results: small lifts on
`multi_hop` / `adversarial` from recency-heavy weights, small dips on
`temporal` / `preference`. The `preference` −3.16 on LongMemEval is a
concrete reason not to ship `recency = 0.4` as a server default — and
the shipped default (`recency = 0.0`) sidesteps it entirely.
No-op invariant verified end-to-end: LOCOMO baseline 53.88 sits within
±1 J of the May 14 pre-ranker (54.5) and May 13 ENG-1747 (54.5 / 54.8)
baselines. The composite reranker code path exists but never alters
retrieval order under default weights.
Full archive + per-category breakdown + methodology in
`services/server/review/assessment/benchmark-runs/2026-05-18-ranker-composite-recency/`.
Closes MEM-53. Part of MEM-52 (RAG quality, cycle 13).
# Conflicts: # packages/python-sdk-memwal/README.md # packages/python-sdk-memwal/examples/async_remember_demo.py # packages/python-sdk-memwal/memwal/__init__.py # packages/python-sdk-memwal/memwal/client.py # packages/python-sdk-memwal/memwal/middleware.py # packages/python-sdk-memwal/memwal/types.py # packages/python-sdk-memwal/pyproject.toml # packages/python-sdk-memwal/tests/test_client.py # packages/python-sdk-memwal/tests/test_integration.py # packages/python-sdk-memwal/tests/test_middleware.py # packages/python-sdk-memwal/tests/test_signing.py
feat(python-sdk): add PyPI release workflow
* feat(server): expose prompt versions on /health (MEM-56)
Surface FACT_EXTRACTION_PROMPT_VERSION (extractor.rs) and
ASK_SYSTEM_PROMPT_VERSION (admin.rs) on the /health response so the
benchmark harness can pin them into result-artifact metadata at run
start. Closes the attribution gap where two LOCOMO runs with different
extractor prompts produced indistinguishable JSON on disk.
HealthResponse gains a prompt_versions: PromptVersions block with
extract + ask fields. Both fields are always populated — there is no
"version unknown" state for a running server.
ASK_SYSTEM_PROMPT_VERSION loses its #[allow(dead_code)] since it's now
load-bearing.
Pinned by health_response_serializes_prompt_versions_block so a future
rename can't silently break the harness pipeline.
188/188 tests pass (was 187).
MEM-56
* feat(benchmarks): pin prompt_versions into run artifacts (MEM-56)
Read prompt_versions from GET /health at run start, fail fast if the
server doesn't expose them (no silent fallback to empty metadata), and
thread the dict onto RunArtifact so every result JSON records which
extract.v* / ask.v* produced it. Comparison table renders a
'prompt versions' row so a future 'score jumped in week N' delta is
attributable to the prompt change vs the weights change rather than
guessed at from git history.
Changes:
- core/types.py: RunArtifact gains prompt_versions: dict[str, str]
with empty-dict default (legacy artifacts loaded by 'compare' still
parse). Fresh runs always populate because the harness fails fast
at startup when the field is missing.
- run.py: at server boot check, after the mode validation, abort with
a clear error if health.prompt_versions doesn't carry both extract
and ask. On success, log the versions and stash them on config
under _server_prompt_versions so stage_eval picks them up without a
signature change.
- core/report.py: generate_comparison_table renders a 'prompt versions'
row showing extract.vN/ask.vM per preset. Empty cells for legacy
artifacts so cross-cycle comparisons stay readable.
Manually verified end-to-end against the running server:
- /health returns {extract:extract.v1, ask:ask.v1}
- Harness fail-fast triggered against a pre-MEM-56 server (no
prompt_versions field) with the documented error message
- Stand-alone Python script confirmed: server -> harness -> artifact
JSON contains the prompt_versions block
- Comparison table renders the row with synthetic data
Not benchmarked end-to-end because this PR doesn't touch scoring,
extraction, or retrieval — pure metadata plumbing. A full LOCOMO +
LongMemEval re-run would reproduce yesterday's ranker numbers within
judge noise at ~$8 spend. Saving that budget for MEM-54 / MEM-55 /
MEM-57 where benchmarks actually buy signal.
MEM-56
…ee-deployment-template-for-memwal-relayer docs: add Nautilus TEE relayer deployment template
…s-forced-to-register-new-delegate-key-on-every MEM-46: refine delegate key import setup
…l-performance perf(server): race Walrus aggregator reads
* feat(server): extract.v2 — relax fact-extraction scope to both parties Relax the extractor prompt's user-only scope to cover memorable facts from either party in the conversation. The v1 prompt scoped extraction to "facts about the user", which systematically under-counted assistant-side content (recommendations, conclusions, summaries, plans). LongMemEval's `single_session_assistant` category sat at ~29.91 J because of this — the LLM was capable of distinguishing user-said from assistant-said facts when asked, but the prompt was preventing it from extracting the latter at all. Bumps `FACT_EXTRACTION_PROMPT_VERSION` from "extract.v1" to "extract.v2". The const is surfaced on `GET /health` (via MEM-56) so every benchmark run-artifact JSON carries the version it was produced under. Prompt-injection guard, NONE-on-no-facts behaviour, and the one-fact-per-line output shape are all preserved verbatim from v1. New rules cover what the assistant says vs what to skip (acks, restatements, formatting meta-talk). Benchmark headline (PlaintextEngine, gpt-4o judge, gpt-4o-mini answer): - LongMemEval overall: 72.15 → 76.6 (+4.45 J) - LongMemEval `single_session_assistant`: 29.91 → 74.2 (+44.3 J, the cycle's first significant single-category lift) - LongMemEval other categories: within judge noise on every other category (smallest move −2.1 on `temporal`) - LOCOMO overall: 53.88 → 53.7 (flat) - LOCOMO `single_hop`: 53.40 → 43.5 (−9.9 J, ~6 SEMs — real, not noise) The LOCOMO `single_hop` regression is a dilution effect at the recall `limit=10` cut: extract.v2 extracts +33% more facts per conversation, so the relevant user-side fact gets pushed below position 10 more often when synthetic single-fact-lookup queries hit. Fix path is MEM-54 (importance signal weighting user-said personal facts higher at ranker time) — landing next, this cycle. Three v2 prompt variants were explored during MEM-55 development to see if the regression could be addressed at the prompt layer. It can't — per-turn ingestion (one /api/analyze call per speaker turn) makes "dedup against context" impossible to implement reliably because the LLM doesn't see the other turns. The fix belongs at the ranker layer. Pre-commit validation gate was per-category: LOCOMO `single_hop` within ±2 J failed by a wide margin. Shipping anyway because the averaged-across-benchmarks delta is +2.13 J net and the fix path for the per-category regression is concrete and immediate (MEM-54). Documented loudly in the benchmark archive README rather than dressed up. MEM-55 * chore(server): archive 2026-05-19 MEM-55 extract.v2 benchmark Archive the LongMemEval + LOCOMO baselines that validated the extract.v2 prompt change. All on commit 47a1f6f (current dev tip with MEM-53 ranker + MEM-56 prompt-version pinning merged). Headlines documented in the README: - LongMemEval overall 76.6 (+4.45 vs v1) - LongMemEval `single_session_assistant` 74.2 (+44.3) - LOCOMO overall 53.7 (flat) - LOCOMO `single_hop` 43.5 (−9.9, real not noise) README is explicit about the validation-gate accounting: gate (3) "LOCOMO within ±2 J" failed on the `single_hop` per-category delta. Documents the dilution-at-recall-limit root cause, why we ship anyway (averaged net +2.13 J, MEM-54 is the next sub-issue and directly addresses the dilution), and what we'd do if MEM-54 doesn't deliver (revisit or revert). This is the first benchmark archive that exercises MEM-56's prompt-version attribution pipeline end-to-end: every artifact JSON carries `prompt_versions: {extract: extract.v2, ask: ask.v1}` in its metadata block, so future cross-run comparisons can attribute J-Score deltas to the prompt change without guessing. MEM-55
…-recovery # Conflicts: # apps/app/src/pages/SetupWizard.tsx # services/server/scripts/sidecar-server.ts # services/server/src/routes/remember.rs # services/server/src/storage/walrus.rs # services/server/src/types.rs
…ee-deployment-template-for-memwal-relayer docs: clarify TEE deployment pattern
Improve sidecar upload recovery diagnostics
Feat: MEM-54 — per-fact importance signal end-to-end + extract.v3 (#177) Surfaces the extractor LLM's implicit importance assessment as an opt-in ranker dimension. Extractor tags each fact with a vital / standard / trivial bucket → persisted on vector_entries.importance (migration 009) → consumed by CompositeRanker via a weighted term on ScoringWeights.importance. Default weights (importance: 0.0) preserve byte-identical-to-today recall ordering. Why categorical (vital/standard/trivial → 0.9/0.5/0.2): LLMs are reliably good at categorical classification and unreliable at continuous quantification (continuous scores bunch around 0.5 under uncertainty). 3-bucket gives the ranker meaningful headroom (vital is 1.8× standard, trivial is 0.4× standard) without spreading so wide that one mis-classification dominates the score. Why opt-in via scoring_weights.importance (default 0.0): backward- compatible by construction, per-request tunability without forcing a server-wide default change, and experimental hygiene — we can A/B the signal via the harness preset system before promoting it to a default. LOCOMO — recovers the MEM-55 single_hop regression and adds +4.3 overall: - single_hop: 53.40 (v1) → 43.5 (MEM-55 v2, ❌) → 53.6 (this PR, ✅) - Overall: 53.88 (v1) → 53.7 (MEM-55 v2) → 58.2 (+4.3) ✅ LongMemEval — known regression on single_session_assistant (74.2 → 62.7, −11.5 vs MEM-55 v2). Tracked as the immediate next ticket MEM-57 (pre-extraction dedup context, Mem0 v3 pattern), which is expected to compensate by giving the extractor stronger signal for what's new vs already-known. If MEM-57 doesn't move the LME number, a small prompt-softening follow-up PR is tested locally as fallback. Infrastructure landed: - migration 009: vector_entries.importance REAL NOT NULL DEFAULT 0.5 (non-destructive ADD COLUMN; legacy rows degrade to neutral standard) - ExtractedFact { text, importance } with case-insensitive bucket parser + legacy no-TAB fallback (handles extract.v1/v2 output during ingestion-format transition) - MemoryEngine.store_blob takes importance, threaded through routes/analyze.rs (benchmark + production), routes/remember.rs (single + bulk), routes/admin.rs restore, and jobs.rs WalletOperation::{UploadAndTransfer, SetMetadataAndTransfer, FinalizeUploadedBlob} + BulkRememberItem (all with #[serde(default = "default_importance")] for in-flight legacy payload compatibility) - HydratedMemory.importance: Option<f32> zipped from SearchHit via the renamed zip_search_hit_fields_onto_hydrated helper (consolidates created_at + importance zip into one DB→ranker pass) - CompositeRanker gains an importance term; is_ranker_active() includes the new weight; tracing breadcrumbs include the weight - ScoringWeights.importance: f64 (default 0.0); validate() bounds match semantic/recency [0.0, 100.0] Prompt change (extract.v2 → extract.v3): - 3-bucket rubric appended with concrete category definitions - BUCKET<TAB>FACT_TEXT output format with worked examples - FACT_EXTRACTION_PROMPT_VERSION bumped to extract.v3 (surfaced on /health and in benchmark artifacts via MEM-56) Tests: 208/208 pass (17 ranker including 6 new importance tests; 14 extractor including 4 new bucket-parser tests). Bucket distribution verified end-to-end on the real DB after benchmark runs. Also includes a chore commit that stops tracking benchmark result archives + working analysis notes in the repo (team decision — archived internally for monitoring). Historical commits that added those files are untouched. Closes MEM-54.
…ittee-aggregator feat(seal): default testnet SEAL to Mysten committee aggregator
…ct.v4 (#178) Adds the Mem0 v3 saliency-aware extraction pattern: before the extractor LLM call, retrieve the top-K nearest existing memories for the input and prepend them as a <related_memories> context block. The extractor uses the context to skip duplicates and anchor borderline facts, without merging or superseding (extraction stays ADD-only). This is the architectural fix for the MEM-54 v3 LME single_session_ assistant regression and the LOCOMO single_hop dilution. Net result on both benchmarks vs the pre-cycle-13 baseline (extract.v1, May 18): LOCOMO — every category improved: - single_hop: 53.40 → 67.3 (+13.9) - multi_hop: 47.08 → 56.7 (+9.6) - open_domain: 52.22 → 71.5 (+19.3) - adversarial: 71.33 → 82.4 (+11.1) - temporal: 36.42 → 45.8 (+9.4) - Overall: 53.88 → 68.5 (+14.6, ~36 SEMs) First time on this codebase that LOCOMO single_hop and multi_hop match or beat Mem0 v2's published numbers (arXiv:2504.19413). LongMemEval — 5 of 6 categories improved: - single_session_assistant: 29.91 → 57.6 (+27.7) - multi_session: 78.57 → 82.5 (+3.9) - preference: 77.83 → 80.2 (+2.4) - knowledge_update: 86.10 → 86.5 (+0.4) - single_session_user: 95.21 → 96.1 (+0.9) - temporal: 62.03 → 59.5 (−2.5) - Overall: 72.15 → 76.0 (+3.9) Known regression vs the historical-best (MEM-55 v2's 74.2 on single_session_assistant): MEM-57's broad dedup occasionally conflates a summary memory in context with the input's atomic list items, dropping the items as 'paraphrases'. Net is still +27.7 vs v1, but −16.6 vs the historical best. The deep-review root cause + paired prompt fix are scoped as MEM-59 (granularity-aware dedup) for the immediate follow-up. ## Implementation Trait extension (extractor.rs) - Extractor::extract_with_context(text, &[&str]) — default impl falls through to extract(text) so test mocks + non-analyze callers don't need to change. - LlmExtractor override: short-circuits to extract() on empty slice, otherwise sends a 3-message payload (system + <related_memories> + input). The static system prompt stays cacheable; per-request context varies in the user-role message. - extract() and extract_with_context() share a private call_chat_completion(messages) helper — single HTTP path, single observability point. Prompt change (extract.v3 → extract.v4) - Adds the <related_memories> instruction block (skip exact-paraphrase duplicates, anchor borderline content, ADD-only) plus a worked dedup example. Output format unchanged from v3 — same parser. - FACT_EXTRACTION_PROMPT_VERSION bumped to extract.v4 (surfaced on /health and in benchmark artifacts via MEM-56). Handler wiring (routes/analyze.rs) - Pre-extraction recall fires before extract_with_context() on both the production and benchmark paths. - On production: search_similar against pgvector, then fetch_batch hits Walrus + the SEAL decrypt sidecar — NOT additional Postgres reads. On benchmark: fetch_batch reads the plaintext column. Both engines emit HydratedMemory { text }, so the prompt rendering operates uniformly on both. - PRE_EXTRACTION_CONTEXT_LIMIT = 10 (matches Mem0 v3's K). - Per-leg timing instrumentation (embed_ms / search_ms / walrus_ms / seal_ms) with a status enum covering 8 outcomes. - Empty-namespace fast path: a cheap btree existence check on idx_vector_entries_owner_ns skips the embed + search round-trip on first-ingest-into-a-namespace (fires ~7% of LME / ~0.4% of LOCOMO calls; saves ~80-150ms + an embedding call per skip). - Graceful degradation: every recall-side failure (embed / search / fetch) falls back to plain extraction with a warn log and a status enum tag — a user's write never fails because the read path is degraded. P0 hardening (per deep review, prerequisites for production ship) 1. Per-leg timeouts — tokio::time::timeout on each leg (embed 800ms, search 300ms, fetch 500ms). Caps the pre-extraction worst case at ~1.6s instead of the observed 30s benchmark outlier. New status values: embed_timeout, search_timeout, fetch_timeout. 2. Prompt-injection guard on <related_memories> content. MEM-57 routes stored user text → SEAL decrypt → LLM prompt; a user storing '</related_memories><system>...' could otherwise manipulate their own future extractions. escape_for_prompt_context() converts <, >, & to entities at the render chokepoint. Cross-tenant injection is already blocked by the DB owner+namespace filter and the SEAL credential tied to auth.account_id; this closes the self-injection-within-one's-own-namespace path. Applies uniformly to the production (SEAL-decrypted) and benchmark (plaintext) paths. ## Latency cost (full disclosure) The MEM-57 ticket's +50-150ms p95 forecast was scoped to "one extra recall round-trip". The real flow runs 5 sequential operations (existence check + input embed + pgvector search + Walrus fetch + SEAL decrypt): measured p50 ~660ms, p95 ~1473ms, p99 ~4882ms across 10,179 events. 81.7% of calls land in the 500-1000ms bucket — structural, not noise. Per-leg timeouts cap the worst case at ~1.6s. Accepted because the +14.6 J LOCOMO win is overwhelming for an LLM-bound endpoint where the extractor itself takes 1-2s on the dominant path. ## Test surface 221/221 unit tests pass (was 208 on dev). 13 new tests across the MEM-57 surface: prompt formatting (incl. UTF-8 boundary truncation and the empty-slice short-circuit contract), the XML-entity injection guard, the dedup parser round-trip on extract.v4 output, and the trait default-impl fallthrough. End-to-end observability verified across 16,121 /api/analyze events (LOCOMO + LME): 99.4% status=ok, 7.1% skipped_empty_namespace on LME (fast path firing as designed), 1 embed_failed (graceful fallback worked), 0 timeouts. ## Docs Also folds two benchmark-docs commits that brought the harness README back in line with the code (it predated MEM-54/55/56/57): - Migrations auto-apply on startup (no manual sqlx migrate); analyze status is "done" not "completed"; presets documented as 3 signals (semantic / recency / importance) with the inert frequency key flagged; importance signal documented; the real run artifacts described instead of non-existent summary.md/detailed-report.md. - Env guidance: RATE_LIMIT_DISABLED=1, PORT=3001 (to match the harness default vs the server's 8000), and the always-required env in benchmark mode (DATABASE_URL / MEMWAL_PACKAGE_ID / MEMWAL_REGISTRY_ID / a reachable SUI_RPC_URL — SEAL + Walrus are bypassed, auth is not). - Added a TL;DR first-run quickstart. - services/server/.env.example: added the benchmark section the README referenced (was missing). - benchmarks/pyproject.toml: declared huggingface_hub directly (was only transitive via datasets). ## Migration safety No DB schema changes in this PR. The pre-extraction flow uses the existing idx_vector_entries_owner_ns index for both the existence check and search_similar. ## Follow-ups - MEM-59 (granularity-aware dedup, extract.v5) — paired prompt-only fix for the single_session_assistant regression. Root cause + suggested text already scoped from this PR's deep review. - K=5 vs K=10 context-limit experiment (potential ~50-150ms p95 saving). - Pre-extraction status/latency metrics (currently structured logs only). - 100k-namespace capacity test before large-customer onboarding (pgvector ≤0.7 HNSW post-filter tail). Closes MEM-57.
The SEAL session migration in this PR requires every relayer-mode signed request to first fetch GET /config and verify the SEAL package version via sui_getObject. test_middleware.py only mocked /api/recall and /api/analyze, so its 4 happy-path tests broke once a real recall was attempted (respx.AllMockedAssertionError on GET /config). Mirror the mock_seal_session_prereqs() helper already used in test_client.py and call it at the start of every @respx.mock test in test_middleware.py. All 26 middleware tests now pass; full suite 107 passing.
…ueue-failure fix(server): mark single remember jobs failed on recovery enqueue errors
…rity fix(python-sdk): align GET signing and use seal sessions
staging <- dev
jasong-03
approved these changes
May 21, 2026
Collaborator
jasong-03
left a comment
There was a problem hiding this comment.
staging → main promotion of the same content as PR #181 (verified there: Python SDK 107/107 pytest, server jobs::tests 12/12, async/sync smoke + protocol checks all green against dev relayer). CI on this PR is also fully green — Mintlify Deployment passed this time, and Railway successfully deployed to staging.memwal.ai. Approving.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.