Skip to content

staging <- dev#190

Open
ducnmm wants to merge 16 commits into
stagingfrom
dev
Open

staging <- dev#190
ducnmm wants to merge 16 commits into
stagingfrom
dev

Conversation

@ducnmm
Copy link
Copy Markdown
Collaborator

@ducnmm ducnmm commented May 24, 2026

No description provided.

Nguyen Mau Minh Duc and others added 16 commits May 21, 2026 17:29
* Feat: MEM-59 — extract.v5 granularity-aware dedup (recover LME single_session_assistant)

Paired follow-up to MEM-57. MEM-57's pre-extraction dedup context won
LOCOMO big (+10.3) but introduced a granularity-blindness regression on
LME single_session_assistant (74.2 peak → 57.6): when <related_memories>
held a SUMMARY of a list and the input held the atomic items, the
extractor dropped the items as paraphrases of the summary.

v5 adds a granularity carve-out to the <related_memories> dedup rules:
specific atomic facts (names, numbers, list items, quotes, dates, titles)
are extracted even when the context holds only a summary/generalisation
of the same topic — plus a worked summary-vs-atomic example. MEM-57's
exact-paraphrase dedup (the mechanism behind the LOCOMO win) is preserved
explicitly.

Pure prompt change. No infra, no latency, parser unchanged. Bumps
FACT_EXTRACTION_PROMPT_VERSION extract.v4 → extract.v5.

## Benchmark validation (baseline preset, vs v4 / MEM-57)

LongMemEval — the recovery target:
  - single_session_assistant:  57.6 → 80.9  (+23.3) — above MEM-55 v2's 74.2 peak
  - multi_session:             82.5 → 80.8  (−1.7)
  - preference:                80.2 → 80.8  (+0.6)
  - knowledge_update:          86.5 → 84.3  (−2.2)
  - single_session_user:       96.1 → 95.5  (−0.6)
  - temporal:                  59.5 → 60.1  (+0.6)
  - Overall:                   76.0 → 77.9  (+1.9)

LOCOMO — no-regression check (held flat, all within ±2-3 J noise):
  - single_hop:    67.3 → 64.8  (−2.5)
  - multi_hop:     56.7 → 57.9  (+1.2)
  - open_domain:   71.5 → 70.2  (−1.3)
  - adversarial:   82.4 → 81.9  (−0.5)
  - temporal:      45.8 → 45.2  (−0.6)
  - Overall:       68.5 → 67.4  (−1.1)

Closes the cycle-13 RAG work: MEM-57 + MEM-59 deliver both wins as a
pair — LOCOMO +10.3 (MEM-57) and LME single_session_assistant fully
recovered + overall up (MEM-59).

## Tests

227/227 pass. New test parse_extracted_facts_handles_v5_granularity_extraction
pins the multi-atomic-item output round-trips through the parser.

Closes MEM-59.

* test(extractor): pin extract.v5 granularity carve-out in the prompt asset

Deep-review follow-up. The existing parse_extracted_facts_handles_v5_
granularity_extraction test only exercises the (unchanged) parser, so it
would still pass if a future edit silently deleted the granularity rule
or worked example from prompts/extract.txt — re-introducing the LME
single_session_assistant regression with no test signal.

Add extract_prompt_asset_contains_v5_granularity_carveout, which asserts
the embedded prompt asset still contains: the granularity rule, the
worked summary-vs-atomic example (incl. its TAB-separated output line,
doubling as a tab-integrity guard), v4's preserved exact-paraphrase
dedup rule, and that the version const tracks at extract.v5.

Pure test addition — no behavior change, prompt unchanged (still the
exact text that produced the validated MEM-59 benchmark numbers).
228/228 pass.
…patibility

MEM-60 define relayer compatibility policy
…aults

Keep testnet Seal defaults on legacy key servers
…non-manual) (#185)

`/api/recall/manual` returned raw pgvector cosine order while `/api/recall`
and `/api/ask` applied the CompositeRanker (recency + importance, opt-in via
scoring_weights), so the same query + weights gave different orderings across
endpoints. Manual recall also validated scoring_weights and then ignored them.

Manual recall now applies the same ranker, keeping its lightweight contract:
it ranks on the SearchHit fields directly (distance / created_at / importance,
all present pre-decrypt) and still returns blob ids + distances WITHOUT a
Walrus fetch or SEAL decrypt. All three recall paths now share one ordering
logic and agree for the same query + weights.

- New `rank_search_hits` reuses the exact `Ranker::rank` the hydrating paths
  use (no re-implementation of scoring on SearchHit — that would risk drift).
- Reorder is index-based, not blob_id-keyed: blob_id is not unique
  (search_similar has no DISTINCT; restore can produce duplicate-blob_id rows),
  so a blob_id-keyed round-trip would collapse duplicates and drop hits.
- recall_manual validates scoring_weights up front (400 on malformed) like recall.
- Default weights short-circuit → cosine order unchanged → existing callers
  unaffected. Wire shape unchanged (Vec<SearchHit>); only order changes.

Tests: 236/236. New recall tests cover manual≡non-manual parity (importance /
recency / combined weights), default no-op, duplicate-blob_id no-drop, an
8-item permutation round-trip, and empty/single/field-preservation cases.

Closes ENG-1785.
…-upstream-memwal-friction-fixes-before-monday

MEM-62: workshop MemWal friction fixes
…-upstream-memwal-friction-fixes-before-monday

MEM-62 remove MEMWAL_KEY fallback
…-upstream-memwal-friction-fixes-before-monday

MEM-62 update SDK changelogs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants