fix: extend envelope stripping to broader channel/system markers by jlin53882 · Pull Request #481 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-04-03T06:57:21Z

Summary

Extends envelope metadata stripping (from PR #444) to cover 8 additional channel/system envelope patterns that were still leaking into memory extraction.

Changes

1. src/smart-extractor.ts — stripEnvelopeMetadata()

Added per-line stripping for 8 new patterns:
- <<<EXTERNAL_UNTRUSTED_CONTENT / <<<END_EXTERNAL_UNTRUSTED_CONTENT (Discord channel forwarded message wrappers)
- Sender (untrusted metadata): + JSON block
- Conversation info (untrusted metadata): + JSON block
- Thread starter (untrusted, for context): block
- Forwarded message context (untrusted metadata): block
- [Queued messages while agent was busy]
Fixed a pre-existing bug in PR fix: filter subagent runtime wrappers from auto-capture #444: subagent wrapper boilerplate on the same line as the wrapper prefix was not stripped. Changed to entire-line matching for reliability.

2. src/noise-filter.ts — ENVELOPE_NOISE_PATTERNS

New exported array of 8 RegExp patterns
Pre-filter in filterNoiseByEmbedding() skips envelope-pattern texts before embedding-based noise check

3. src/tools.ts — registerMemoryStoreTool

Added envelope metadata guard: rejects input matching any ENVELOPE_NOISE_PATTERNS with a clear error

4. test/strip-envelope-metadata.test.mjs

8 new regression test cases (one per new pattern)
1 mixed integration test covering all Phase 2 patterns together

Testing

node --test test/strip-envelope-metadata.test.mjs
22/22 pass

References

Closes #394 — Envelope Metadata Leak into Memory #446
Builds on PR fix: filter subagent runtime wrappers from auto-capture #444 (merged 2026-04-03)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa7f6d8f4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T07:00:33Z

+
+    // Pre-filter: skip known envelope patterns before embedding check
+    if (texts.length > 0) {
+      const nonEnvelopeTexts = texts.filter(t => !ENVELOPE_NOISE_PATTERNS.some(p => p.test(t)));


Do not discard mixed-content messages as envelope noise

SmartExtractor.filterNoiseByEmbedding now drops any text where ENVELOPE_NOISE_PATTERNS matches, but this runs before stripEnvelopeMetadata in the auto-capture path, so messages that contain both envelope lines and real user content are removed entirely instead of cleaned. For example, a single text containing System: [...] plus a user fact will be filtered out, which can reduce cleanTexts below minMessages and skip extraction, causing valid memories to be lost.

Useful? React with 👍 / 👎.

P1 addressed — pushed. The pre-filter has been removed from filterNoiseByEmbedding. Reasoning: in the extraction path, stripEnvelopeMetadata runs before filterNoiseByEmbedding, so the pre-filter would fire on already-cleaned text. The ENVELOPE_NOISE_PATTERNS guard is now only used in the memory_store tool path (see below).

chatgpt-codex-connector · 2026-04-03T07:00:34Z


        try {
+          // Guard: reject envelope metadata before any processing
+          if (ENVELOPE_NOISE_PATTERNS.some(p => p.test(text))) {


Avoid rejecting memory_store input on partial envelope match

The new memory_store guard rejects the whole write if any envelope regex matches (ENVELOPE_NOISE_PATTERNS.some(...)), but these patterns are line-level markers and can appear alongside legitimate memory content in the same payload. In that mixed-content case, the tool returns envelope_metadata_rejected and stores nothing, which drops explicit user-provided facts instead of preserving content after metadata stripping.

Useful? React with 👍 / 👎.

P2 addressed — pushed. The memory_store guard now calls stripEnvelopeMetadata first, then checks if the result is empty. Mixed-content messages (envelope markers + real facts) will have their envelope stripped and the facts preserved. Only purely-envelope text gets rejected.

app3apps · 2026-04-03T10:27:38Z

Current head cb636e2 is still broken: src/smart-extractor.ts has a stray } around filterNoiseByEmbedding(), which makes the module fail to parse.

I can reproduce it with:

jiti("./src/smart-extractor.ts")
node --test test/strip-envelope-metadata.test.mjs

This needs to be fixed before merge.

AliceLJY

Review: PR #481 — fix: extend envelope stripping to broader channel/system markers

The envelope stripping patterns and test coverage are good, but there are two code-level issues that need to be fixed before merge.

Codex P1 — "Do not discard mixed-content messages as envelope noise"

Valid concern, and the author has addressed it (confirmed by jlin53882's comment). The pre-filter was removed from filterNoiseByEmbedding, and the memory_store guard now strips first and only rejects if nothing remains. The current diff reflects the fix.

Codex P2 — "Avoid rejecting memory_store input on partial envelope match"

Valid concern, and the author has addressed it. The memory_store guard in tools.ts now calls stripEnvelopeMetadata(text) first, then checks if the stripped result is empty. Mixed-content messages are preserved after stripping. Correct.

Blocking issues

1. Stray closing brace breaks filterNoiseByEmbedding

In src/smart-extractor.ts, the diff shows:

  async filterNoiseByEmbedding(texts: string[]): Promise<string[]> {
    const noiseBank = this.config.noiseBank;
+
+    }
+
    if (!noiseBank || !noiseBank.initialized) return texts;

There is a stray } on its own line that prematurely closes the filterNoiseByEmbedding method. The if (!noiseBank ...) line and everything after it would then be outside the method body, causing a compile error. This appears to be a leftover from removing the envelope pre-filter block — the block's closing brace was not removed along with its opening.

2. Character encoding corruption throughout smart-extractor.ts

Multiple comment lines have their em-dash characters (—) replaced with ??. Examples:

"Smart Memory Extractor ??LLM-powered extraction pipeline" (was — LLM-powered)
"Pipeline: conversation ??LLM extract ??candidates ??dedup ??persist" (was →)
"LLM returned zero candidates ??strongest noise signal" (was —)
"Same brand, different item ??should not be deduped" (was →)
Chinese characters also corrupted: "?�欢麦�??��??�烧鸡腿??" (was 喜欢麦当劳的板烧鸡腿堡)

This is a systematic encoding issue — likely the file was saved or committed with an incorrect encoding. All corrupted characters need to be restored to their originals.

Non-blocking notes

ENVELOPE_NOISE_PATTERNS export in noise-filter.ts is well-structured and the patterns are specific enough to avoid false positives.
The 8 new test cases in strip-envelope-metadata.test.mjs provide good regression coverage.
tools.ts lost its trailing newline (\ No newline at end of file). Minor.

Please fix the stray brace and the encoding corruption, then this is ready for another look.

jlin53882 · 2026-04-03T19:14:27Z

All blocking issues have been fixed:

P1: Removed stray } in ilterNoiseByEmbedding that caused the compile error
P2: Restored corrupted characters throughout smart-extractor.ts — em-dashes (—), arrows (→), and Chinese text all fixed
T1: Added trailing newline to ools.ts

Please re-review when you have a chance 🙏

AliceLJY

Re-reviewed after latest commit. Both blocking issues from my previous review are fixed:

Stray closing brace — removed, filterNoiseByEmbedding method body is clean ✅
Encoding corruption — CJK characters (喜欢麦当劳的板烧鸡腿堡) and em-dashes (—) are now correct ✅

Code quality:

ENVELOPE_NOISE_PATTERNS export is well-structured
stripEnvelopeMetadata extensions handle Discord/channel markers properly (per-line, no blanket rejection)
tools.ts guard correctly strips first → rejects only if empty (P2 fix)
8 new test cases provide solid regression coverage

LGTM. @rwmjhb 请 review 后合并。

…se 2) - Extend stripEnvelopeMetadata() with 8 new patterns: <<<EXTERNAL_UNTRUSTED_CONTENT, <<<END_EXTERNAL_UNTRUSTED_CONTENT, Sender/Conversation info (untrusted metadata), Thread starter, Forwarded message context, [Queued messages while agent was busy] - Add ENVELOPE_NOISE_PATTERNS to noise-filter.ts for pre-embedding guard - Add memory_store tool guard in tools.ts (strip-then-check approach) - Add 8 regression test cases in strip-envelope-metadata.test.mjs - Fix PR CortexReach#444 regex bug: subagent wrapper lines now stripped via entire-line matching (/^[Subagent Context|Subagent Task].*$/gm) - P1 fix: remove pre-filter from filterNoiseByEmbedding (runs before stripEnvelopeMetadata in extraction path, would cause false positives) - P2 fix: memory_store guard now strips first then checks if empty, preserving mixed-content messages Fixes CortexReach#446

jlin53882 · 2026-04-08T18:55:02Z

CI 修復記錄（本次 PR）

@AliceLJY 麻煩請你在review 確認一下
本次 PR 經歷 4 輪 CI 迭代，主要問題並非功能邏輯錯誤，而是 rebase 過程中殘留的 artifact。以下為完整修復說明：

🔴 第一輪 CI 失敗（Jobs: packaging-and-workflow / storage-and-schema / cli-smoke / core-regression）

根本原因：CLI smoke test 的 jiti 預編譯快取在第一次失敗後未清除，後續 CI run 吃到殘留的錯誤版本。屬於 CI 環境問題，非 PR code 問題。

修復方式：等待 CI cache 自然過期，無需修改任何 code。

🔴 第二輪 CI 失敗（Jobs 同上）

錯誤症狀：

Error: ParseError: Unexpected token, expected ","
src/tools.ts:790:61

根本原因：git rebase -i 執行過程中，rebase session 的 conflict 標記格式異常，在 tools.ts:790 留下 orphan artifact：

// ❌ 錯誤（殘留 artifact）
const validUntil = inferExpiry(text); (fix: extend envelope stripping to broader channel/system markers (Phase 2))

// ✅ 正確
const validUntil = inferExpiry(text);

修復 commit：77f2eef — fix: remove orphaned Phase 2 artifact from tools.ts

🔴 第三輪 CI 失敗（Jobs 同上）

錯誤症狀：

Error: ParseError: Unexpected token, expected ","
src/smart-extractor.ts:168:18

根本原因：同一 rebase session 在 smart-extractor.ts:168 也留下相同的 orphan artifact：

// ❌ 錯誤（殘留 artifact）
  );
 (fix: extend envelope stripping to broader channel/system markers (Phase 2))

// ✅ 正確
  );

修復 commit：128944e — fix: remove orphaned Phase 2 artifact from smart-extractor.ts

✅ 第四輪 CI — ALL GREEN

所有 artifact 清除完畢後，CI 全部通過。

📋 涉及檔案

檔案	Commit	變更
`src/tools.ts`	`77f2eef`	移除 line 790 orphan artifact
`src/smart-extractor.ts`	`128944e`	移除 line 168 orphan artifact

📌 PR 內容摘要

本 PR（Phase 2）為 envelope stripping 功能擴展：

擴展 stripEnvelopeMetadata()：新增 8 種 channel/system metadata 格式
- <<<EXTERNAL_UNTRUSTED_CONTENT / <<<END_EXTERNAL_UNTRUSTED_CONTENT
- Sender/Conversation info (untrusted metadata)
- Thread starter
- Forwarded message context
- [Queued messages while agent was busy]
新增 ENVELOPE_NOISE_PATTERNS：noise-filter.ts 預過濾，增強 embedding 前保護
新增 memory_store guard：strip-then-check 策略，保留 mixed content
新增迴歸測試：8 個 test cases 覆蓋新 pattern
修復 PR fix: filter subagent runtime wrappers from auto-capture #444 regex bug：subagent wrapper lines 現在以整行匹配 /^[Subagent Context|Subagent Task].*$/gm 正確移除

所有功能變更均已通過 CI ✅

…pattern, use stripped for embed - smart-extractor.ts: restore stripLeadingRuntimeWrappers() call instead of global regex that strips all wrapper lines at any position (PR#444 regression) - noise-filter.ts: narrow ENVELOPE_NOISE_PATTERNS System pattern from /^System:\\s*\\[/im to /^System:\\s*\\[[\\d\\-: +GMT]+\\][\/im to avoid false-positives on legitimate user content like 'System: [小明] said...' - tools.ts: use 'stripped' instead of 'text' for embedPassage/classifyTemporal/ inferExpiry to avoid embedding/classifying envelope metadata PR#481 intent: extend PR#444 with 8 new envelope patterns, NOT replace it. This fix ensures PR#444's leading-only wrapper stripping is preserved.

jlin53882 · 2026-04-10T08:23:32Z

Fix Summary — PR #481 Bug Fixes

Fixed 3 bugs in PR #481 that either broke PR #444's existing behavior or introduced regressions:

Bug 1: `smart-extractor.ts` — PR #444 Regression (Critical)

Problem: stripEnvelopeMetadata() replaced PR #444's stripLeadingRuntimeWrappers(text) call with a global gm regex:

// ❌ WRONG: strips ALL [Subagent Context] lines at any position
let cleaned = text.replace(/^\[(?:Subagent Context|Subagent Task)\].*$/gm, "");

This means if a wrapper header appears mid-message (not just at the beginning), it would be stripped and potentially corrupt legitimate content.

Fix: Restore stripLeadingRuntimeWrappers(text) — PR #444's leading-only design:

// ✅ CORRECT: strips leading wrappers only, preserves content after first real line
let cleaned = stripLeadingRuntimeWrappers(text);

Bug 2: `noise-filter.ts` — System Pattern Too Broad

Problem: ENVELOPE_NOISE_PATTERNS used /^System:\s*\[/im which matches any "System: [" line, including legitimate user content like:

System: [小明] 說這個不行

Fix: Narrow to precise timestamp pattern:

// ✅ Precise: only matches "System: [2026-03-18 14:21:36 GMT+8]"
/^System:\s*\[[\d\-: +GMT]+\]/im

Bug 3: `tools.ts` — Embedding Used Raw `text` Instead of `stripped`

Problem: After stripping, the guard checked stripped but embedPassage(), classifyTemporal(), and inferExpiry() still used the raw text:

// ❌ Used raw text for embedding
const vector = await runtimeContext.embedder.embedPassage(text);
const temporalType = classifyTemporal(text);
const validUntil = inferExpiry(text);

Fix: Use stripped for all downstream processing:

// ✅ Uses stripped text for embedding and classification
const vector = await runtimeContext.embedder.embedPassage(stripped);
const temporalType = classifyTemporal(stripped);
const validUntil = inferExpiry(stripped);

Design Principle

PR #444 is the foundation. PR #481 should extend it, not replace it.

PR #481's job is to add 8 new envelope patterns on top of PR #444's existing stripping logic. The fix ensures PR #444's stripLeadingRuntimeWrappers() leading-only behavior is preserved.

Test Results

All 14 existing envelope stripping tests pass ✅

✔ stripEnvelopeMetadata — 14/14 pass

Commit

05d6ff1 — fix: preserve PR#444 stripLeadingRuntimeWrappers(), fix broad System pattern, use stripped for embed

jlin53882 · 2026-04-10T08:52:06Z

@AliceLJY 我剛剛已經有經過 codex 對抗，將一些隱藏bug 抓取出來重新修正，已推上的最新的 commit 05d6ff1 ，再麻煩您有空的的時候，幫我重新review 一次，看看有沒有其他忽略的點。

app3apps · 2026-04-11T10:18:46Z

Re-checked the latest head. The main regression is fixed, and I no longer consider this PR blocked.

There is still a small follow-up in the manual memory_store path where wrapped input is not handled consistently end-to-end, but I don’t think that should block this PR. I’m okay with merging as-is, and would handle the remaining cleanup in a follow-up with a focused regression test.

jlin53882 mentioned this pull request Apr 3, 2026

#394 — Envelope Metadata Leak into Memory #446

Closed

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

jlin53882 force-pushed the fix/envelope-stripping-phase2 branch from aa7f6d8 to cb636e2 Compare April 3, 2026 07:10

AliceLJY requested changes Apr 3, 2026

View reviewed changes

AliceLJY approved these changes Apr 4, 2026

View reviewed changes

AliceLJY assigned rwmjhb Apr 4, 2026

jlin53882 force-pushed the fix/envelope-stripping-phase2 branch from cdd5087 to d86bdb4 Compare April 8, 2026 17:40

jlin53882 force-pushed the fix/envelope-stripping-phase2 branch from d86bdb4 to 8226c7a Compare April 8, 2026 17:45

jlin53882 added 4 commits April 9, 2026 01:49

fix: restore arrow and fix Chinese example in preference-slot guard

7e53ac3

fix: replace em-dash with arrow in step-sequence comments

22493f1

fix: remove orphaned artifact in tools.ts execute function

128944e

fix: remove orphaned Phase 2 artifact from smart-extractor.ts

77f2eef

rwmjhb merged commit a265f64 into CortexReach:master Apr 11, 2026
7 checks passed

Conversation

jlin53882 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

References

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

jlin53882 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

jlin53882 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

app3apps commented Apr 3, 2026

Uh oh!

AliceLJY left a comment

Choose a reason for hiding this comment

Review: PR #481 — fix: extend envelope stripping to broader channel/system markers

Codex P1 — "Do not discard mixed-content messages as envelope noise"

Codex P2 — "Avoid rejecting memory_store input on partial envelope match"

Blocking issues

Non-blocking notes

Uh oh!

jlin53882 commented Apr 3, 2026

Uh oh!

AliceLJY left a comment

Choose a reason for hiding this comment

Uh oh!

jlin53882 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI 修復記錄（本次 PR）

🔴 第一輪 CI 失敗（Jobs: packaging-and-workflow / storage-and-schema / cli-smoke / core-regression）

🔴 第二輪 CI 失敗（Jobs 同上）

🔴 第三輪 CI 失敗（Jobs 同上）

✅ 第四輪 CI — ALL GREEN

📋 涉及檔案

📌 PR 內容摘要

Uh oh!

jlin53882 commented Apr 10, 2026

Fix Summary — PR #481 Bug Fixes

Bug 1: smart-extractor.ts — PR #444 Regression (Critical)

Bug 2: noise-filter.ts — System Pattern Too Broad

Bug 3: tools.ts — Embedding Used Raw text Instead of stripped

Design Principle

Test Results

Commit

Uh oh!

jlin53882 commented Apr 10, 2026

Uh oh!

app3apps commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jlin53882 commented Apr 3, 2026 •

edited

Loading

jlin53882 commented Apr 8, 2026 •

edited

Loading

Bug 1: `smart-extractor.ts` — PR #444 Regression (Critical)

Bug 2: `noise-filter.ts` — System Pattern Too Broad

Bug 3: `tools.ts` — Embedding Used Raw `text` Instead of `stripped`