Skip to content

fix: extend envelope stripping to broader channel/system markers#481

Merged
rwmjhb merged 6 commits intoCortexReach:masterfrom
jlin53882:fix/envelope-stripping-phase2
Apr 11, 2026
Merged

fix: extend envelope stripping to broader channel/system markers#481
rwmjhb merged 6 commits intoCortexReach:masterfrom
jlin53882:fix/envelope-stripping-phase2

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

@jlin53882 jlin53882 commented Apr 3, 2026

Summary

Extends envelope metadata stripping (from PR #444) to cover 8 additional channel/system envelope patterns that were still leaking into memory extraction.

Changes

1. src/smart-extractor.ts — stripEnvelopeMetadata()

  • Added per-line stripping for 8 new patterns:
    • <<<EXTERNAL_UNTRUSTED_CONTENT / <<<END_EXTERNAL_UNTRUSTED_CONTENT (Discord channel forwarded message wrappers)
    • Sender (untrusted metadata): + JSON block
    • Conversation info (untrusted metadata): + JSON block
    • Thread starter (untrusted, for context): block
    • Forwarded message context (untrusted metadata): block
    • [Queued messages while agent was busy]
  • Fixed a pre-existing bug in PR fix: filter subagent runtime wrappers from auto-capture #444: subagent wrapper boilerplate on the same line as the wrapper prefix was not stripped. Changed to entire-line matching for reliability.

2. src/noise-filter.ts — ENVELOPE_NOISE_PATTERNS

  • New exported array of 8 RegExp patterns
  • Pre-filter in filterNoiseByEmbedding() skips envelope-pattern texts before embedding-based noise check

3. src/tools.ts — registerMemoryStoreTool

  • Added envelope metadata guard: rejects input matching any ENVELOPE_NOISE_PATTERNS with a clear error

4. test/strip-envelope-metadata.test.mjs

  • 8 new regression test cases (one per new pattern)
  • 1 mixed integration test covering all Phase 2 patterns together

Testing

node --test test/strip-envelope-metadata.test.mjs
22/22 pass

References

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa7f6d8f4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/smart-extractor.ts Outdated

// Pre-filter: skip known envelope patterns before embedding check
if (texts.length > 0) {
const nonEnvelopeTexts = texts.filter(t => !ENVELOPE_NOISE_PATTERNS.some(p => p.test(t)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not discard mixed-content messages as envelope noise

SmartExtractor.filterNoiseByEmbedding now drops any text where ENVELOPE_NOISE_PATTERNS matches, but this runs before stripEnvelopeMetadata in the auto-capture path, so messages that contain both envelope lines and real user content are removed entirely instead of cleaned. For example, a single text containing System: [...] plus a user fact will be filtered out, which can reduce cleanTexts below minMessages and skip extraction, causing valid memories to be lost.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 addressed — pushed. The pre-filter has been removed from filterNoiseByEmbedding. Reasoning: in the extraction path, stripEnvelopeMetadata runs before filterNoiseByEmbedding, so the pre-filter would fire on already-cleaned text. The ENVELOPE_NOISE_PATTERNS guard is now only used in the memory_store tool path (see below).

Comment thread src/tools.ts Outdated

try {
// Guard: reject envelope metadata before any processing
if (ENVELOPE_NOISE_PATTERNS.some(p => p.test(text))) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rejecting memory_store input on partial envelope match

The new memory_store guard rejects the whole write if any envelope regex matches (ENVELOPE_NOISE_PATTERNS.some(...)), but these patterns are line-level markers and can appear alongside legitimate memory content in the same payload. In that mixed-content case, the tool returns envelope_metadata_rejected and stores nothing, which drops explicit user-provided facts instead of preserving content after metadata stripping.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 addressed — pushed. The memory_store guard now calls stripEnvelopeMetadata first, then checks if the result is empty. Mixed-content messages (envelope markers + real facts) will have their envelope stripped and the facts preserved. Only purely-envelope text gets rejected.

@jlin53882 jlin53882 force-pushed the fix/envelope-stripping-phase2 branch from aa7f6d8 to cb636e2 Compare April 3, 2026 07:10
Copy link
Copy Markdown

app3apps commented Apr 3, 2026

Current head cb636e2 is still broken: src/smart-extractor.ts has a stray } around filterNoiseByEmbedding(), which makes the module fail to parse.

I can reproduce it with:

  • jiti("./src/smart-extractor.ts")
  • node --test test/strip-envelope-metadata.test.mjs

This needs to be fixed before merge.

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #481 — fix: extend envelope stripping to broader channel/system markers

The envelope stripping patterns and test coverage are good, but there are two code-level issues that need to be fixed before merge.

Codex P1 — "Do not discard mixed-content messages as envelope noise"

Valid concern, and the author has addressed it (confirmed by jlin53882's comment). The pre-filter was removed from filterNoiseByEmbedding, and the memory_store guard now strips first and only rejects if nothing remains. The current diff reflects the fix.

Codex P2 — "Avoid rejecting memory_store input on partial envelope match"

Valid concern, and the author has addressed it. The memory_store guard in tools.ts now calls stripEnvelopeMetadata(text) first, then checks if the stripped result is empty. Mixed-content messages are preserved after stripping. Correct.

Blocking issues

1. Stray closing brace breaks filterNoiseByEmbedding

In src/smart-extractor.ts, the diff shows:

  async filterNoiseByEmbedding(texts: string[]): Promise<string[]> {
    const noiseBank = this.config.noiseBank;
+
+    }
+
    if (!noiseBank || !noiseBank.initialized) return texts;

There is a stray } on its own line that prematurely closes the filterNoiseByEmbedding method. The if (!noiseBank ...) line and everything after it would then be outside the method body, causing a compile error. This appears to be a leftover from removing the envelope pre-filter block — the block's closing brace was not removed along with its opening.

2. Character encoding corruption throughout smart-extractor.ts

Multiple comment lines have their em-dash characters () replaced with ??. Examples:

  • "Smart Memory Extractor ??LLM-powered extraction pipeline" (was — LLM-powered)
  • "Pipeline: conversation ??LLM extract ??candidates ??dedup ??persist" (was )
  • "LLM returned zero candidates ??strongest noise signal" (was )
  • "Same brand, different item ??should not be deduped" (was )
  • Chinese characters also corrupted: "?�欢麦�??��??�烧鸡腿??" (was 喜欢麦当劳的板烧鸡腿堡)

This is a systematic encoding issue — likely the file was saved or committed with an incorrect encoding. All corrupted characters need to be restored to their originals.

Non-blocking notes

  • ENVELOPE_NOISE_PATTERNS export in noise-filter.ts is well-structured and the patterns are specific enough to avoid false positives.
  • The 8 new test cases in strip-envelope-metadata.test.mjs provide good regression coverage.
  • tools.ts lost its trailing newline (\ No newline at end of file). Minor.

Please fix the stray brace and the encoding corruption, then this is ready for another look.

@jlin53882
Copy link
Copy Markdown
Contributor Author

All blocking issues have been fixed:

  • P1: Removed stray } in ilterNoiseByEmbedding that caused the compile error
  • P2: Restored corrupted characters throughout smart-extractor.ts — em-dashes (—), arrows (→), and Chinese text all fixed
  • T1: Added trailing newline to ools.ts

Please re-review when you have a chance 🙏

Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after latest commit. Both blocking issues from my previous review are fixed:

  1. Stray closing brace — removed, filterNoiseByEmbedding method body is clean ✅
  2. Encoding corruption — CJK characters (喜欢麦当劳的板烧鸡腿堡) and em-dashes () are now correct ✅

Code quality:

  • ENVELOPE_NOISE_PATTERNS export is well-structured
  • stripEnvelopeMetadata extensions handle Discord/channel markers properly (per-line, no blanket rejection)
  • tools.ts guard correctly strips first → rejects only if empty (P2 fix)
  • 8 new test cases provide solid regression coverage

LGTM. @rwmjhb 请 review 后合并。

@jlin53882 jlin53882 force-pushed the fix/envelope-stripping-phase2 branch from cdd5087 to d86bdb4 Compare April 8, 2026 17:40
…se 2)

- Extend stripEnvelopeMetadata() with 8 new patterns:
  <<<EXTERNAL_UNTRUSTED_CONTENT, <<<END_EXTERNAL_UNTRUSTED_CONTENT,
  Sender/Conversation info (untrusted metadata), Thread starter,
  Forwarded message context, [Queued messages while agent was busy]
- Add ENVELOPE_NOISE_PATTERNS to noise-filter.ts for pre-embedding guard
- Add memory_store tool guard in tools.ts (strip-then-check approach)
- Add 8 regression test cases in strip-envelope-metadata.test.mjs
- Fix PR CortexReach#444 regex bug: subagent wrapper lines now stripped via
  entire-line matching (/^[Subagent Context|Subagent Task].*$/gm)
- P1 fix: remove pre-filter from filterNoiseByEmbedding (runs before
  stripEnvelopeMetadata in extraction path, would cause false positives)
- P2 fix: memory_store guard now strips first then checks if empty,
  preserving mixed-content messages

Fixes CortexReach#446
@jlin53882 jlin53882 force-pushed the fix/envelope-stripping-phase2 branch from d86bdb4 to 8226c7a Compare April 8, 2026 17:45
@jlin53882
Copy link
Copy Markdown
Contributor Author

jlin53882 commented Apr 8, 2026

CI 修復記錄(本次 PR)

@AliceLJY 麻煩請你在review 確認一下
本次 PR 經歷 4 輪 CI 迭代,主要問題並非功能邏輯錯誤,而是 rebase 過程中殘留的 artifact。以下為完整修復說明:


🔴 第一輪 CI 失敗(Jobs: packaging-and-workflow / storage-and-schema / cli-smoke / core-regression)

根本原因:CLI smoke test 的 jiti 預編譯快取在第一次失敗後未清除,後續 CI run 吃到殘留的錯誤版本。屬於 CI 環境問題,非 PR code 問題。

修復方式:等待 CI cache 自然過期,無需修改任何 code。


🔴 第二輪 CI 失敗(Jobs 同上)

錯誤症狀

Error: ParseError: Unexpected token, expected ","
src/tools.ts:790:61

根本原因git rebase -i 執行過程中,rebase session 的 conflict 標記格式異常,在 tools.ts:790 留下 orphan artifact:

// ❌ 錯誤(殘留 artifact)
const validUntil = inferExpiry(text); (fix: extend envelope stripping to broader channel/system markers (Phase 2))

// ✅ 正確
const validUntil = inferExpiry(text);

修復 commit77f2eef — fix: remove orphaned Phase 2 artifact from tools.ts


🔴 第三輪 CI 失敗(Jobs 同上)

錯誤症狀

Error: ParseError: Unexpected token, expected ","
src/smart-extractor.ts:168:18

根本原因:同一 rebase session 在 smart-extractor.ts:168 也留下相同的 orphan artifact:

// ❌ 錯誤(殘留 artifact)
  );
 (fix: extend envelope stripping to broader channel/system markers (Phase 2))

// ✅ 正確
  );

修復 commit128944e — fix: remove orphaned Phase 2 artifact from smart-extractor.ts


✅ 第四輪 CI — ALL GREEN

所有 artifact 清除完畢後,CI 全部通過。


📋 涉及檔案

檔案 Commit 變更
src/tools.ts 77f2eef 移除 line 790 orphan artifact
src/smart-extractor.ts 128944e 移除 line 168 orphan artifact

📌 PR 內容摘要

本 PR(Phase 2)為 envelope stripping 功能擴展:

  • 擴展 stripEnvelopeMetadata():新增 8 種 channel/system metadata 格式
    • <<<EXTERNAL_UNTRUSTED_CONTENT / <<<END_EXTERNAL_UNTRUSTED_CONTENT
    • Sender/Conversation info (untrusted metadata)
    • Thread starter
    • Forwarded message context
    • [Queued messages while agent was busy]
  • 新增 ENVELOPE_NOISE_PATTERNS:noise-filter.ts 預過濾,增強 embedding 前保護
  • 新增 memory_store guard:strip-then-check 策略,保留 mixed content
  • 新增迴歸測試:8 個 test cases 覆蓋新 pattern
  • 修復 PR fix: filter subagent runtime wrappers from auto-capture #444 regex bug:subagent wrapper lines 現在以整行匹配 /^[Subagent Context|Subagent Task].*$/gm 正確移除

所有功能變更均已通過 CI ✅

…pattern, use stripped for embed

- smart-extractor.ts: restore stripLeadingRuntimeWrappers() call instead of
  global regex that strips all wrapper lines at any position (PR#444 regression)
- noise-filter.ts: narrow ENVELOPE_NOISE_PATTERNS System pattern from
  /^System:\\s*\\[/im to /^System:\\s*\\[[\\d\\-: +GMT]+\\][\/im to avoid
  false-positives on legitimate user content like 'System: [小明] said...'
- tools.ts: use 'stripped' instead of 'text' for embedPassage/classifyTemporal/
  inferExpiry to avoid embedding/classifying envelope metadata

PR#481 intent: extend PR#444 with 8 new envelope patterns, NOT replace it.
This fix ensures PR#444's leading-only wrapper stripping is preserved.
@jlin53882
Copy link
Copy Markdown
Contributor Author

Fix Summary — PR #481 Bug Fixes

Fixed 3 bugs in PR #481 that either broke PR #444's existing behavior or introduced regressions:


Bug 1: smart-extractor.ts — PR #444 Regression (Critical)

Problem: stripEnvelopeMetadata() replaced PR #444's stripLeadingRuntimeWrappers(text) call with a global gm regex:

// ❌ WRONG: strips ALL [Subagent Context] lines at any position
let cleaned = text.replace(/^\[(?:Subagent Context|Subagent Task)\].*$/gm, "");

This means if a wrapper header appears mid-message (not just at the beginning), it would be stripped and potentially corrupt legitimate content.

Fix: Restore stripLeadingRuntimeWrappers(text) — PR #444's leading-only design:

// ✅ CORRECT: strips leading wrappers only, preserves content after first real line
let cleaned = stripLeadingRuntimeWrappers(text);

Bug 2: noise-filter.ts — System Pattern Too Broad

Problem: ENVELOPE_NOISE_PATTERNS used /^System:\s*\[/im which matches any "System: [" line, including legitimate user content like:

System: [小明] 說這個不行

Fix: Narrow to precise timestamp pattern:

// ✅ Precise: only matches "System: [2026-03-18 14:21:36 GMT+8]"
/^System:\s*\[[\d\-: +GMT]+\]/im

Bug 3: tools.ts — Embedding Used Raw text Instead of stripped

Problem: After stripping, the guard checked stripped but embedPassage(), classifyTemporal(), and inferExpiry() still used the raw text:

// ❌ Used raw text for embedding
const vector = await runtimeContext.embedder.embedPassage(text);
const temporalType = classifyTemporal(text);
const validUntil = inferExpiry(text);

Fix: Use stripped for all downstream processing:

// ✅ Uses stripped text for embedding and classification
const vector = await runtimeContext.embedder.embedPassage(stripped);
const temporalType = classifyTemporal(stripped);
const validUntil = inferExpiry(stripped);

Design Principle

PR #444 is the foundation. PR #481 should extend it, not replace it.

PR #481's job is to add 8 new envelope patterns on top of PR #444's existing stripping logic. The fix ensures PR #444's stripLeadingRuntimeWrappers() leading-only behavior is preserved.


Test Results

All 14 existing envelope stripping tests pass ✅

✔ stripEnvelopeMetadata — 14/14 pass

Commit

05d6ff1 — fix: preserve PR#444 stripLeadingRuntimeWrappers(), fix broad System pattern, use stripped for embed

@jlin53882
Copy link
Copy Markdown
Contributor Author

@AliceLJY 我剛剛已經有經過 codex 對抗,將一些隱藏bug 抓取出來重新修正,已推上的最新的 commit 05d6ff1 ,再麻煩您有空的的時候 ,幫我重新review 一次,看看有沒有其他忽略的點。

Copy link
Copy Markdown

Re-checked the latest head. The main regression is fixed, and I no longer consider this PR blocked.

There is still a small follow-up in the manual memory_store path where wrapped input is not handled consistently end-to-end, but I don’t think that should block this PR. I’m okay with merging as-is, and would handle the remaining cleanup in a follow-up with a focused regression test.

@rwmjhb rwmjhb merged commit a265f64 into CortexReach:master Apr 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

#394 — Envelope Metadata Leak into Memory

4 participants