fix: extend envelope stripping to broader channel/system markers#481
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa7f6d8f4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // Pre-filter: skip known envelope patterns before embedding check | ||
| if (texts.length > 0) { | ||
| const nonEnvelopeTexts = texts.filter(t => !ENVELOPE_NOISE_PATTERNS.some(p => p.test(t))); |
There was a problem hiding this comment.
Do not discard mixed-content messages as envelope noise
SmartExtractor.filterNoiseByEmbedding now drops any text where ENVELOPE_NOISE_PATTERNS matches, but this runs before stripEnvelopeMetadata in the auto-capture path, so messages that contain both envelope lines and real user content are removed entirely instead of cleaned. For example, a single text containing System: [...] plus a user fact will be filtered out, which can reduce cleanTexts below minMessages and skip extraction, causing valid memories to be lost.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
P1 addressed — pushed. The pre-filter has been removed from filterNoiseByEmbedding. Reasoning: in the extraction path, stripEnvelopeMetadata runs before filterNoiseByEmbedding, so the pre-filter would fire on already-cleaned text. The ENVELOPE_NOISE_PATTERNS guard is now only used in the memory_store tool path (see below).
|
|
||
| try { | ||
| // Guard: reject envelope metadata before any processing | ||
| if (ENVELOPE_NOISE_PATTERNS.some(p => p.test(text))) { |
There was a problem hiding this comment.
Avoid rejecting memory_store input on partial envelope match
The new memory_store guard rejects the whole write if any envelope regex matches (ENVELOPE_NOISE_PATTERNS.some(...)), but these patterns are line-level markers and can appear alongside legitimate memory content in the same payload. In that mixed-content case, the tool returns envelope_metadata_rejected and stores nothing, which drops explicit user-provided facts instead of preserving content after metadata stripping.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
P2 addressed — pushed. The memory_store guard now calls stripEnvelopeMetadata first, then checks if the result is empty. Mixed-content messages (envelope markers + real facts) will have their envelope stripped and the facts preserved. Only purely-envelope text gets rejected.
aa7f6d8 to
cb636e2
Compare
|
Current head I can reproduce it with:
This needs to be fixed before merge. |
AliceLJY
left a comment
There was a problem hiding this comment.
Review: PR #481 — fix: extend envelope stripping to broader channel/system markers
The envelope stripping patterns and test coverage are good, but there are two code-level issues that need to be fixed before merge.
Codex P1 — "Do not discard mixed-content messages as envelope noise"
Valid concern, and the author has addressed it (confirmed by jlin53882's comment). The pre-filter was removed from filterNoiseByEmbedding, and the memory_store guard now strips first and only rejects if nothing remains. The current diff reflects the fix.
Codex P2 — "Avoid rejecting memory_store input on partial envelope match"
Valid concern, and the author has addressed it. The memory_store guard in tools.ts now calls stripEnvelopeMetadata(text) first, then checks if the stripped result is empty. Mixed-content messages are preserved after stripping. Correct.
Blocking issues
1. Stray closing brace breaks filterNoiseByEmbedding
In src/smart-extractor.ts, the diff shows:
async filterNoiseByEmbedding(texts: string[]): Promise<string[]> {
const noiseBank = this.config.noiseBank;
+
+ }
+
if (!noiseBank || !noiseBank.initialized) return texts;There is a stray } on its own line that prematurely closes the filterNoiseByEmbedding method. The if (!noiseBank ...) line and everything after it would then be outside the method body, causing a compile error. This appears to be a leftover from removing the envelope pre-filter block — the block's closing brace was not removed along with its opening.
2. Character encoding corruption throughout smart-extractor.ts
Multiple comment lines have their em-dash characters (—) replaced with ??. Examples:
"Smart Memory Extractor ??LLM-powered extraction pipeline"(was— LLM-powered)"Pipeline: conversation ??LLM extract ??candidates ??dedup ??persist"(was→)"LLM returned zero candidates ??strongest noise signal"(was—)"Same brand, different item ??should not be deduped"(was→)- Chinese characters also corrupted:
"?�欢麦�??��??�烧鸡腿??"(was喜欢麦当劳的板烧鸡腿堡)
This is a systematic encoding issue — likely the file was saved or committed with an incorrect encoding. All corrupted characters need to be restored to their originals.
Non-blocking notes
ENVELOPE_NOISE_PATTERNSexport innoise-filter.tsis well-structured and the patterns are specific enough to avoid false positives.- The 8 new test cases in
strip-envelope-metadata.test.mjsprovide good regression coverage. tools.tslost its trailing newline (\ No newline at end of file). Minor.
Please fix the stray brace and the encoding corruption, then this is ready for another look.
|
All blocking issues have been fixed:
Please re-review when you have a chance 🙏 |
AliceLJY
left a comment
There was a problem hiding this comment.
Re-reviewed after latest commit. Both blocking issues from my previous review are fixed:
- Stray closing brace — removed,
filterNoiseByEmbeddingmethod body is clean ✅ - Encoding corruption — CJK characters (
喜欢麦当劳的板烧鸡腿堡) and em-dashes (—) are now correct ✅
Code quality:
ENVELOPE_NOISE_PATTERNSexport is well-structuredstripEnvelopeMetadataextensions handle Discord/channel markers properly (per-line, no blanket rejection)tools.tsguard correctly strips first → rejects only if empty (P2 fix)- 8 new test cases provide solid regression coverage
LGTM. @rwmjhb 请 review 后合并。
cdd5087 to
d86bdb4
Compare
…se 2) - Extend stripEnvelopeMetadata() with 8 new patterns: <<<EXTERNAL_UNTRUSTED_CONTENT, <<<END_EXTERNAL_UNTRUSTED_CONTENT, Sender/Conversation info (untrusted metadata), Thread starter, Forwarded message context, [Queued messages while agent was busy] - Add ENVELOPE_NOISE_PATTERNS to noise-filter.ts for pre-embedding guard - Add memory_store tool guard in tools.ts (strip-then-check approach) - Add 8 regression test cases in strip-envelope-metadata.test.mjs - Fix PR CortexReach#444 regex bug: subagent wrapper lines now stripped via entire-line matching (/^[Subagent Context|Subagent Task].*$/gm) - P1 fix: remove pre-filter from filterNoiseByEmbedding (runs before stripEnvelopeMetadata in extraction path, would cause false positives) - P2 fix: memory_store guard now strips first then checks if empty, preserving mixed-content messages Fixes CortexReach#446
d86bdb4 to
8226c7a
Compare
CI 修復記錄(本次 PR)@AliceLJY 麻煩請你在review 確認一下 🔴 第一輪 CI 失敗(Jobs: packaging-and-workflow / storage-and-schema / cli-smoke / core-regression)根本原因:CLI smoke test 的 jiti 預編譯快取在第一次失敗後未清除,後續 CI run 吃到殘留的錯誤版本。屬於 CI 環境問題,非 PR code 問題。 修復方式:等待 CI cache 自然過期,無需修改任何 code。 🔴 第二輪 CI 失敗(Jobs 同上)錯誤症狀: 根本原因: // ❌ 錯誤(殘留 artifact)
const validUntil = inferExpiry(text); (fix: extend envelope stripping to broader channel/system markers (Phase 2))
// ✅ 正確
const validUntil = inferExpiry(text);修復 commit: 🔴 第三輪 CI 失敗(Jobs 同上)錯誤症狀: 根本原因:同一 rebase session 在 // ❌ 錯誤(殘留 artifact)
);
(fix: extend envelope stripping to broader channel/system markers (Phase 2))
// ✅ 正確
);修復 commit: ✅ 第四輪 CI — ALL GREEN所有 artifact 清除完畢後,CI 全部通過。 📋 涉及檔案
📌 PR 內容摘要本 PR(Phase 2)為 envelope stripping 功能擴展:
所有功能變更均已通過 CI ✅ |
…pattern, use stripped for embed - smart-extractor.ts: restore stripLeadingRuntimeWrappers() call instead of global regex that strips all wrapper lines at any position (PR#444 regression) - noise-filter.ts: narrow ENVELOPE_NOISE_PATTERNS System pattern from /^System:\\s*\\[/im to /^System:\\s*\\[[\\d\\-: +GMT]+\\][\/im to avoid false-positives on legitimate user content like 'System: [小明] said...' - tools.ts: use 'stripped' instead of 'text' for embedPassage/classifyTemporal/ inferExpiry to avoid embedding/classifying envelope metadata PR#481 intent: extend PR#444 with 8 new envelope patterns, NOT replace it. This fix ensures PR#444's leading-only wrapper stripping is preserved.
Fix Summary — PR #481 Bug FixesFixed 3 bugs in PR #481 that either broke PR #444's existing behavior or introduced regressions: Bug 1:
|
|
Re-checked the latest head. The main regression is fixed, and I no longer consider this PR blocked. There is still a small follow-up in the manual |
Summary
Extends envelope metadata stripping (from PR #444) to cover 8 additional channel/system envelope patterns that were still leaking into memory extraction.
Changes
1. src/smart-extractor.ts — stripEnvelopeMetadata()
2. src/noise-filter.ts — ENVELOPE_NOISE_PATTERNS
3. src/tools.ts — registerMemoryStoreTool
4. test/strip-envelope-metadata.test.mjs
Testing
node --test test/strip-envelope-metadata.test.mjs
22/22 pass
References