Fix session restore invariants: fail-closed on corrupted tool rows, redact config secrets, and type getSession errors by JasonBuildAI · Pull Request #79 · FireRedTeam/FireRed-OpenStoryline

JasonBuildAI · 2026-04-07T11:08:15Z

Summary

我重新提交并收敛了 PR #75「会话状态落盘与重启恢复」相关能力，并严格按维护者 review 的三点建议补齐边界与兜底策略。该 PR 只包含 2 个文件的修改：agent_fastapi.py、web/static/app.js。

Why PR #75 was closed

原 PR #75 后期为了对齐上游与 UI/配置结构，混入了几次与“会话持久化主线”无关的提交（例如 merge upstream、补配置 schema、UI adopt）。这些混入提交随后在关闭前集中 revert，导致 PR 变更范围失焦、难以 review/merge，因此 PR #75 选择关闭并建议后续以更清晰的范围重提。

What changed

1) 会话状态落盘与重启恢复（延续 PR#75 主线）

支持会话状态落盘与懒恢复，保障重启后同一 session_id 可继续使用。
保持 503 vs 404 语义边界：状态文件不可读/损坏 → 503；会话不存在/不匹配 → 404。

2) 我如何落实维护者提出的三点建议

建议 1：脱敏不应只依赖 regex；应在配置结构体层面避免序列化 raw key

我把“主边界”前移到 持久化结构体层面：落盘时对 custom_llm_config/custom_vlm_config 进行结构化裁剪，保证 raw key 不会被序列化进状态文件。
同时保留原有的递归脱敏（regex/关键词匹配）作为兜底，用于覆盖 history/tool payload 等非结构化字符串路径。

建议 2：坏掉的 `session_state.json` 处理策略（503 vs 404）

我维持并强化了该边界：读/解析失败 → 503；缺失/不匹配 → 404，使客户端能正确区分“可重试”与“确实不存在”。

建议 3：ToolMessage 丢 `tool_call_id` 时不要只 warning；应 fail-closed，并显式 degraded

我实现了 fail-closed：一旦恢复时发现 ToolMessage 缺失 tool_call_id 或 tool 协议不一致，会从其所属的 tool-use turn 起截断后续 lc_messages，避免把语义可能错配的 tool 结果继续喂给 LLM。
同时我把降级状态 显式暴露给上层：在 session snapshot 中返回 restore_degraded / restore_degraded_reason，便于前端或调用方提示用户“部分历史未恢复”。

Test plan（全链路）

我做了全链路验证并覆盖了以下场景：

custom_llm_config 脱敏（通过）
TTS 配置脱敏（通过）
递归脱敏兜底能力（通过）
状态文件读写（通过）
损坏状态文件（503/404 边界正确）（通过）
ToolMessage（有效/损坏/缺失）处理（通过）
边界情况（None/空字典/部分配置）（通过）
完整工作流（含损坏场景检测）（通过）

Persist session state to disk with safe recovery and secret redaction so chat can continue after server restart. Made-with: Cursor

Ensure restored AIMessage tool_calls are reconstructed and sanitize lc_messages to avoid provider 400 errors. Made-with: Cursor

Persist explicit AIMessage tool_calls and enforce block-scoped tool message closure/sanitization so restored sessions avoid orphan-tool protocol violations across providers. Made-with: Cursor

…ring secret masking Made-with: Cursor

…ch; clarify vs review Made-with: Cursor

…tstrap retry Made-with: Cursor

Made-with: Cursor

Anlittledy · 2026-04-07T12:24:58Z

再次感谢你的贡献！我大致看了一下，之前我提出的几点担忧已经基本解决了。我再看一下细节实现，确认下有没有遗漏的地方。

Anlittledy

整体方向没问题，我这边主要关注了几处“恢复一致性”问题，包括瞬时错误下的启动恢复、lc_messages 裁剪边界、切语言后的 prompt 规范化，以及 sent_media_total 没有持久化。建议把这几处补齐，这样整条恢复链路会更稳定。如果你有其他意见，我们也可以再讨论下。

Anlittledy · 2026-04-08T03:04:21Z

web/static/app.js

+   * overwrite SESSION_ID_KEY with a new id).
+   * @returns {Promise<boolean>} true if session was restored
+   */
+  async _bootstrapRestoreSavedSession(saved) {


启动恢复时， 429 其实也应该纳入可重试类型。另外，网络抖动在 3 次重试失败后，仍然会自动新建 session。这样会把原本合法的 SESSION_ID_KEY 替换掉。这个问题本质上是布尔值不够表达多种状态。
建议：把恢复结果从布尔值改成三态：restored、missing(404)、unavailable(429/503/网络错误)；只有明确 404 时才清理本地记录并创建新 session，如果是 unavailable，则保留旧 id，并提示“稍后重试”。

已处理：429 与 5xx/网络一样走可重试逻辑；恢复结果改为 restored / missing / unavailable 三态。仅 missing（404）会清本地会话并 newSession()；unavailable 保留原 SESSION_ID_KEY，Toast 提示稍后重试，并用空快照 + recovery poll 继续拉服务端快照。

Anlittledy · 2026-04-08T03:06:26Z

agent_fastapi.py

+            out = out[-SESSION_STATE_MAX_HISTORY:]
+        return out
+
+    def _serialize_lc_messages(self) -> List[Dict[str, Any]]:


_serialize_lc_messages() 目前按条数硬裁剪，可能在 AIMessage(tool_calls) -> ToolMessage 中间截断，恢复时会触发 fail-closed，导致长会话被意外大幅截短。
建议：裁剪时按 tool turn 边界处理：如果要删掉一个带 tool_calls 的 AIMessage，应把它后面对应的 ToolMessage 一起删掉，确保裁剪结果不会从孤儿 ToolMessage 开头。

已处理：在保留前两条固定 system 头后，对正文用 _safe_body_suffix_for_state 按预算取后缀，避免截取结果以孤立的 ToolMessage 开头，从而减少恢复时因协议不整段 fail-closed、历史被大幅截短的问题。

Anlittledy · 2026-04-08T03:10:34Z

agent_fastapi.py

+
    def _ensure_system_prompt(self) -> None:
+        # Keep the instruction system prompt and upload-status placeholder stable.
+        self._ensure_lc_messages_invariants()


切换语言时，self._ensure_lc_messages_invariants() 在发现第 0 条不是当前语言 prompt 时，会在头部插入新的 prompt，但不会删除旧的 prompt。随后又会把 upload-status 挪到第 1 条，结果旧语言 prompt 会被挤到 index 2，最终和新 prompt 一起保留下来。建议把 _normalized_lc_messages_for_state() 和 _ensure_lc_messages_invariants() 里重复的规范化逻辑抽成一个公用 helper，统一输出固定结构：第 0 条是当前语言 prompt，第 1 条是 upload-status。

已处理：抽出 _apply_instruction_and_upload_headers，统一保证第 0 条为当前语言 instruction、第 1 条为 upload-status，并剔除旧的中/英 instruction system 行，避免切语言后旧 prompt 仍留在列表里。

Anlittledy · 2026-04-08T03:12:01Z

agent_fastapi.py

+                continue
+        return out
+
+    def dump_state(self) -> Dict[str, Any]:


sent_media_total 目前只存在于内存中，但它参与后续媒体统计，属于应持久化的 session 状态。建议在 dump_state() 中写入，并在 load_from_state() 中恢复。

已处理：sent_media_total 写入 dump_state，在 load_from_state 恢复；并在 snapshot.stats 中带上 sent_media_total，与跨重启的媒体统计一致。

…_media_total, 3-state bootstrap Made-with: Cursor

JasonBuildAI · 2026-04-08T05:15:57Z

感谢 review。我已按意见改了一版：前端启动恢复改为三态并区分 404 与可重试错误；lc_messages 落盘裁剪避免在 tool turn 中间断开；切语言后的 system 头用统一 helper 固定为「当前语言 instruction + upload-status」；sent_media_total 已纳入落盘与恢复。烦请再帮看一眼是否还有边界情况需要收紧。

Anlittledy

感谢更新。有一个关于_merged_messages的小问题建议优先修复一下，完成后我认为可以merge.
另外记录一个 non-blocking concern：
当前 unavailable 分支会先 useSession(saved, { history: [], ... })，然后异步 _scheduleRecoveryPoll()；也就是说，前端会先进入一个绑定旧 session_id 的占位会话，而不是真正已经恢复完成的状态。与此同时，useSession() 只会发起 connectWs()，不会等待 WebSocket ready，也不会等待首个 session.snapshot 到达并应用。发送入口也没有把这些条件纳入禁用判断，_updateComposerDisabledState() 目前只看 uploading/canceling。因此这里有两层一致性风险：

如果用户在 socket 真正连上前（WS尚未 ready时）立即发送，sendPrompt() 会先把 user message 乐观插进 UI，然后调用 this.ws.send(...)；而 WsClient.send() 在 readyState !== 1 时会直接return，消息可能根本没出浏览器。
即使用户在 socket 已经 ready时发送消息，只要首个 session.snapshot 还没被应用，用户此时看到的仍然可能是一个空白占位界面，容易误以为自己在“新对话”里发送；但这条消息实际会继续追加到原来的旧会话上。后面一旦恢复成功并收到 session.snapshot，前端又会 clearAll() 并按服务端历史重放，用户看到的效果就可能是界面上下文突然跳变，甚至刚发出的消息短暂消失。

严格说，这个窗口不只存在于 unavailable 分支，而是所有 useSession() 路径在“初始同步完成前”理论上都存在；只是unavailable会显著放大这个窗口，因此更容易被用户撞到。我认为这是个真实的一致性风险，但更合适的修法应该是统一把所有 useSession() 路径都纳入“initial session sync ready before send”的状态管理，而不只是给 unavailable 分支打补丁。

考虑到改动面相对大，远远超出本 PR 的范围，我不会因为这一点单独挡 merge，这里仅作为 follow-up 记录。

Anlittledy · 2026-04-09T07:44:41Z

agent_fastapi.py

                            try:
+                                # Guard: ensure the messages list won't violate tool protocol after restore.
+                                # This prevents provider-side 400 like: "Messages with role 'tool' must be a response..."
+                                sess._sanitize_tool_protocol_in_lc_messages()


这边代码有个新增的逻辑：sess.agent.astream拿的是_merged_messages，而不是sess.lc_messages以确保兼容性（见 #77），所以sess._sanitize_tool_protocol_in_lc_messages()对后面的 stream = sess.agent.astream其实没有起效。建议把sess._sanitize_tool_protocol_in_lc_messages()挪到_system_parts: List[str] = []前面，这样_merged_messages就是基于修正后的sess.lc_messages构造的。

感谢审核！我刚刚直接在GitHub中进行了修改，把sess._sanitize_tool_protocol_in_lc_messages()挪到_system_parts: List[str] = []前面。

Added sanitization of protocol on sess.lc_messages before building the merged request list for astream to ensure the structure is always sanitized.

JasonBuildAI · 2026-04-09T09:02:01Z

感谢更新。有一个关于_merged_messages的小问题建议优先修复一下，完成后我认为可以merge. 另外记录一个 non-blocking concern：当前 unavailable 分支会先 useSession(saved, { history: [], ... })，然后异步 _scheduleRecoveryPoll()；也就是说，前端会先进入一个绑定旧 session_id 的占位会话，而不是真正已经恢复完成的状态。与此同时，useSession() 只会发起 connectWs()，不会等待 WebSocket ready，也不会等待首个 session.snapshot 到达并应用。发送入口也没有把这些条件纳入禁用判断，_updateComposerDisabledState() 目前只看 uploading/canceling。因此这里有两层一致性风险：

如果用户在 socket 真正连上前（WS尚未 ready时）立即发送，sendPrompt() 会先把 user message 乐观插进 UI，然后调用 this.ws.send(...)；而 WsClient.send() 在 readyState !== 1 时会直接return，消息可能根本没出浏览器。

即使用户在 socket 已经 ready时发送消息，只要首个 session.snapshot 还没被应用，用户此时看到的仍然可能是一个空白占位界面，容易误以为自己在“新对话”里发送；但这条消息实际会继续追加到原来的旧会话上。后面一旦恢复成功并收到 session.snapshot，前端又会 clearAll() 并按服务端历史重放，用户看到的效果就可能是界面上下文突然跳变，甚至刚发出的消息短暂消失。

严格说，这个窗口不只存在于 unavailable 分支，而是所有 useSession() 路径在“初始同步完成前”理论上都存在；只是unavailable会显著放大这个窗口，因此更容易被用户撞到。我认为这是个真实的一致性风险，但更合适的修法应该是统一把所有 useSession() 路径都纳入“initial session sync ready before send”的状态管理，而不只是给 unavailable 分支打补丁。

考虑到改动面相对大，远远超出本 PR 的范围，我不会因为这一点单独挡 merge，这里仅作为 follow-up 记录。

感谢您非常细致的 review 和提醒，这个问题您说得很准确。

原本的设计思路是希望在 unavailable 路径下先进入占位会话，再异步拉起恢复，优先保证用户侧“可见且可操作”的启动体验，避免页面长时间无响应；同时配合 typed error / recovery poll 做渐进恢复，这显然并不够完善。

您指出的这两类一致性窗口（WS 未 ready 即发送、以及首个 session.snapshot 未应用前发送）确实成立。我这边在实际测试中也复现到了类似现象：在特定时序下，前端显示状态与真实会话状态会短暂不一致，出现消息“看起来发了但未真正送出”或恢复后上下文跳变的问题。

所以我认同您的判断：这不是只在 unavailable 分支打补丁就能彻底解决的问题，而是需要统一到所有 useSession() 路径上的 “initial session sync ready before send” 状态管理。对于一个严谨的项目，这是必须解决的一致性问题。

当前这个 PR 先收敛会话恢复主线与 review 中的 blocking 点；关于“初始同步完成前禁发/排队发送”的统一治理，确实需要独立改进尽快补齐。

再次感谢您的严格把关和建设性建议。

Anlittledy · 2026-04-09T10:42:47Z

感谢你对这个PR的耐心迭代，我认为这个PR目前可以合并了。

JasonBuildAI added 7 commits April 7, 2026 18:34

feat: persist chat sessions securely across restarts

ce78a29

Persist session state to disk with safe recovery and secret redaction so chat can continue after server restart. Made-with: Cursor

fix: prevent orphan tool messages after session restore

c548966

Ensure restored AIMessage tool_calls are reconstructed and sanitize lc_messages to avoid provider 400 errors. Made-with: Cursor

fix: harden tool_calls persistence and restore sequencing

0e0cf93

Persist explicit AIMessage tool_calls and enforce block-scoped tool message closure/sanitization so restored sessions avoid orphan-tool protocol violations across providers. Made-with: Cursor

fix(session): address review — warn on empty tool_call_id; tighten st…

575d440

…ring secret masking Made-with: Cursor

fix(session): return 503 for unreadable state; 404 for missing/mismat…

2d1e14c

…ch; clarify vs review Made-with: Cursor

fix(web): typed HttpError for getSession; 404-only local cleanup; boo…

3c3042b

…tstrap retry Made-with: Cursor

fix(session): fail-closed restore + redact secrets in persisted config

1829b50

Made-with: Cursor

Anlittledy requested changes Apr 8, 2026

View reviewed changes

fix(session): address review — tool-safe lc trim, header helper, sent…

a65c4a4

…_media_total, 3-state bootstrap Made-with: Cursor

JasonBuildAI requested a review from Anlittledy April 8, 2026 07:22

Anlittledy requested changes Apr 9, 2026

View reviewed changes

Sanitize protocol on lc_messages before merging requests

1dab33f

Added sanitization of protocol on sess.lc_messages before building the merged request list for astream to ensure the structure is always sanitized.

JasonBuildAI requested a review from Anlittledy April 9, 2026 08:47

Anlittledy merged commit f00971f into FireRedTeam:main Apr 9, 2026

Conversation

JasonBuildAI commented Apr 7, 2026

Summary

Why PR #75 was closed

What changed

1) 会话状态落盘与重启恢复（延续 PR#75 主线）

2) 我如何落实维护者提出的三点建议

建议 1：脱敏不应只依赖 regex；应在配置结构体层面避免序列化 raw key

建议 2：坏掉的 session_state.json 处理策略（503 vs 404）

建议 3：ToolMessage 丢 tool_call_id 时不要只 warning；应 fail-closed，并显式 degraded

Test plan（全链路）

Uh oh!

Anlittledy commented Apr 7, 2026

Uh oh!

Anlittledy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonBuildAI commented Apr 8, 2026

Uh oh!

Anlittledy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonBuildAI commented Apr 9, 2026

Uh oh!

Anlittledy commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

建议 2：坏掉的 `session_state.json` 处理策略（503 vs 404）

建议 3：ToolMessage 丢 `tool_call_id` 时不要只 warning；应 fail-closed，并显式 degraded