Why this is interesting
Memory in Phase D is naive on purpose: a key/value table the agent writes to via the update-memory skill, with the platform injecting all of it into role_context at dispatch. Correct for hackathon scale and forward-compatible — but it accumulates linearly with the agent's life. The interesting engineering problem is what to do once an agent has hundreds of keys, many of them stale, redundant, or contradictory.
Same shape as MemGPT, Letta, and the generative-agents reflection pattern. Real product moat — "your AI employee actually learns and consolidates how it works for you" — not just maintenance.
The forcing function
Compaction starts mattering when injected memory exceeds the role-context budget. Rough thresholds:
- ~1–2k tokens of role_context (~50–100 keys at typical sizes) → noticeable signal dilution.
- Above ~4k tokens → the agent's actual instruction starts losing salience against accumulated trivia.
Below that, all-in injection (Phase D default) is fine.
Three tiers of approach (build order)
-
Mechanical eviction. LRU on agent_memory (touch updated_at on every read, evict bottom of LRU above N keys). Cheap, no LLM. Ship first as a safety net.
-
Heuristic clustering. Group keys by namespace prefix (style.*, repos.*, people.*); show counts; let the user prune manually via a UI. Mid-effort, no LLM, gives the user a knob.
-
LLM reflection (the real prize). A scheduled job (or budget-triggered) loads the agent's full memory + recent slice of agent_action_log, asks the LLM:
"Consolidate this into the smallest set of facts/preferences that still capture how this user wants you to work. Drop stale/contradicted preferences. Surface contradictions."
Rewrites memory in place. The agent next session reads the consolidated form.
What makes (3) genuinely hard
- Race conditions. The agent may be writing memory while reflection runs. Need pessimistic locking, an
agent_memory_version snapshot, or a "dirty" flag.
- Versioning + rollback. If reflection produces worse memory than before, you need a way to revert. Store the pre-reflection snapshot for N days.
- Trust & visibility. The user should be able to see what reflection did — diff view of before vs after, with the LLM's reasoning. Without that, the agent feels like it's silently forgetting things.
- Triggering policy. Cron (
every Sunday at 3am)? Token-budget triggered (when injected memory > X)? Manual user button? Probably all three with different defaults.
- Action-log distillation. The cousin problem: turning thousands of
agent_action_log rows into a "what has my agent been doing" narrative for the work-log surface. Same reflection-style approach.
Forward-compatible defaults Phase D will set
Both lock-in-free; compaction can be added without schema changes:
agent_memory rows carry updated_at for last-write-wins and future LRU.
- Memory injection at dispatch is "all keys for this agent" — swap the function later for a filtered/compacted version. The dispatch contract is stable.
Open questions
- Reflection model: Kimi (in-container) or Claude on the platform? Probably Claude — the platform has the full memory and shouldn't depend on the agent's container being up.
- Should the user be able to edit memory directly, bypassing the agent? (Probably yes; surfaces as a settings panel post-Phase-D.)
- Is there a public eval for "did this reflection make memory better"? (Probably not without ground truth — but a held-out task can A/B "task quality before vs after reflection.")
Depends on #4 (Phase D).
Why this is interesting
Memory in Phase D is naive on purpose: a key/value table the agent writes to via the
update-memoryskill, with the platform injecting all of it intorole_contextat dispatch. Correct for hackathon scale and forward-compatible — but it accumulates linearly with the agent's life. The interesting engineering problem is what to do once an agent has hundreds of keys, many of them stale, redundant, or contradictory.Same shape as MemGPT, Letta, and the generative-agents reflection pattern. Real product moat — "your AI employee actually learns and consolidates how it works for you" — not just maintenance.
The forcing function
Compaction starts mattering when injected memory exceeds the role-context budget. Rough thresholds:
Below that, all-in injection (Phase D default) is fine.
Three tiers of approach (build order)
Mechanical eviction. LRU on
agent_memory(touchupdated_aton every read, evict bottom of LRU above N keys). Cheap, no LLM. Ship first as a safety net.Heuristic clustering. Group keys by namespace prefix (
style.*,repos.*,people.*); show counts; let the user prune manually via a UI. Mid-effort, no LLM, gives the user a knob.LLM reflection (the real prize). A scheduled job (or budget-triggered) loads the agent's full memory + recent slice of
agent_action_log, asks the LLM:Rewrites memory in place. The agent next session reads the consolidated form.
What makes (3) genuinely hard
agent_memory_versionsnapshot, or a "dirty" flag.every Sunday at 3am)? Token-budget triggered (when injected memory > X)? Manual user button? Probably all three with different defaults.agent_action_logrows into a "what has my agent been doing" narrative for the work-log surface. Same reflection-style approach.Forward-compatible defaults Phase D will set
Both lock-in-free; compaction can be added without schema changes:
agent_memoryrows carryupdated_atfor last-write-wins and future LRU.Open questions
Depends on #4 (Phase D).