Token efficiency Phase 3: description compression, adaptive max_tokens, conditional reflection#19
Conversation
…kens, conditional reflection Three optimizations that reduce token waste without affecting capability: 1. Tool description compression: Cut verbose multi-sentence descriptions to concise single-line summaries across all 50+ tool definitions. Saves ~40% of per-tool token overhead (e.g. file_read from 230 chars to 90 chars, shell_exec from 280 chars to 100 chars, heartbeat_add from 250 chars to 100). 2. Adaptive max_tokens by model tier: Cheap tier gets 1024 max_tokens (simple greetings), mid tier gets 2048, expensive tier keeps 4096. Ollama stays at 16384 for reasoning models. Only applies when the user has not set an explicit max_tokens in their config. 3. Conditional reflection: Skip the post-bout reflection LLM call for simple exchanges (fewer than 3 loop iterations AND no tool calls). A hello/goodbye bout no longer wastes a reflection call. Substantive bouts (3+ turns or any tool use) still get full reflection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5cea54086f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| match params.manifest.model.provider { | ||
| punch_types::Provider::Ollama => 16384, | ||
| _ => 4096, | ||
| }, | ||
| ), | ||
| _ => tier_default, |
There was a problem hiding this comment.
Use routed provider for adaptive max-token defaults
This adaptive default now depends on routed_tier, but the Ollama override is still keyed to params.manifest.model.provider rather than the provider of the routed tier model. In mixed-provider routing configs, a cheap/mid tier routed to Ollama can incorrectly get 1024/2048 instead of 16384, which can starve reasoning-token models and produce truncated or empty visible output. Compute this override from the routed model/provider (or active driver) instead of the base manifest provider.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. Fixed in 8a505b2 — the Ollama override now checks routed_provider (the tier's actual provider) instead of params.manifest.model.provider. When routing selects a cheap/mid tier backed by Ollama, it correctly gets 16384.
Generated by Claude Code
| // Conditional reflection: only reflect on substantive bouts. | ||
| // Skip reflection for simple exchanges (< 3 turns and no tool use) | ||
| // to avoid wasting an LLM call on "hello" / "how are you?" bouts. | ||
| let is_substantive_bout = guard.iterations() >= 3 || tool_calls_made > 0; |
There was a problem hiding this comment.
Gate reflection on actual turns, not loop-guard iterations
guard.iterations() does not represent conversation turns; it only increments on tool/continuation paths. With this condition, any normal no-tool EndTurn leaves iterations at 0, so reflection is skipped for essentially all text-only bouts, not just "< 3 turns" as the comment states. That unintentionally suppresses creed-learning updates for substantive non-tool exchanges.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed — fixed in 93ece0e. Switched from guard.iterations() to messages.len() >= REFLECTION_MIN_MESSAGES (6), which counts actual messages in the bout including history. This correctly captures substantive text-only exchanges that would have had 0 iterations. Also added tests for both the skip and trigger paths.
Generated by Claude Code
humancto
left a comment
There was a problem hiding this comment.
Review: Token Efficiency Phase 3
CI is all green (Test, Check, Clippy, Format, GitGuardian). Nice clean diff — 63 additions, 75 deletions, net negative line count. Three distinct optimizations reviewed below.
1. Tool Description Compression (tools.rs) ✅
Well done. The compressed descriptions are concise while retaining the essential information an LLM needs to select the right tool. A few highlights:
- Good:
heartbeat_addnow lists the valid cadences inline (every_bout, on_wake, hourly, daily, weekly, cron) — this is actually more useful than the verbose original since the LLM sees valid values at a glance. - Good:
shell_execkeeps "Universal fallback for any task doable from a terminal" — important behavioral hint preserved. - Minor concern:
file_listwent from a descriptive sentence to just "List files and directories in a folder." — this is fine, but some tools likeapp_ocrlost the guidance about when to prefer OCR over screenshots ("Use this first for reading text, fall back to sys_screenshot for visual/spatial understanding"). That routing hint helped the LLM pick the right tool. Consider keeping a short hint like "Prefer over sys_screenshot for text extraction."
Overall the compression is tasteful and the ~2K token/request savings is significant.
2. Adaptive max_tokens (fighter_loop.rs:314-332) ✅ with one note
The tier-based scaling logic is clean:
cheap → 1024, mid → 2048, expensive/none → 4096, Ollama → always 16384
The Ollama override correctly takes priority over tier defaults since reasoning models need headroom for thinking tokens regardless of routing tier.
One thing to note: When routing is configured but the tier driver fails to create (the Err(e) branch at line 189), routed_tier stays None, so max_tokens falls through to 4096 via the _ => 4096 wildcard. This is correct behavior (fail-safe to full budget), just worth being aware of.
Suggestion: The magic numbers 1024, 2048, 4096, 16384 appear without named constants. Consider defining them as constants (e.g., DEFAULT_MAX_TOKENS_CHEAP, etc.) in punch-types — this would make them easier to tune later and self-documenting. Not a blocker.
3. Conditional Reflection (fighter_loop.rs:453-475) ⚠️ Needs attention
The idea is sound — skipping a full LLM call on trivial bouts is a great optimization. However:
guard.iterations() does not count conversation turns. It counts loop iterations, which are incremented by record_tool_calls() (on tool-use stops) and record_iteration() (on empty-response retries and max-tokens continuations). A normal single-turn bout where the user says "hello" and the LLM responds with text (StopReason::EndTurn, non-empty) completes without incrementing iterations at all — it hits the EndTurn branch and returns directly.
This means the threshold guard.iterations() >= 3 is effectively checking "did we have 3+ tool-use rounds or continuation/retry cycles", not "did we have 3+ conversational turns." The practical impact:
- Simple "hello" bouts: iterations=0, tool_calls=0 → reflection correctly skipped ✅
- Complex multi-tool bouts: iterations≥1 (from tool use), tool_calls>0 → reflection correctly runs ✅
- Substantive text-only conversations (e.g., user asks a detailed question, LLM gives a long answer in one turn): iterations=0, tool_calls=0 → reflection skipped even though the bout may have meaningful learned behaviors
⚠️
The last case is the concern. A fighter giving a detailed technical explanation in a single turn would never get reflected on. The comment says "< 3 turns" but it's really "< 3 loop iterations", which is a different (and stricter) metric.
Recommendation: Consider using message count instead, e.g.:
let is_substantive_bout = messages.len() >= 4 || tool_calls_made > 0;This would count actual messages (user + assistant pairs) and better capture substantive text-only exchanges. Or if the current behavior is intentional (only reflect when tools were used or retries/continuations happened), update the comment to accurately describe what's being measured.
4. Missing Tests ⚠️
Per CLAUDE.md: "Every public function must have at least one unit test." The existing fighter_loop_tests.rs has no coverage for:
- Adaptive max_tokens: No test verifying that cheap tier gets 1024, mid gets 2048, etc.
- Conditional reflection: No test verifying reflection is skipped on simple bouts and runs on substantive ones.
Both are behavioral changes that could regress silently. At minimum, a test for each would be valuable:
- A test that mocks a cheap-tier route and asserts
max_tokensin the resultingCompletionRequestis 1024. - A test with a simple 1-turn bout asserting
reflect_on_boutis not called (or with a multi-tool bout asserting it is).
Summary
| Area | Verdict |
|---|---|
| Tool description compression | ✅ Ship it |
| Adaptive max_tokens | ✅ Good (constants are nice-to-have) |
| Conditional reflection | iterations() metric may not match intent — verify or adjust |
| Test coverage |
Solid optimization work overall. The description compression alone is a clear win. The reflection concern is the main item to address before merge.
Generated by Claude Code
…dd tests - Restore LLM routing hints in tool descriptions for app_ocr, ui_find_elements, skill_recommend, and memory_store that were over-compressed - Fix conditional reflection to use messages.len() instead of guard.iterations() which only counts tool-use rounds, not conversation turns - Extract magic max_tokens numbers into named constants - Add tests for adaptive max_tokens (Ollama, non-Ollama, explicit override) - Add tests for conditional reflection (simple bout skips, tool bout triggers) https://claude.ai/code/session_0135FeAb3vUvn6Yw29JdiSna
When model routing selects a tier that uses Ollama as its provider, the adaptive max_tokens logic now correctly checks the routed provider instead of the base manifest provider, preventing reasoning models from being starved of output tokens. https://claude.ai/code/session_0135FeAb3vUvn6Yw29JdiSna
Summary
Token savings breakdown
What's left from the token efficiency plan
Test plan
cargo test --workspace— all passcargo clippy --workspace -- -D warnings— zero warningscargo fmt --all -- --check— clean🤖 Generated with Claude Code