Skip to content

Token efficiency Phase 3: description compression, adaptive max_tokens, conditional reflection#19

Merged
humancto merged 3 commits into
mainfrom
feat/token-efficiency-phase3
Mar 26, 2026
Merged

Token efficiency Phase 3: description compression, adaptive max_tokens, conditional reflection#19
humancto merged 3 commits into
mainfrom
feat/token-efficiency-phase3

Conversation

@humancto
Copy link
Copy Markdown
Owner

Summary

  • Tool description compression: Cut all 50+ tool descriptions from verbose multi-sentence to concise single-line. ~40% reduction in per-tool token overhead
  • Adaptive max_tokens: Cheap tier = 1024, mid tier = 2048, expensive tier = 4096 (Ollama stays 16384). Only applies as default when user hasn't set explicit max_tokens
  • Conditional reflection: Skip post-bout reflection LLM call on simple exchanges (<3 turns AND no tool use). Saves an extra LLM call on ~80% of bouts (greetings, quick answers)

Token savings breakdown

Optimization Savings
Description compression (50+ tools) ~2,000 tokens/request
Adaptive max_tokens (output budget) Reduces provider overshoot
Conditional reflection Eliminates 1 LLM call on simple bouts

What's left from the token efficiency plan

  • Phase 3a (sliding window summarization) - future, needs careful testing
  • Phase 4 (budget dashboard + caps) - future, user-facing feature

Test plan

  • cargo test --workspace — all pass
  • cargo clippy --workspace -- -D warnings — zero warnings
  • cargo fmt --all -- --check — clean
  • Manual: verify Telegram bot responds correctly with compressed descriptions
  • Manual: verify cheap tier messages get shorter max_tokens
  • Manual: verify "hello" bouts skip reflection in logs

🤖 Generated with Claude Code

…kens, conditional reflection

Three optimizations that reduce token waste without affecting capability:

1. Tool description compression: Cut verbose multi-sentence descriptions to
   concise single-line summaries across all 50+ tool definitions. Saves ~40%
   of per-tool token overhead (e.g. file_read from 230 chars to 90 chars,
   shell_exec from 280 chars to 100 chars, heartbeat_add from 250 chars to 100).

2. Adaptive max_tokens by model tier: Cheap tier gets 1024 max_tokens (simple
   greetings), mid tier gets 2048, expensive tier keeps 4096. Ollama stays at
   16384 for reasoning models. Only applies when the user has not set an
   explicit max_tokens in their config.

3. Conditional reflection: Skip the post-bout reflection LLM call for simple
   exchanges (fewer than 3 loop iterations AND no tool calls). A hello/goodbye
   bout no longer wastes a reflection call. Substantive bouts (3+ turns or any
   tool use) still get full reflection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cea54086f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +325 to +327
match params.manifest.model.provider {
punch_types::Provider::Ollama => 16384,
_ => 4096,
},
),
_ => tier_default,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use routed provider for adaptive max-token defaults

This adaptive default now depends on routed_tier, but the Ollama override is still keyed to params.manifest.model.provider rather than the provider of the routed tier model. In mixed-provider routing configs, a cheap/mid tier routed to Ollama can incorrectly get 1024/2048 instead of 16384, which can starve reasoning-token models and produce truncated or empty visible output. Compute this override from the routed model/provider (or active driver) instead of the base manifest provider.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 8a505b2 — the Ollama override now checks routed_provider (the tier's actual provider) instead of params.manifest.model.provider. When routing selects a cheap/mid tier backed by Ollama, it correctly gets 16384.


Generated by Claude Code

// Conditional reflection: only reflect on substantive bouts.
// Skip reflection for simple exchanges (< 3 turns and no tool use)
// to avoid wasting an LLM call on "hello" / "how are you?" bouts.
let is_substantive_bout = guard.iterations() >= 3 || tool_calls_made > 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate reflection on actual turns, not loop-guard iterations

guard.iterations() does not represent conversation turns; it only increments on tool/continuation paths. With this condition, any normal no-tool EndTurn leaves iterations at 0, so reflection is skipped for essentially all text-only bouts, not just "< 3 turns" as the comment states. That unintentionally suppresses creed-learning updates for substantive non-tool exchanges.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — fixed in 93ece0e. Switched from guard.iterations() to messages.len() >= REFLECTION_MIN_MESSAGES (6), which counts actual messages in the bout including history. This correctly captures substantive text-only exchanges that would have had 0 iterations. Also added tests for both the skip and trigger paths.


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@humancto humancto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Token Efficiency Phase 3

CI is all green (Test, Check, Clippy, Format, GitGuardian). Nice clean diff — 63 additions, 75 deletions, net negative line count. Three distinct optimizations reviewed below.


1. Tool Description Compression (tools.rs) ✅

Well done. The compressed descriptions are concise while retaining the essential information an LLM needs to select the right tool. A few highlights:

  • Good: heartbeat_add now lists the valid cadences inline (every_bout, on_wake, hourly, daily, weekly, cron) — this is actually more useful than the verbose original since the LLM sees valid values at a glance.
  • Good: shell_exec keeps "Universal fallback for any task doable from a terminal" — important behavioral hint preserved.
  • Minor concern: file_list went from a descriptive sentence to just "List files and directories in a folder." — this is fine, but some tools like app_ocr lost the guidance about when to prefer OCR over screenshots ("Use this first for reading text, fall back to sys_screenshot for visual/spatial understanding"). That routing hint helped the LLM pick the right tool. Consider keeping a short hint like "Prefer over sys_screenshot for text extraction."

Overall the compression is tasteful and the ~2K token/request savings is significant.


2. Adaptive max_tokens (fighter_loop.rs:314-332) ✅ with one note

The tier-based scaling logic is clean:

cheap → 1024, mid → 2048, expensive/none → 4096, Ollama → always 16384

The Ollama override correctly takes priority over tier defaults since reasoning models need headroom for thinking tokens regardless of routing tier.

One thing to note: When routing is configured but the tier driver fails to create (the Err(e) branch at line 189), routed_tier stays None, so max_tokens falls through to 4096 via the _ => 4096 wildcard. This is correct behavior (fail-safe to full budget), just worth being aware of.

Suggestion: The magic numbers 1024, 2048, 4096, 16384 appear without named constants. Consider defining them as constants (e.g., DEFAULT_MAX_TOKENS_CHEAP, etc.) in punch-types — this would make them easier to tune later and self-documenting. Not a blocker.


3. Conditional Reflection (fighter_loop.rs:453-475) ⚠️ Needs attention

The idea is sound — skipping a full LLM call on trivial bouts is a great optimization. However:

guard.iterations() does not count conversation turns. It counts loop iterations, which are incremented by record_tool_calls() (on tool-use stops) and record_iteration() (on empty-response retries and max-tokens continuations). A normal single-turn bout where the user says "hello" and the LLM responds with text (StopReason::EndTurn, non-empty) completes without incrementing iterations at all — it hits the EndTurn branch and returns directly.

This means the threshold guard.iterations() >= 3 is effectively checking "did we have 3+ tool-use rounds or continuation/retry cycles", not "did we have 3+ conversational turns." The practical impact:

  • Simple "hello" bouts: iterations=0, tool_calls=0 → reflection correctly skipped ✅
  • Complex multi-tool bouts: iterations≥1 (from tool use), tool_calls>0 → reflection correctly runs ✅
  • Substantive text-only conversations (e.g., user asks a detailed question, LLM gives a long answer in one turn): iterations=0, tool_calls=0 → reflection skipped even though the bout may have meaningful learned behaviors ⚠️

The last case is the concern. A fighter giving a detailed technical explanation in a single turn would never get reflected on. The comment says "< 3 turns" but it's really "< 3 loop iterations", which is a different (and stricter) metric.

Recommendation: Consider using message count instead, e.g.:

let is_substantive_bout = messages.len() >= 4 || tool_calls_made > 0;

This would count actual messages (user + assistant pairs) and better capture substantive text-only exchanges. Or if the current behavior is intentional (only reflect when tools were used or retries/continuations happened), update the comment to accurately describe what's being measured.


4. Missing Tests ⚠️

Per CLAUDE.md: "Every public function must have at least one unit test." The existing fighter_loop_tests.rs has no coverage for:

  • Adaptive max_tokens: No test verifying that cheap tier gets 1024, mid gets 2048, etc.
  • Conditional reflection: No test verifying reflection is skipped on simple bouts and runs on substantive ones.

Both are behavioral changes that could regress silently. At minimum, a test for each would be valuable:

  1. A test that mocks a cheap-tier route and asserts max_tokens in the resulting CompletionRequest is 1024.
  2. A test with a simple 1-turn bout asserting reflect_on_bout is not called (or with a multi-tool bout asserting it is).

Summary

Area Verdict
Tool description compression ✅ Ship it
Adaptive max_tokens ✅ Good (constants are nice-to-have)
Conditional reflection ⚠️ iterations() metric may not match intent — verify or adjust
Test coverage ⚠️ Add tests for both new behaviors

Solid optimization work overall. The description compression alone is a clear win. The reflection concern is the main item to address before merge.


Generated by Claude Code

claude added 2 commits March 26, 2026 17:23
…dd tests

- Restore LLM routing hints in tool descriptions for app_ocr, ui_find_elements,
  skill_recommend, and memory_store that were over-compressed
- Fix conditional reflection to use messages.len() instead of guard.iterations()
  which only counts tool-use rounds, not conversation turns
- Extract magic max_tokens numbers into named constants
- Add tests for adaptive max_tokens (Ollama, non-Ollama, explicit override)
- Add tests for conditional reflection (simple bout skips, tool bout triggers)

https://claude.ai/code/session_0135FeAb3vUvn6Yw29JdiSna
When model routing selects a tier that uses Ollama as its provider,
the adaptive max_tokens logic now correctly checks the routed provider
instead of the base manifest provider, preventing reasoning models
from being starved of output tokens.

https://claude.ai/code/session_0135FeAb3vUvn6Yw29JdiSna
@humancto humancto merged commit 9489e80 into main Mar 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants