Token efficiency Phase 3: description compression, adaptive max_tokens, conditional reflection by humancto · Pull Request #19 · humancto/punch

humancto · 2026-03-26T16:58:55Z

Summary

Tool description compression: Cut all 50+ tool descriptions from verbose multi-sentence to concise single-line. ~40% reduction in per-tool token overhead
Adaptive max_tokens: Cheap tier = 1024, mid tier = 2048, expensive tier = 4096 (Ollama stays 16384). Only applies as default when user hasn't set explicit max_tokens
Conditional reflection: Skip post-bout reflection LLM call on simple exchanges (<3 turns AND no tool use). Saves an extra LLM call on ~80% of bouts (greetings, quick answers)

Token savings breakdown

Optimization	Savings
Description compression (50+ tools)	~2,000 tokens/request
Adaptive max_tokens (output budget)	Reduces provider overshoot
Conditional reflection	Eliminates 1 LLM call on simple bouts

What's left from the token efficiency plan

Phase 3a (sliding window summarization) - future, needs careful testing
Phase 4 (budget dashboard + caps) - future, user-facing feature

Test plan

cargo test --workspace — all pass
cargo clippy --workspace -- -D warnings — zero warnings
cargo fmt --all -- --check — clean
Manual: verify Telegram bot responds correctly with compressed descriptions
Manual: verify cheap tier messages get shorter max_tokens
Manual: verify "hello" bouts skip reflection in logs

🤖 Generated with Claude Code

…kens, conditional reflection Three optimizations that reduce token waste without affecting capability: 1. Tool description compression: Cut verbose multi-sentence descriptions to concise single-line summaries across all 50+ tool definitions. Saves ~40% of per-tool token overhead (e.g. file_read from 230 chars to 90 chars, shell_exec from 280 chars to 100 chars, heartbeat_add from 250 chars to 100). 2. Adaptive max_tokens by model tier: Cheap tier gets 1024 max_tokens (simple greetings), mid tier gets 2048, expensive tier keeps 4096. Ollama stays at 16384 for reasoning models. Only applies when the user has not set an explicit max_tokens in their config. 3. Conditional reflection: Skip the post-bout reflection LLM call for simple exchanges (fewer than 3 loop iterations AND no tool calls). A hello/goodbye bout no longer wastes a reflection call. Substantive bouts (3+ turns or any tool use) still get full reflection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cea54086f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T17:04:57Z

                match params.manifest.model.provider {
                    punch_types::Provider::Ollama => 16384,
-                    _ => 4096,
-                },
-            ),
+                    _ => tier_default,


Use routed provider for adaptive max-token defaults

This adaptive default now depends on routed_tier, but the Ollama override is still keyed to params.manifest.model.provider rather than the provider of the routed tier model. In mixed-provider routing configs, a cheap/mid tier routed to Ollama can incorrectly get 1024/2048 instead of 16384, which can starve reasoning-token models and produce truncated or empty visible output. Compute this override from the routed model/provider (or active driver) instead of the base manifest provider.

Useful? React with 👍 / 👎.

Good catch. Fixed in 8a505b2 — the Ollama override now checks routed_provider (the tier's actual provider) instead of params.manifest.model.provider. When routing selects a cheap/mid tier backed by Ollama, it correctly gets 16384.

Generated by Claude Code

chatgpt-codex-connector · 2026-03-26T17:04:57Z

+                // Conditional reflection: only reflect on substantive bouts.
+                // Skip reflection for simple exchanges (< 3 turns and no tool use)
+                // to avoid wasting an LLM call on "hello" / "how are you?" bouts.
+                let is_substantive_bout = guard.iterations() >= 3 || tool_calls_made > 0;


Gate reflection on actual turns, not loop-guard iterations

guard.iterations() does not represent conversation turns; it only increments on tool/continuation paths. With this condition, any normal no-tool EndTurn leaves iterations at 0, so reflection is skipped for essentially all text-only bouts, not just "< 3 turns" as the comment states. That unintentionally suppresses creed-learning updates for substantive non-tool exchanges.

Useful? React with 👍 / 👎.

Agreed — fixed in 93ece0e. Switched from guard.iterations() to messages.len() >= REFLECTION_MIN_MESSAGES (6), which counts actual messages in the bout including history. This correctly captures substantive text-only exchanges that would have had 0 iterations. Also added tests for both the skip and trigger paths.

Generated by Claude Code

humancto

Review: Token Efficiency Phase 3

CI is all green (Test, Check, Clippy, Format, GitGuardian). Nice clean diff — 63 additions, 75 deletions, net negative line count. Three distinct optimizations reviewed below.

1. Tool Description Compression (`tools.rs`) ✅

Well done. The compressed descriptions are concise while retaining the essential information an LLM needs to select the right tool. A few highlights:

Good: heartbeat_add now lists the valid cadences inline (every_bout, on_wake, hourly, daily, weekly, cron) — this is actually more useful than the verbose original since the LLM sees valid values at a glance.
Good: shell_exec keeps "Universal fallback for any task doable from a terminal" — important behavioral hint preserved.
Minor concern: file_list went from a descriptive sentence to just "List files and directories in a folder." — this is fine, but some tools like app_ocr lost the guidance about when to prefer OCR over screenshots ("Use this first for reading text, fall back to sys_screenshot for visual/spatial understanding"). That routing hint helped the LLM pick the right tool. Consider keeping a short hint like "Prefer over sys_screenshot for text extraction."

Overall the compression is tasteful and the ~2K token/request savings is significant.

2. Adaptive `max_tokens` (`fighter_loop.rs:314-332`) ✅ with one note

The tier-based scaling logic is clean:

cheap → 1024, mid → 2048, expensive/none → 4096, Ollama → always 16384

The Ollama override correctly takes priority over tier defaults since reasoning models need headroom for thinking tokens regardless of routing tier.

One thing to note: When routing is configured but the tier driver fails to create (the Err(e) branch at line 189), routed_tier stays None, so max_tokens falls through to 4096 via the _ => 4096 wildcard. This is correct behavior (fail-safe to full budget), just worth being aware of.

Suggestion: The magic numbers 1024, 2048, 4096, 16384 appear without named constants. Consider defining them as constants (e.g., DEFAULT_MAX_TOKENS_CHEAP, etc.) in punch-types — this would make them easier to tune later and self-documenting. Not a blocker.

3. Conditional Reflection (`fighter_loop.rs:453-475`) ⚠️ Needs attention

The idea is sound — skipping a full LLM call on trivial bouts is a great optimization. However:

guard.iterations() does not count conversation turns. It counts loop iterations, which are incremented by record_tool_calls() (on tool-use stops) and record_iteration() (on empty-response retries and max-tokens continuations). A normal single-turn bout where the user says "hello" and the LLM responds with text (StopReason::EndTurn, non-empty) completes without incrementing iterations at all — it hits the EndTurn branch and returns directly.

This means the threshold guard.iterations() >= 3 is effectively checking "did we have 3+ tool-use rounds or continuation/retry cycles", not "did we have 3+ conversational turns." The practical impact:

Simple "hello" bouts: iterations=0, tool_calls=0 → reflection correctly skipped ✅
Complex multi-tool bouts: iterations≥1 (from tool use), tool_calls>0 → reflection correctly runs ✅
Substantive text-only conversations (e.g., user asks a detailed question, LLM gives a long answer in one turn): iterations=0, tool_calls=0 → reflection skipped even though the bout may have meaningful learned behaviors ⚠️

The last case is the concern. A fighter giving a detailed technical explanation in a single turn would never get reflected on. The comment says "< 3 turns" but it's really "< 3 loop iterations", which is a different (and stricter) metric.

Recommendation: Consider using message count instead, e.g.:

let is_substantive_bout = messages.len() >= 4 || tool_calls_made > 0;

This would count actual messages (user + assistant pairs) and better capture substantive text-only exchanges. Or if the current behavior is intentional (only reflect when tools were used or retries/continuations happened), update the comment to accurately describe what's being measured.

4. Missing Tests ⚠️

Per CLAUDE.md: "Every public function must have at least one unit test." The existing fighter_loop_tests.rs has no coverage for:

Adaptive max_tokens: No test verifying that cheap tier gets 1024, mid gets 2048, etc.
Conditional reflection: No test verifying reflection is skipped on simple bouts and runs on substantive ones.

Both are behavioral changes that could regress silently. At minimum, a test for each would be valuable:

A test that mocks a cheap-tier route and asserts max_tokens in the resulting CompletionRequest is 1024.
A test with a simple 1-turn bout asserting reflect_on_bout is not called (or with a multi-tool bout asserting it is).

Summary

Area	Verdict
Tool description compression	✅ Ship it
Adaptive max_tokens	✅ Good (constants are nice-to-have)
Conditional reflection	⚠️ `iterations()` metric may not match intent — verify or adjust
Test coverage	⚠️ Add tests for both new behaviors

Solid optimization work overall. The description compression alone is a clear win. The reflection concern is the main item to address before merge.

Generated by Claude Code

…dd tests - Restore LLM routing hints in tool descriptions for app_ocr, ui_find_elements, skill_recommend, and memory_store that were over-compressed - Fix conditional reflection to use messages.len() instead of guard.iterations() which only counts tool-use rounds, not conversation turns - Extract magic max_tokens numbers into named constants - Add tests for adaptive max_tokens (Ollama, non-Ollama, explicit override) - Add tests for conditional reflection (simple bout skips, tool bout triggers) https://claude.ai/code/session_0135FeAb3vUvn6Yw29JdiSna

When model routing selects a tier that uses Ollama as its provider, the adaptive max_tokens logic now correctly checks the routed provider instead of the base manifest provider, preventing reasoning models from being starved of output tokens. https://claude.ai/code/session_0135FeAb3vUvn6Yw29JdiSna

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

humancto commented Mar 26, 2026

View reviewed changes

claude added 2 commits March 26, 2026 17:23

humancto merged commit 9489e80 into main Mar 26, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token efficiency Phase 3: description compression, adaptive max_tokens, conditional reflection#19

Token efficiency Phase 3: description compression, adaptive max_tokens, conditional reflection#19
humancto merged 3 commits into
mainfrom
feat/token-efficiency-phase3

humancto commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

humancto Mar 26, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Uh oh!

humancto Mar 26, 2026

Uh oh!

humancto left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

humancto commented Mar 26, 2026

Summary

Token savings breakdown

What's left from the token efficiency plan

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

humancto Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

humancto Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

humancto left a comment

Choose a reason for hiding this comment

Review: Token Efficiency Phase 3

1. Tool Description Compression (tools.rs) ✅

2. Adaptive max_tokens (fighter_loop.rs:314-332) ✅ with one note

3. Conditional Reflection (fighter_loop.rs:453-475) ⚠️ Needs attention

4. Missing Tests ⚠️

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Tool Description Compression (`tools.rs`) ✅

2. Adaptive `max_tokens` (`fighter_loop.rs:314-332`) ✅ with one note

3. Conditional Reflection (`fighter_loop.rs:453-475`) ⚠️ Needs attention