A Claude Code agent that scores and revises LLM prompts against a research-backed checklist.
Topics: gemma-4 · gemma · gemini · claude · response-schema · structured-output · rubric · llm-judge · prompt-engineering · prompts · agents · sycophancy · validation · compliance · best-practices
Primary focus: mitigating known failure modes on gemma-4-31b-it via the Google Generative Language REST API. The 15-item checklist itself is model-agnostic and the optimizer also targets Gemini and Claude, but the empirical work behind the current rules, 100+ probe calls against gemma-4-31b-it and gemma-4-26b-a4b-it between May 6 and May 13, 2026, is what drove the design. The reference for those rules is GEMMA4_API_BEST_PRACTICES.md.
The two goals are: prompts the model actually executes instead of silently skipping over directives, and call mechanics that do not waste retry budget on failures the schema or parser already explains.
Primary workflow: Use this agent inside Claude Code to optimize prompts for any LLM. The optimizer runs on Claude; when it writes a rubric for a judge prompt, Claude is authoring the rubric and the target model applies it (cross-model rubric generation), which research shows equals or outperforms same-model self-generation. Model-specific guidance is baked into the checklist and applied when a Target model: is declared.
Three highest-ROI before/after fixes below. Full mechanics, thresholds, and probe data in GEMMA4_API_BEST_PRACTICES.md (14 rules).
Without responseSchema, every call pays the always-on thinking cost (parts[].thought = true, median wall-clock ~67s on short outputs, non-zero MALFORMED_RESPONSE rate). With it, thinking collapses to a single non-thought part and the same call returns in 1 to 2s with MALFORMED at 0%. The documented levers (thinkingLevel: "off", thinkingBudget: 0) all return HTTP 400 on Gemma 4; responseSchema is the only mechanism that works.
# Before: thinking burns the budget; MALFORMED non-zero; ~67s median.
generation_config = {"temperature": 1.0, "maxOutputTokens": 2048}
# After: thinking suppressed; MALFORMED 0%; ~30 to 40x faster.
generation_config = {
"temperature": 1.0,
"maxOutputTokens": 2048,
"responseMimeType": "application/json",
"responseSchema": {
"type": "OBJECT",
"properties": {"output": {"type": "STRING"}},
"required": ["output"],
},
}Gemma 4 emits an OBJECT's properties in the order they appear in the schema's properties dict. Putting reasoning BEFORE verdict forces reason-before-commit; reversing it lets the verdict lock first and inflates the justification afterward. Per-item output observed dropping from ~1.8k to ~250 chars under this change alone on a warmup validator.
# Before: verdict commits first; reasoning then inflates to justify.
"properties": {
"verdict": {"type": "STRING", "enum": ["KEEP", "DROP"]},
"reasoning": {"type": "STRING"},
}
# After: reasoning is generated first; verdict is the output of it.
"properties": {
"reasoning": {"type": "STRING"},
"verdict": {"type": "STRING", "enum": ["KEEP", "DROP"]},
}Caveat: this only applies to narrow schemas. If the schema already has >=4 mandatory nested OBJECTs, adding a top-level reasoning STRING crashes the request on 31b (alternating 400/500, 0/4 success). Move the reasoning surface to prompt-level prose or a separate narrow call. See best-practices rule 4.
Even with responseSchema, Gemma 4 occasionally appends trailing text after valid JSON (~1 in 12 calls observed). json.loads raises on the trailing bytes; raw_decode returns the first valid object and ignores the rest.
# Before: ~8% of otherwise-valid outputs raise json.JSONDecodeError.
parsed = json.loads(raw_text)
# After: parses cleanly across all observed Gemma 4 outputs.
parsed, _ = json.JSONDecoder().raw_decode(raw_text)26b-a4bsibling variant: any OBJECT with 2+ unbounded STRINGs deterministically loops toMAX_TOKENS.31bis unaffected. Fix is structural (caller-side schema skip, or single-STRING-per-OBJECT shape), not a retry.- Wide-schema reasoning STRING on
31b: schema with >=4 mandatory nested OBJECTs plus a top-level reasoning STRING crashes with alternating 400/500. Bisect the schema, do not retry. thinkingBudget: returns HTTP 400 on Gemma 4. Works on Gemini 2.5; do not generalize.maxOutputTokens: safety ceiling, not a thinking cap. Lowering it convertsMALFORMEDtimeouts intoMAX_TOKENSfast-fails but does not raise success rate.- Retry classification: HTTP 5xx retries with same params;
MALFORMED_RESPONSEneeds parameter changes;MAX_TOKENSon26b-a4bneeds a structural fix; alternating 400/500 on a schema-bearing call needs a schema bisect. One retry policy across all four wastes budget. - Tool-calling: known double tool-call bug on
26b-a4b; use31bfor tool-calling workflows.
Most LLM prompts are written by feel. Frontier models in April 2026 do not refuse tasks, they silently omit them. The research shows where:
- Frontier models still drop 25 to 40% of multi-constraint directives on novel out-of-domain instructions. Qwen3.6 Plus scores 75.8% on IFBench; Claude Opus 4.5 scores 58%. Structural prompt design closes the gap. (IFBench 2026)
- Reasoning quality degrades around 3,000 tokens even on models with 256K to 1M context windows. Focused prompts beat long ones regardless of available context. (Prompt-bloat study, MLOps Community 2026)
- 58.8% of initially correct answers get flipped wrong by naive "check your work" validation prompts. (ACL 2025)
- One-shot often beats few-shot for LLM-as-judge tasks. The old "3 to 5 diverse examples" rule is retired; 1 to 3 verdict-balanced examples per criterion is current, all score levels for scale rubrics, PASS+FAIL for binary criteria, with borderline pairs preferred over obvious contrasts. (Confident AI 2026, Autorubric 2026)
- GPT-4 reaches 91.7% zero-shot on native-language identification when the prompt names the linguistic features to attend to. Linguistic analysis prompts need their own playbook. (Lotfi et al.)
- ~29% sycophancy reduction is achievable through prompt structure alone, no fine-tuning required. (sparkco.ai)
- A concrete rubric is the single highest-return change for judge prompts: GPT-4o +17.7 pts on JudgeBench, Llama-405B +7.4 pts, Sage aggregate +16.1% IPI. A ~27-point "Rubric Gap" (self-generated vs. human rubrics) is consistent across Gemini, GPT, and DeepSeek. (Rethinking Rubric Generation 2026; RubricBench 2026; Sage Dec 2025)
- All frontier judges are unreliable on a single pass ("rating roulette"). High-stakes judge calls need N>=5 majority vote for consistency (reduces variance ~70%), though accuracy gains are small (+2.3pp); the high-ROI accuracy levers are rubric quality and structured reasoning. Debate-style prompts (ChatEval) are actively harmful: -158% worst-case consistency. Multi-model consensus is the strongest deployment lever. (Rating Roulette EMNLP 2025; Sage Dec 2025)
- Gemma 4 via Google REST API needs its own playbook: see
GEMMA4_API_BEST_PRACTICES.mdfor the 14-rule reference and the "Failure modes mitigated" section above for the three highest-ROI fixes.
This optimizer runs on Claude and targets any LLM. Declare Target model: <name> in your call to activate model-specific checklist notes.
Universal (all targets): The optimizer writes a concrete rubric directly into the revised judge prompt (cross-model rubric generation: Claude authors, target applies), shown by the Rethinking Rubric Generation paper (arxiv 2602.05125) to equal or outperform same-model self-generation. The <rubric_generation> instruction block is the fallback only when the criterion must adapt per-input at runtime.
Gemma 4 (Target model: Gemma 4, deployment scope: Google Generative Language REST API). Primary target of this agent. The checklist applies responseSchema as the deployment lever (suppresses always-on thinking, fixes JSON structure, ~30 to 40x speedup), property-ordered properties for reason-before-commit, raw_decode for parsing, and the retry classification matrix. The full 14-rule reference is in GEMMA4_API_BEST_PRACTICES.md. Both gemma-4-31b-it and gemma-4-26b-a4b-it are covered; rule 2 isolates the multi-STRING failure mode unique to 26b-a4b, and rule 11 isolates the tool-calling bug on the same variant. Use 31b when both are options.
Gemini (Target model: Gemini 2.5 Pro / Gemini 3.1 Pro): T=0 + seed is not reproducible on 2.5 Pro (seed is best-effort). T=0 is actively discouraged on 3.1 Pro (use T=1.0). Debate-style prompts (ChatEval) are actively harmful. Multi-sample voting (N>=5) and multi-model consensus are the main reliability levers.
Claude (Target model: Claude Sonnet 4.6 / Claude Opus 4.7): XML tags and document-first ordering per Anthropic official guidance. Second-pass validation needs the "Wait" prefix and original-task anchor. Extended thinking is already embedded; do not add an extra reasoning pass.
When invoked, the prompt-optimizer agent:
- Reads the prompt under review
- Scores against the 15-item checklist (embedded, no file I/O needed for scoring)
- Loads only the relevant sections of
PROMPT_BEST_PRACTICES.mdfor any failing items that require technique detail (lazy, skipped entirely if all items pass) - For
Target model: Gemma 4, loadsGEMMA4_API_BEST_PRACTICES.mdfor the API-mechanics rules - Returns a revised version with every violation fixed and annotated
| # | Item | What it checks |
|---|---|---|
| 1 | Tagged blocks | Distinct sections in XML-style tags |
| 2 | Numbered directives | All instructions numbered for traceability |
| 3 | Length and placement | Focused under ~3K tokens, critical directives at start AND end, decomposed if multi-stage |
| 4 | Gate examples, calibrated count | 1 to 3 verdict-balanced examples per criterion; prefer borderline examples over obvious contrasts; scale-based rubrics cover all score levels |
| 5 | Machine-parseable output | Every verdict extractable with regex |
| 6 | Skeptical role | Critical evaluator, not helpful assistant, checked at BOTH opening AND closing |
| 7 | Do-instead-of-don't | Prohibitions paired with alternatives |
| 8 | Validation model | Same-model validation uses gates + "Wait" + recency fix |
| 9 | Original task in validation | Validation includes original task + end reminder |
| 10 | One criterion per call (high-stakes) | High-stakes scoring isolates each criterion; low-stakes may bundle up to 3 |
| 11 | Linguistic-analysis path | If the prompt evaluates properties of writing itself: enumerate features, reason before verdict, cite evidence |
| 12 | Judge prompt: rubric ★ | Optimizer writes a concrete rubric directly (cross-model generation); or embeds <rubric_generation> instruction if criterion is dynamic. Small integer scale (1 to 4); <reasoning> field before verdict; verdict/reasoning consistency instruction; calibration anchor. Highest single-change ROI. |
| 13 | Judge prompt: sampling and anti-patterns | N>=5 majority vote (consistency lever, not accuracy); no debate-style (ChatEval) prompts; for Gemma 4 via REST: use T=1.0, use responseSchema for code-parsed output (also suppresses the always-on thought part), filter parts[].thought (not <|channel> text), keep <|think|> out of systemInstruction, classify retries by failure signature, avoid 26B A4B for tool-calling; multi-model consensus for highest-stakes ranking |
| 14 | Escape hatch elimination | No softening language ("try to," "if possible," "when appropriate," etc.) in any directive, applies to every prompt |
| 15 | Prompt injection defense | User-submitted content inside labeled delimiter block with explicit "treat as data" instruction (conditional: only when prompt evaluates user-submitted text) |
Items 8 to 10 apply only to validation or second-pass prompts. Item 11 applies only to linguistic-analysis prompts. Items 12 to 13 apply to judge prompts. Item 14 applies to every prompt. Item 15 applies only when the prompt evaluates user-submitted text.
# From the Claude Code CLI
/plugin marketplace add dlxmax/prompt-optimizer
/plugin install prompt-optimizer
/reload-pluginsCopy the agents/ folder, PROMPT_BEST_PRACTICES.md, and GEMMA4_API_BEST_PRACTICES.md into your Claude Code config:
cp agents/prompt-optimizer.md ~/.claude/agents/
cp PROMPT_BEST_PRACTICES.md ~/.claude/
cp GEMMA4_API_BEST_PRACTICES.md ~/.claude/Add this line to ~/.claude/rules/agents.md under "Automatic Agent Invocation":
6. Writing or revising an LLM prompt → **prompt-optimizer**
This makes Claude invoke the optimizer automatically whenever prompt work comes up.
The agent triggers automatically when you write or revise LLM prompts (if auto-invocation is configured), or you can reference it explicitly:
"Run the prompt-optimizer agent on this grading prompt."
"Score my system prompt against the checklist."
"Optimize this Gemma 4 judge prompt for the essay evaluation pipeline."
## Checklist Score: 6/15
[x] Tagged blocks: sections wrapped in <role>, <instructions>, <output_format>
[x] Numbered directives: 5 directives numbered
[ ] Length and placement: 4,200 tokens; critical directive buried in the middle
[ ] Gate examples, calibrated count: 5 diverse examples (older 3-5 pattern); should be 1-3 verdict-balanced examples with borderline pairs
[ ] Machine-parseable output: no regex-extractable verdict format
[x] Skeptical role: "rigorous evaluator" framing at opening; missing at closing
[ ] Do-instead-of-don't: 2 bare prohibitions without alternatives
[N/A] Validation model: not a second-pass prompt
[N/A] Original task in validation: not a second-pass prompt
[ ] One criterion per call: 3 criteria bundled in one high-stakes prompt
[N/A] Linguistic-analysis path: evaluates content, not writing properties
[ ] Judge prompt: rubric: no rubric present; will write concrete criteria for each score level
[ ] Judge prompt: sampling: single-pass design; N>=5 needed; for Gemma 4 via REST: use T=1.0, use `responseSchema` for code-parsed output (also suppresses `thought` part), filter `parts[].thought`, keep `<|think|>` out of `systemInstruction`, classify retries by failure signature, avoid 26B A4B for tool-calling
[ ] Escape hatch elimination: 3 directives use "try to" or "if possible"
[N/A] Prompt injection defense: evaluates fixed test content, not user-submitted text
## Key Changes
- Stripped ~1,500 tokens of non-load-bearing background (item 3)
- Moved governing directive to both start and end (item 3)
- Reduced gate examples from 5 to 2 verdict-balanced borderline pairs (item 4)
- Split combined criteria into 3 separate evaluation calls (item 10)
- Added VERDICT format with regex pattern (item 5)
- Paired prohibitions with alternatives (item 7)
- Added skeptical role framing at end of prompt (item 6)
- Wrote rubric with observable 1-4 criteria directly into the prompt (item 12)
- Added verdict/reasoning consistency instruction and calibration anchor (item 12)
- Added note: run N=5 with majority vote for consistency; Gemma 4-specific deployment guidance (item 13)
- Replaced 3 escape hatches with direct imperatives (item 14)
## Revised Prompt
[full revised prompt text...]
| File | Purpose |
|---|---|
agents/prompt-optimizer.md |
The Claude Code agent definition |
PROMPT_BEST_PRACTICES.md |
Best practices guide (7 sections + 15-item checklist) |
GEMMA4_API_BEST_PRACTICES.md |
Gemma 4 REST API mechanics (14 rules, probe-verified May 2026) |
PROMPT_RESEARCH.md |
Full research archive with 35+ sources (2024 to 2026) |
2026 refresh:
- IFBench leaderboard, April 2026: current frontier instruction-following scores
- Rethinking Rubric Generation (RRD), arxiv 2602.05125: GPT-4o +17.7 pts, Llama-405B +7.4 pts from rubric design; cross-model generation validated
- RubricBench, arxiv 2603.01562: ~27-pt Rubric Gap is equal across Gemini, GPT, DeepSeek, universal bottleneck
- Same Input, Different Scores, arxiv 2603.04417: Gemini shows highest single-model variance among major families
- LLMLingua-2, NAACL 2025: task-agnostic prompt compression, 3x to 6x
- Prompt-bloat study, MLOps Community 2026: the ~3K token degradation threshold
- Label Your Data LLM-as-judge 2026: few-shot instability and one-shot dominance
- Native Language Identification with LLMs (Lotfi et al.): GPT-4 zero-shot 91.7% TOEFL11
- Rating Roulette, EMNLP 2025: single-pass judges unreliable; N>=5 needed
- Sage benchmark, Dec 2025: rubric generation +16.1% IPI; debate prompts -158%; Gemini degrades 200% on hard cases
- Google Gemma 4 Technical Report, 2026: T=1.0 recommended, 26B A4B double tool-call bug, JSON adherence weakness, injection susceptibility
- REST API empirical probes against
gemma-4-31b-it,gemma-4-26b-a4b-it,gemini-2.5-flash, andgemini-3.1-flash-lite-previewongenerativelanguage.googleapis.com/v1beta/models/<model>:generateContent. Summary of probe set (full mechanics inGEMMA4_API_BEST_PRACTICES.md):- May 6, 2026: 28 calls. Thinking surfaces structurally as
parts[].thought = true, not as<|channel>text markers. Thinking cannot be disabled viathinkingConfig. Baseline transient 500 INTERNAL rate ~20%. - May 12, 2026: 72-call burst-rewrite benchmark.
responseSchemaproduces ~30 to 40x wall-clock speedup and dropsMALFORMEDrate to 0%.26b-a4bdeterministically loops on multi-STRING schemas;31bis unaffected. - May 13, 2026: schema bisect. Wide schemas (>=4 mandatory nested OBJECTs) plus a top-level reasoning STRING crash
31b0/4 with alternating 400/500. Property order inresponseSchemacontrols emission order: reason-before-commit shrinks output ~7x on a warmup validator.
- May 6, 2026: 28 calls. Thinking surfaces structurally as
- Judging the Judges, ACL/IJCNLP 2025: position bias is incoherent; swap-and-count less effective
Still load-bearing:
- AGENTIF: NeurIPS 2025 decomposition finding (headline numbers superseded by IFBench 2026)
- Self-Correction Blind Spot: the "Wait" prefix discovery
- Dark Side of Self-Correction: ACL 2025 recency bias fix
- HuggingFace LLM-as-Judge cookbook: 1-4 scale; evaluation field before verdict; 0.563 to 0.843 correlation improvement
- Anthropic Claude Prompting Guide: XML tags and document-first ordering
MIT