harness: interactive proxy robustness + analytics fixes#26
Conversation
Major improvements to the interactive agent runtime to reduce spurious failures and improve tool feedback quality: - Pre-build Benchmark case modules once per task so the interactive tools don't race the Lake build (eliminates olean-not-found flake). - Add environment_error failure class with one-shot self-heal: when Lean can't find a dependency olean, auto-run lake build and retry before reporting to the model. Stagnation tracking skips these classes. - write_editable_proof now returns immediate warnings (placeholder, hole, theorem_statement_mismatch, hidden_proof_import, etc.) so the model sees actionable feedback before spending a build budget. - Synthesize TOOLS.md for interactive mode from the real tool specs instead of relying on static prompts that reference shell commands. - HTTP layer: retry/backoff with jitter, honour Retry-After, fall back through extra_body.fallback_models, and silently grow max_completion_tokens (up to 12000) on length finish_reason. - Raise max_tool_calls (24 -> 40) and max_completion_tokens (2000 -> 4096) for the default interactive profile. - Tighten HOLE_PATTERN (avoid matching `?_foo`) and broaden hidden import regex to cover `open`/`export` of proof modules. - Fix candidate_change_count / distinct_candidate_count for interactive attempts by falling back to stable_digest of candidate_file_contents when trace is missing. - Add light temperature schedule on repeated failure_class history. Schema: extend agent-run.schema.json with prebuild_reports. Verified: 3 quick tasks pass cleanly on both gpt and builtin/smart with correct analytics (distinct=1, change_count=1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 503a0654fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _LAKE_BUILD_CACHE.add(module_name) | ||
| code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT) |
There was a problem hiding this comment.
Cache only successful lake builds
prebuild_task_modules adds each module to _LAKE_BUILD_CACHE before running lake build, so a failed build is still marked as cached. In agent_runner.run_many (case/suite runs in one process), that causes later tasks sharing the same module to skip rebuilds and keep reporting cached, which prevents recovery from transient build failures and undermines the new self-healing behavior.
Useful? React with 👍 / 👎.
| if candidate_hash not in distinct_candidate_hashes: | ||
| distinct_candidate_hashes.add(candidate_hash) | ||
| if not isinstance(trace, dict) or not trace.get("candidate_sha256"): | ||
| if candidate_text != previous_candidate: | ||
| candidate_change_count += 1 |
There was a problem hiding this comment.
Count candidate changes on every transition
In the interactive fallback path (attempts without trace), candidate_change_count is incremented only when the hash has not been seen before. That undercounts real transitions when the model revisits an earlier candidate (e.g., A -> B -> A should count two changes after the first attempt, but the second A transition is dropped), so analytics remain inaccurate for non-monotonic candidate sequences.
Useful? React with 👍 / 👎.
- P1: only cache *successful* lake builds so failed builds can be retried on subsequent calls. Previously `_LAKE_BUILD_CACHE` marked modules as attempted before invoking lake, which prevented the self-heal path from ever recovering from a transient build failure in batch/suite runs that share a process. - P2: count every candidate transition in the interactive analytics fallback, including reverts (A -> B -> A now counts 2 changes, not 1). Previously the increment was gated by "hash not yet seen", which undercounted non-monotonic candidate sequences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bugbot (Medium): `_attempt_lake_build` returned False when the module was already cached as built, so after `prebuild_task_modules` populated the cache, every subsequent environment_error auto-heal attempt short-circuited without invoking lake and without retrying the check — while the model was told "The harness is attempting to rebuild it". The auto-heal path only fires when the runtime has observed a missing .olean at check time, which means the cached 'success' entry is stale. Always invoke lake build in this path and refresh the cache with the latest result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Low: merge the two back-to-back `if config.mode == "interactive":` blocks in `execute_agent_task` into a single branch. - Low: `try_tactic_at_hole` now uses the same `HOLE_PATTERN` regex as `inspect_lean_goals`, so the two tools agree on what counts as a hole (previously the tightened lookbehind in HOLE_PATTERN was not mirrored in `try_tactic_at_hole`, which could let it replace a match that `inspect_lean_goals` had reported as absent). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 950d7e3145
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex (P2): `_parse_retry_after` only handled delta-seconds, so when a provider sent an HTTP-date on 429/503 (which RFC 7231 permits), the function returned None and `_backoff_delay` fell back to a short exponential retry instead of honouring the server-requested wait. That can repeatedly hit rate limits and prolong failures. Now parse both forms; a date in the past clamps to 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ddda642d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if retry_after is not None: | ||
| return min(retry_after, 60.0) |
There was a problem hiding this comment.
Respect full Retry-After delay from provider
_backoff_delay clamps Retry-After to 60 seconds, so 429/503 responses that legitimately request longer waits (for example several minutes) will be retried too early and can burn through all retries while the provider is still enforcing the limit. This directly undermines the new “honour Retry-After” behavior and can turn recoverable throttling into hard failures.
Useful? React with 👍 / 👎.
| re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"), | ||
| re.compile(r"failed to load environment"), |
There was a problem hiding this comment.
Narrow environment-error matching to real infra failures
The new ENVIRONMENT_ERROR_PATTERNS treats any "object file ... .olean does not exist" / "failed to load environment" message as infrastructure, but those strings also occur when the candidate introduces a bad import/module name. Because _annotate_check_result short-circuits on environment_error, these model-caused mistakes are labeled “not your fault” and excluded from stagnation/temperature logic, which can prevent corrective behavior in interactive runs.
Useful? React with 👍 / 👎.
Codex P2 #1: narrow `environment_error` classification to real infra failures. Previously any missing .olean matched, but that string is also produced when the model imports a non-existent module. In the latter case we want the normal stagnation/temperature logic to kick in so the model can correct itself. Now only classify as environment_error when the missing module is under `Benchmark.*` (our dependency tree which should have been pre-built) or when lean itself is missing. The generic "failed to load environment" substring no longer short-circuits. Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to 600s. Providers routinely request several-minute waits on 429/503; the old 60s clamp caused retries to fire while the rate limit was still in force, undermining the "honour Retry-After" behaviour. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df1004c256
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # pre-built) or the model's fault (imported a module that doesn't exist). We | ||
| # only classify the former as environment_error so stagnation/temperature | ||
| # logic still applies to model-caused import mistakes. | ||
| _MISSING_OLEAN_RE = re.compile(r"object file ['\"]([^'\"]+\.olean)['\"]? does not exist") |
There was a problem hiding this comment.
Handle
of module in missing-olean regex
Lean commonly reports missing artifacts as object file '...olean' of module <Name> does not exist, but _MISSING_OLEAN_RE only matches the shorter '...olean' does not exist form. Fresh evidence in this revision is that the new regex at this line requires does not exist immediately after the path, so _missing_olean_module returns None for the standard message shape; that prevents classify_failure from returning environment_error and skips the new auto-heal lake build retry path in run_lean_check.
Useful? React with 👍 / 👎.
Bugbot (Medium): the temperature schedule condition was checked at the top of every loop iteration, so once two consecutive same-class failures triggered it, temperature was bumped on every subsequent turn -- including pure search/write turns -- until the 0.7 cap was reached. Track the history length we have already acted on and only bump once per new failure entry. Codex (P1): Lean reports missing artifacts in two forms depending on context: object file '...olean' does not exist object file '...olean' of module <Name> does not exist The `_MISSING_OLEAN_RE` regex only matched the shorter form, so on the more common "of module" diagnostic `_missing_olean_module` returned None, `classify_failure` did not return `environment_error`, and the auto-heal retry path was skipped. Accept arbitrary text between the path and the "does not exist" tail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 894a285f92
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| missing_module = _missing_olean_module(details) | ||
| if missing_module and missing_module.startswith("Benchmark."): | ||
| return "environment_error" |
There was a problem hiding this comment.
Verify missing Benchmark module exists before env classification
Fresh evidence in this revision is that classify_failure now labels any missing .olean as environment_error whenever the parsed module name starts with Benchmark.. A model-caused bad import like import Benchmark.Foo.Bar (which is not a real module) will therefore be treated as infrastructure failure, causing _annotate_check_result to mark it as environment_error and skip stagnation/temperature correction, so interactive runs can keep retrying instead of fixing the import.
Useful? React with 👍 / 👎.
Codex (P2): a model-authored bad import like `import Benchmark.Foo.Bar` where no such module exists would still produce an "object file ...olean does not exist" diagnostic, and under the previous check that matched the `Benchmark.*` prefix it would be misclassified as infrastructure failure -- skipping stagnation/temperature correction instead of letting the model discover and fix the bogus import. Require the corresponding `.lean` source file to actually exist in the tree before treating a missing .olean as environment_error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42efe75153
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Fall back only on rate-limit / service-unavailable style errors. | ||
| if exc.status not in (429, 500, 502, 503, 504) and exc.status != 0: | ||
| break |
There was a problem hiding this comment.
Retry alternate models on all transient HTTP statuses
send_chat_completion only falls through to fallback_models for 429/5xx (or status 0), but _post_chat_completion explicitly treats 408, 409, and 425 as transient retryable failures too. If the primary model keeps returning one of those statuses, the code exits after exhausting retries on that model instead of trying configured fallback models, so the new fallback chain is skipped in a recoverable outage scenario.
Useful? React with 👍 / 👎.
Codex (P2): `send_chat_completion`'s fallback-model gate used a
hard-coded subset {429, 500, 502, 503, 504}, but `_post_chat_completion`
retries the broader `RETRY_STATUS_CODES` set (408, 409, 425, 429, 500,
502, 503, 504). A primary model that kept returning 408/409/425 would
exhaust retries and then skip the configured fallback chain entirely.
Route on the same set used by the underlying retry loop.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8d43bbfae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| token_budget = min(int(token_budget * 1.5), 4500) | ||
| # Up to 3 silent budget bumps before nudging the model to simplify. | ||
| if consecutive_length_stops <= 3: | ||
| token_budget = min(int(token_budget * 1.5), 12000) |
There was a problem hiding this comment.
Cap length-retry token bumps at model-safe limit
The new length-retry path raises token_budget to min(int(token_budget * 1.5), 12000), which can exceed the model/provider output-token cap after a finish_reason == "length" response. With the updated interactive default already at 4096, the first bump becomes 6144; providers that enforce a 4096 max will return HTTP 400, and send_chat_completion treats 400 as non-transient and exits instead of recovering. This turns a recoverable truncation into a hard task failure for capped models.
Useful? React with 👍 / 👎.
Three related fixes from bugbot/codex on f8d43bb: 1. (codex P1) Cap length-retry token bump at config.max_completion_tokens. The previous hard-coded cap of 12000 could exceed the provider's per- response output-token limit (e.g. models with a 4096 hard cap), turning a recoverable truncation into a non-transient HTTP 400 hard failure. 2. (bugbot Low) Also reset token_budget when recovering from a length streak, not only when consecutive_length_stops >= 5. Previously a 1-4 stop streak followed by recovery left the elevated budget in place for the rest of the run. 3. (bugbot Low) Skip environment_error entries when appending to failure_class_history. They are infra noise that would break the sliding window same-class check (e.g. [type_error, env_error, type_error]) even though the filter at the trigger site rejects them -- consistent with the PR's stated intent that env_error be invisible to stagnation tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 07c3880b8d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # provider-enforced per-response limit (some models hard-cap at the | ||
| # configured value and would return HTTP 400 on anything larger). | ||
| if consecutive_length_stops <= 3: | ||
| token_budget = min(int(token_budget * 1.5), config.max_completion_tokens) |
There was a problem hiding this comment.
Make length-stop retries raise max_tokens
execute_interactive_agent_task initializes token_budget to config.max_completion_tokens, but the length-retry branch sets it to min(int(token_budget * 1.5), config.max_completion_tokens), which can never increase the value. As a result, repeated finish_reason == "length" responses are retried with the same token limit instead of a larger one, so the new silent recovery path is effectively disabled and long responses can still exhaust attempts unnecessarily.
Useful? React with 👍 / 👎.
Python 3.12 surfaces socket read timeouts during SSL as bare TimeoutError, which slipped past the urllib.error.URLError handler and killed the task with a traceback. Catch it alongside URLError and apply the same backoff retry policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous commit capped the silent length-retry bump at config.max_completion_tokens, which -- as bugbot and codex both pointed out -- turned the feature into a no-op because token_budget starts at exactly that value. Add a config.extra_body.length_retry_token_cap knob (stripped from the provider payload in send_chat_completion). Default stays at max_completion_tokens so models with a hard per-response cap don't get surprised by HTTP 400; interactive.json opts into a 12000 ceiling, which is the original intended behavior for gpt-class models that accept larger single-response budgets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5f118c37b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| length_retry_token_cap = int( | ||
| config.extra_body.get("length_retry_token_cap", config.max_completion_tokens) | ||
| ) |
There was a problem hiding this comment.
Handle invalid length_retry_token_cap without crashing
Interactive runs cast extra_body.length_retry_token_cap with int(...) unconditionally, but extra_body is schema-free and often operator-edited; non-integer values (e.g., null, "12k", or nested objects) will raise before the first model turn and abort the task. This knob should be parsed defensively (or validated up front) and fall back to max_completion_tokens instead of terminating execution.
Useful? React with 👍 / 👎.
Both knobs are read from schema-free extra_body: - fallback_models="gpt-4o-mini" (string shorthand) previously iterated character-by-character, producing fake model ids like "g", "p", "t" during failover. Normalize a bare string to a one-element list first. - length_retry_token_cap=null / "12k" / nested object would crash the int() cast before the first model turn, aborting the task. Fall back to max_completion_tokens silently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
- `harness/agents/interactive-gpt.json`: OpenRouter openai/gpt-5.4 - `harness/agents/interactive-opus.json`: OpenRouter anthropic/claude-opus-4.6 - `scripts/run_resumable_matrix.py`: runs every (profile, task) combo, skips already-produced results so the script can be re-run after rate-limit / crash / interruption to continue where it left off; writes an incremental summary after every task so partial runs still leave analyzable output. Progress and summary live under `results/matrix_runs/<run_id>/`.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4c4cf50f4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Introduce a grind-first template generator for Benchmark/Generated/** task skeletons so new/regenerated proof templates import Benchmark.Grindset and start with `unfold <spec>; grind [ContractName.fn, storage fields]`. Deliverables: - scripts/generate_task_skeletons.py: parses the existing `Tasks/*.lean` templates, extracts the contract function call from the theorem signature, looks up the companion `Benchmark/Cases/.../Contract.lean` to pull every storage field declared inside `verity_contract`, and emits a rewritten skeleton with `import Benchmark.Grindset` and a grind-first proof body. Supports --preview (writes under Benchmark/GeneratedPreview/), --in-place (rewrites live files when you are sure nothing is consuming them), and --patch (unified diff on stdout). Idempotent when re-run on its own output. - Benchmark/Grindset.lean: minimal empty stub so generator output is buildable standalone before the real `@[grind]` lemma bundle lands on grindset/s1-verity-grindset. Content-free on purpose — the s1 branch replaces it with the real bundle. - harness/PROOF_PATTERNS.md: rewritten to lead with a "grind-first" pattern (unfold spec, grind with ContractName.fn + all storage fields, grind? for lemma discovery on stuck goals). The pre-grindset simp+by_cases recipe is preserved under a "Fallback" section. - harness/PROMPT.md: matching agent-facing rewrite. Enumerates the grind-first strategy as steps 1-4 and demotes simp/simp_all/native_decide to the step-5 fallback. - Benchmark/GeneratedPreview/: 88 preview task skeletons so reviewers can diff the before/after shape without touching live Benchmark/Generated/** that a running benchmark could be reading. Assumptions about the grindset interface: - `@[grind]`-tagged lemmas on getStorage, setStorage, setMapping, setMappingUint, Verity.require, Verity.bind, Bind.bind, Verity.pure, Pure.pure, Contract.run, ContractResult.snd (i.e. the operational bundle `simp` already needs today). The generator therefore does NOT hint any of these in its `grind [...]` list — it only hints `ContractName.fn` and the storage fields declared in the companion `verity_contract` block.
…[grind]
Introduces Benchmark/Grindset/Invariants.lean which re-exports 118 domain-level
invariant lemmas with `attribute [grind …]` so the grind tactic can use them
during proof search. Complementary to S1's operational-primitives tagging.
Coverage (see Benchmark/Grindset/INVARIANTS_AUDIT.md for per-entry rationale):
- 49 × [grind =] : store/load identities, zero/self identities, sum lemmas
whose LHS captures every bound parameter.
- 48 × [grind →] : forward-only implications (monotonicity, safeAdd/Sub/Mul
bounds, wad gated identities, positivity).
- 21 × [grind] : case-local `def` unfolds (17 across 5 cases) plus 4
mulDivDown inequality lemmas whose conclusions are
`≤`/`<` rather than `=`.
Modules covered: Verity.Core.Uint256, Verity.Proofs.Stdlib.Math / ListSum /
MappingAutomation, Verity.Specs.Common (+.Sum), and case Specs.lean files for
Kleros/SortitionTrees, Lido/VaulthubLocked, PaladinVotes/StreamRecoveryClaimUsdc,
Safe/OwnerManagerReach, Zama/ERC7984ConfidentialToken.
Deliberately NOT tagged:
- Commutativity rewrites (E-match loop traps).
- `map_sum_point_update/decrease/transfer_eq` and 3 of the 5
`sumBalances_*` — their equation LHS doesn't mention every bound parameter,
so grind refuses to register them; callers should pass them manually.
- Case-local `reachable`/`acyclic`/`freshInList`/`calculateBuyReserve` etc.
(unbounded branching).
- Already-@[simp] Uint256 algebraic identities.
Also adds a minimal Benchmark/Grindset.lean stub that imports only
Benchmark.Grindset.Invariants — to be replaced/extended by sibling worker S1.
Builds clean under `lake build Benchmark.Grindset.Invariants` and
`lake build Benchmark.Grindset`.
No file under `.lake/packages/verity/**` or `Benchmark/Cases/**/{Specs,Proofs}.lean`
was modified.
…grind) Adds Benchmark/Grindset/Reach.lean with @[grind]-tagged reach closure lemmas for two reach flavours found in (or likely useful for) the reach-heavy benchmark cases: - Inductive via mathlib Relation.ReflTransGen: reach_refl, reach_of_step, reach_tail, reach_head, reach_trans, reach_preserves_invariant. - Witness-based (matching Safe.OwnerManagerReach.reachable shape): IsChain + Reachable defs, reachable_refl / _step / _of_step (@[grind]), plus reachable_snoc / _trans / _preserves_invariant (intentionally not @[grind] — they loop the E-matcher). Includes a verity_reach_grind tactic macro that first applies reachable_preserves_invariant / reach_preserves_invariant before falling back to plain grind, and ReachTests.lean with two independent demo proofs authored without consulting Proofs.lean. Ships Benchmark/Grindset.lean stub and REACH_NOTES.md documenting the design, the fact that only Safe/OwnerManagerReach among the four flagged cases actually uses reach, and applicability estimates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updates run_slug and model for the interactive-opus agent profile from claude-opus-4.6 to claude-opus-4.7. Matches the model used by the matrix_runs/matrix-opus run (pass rate 52/77 = 67.5%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…grind) # Conflicts: # Benchmark/Grindset.lean
…ompts) # Conflicts: # Benchmark/Grindset.lean
… WIP) # Conflicts: # Benchmark/Grindset.lean
Combines: - grindset/a1-invariant-tags (118 @[grind] invariants) - grindset/a3-reach-grind-ext (Reach lemma pack + verity_reach_grind) - grindset/s1-verity-grindset + s1-fix (Attr/Monad/Core/Tests, all 3 demos pass) - grindset/s3-skeleton-gen (skeleton generator + grind-first prompts) - harness/interactive-opus-4-7-config (profile bump to opus-4.7) Not included: - grindset/s2-grind-tool (worker refused 3x; try_grind tool to be implemented in a follow-up pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
verity-benchmark/harness/interactive_runtime.py
Lines 452 to 455 in 332bce4
Adding hidden_proof_import_detected/hidden_case_import_detected introduces new failure_mode values, but the no-tool fallback in execute_interactive_agent_task only retries lean_check_failed, placeholder_detected, and theorem_statement_mismatch (it returns immediately for other modes). That means a plain-text model response containing a blocked import now hard-fails the task after one attempt instead of feeding back a correction prompt and continuing, which regresses interactive recovery in non-tool turns.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
P1 (runtime reliability): - interactive_runtime.py: defensive read-side guard against serving an `environment_error` result from the run_lean_check fast-path cache. The write-side already skipped caching env errors; this closes the remaining loophole in case a future refactor leaves one in the cache, keeping `_attempt_lake_build` heal retries reachable on stale infra. - run_resumable_matrix.py: treat unreadable/corrupt result artifacts as missing. `read_result()` returning None previously marked the task SKIP with `unknown` status, so a resumed matrix could silently finish with gaps that resume was meant to fill. Now we delete the corrupt file, log a rerun event, and fall through to the RUN branch. Codex P2: - default_agent.py: route hidden_proof_import_detected and hidden_case_import_detected through the retry loop alongside placeholder_detected / theorem_statement_mismatch. Previously these fell through to the generic `return`, aborting the task on the first hidden-import mistake even though the model can trivially recover by dropping the bad import. Medium: - interactive_runtime.py: invalidate `_last_eval_cache` at the top of `write_editable_proof`. Repeat writes of identical content (common during stagnation loops) would otherwise hit the cache and surface a misleading `cached: true` / "redundant run_lean_check" note on what is actually a fresh write turn. Low: - default_agent.py: re-export `_PREFLIGHT_FAILURE_MODES` from `interactive_runtime` instead of maintaining a second copy. An earlier refactor had already dropped `empty_response` from the duplicate; importing removes that whole class of drift bug. - default_agent.py: never let the escalation schedule drop the temperature below `config.temperature`. With a configured base of 1.0 the old `min(0.7, ...)` cap would DECREASE to 0.7 on the first stagnation trigger, the opposite of the escalation intent.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6538119cbc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Add Benchmark/Grindset/Arith.lean with 13 @[grind_norm, simp]-tagged
lemmas that help grind and omega close the three arithmetic obligations
in the lido/vaulthub_locked case:
- Uint256→Nat wrappers: mul, sub, div, add val-level reductions
- ceilDiv unfolding: ceilDiv_val_eq (Nat-level identity)
- ceilDiv bounds: ceilDiv_le_numerator, ceilDiv_mul_ge (sandwich),
ceilDiv_monotone
- Spec-level: ceildiv_sandwich_spec_holds,
shares_conversion_monotone_spec_holds
- 3 demo theorems closing specs without sorry
Zero sorry, zero new axioms. Builds clean on Lean 4.22.0.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Benchmark/Grindset/Arith.lean (13 lemmas) covering: - Uint256 <-> Nat wrappers (mul/sub/div/add under overflow guards) - ceilDiv core (val, ≤-numerator, sandwich, monotonicity) - Spec-level closers: ceildiv_sandwich_spec, shares_conversion_monotone_spec - 3 demo theorems exercising the pack via simp only [grind_norm] No new axioms, zero sorry, pure Nat arithmetic (omega-reachable). Umbrella Benchmark.Grindset.lean imports the new module.
The self-hosted Blacksmith sticky disk for bucket `pr-26-*` has been stuck with a corrupt `.lake/packages/aesop` (missing HEAD) for the entire lifetime of this PR. Since the sticky-disk key is derived purely from `lean-toolchain / lakefile.lean / lake-manifest.json` hashes, plain reruns keep remounting the broken state. Add a `.github/cache-salt` file hashed into the elan / packages / build keys. Bumping the salt (currently `1`) rotates every sticky bucket, forcing a fresh checkout of all `.lake/packages/*` next run. The salt only affects cache-key derivation; no build inputs change.
Three review findings from PR #26: - Cursor Medium: `countOcc_cons_ne` / `countOccU_cons_ne` were tagged `[grind =]` but the audit specifies `[grind →]` (conditional `a ≠ t →` equalities are forward-only). Move them to the forward block and update the section docstring. - Cursor Low: `mulDivDown_nat_eq` was tagged `[grind =]` while its three siblings (`mulDivUp_nat_eq`, `wMulDown_nat_eq`, `wDivUp_nat_eq`) are `[grind →]`. The audit specifies all four as `[grind →]` (fits-within-modulus antecedent). Consolidate into the forward block. - Codex P2: `_append_failure_class` dedupes on `(candidate_hash, failure_class)` across the whole task. When the model stalls and re-submits the same unchanged candidate across turns, legitimate cross-turn repeats were silently dropped, preventing the repeated- class temperature escalation from firing. Reset `_last_history_key` at the top of each turn so dedupe is strictly intra-turn (its actual purpose: coalescing `write_editable_proof` + `run_lean_check` on the same candidate within one model turn). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b98f70f179
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Remove unused/legacy profiles (builtin-smart, combined-lean-tools,
interactive-candidate, interactive, leanstral, openrouter-gemini-3.1-
flash-lite-preview) and keep only the three interactive profiles we
actually use, with larger budgets (max_attempts=32, max_tool_calls=80).
- scripts/run_benchmark_matrix.py: update TARGET_CONFIGS + CLI flags
(--interactive-{gpt,opus,smart}-repeats) accordingly.
- scripts/repeat_benchmark_compare.py: refresh usage example.
- harness/README.md: refresh bundled-profile listing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses two Codex P2 findings on scripts/run_resumable_matrix.py: - Use $HOME/.elan/bin (falling back to expanduser) when prepending the elan toolchain to PATH, so non-root environments (local dev, CI) pick up lake/lean instead of silently missing them. - Gate the unlink of unreadable artifacts behind "not args.dry_run": dry-run mode now just reports the would-be deletion and continues, keeping the workspace untouched so users can inspect the corrupt file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e1544ab. Configure here.
| packages_key="lake-packages-benchmark-${CACHE_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}" | ||
| packages_main_key="lake-packages-benchmark-${MAIN_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}" | ||
| build_key="lake-build-benchmark-${CACHE_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}" | ||
| build_main_key="lake-build-benchmark-${MAIN_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}" |
There was a problem hiding this comment.
Cache salt missing from GitHub Actions fallback keys
Medium Severity
The new CACHE_SALT_HASH is appended to all sticky-disk cache keys (lines 35–39) but is not added to the GitHub Actions cache fallback keys at line 65 (elan-${{ runner.os }}-...) and line 170 (lake-${{ runner.os }}-...). Bumping .github/cache-salt will bust caches only for workflows using sticky disks; workflows on the non-sticky path will continue hitting stale caches, defeating the cache-busting mechanism.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e1544ab. Configure here.


Summary
Robustness and analytics fixes for the interactive benchmark agent runtime. Addresses spurious failures caused by Lake build races, transient 429/5xx responses, truncated responses, and missing or misleading tool feedback. Also corrects stats that were always reporting 0 distinct candidates.
Changes
Pre-build + self-heal (biggest impact)
prebuild_task_modules()runslake buildonce per task up front for allBenchmark.*modules needed by the case, soinspect_lean_goals/try_tactic_at_hole/run_lean_checknever race an incomplete build.environment_errorfailure class: detect "object file ...olean does not exist" / "failed to load environment", auto-invokelake buildfor the missing module, and retry the check once before surfacing to the model. Stagnation tracking and temperature schedules ignore this class.Tool feedback
write_editable_proofnow returns immediate warnings for: placeholder detected, hiddenProofsimport, unfilled?_hole, theorem-statement mismatch, empty content. The model sees these before spending build budget.HOLE_PATTERNtightened to(?<!\w)\?_(?!\w)so identifiers like?_foodon't falsely trigger.HIDDEN_PROOF_IMPORT_PATTERNnow coversimport/open/exportofBenchmark.Cases.*.Proofspaths.TOOLS.mdis synthesized from the real tool specs (6 tools) rather than relying on the static shell-oriented prompt.HTTP resilience
send_chat_completionrewritten with retry/backoff + jitter (up to 6 attempts), honoursRetry-After, and falls back throughextra_body.fallback_modelson persistent failure.finish_reason: length, bumpmax_completion_tokens(up to 12000, 3 silent bumps) before counting the attempt as a failure.Defaults
agents/interactive.json:max_completion_tokens2000 → 4096,max_tool_calls24 → 40.Analytics
build_run_analysisnow has a fallback path for interactive attempts (which don't populatetrace) that derivesstable_digestfromcandidate_file_contents, sodistinct_candidate_count/candidate_change_countare correct.Schema
schemas/agent-run.schema.json: addprebuild_reportsproperty.Verification
Ran the 3 quick tasks on both
gptandbuiltin/smart: 6/6 pass, analytics reportdistinct=1 change_count=1(previously 0).Test plan
python -m py_compile) on all edited modulesgpt(in flight)🤖 Generated with Claude Code
Note
Medium Risk
Medium risk because it changes CI cache key derivation (may cause cache misses/longer builds) and adds a large new proof-automation layer whose rewrite/tagging choices could impact proof stability and performance.
Overview
Adds a new
Benchmark.Grindsetumbrella module plus supporting packs (Attr,Monad,Core,Invariants,Arith) to let downstream benchmark proofs discharge common monad/storage/mapping/arithmetic obligations viagrindand related simp attributes.Introduces a large set of
Benchmark/GeneratedPreview/**/Tasks/*.leantheorems that importBenchmark.Grindsetand providegrind-first proof skeletons for multiple benchmark cases (e.g. SideEntrance, DepositContractMinimal, SortitionTrees, VaultHubLocked, OwnerManagerReach, ERC7984).Updates the
setup-leanGitHub Action to include a.github/cache-salthash in sticky-disk cache keys, enabling explicit cache invalidation without changing toolchain or Lake manifests.Reviewed by Cursor Bugbot for commit e1544ab. Bugbot is set up for automated code reviews on this repo. Configure here.