harness: interactive proxy robustness + analytics fixes by Th0rgal · Pull Request #26 · lfglabs-dev/verity-benchmark

Th0rgal · 2026-04-22T07:10:06Z

Summary

Robustness and analytics fixes for the interactive benchmark agent runtime. Addresses spurious failures caused by Lake build races, transient 429/5xx responses, truncated responses, and missing or misleading tool feedback. Also corrects stats that were always reporting 0 distinct candidates.

Changes

Pre-build + self-heal (biggest impact)

prebuild_task_modules() runs lake build once per task up front for all Benchmark.* modules needed by the case, so inspect_lean_goals / try_tactic_at_hole / run_lean_check never race an incomplete build.
New environment_error failure class: detect "object file ...olean does not exist" / "failed to load environment", auto-invoke lake build for the missing module, and retry the check once before surfacing to the model. Stagnation tracking and temperature schedules ignore this class.

Tool feedback

write_editable_proof now returns immediate warnings for: placeholder detected, hidden Proofs import, unfilled ?_ hole, theorem-statement mismatch, empty content. The model sees these before spending build budget.
HOLE_PATTERN tightened to (?<!\w)\?_(?!\w) so identifiers like ?_foo don't falsely trigger.
HIDDEN_PROOF_IMPORT_PATTERN now covers import/open/export of Benchmark.Cases.*.Proofs paths.
Interactive-mode TOOLS.md is synthesized from the real tool specs (6 tools) rather than relying on the static shell-oriented prompt.

HTTP resilience

send_chat_completion rewritten with retry/backoff + jitter (up to 6 attempts), honours Retry-After, and falls back through extra_body.fallback_models on persistent failure.
Silent length retry: on finish_reason: length, bump max_completion_tokens (up to 12000, 3 silent bumps) before counting the attempt as a failure.

Defaults

agents/interactive.json: max_completion_tokens 2000 → 4096, max_tool_calls 24 → 40.

Analytics

build_run_analysis now has a fallback path for interactive attempts (which don't populate trace) that derives stable_digest from candidate_file_contents, so distinct_candidate_count / candidate_change_count are correct.
Light temperature schedule on repeated non-environment failure classes.

Schema

schemas/agent-run.schema.json: add prebuild_reports property.

Verification

Ran the 3 quick tasks on both gpt and builtin/smart: 6/6 pass, analytics report distinct=1 change_count=1 (previously 0).

task	gpt tool_calls	smart tool_calls
ethereum/deposit_contract_minimal/deposit_count	2	4
damn_vulnerable_defi/side_entrance/deposit_sets_sender_credit	2	8
damn_vulnerable_defi/side_entrance/deposit_sets_pool_balance	2	2

Test plan

Syntax check (python -m py_compile) on all edited modules
3 quick tasks × 2 models all pass with correct analytics
Full benchmark sweep with gpt (in flight)
Address any bugbot / codex findings on this PR

🤖 Generated with Claude Code

Note

Medium Risk
Medium risk because it changes CI cache key derivation (may cause cache misses/longer builds) and adds a large new proof-automation layer whose rewrite/tagging choices could impact proof stability and performance.

Overview
Adds a new Benchmark.Grindset umbrella module plus supporting packs (Attr, Monad, Core, Invariants, Arith) to let downstream benchmark proofs discharge common monad/storage/mapping/arithmetic obligations via grind and related simp attributes.

Introduces a large set of Benchmark/GeneratedPreview/**/Tasks/*.lean theorems that import Benchmark.Grindset and provide grind-first proof skeletons for multiple benchmark cases (e.g. SideEntrance, DepositContractMinimal, SortitionTrees, VaultHubLocked, OwnerManagerReach, ERC7984).

Updates the setup-lean GitHub Action to include a .github/cache-salt hash in sticky-disk cache keys, enabling explicit cache invalidation without changing toolchain or Lake manifests.

^{Reviewed by Cursor Bugbot for commit e1544ab. Bugbot is set up for automated code reviews on this repo. Configure here.}

Major improvements to the interactive agent runtime to reduce spurious failures and improve tool feedback quality: - Pre-build Benchmark case modules once per task so the interactive tools don't race the Lake build (eliminates olean-not-found flake). - Add environment_error failure class with one-shot self-heal: when Lean can't find a dependency olean, auto-run lake build and retry before reporting to the model. Stagnation tracking skips these classes. - write_editable_proof now returns immediate warnings (placeholder, hole, theorem_statement_mismatch, hidden_proof_import, etc.) so the model sees actionable feedback before spending a build budget. - Synthesize TOOLS.md for interactive mode from the real tool specs instead of relying on static prompts that reference shell commands. - HTTP layer: retry/backoff with jitter, honour Retry-After, fall back through extra_body.fallback_models, and silently grow max_completion_tokens (up to 12000) on length finish_reason. - Raise max_tool_calls (24 -> 40) and max_completion_tokens (2000 -> 4096) for the default interactive profile. - Tighten HOLE_PATTERN (avoid matching `?_foo`) and broaden hidden import regex to cover `open`/`export` of proof modules. - Fix candidate_change_count / distinct_candidate_count for interactive attempts by falling back to stable_digest of candidate_file_contents when trace is missing. - Add light temperature schedule on repeated failure_class history. Schema: extend agent-run.schema.json with prebuild_reports. Verified: 3 quick tasks pass cleanly on both gpt and builtin/smart with correct analytics (distinct=1, change_count=1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 503a0654fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T07:15:07Z

+        _LAKE_BUILD_CACHE.add(module_name)
+        code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT)


Cache only successful lake builds

prebuild_task_modules adds each module to _LAKE_BUILD_CACHE before running lake build, so a failed build is still marked as cached. In agent_runner.run_many (case/suite runs in one process), that causes later tasks sharing the same module to skip rebuilds and keep reporting cached, which prevents recovery from transient build failures and undermines the new self-healing behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T07:15:07Z

+            if candidate_hash not in distinct_candidate_hashes:
+                distinct_candidate_hashes.add(candidate_hash)
+                if not isinstance(trace, dict) or not trace.get("candidate_sha256"):
+                    if candidate_text != previous_candidate:
+                        candidate_change_count += 1


Count candidate changes on every transition

In the interactive fallback path (attempts without trace), candidate_change_count is incremented only when the hash has not been seen before. That undercounts real transitions when the model revisits an earlier candidate (e.g., A -> B -> A should count two changes after the first attempt, but the second A transition is dropped), so analytics remain inaccurate for non-monotonic candidate sequences.

Useful? React with 👍 / 👎.

- P1: only cache *successful* lake builds so failed builds can be retried on subsequent calls. Previously `_LAKE_BUILD_CACHE` marked modules as attempted before invoking lake, which prevented the self-heal path from ever recovering from a transient build failure in batch/suite runs that share a process. - P2: count every candidate transition in the interactive analytics fallback, including reverts (A -> B -> A now counts 2 changes, not 1). Previously the increment was gated by "hash not yet seen", which undercounted non-monotonic candidate sequences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bugbot (Medium): `_attempt_lake_build` returned False when the module was already cached as built, so after `prebuild_task_modules` populated the cache, every subsequent environment_error auto-heal attempt short-circuited without invoking lake and without retrying the check — while the model was told "The harness is attempting to rebuild it". The auto-heal path only fires when the runtime has observed a missing .olean at check time, which means the cached 'success' entry is stale. Always invoke lake build in this path and refresh the cache with the latest result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Low: merge the two back-to-back `if config.mode == "interactive":` blocks in `execute_agent_task` into a single branch. - Low: `try_tactic_at_hole` now uses the same `HOLE_PATTERN` regex as `inspect_lean_goals`, so the two tools agree on what counts as a hole (previously the tightened lookbehind in HOLE_PATTERN was not mirrored in `try_tactic_at_hole`, which could let it replace a match that `inspect_lean_goals` had reported as absent). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 950d7e3145

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex (P2): `_parse_retry_after` only handled delta-seconds, so when a provider sent an HTTP-date on 429/503 (which RFC 7231 permits), the function returned None and `_backoff_delay` fell back to a short exponential retry instead of honouring the server-requested wait. That can repeatedly hit rate limits and prolong failures. Now parse both forms; a date in the past clamps to 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ddda642d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T08:48:15Z

+    if retry_after is not None:
+        return min(retry_after, 60.0)


Respect full Retry-After delay from provider

_backoff_delay clamps Retry-After to 60 seconds, so 429/503 responses that legitimately request longer waits (for example several minutes) will be retried too early and can burn through all retries while the provider is still enforcing the limit. This directly undermines the new “honour Retry-After” behavior and can turn recoverable throttling into hard failures.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T08:48:15Z

+    re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"),
+    re.compile(r"failed to load environment"),


Narrow environment-error matching to real infra failures

The new ENVIRONMENT_ERROR_PATTERNS treats any "object file ... .olean does not exist" / "failed to load environment" message as infrastructure, but those strings also occur when the candidate introduces a bad import/module name. Because _annotate_check_result short-circuits on environment_error, these model-caused mistakes are labeled “not your fault” and excluded from stagnation/temperature logic, which can prevent corrective behavior in interactive runs.

Useful? React with 👍 / 👎.

Codex P2 #1: narrow `environment_error` classification to real infra failures. Previously any missing .olean matched, but that string is also produced when the model imports a non-existent module. In the latter case we want the normal stagnation/temperature logic to kick in so the model can correct itself. Now only classify as environment_error when the missing module is under `Benchmark.*` (our dependency tree which should have been pre-built) or when lean itself is missing. The generic "failed to load environment" substring no longer short-circuits. Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to 600s. Providers routinely request several-minute waits on 429/503; the old 60s clamp caused retries to fire while the rate limit was still in force, undermining the "honour Retry-After" behaviour. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df1004c256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T08:59:40Z

+# pre-built) or the model's fault (imported a module that doesn't exist). We
+# only classify the former as environment_error so stagnation/temperature
+# logic still applies to model-caused import mistakes.
+_MISSING_OLEAN_RE = re.compile(r"object file ['\"]([^'\"]+\.olean)['\"]? does not exist")


Handle of module in missing-olean regex

Lean commonly reports missing artifacts as object file '...olean' of module <Name> does not exist, but _MISSING_OLEAN_RE only matches the shorter '...olean' does not exist form. Fresh evidence in this revision is that the new regex at this line requires does not exist immediately after the path, so _missing_olean_module returns None for the standard message shape; that prevents classify_failure from returning environment_error and skips the new auto-heal lake build retry path in run_lean_check.

Useful? React with 👍 / 👎.

Bugbot (Medium): the temperature schedule condition was checked at the top of every loop iteration, so once two consecutive same-class failures triggered it, temperature was bumped on every subsequent turn -- including pure search/write turns -- until the 0.7 cap was reached. Track the history length we have already acted on and only bump once per new failure entry. Codex (P1): Lean reports missing artifacts in two forms depending on context: object file '...olean' does not exist object file '...olean' of module <Name> does not exist The `_MISSING_OLEAN_RE` regex only matched the shorter form, so on the more common "of module" diagnostic `_missing_olean_module` returned None, `classify_failure` did not return `environment_error`, and the auto-heal retry path was skipped. Accept arbitrary text between the path and the "does not exist" tail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 894a285f92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T09:07:53Z

+    missing_module = _missing_olean_module(details)
+    if missing_module and missing_module.startswith("Benchmark."):
+        return "environment_error"


Verify missing Benchmark module exists before env classification

Fresh evidence in this revision is that classify_failure now labels any missing .olean as environment_error whenever the parsed module name starts with Benchmark.. A model-caused bad import like import Benchmark.Foo.Bar (which is not a real module) will therefore be treated as infrastructure failure, causing _annotate_check_result to mark it as environment_error and skip stagnation/temperature correction, so interactive runs can keep retrying instead of fixing the import.

Useful? React with 👍 / 👎.

Codex (P2): a model-authored bad import like `import Benchmark.Foo.Bar` where no such module exists would still produce an "object file ...olean does not exist" diagnostic, and under the previous check that matched the `Benchmark.*` prefix it would be misclassified as infrastructure failure -- skipping stagnation/temperature correction instead of letting the model discover and fix the bogus import. Require the corresponding `.lean` source file to actually exist in the tree before treating a missing .olean as environment_error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42efe75153

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T09:17:51Z

+            # Fall back only on rate-limit / service-unavailable style errors.
+            if exc.status not in (429, 500, 502, 503, 504) and exc.status != 0:
+                break


Retry alternate models on all transient HTTP statuses

send_chat_completion only falls through to fallback_models for 429/5xx (or status 0), but _post_chat_completion explicitly treats 408, 409, and 425 as transient retryable failures too. If the primary model keeps returning one of those statuses, the code exits after exhausting retries on that model instead of trying configured fallback models, so the new fallback chain is skipped in a recoverable outage scenario.

Useful? React with 👍 / 👎.

Codex (P2): `send_chat_completion`'s fallback-model gate used a hard-coded subset {429, 500, 502, 503, 504}, but `_post_chat_completion` retries the broader `RETRY_STATUS_CODES` set (408, 409, 425, 429, 500, 502, 503, 504). A primary model that kept returning 408/409/425 would exhaust retries and then skip the configured fallback chain entirely. Route on the same set used by the underlying retry loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8d43bbfae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T09:53:29Z

-                token_budget = min(int(token_budget * 1.5), 4500)
+            # Up to 3 silent budget bumps before nudging the model to simplify.
+            if consecutive_length_stops <= 3:
+                token_budget = min(int(token_budget * 1.5), 12000)


Cap length-retry token bumps at model-safe limit

The new length-retry path raises token_budget to min(int(token_budget * 1.5), 12000), which can exceed the model/provider output-token cap after a finish_reason == "length" response. With the updated interactive default already at 4096, the first bump becomes 6144; providers that enforce a 4096 max will return HTTP 400, and send_chat_completion treats 400 as non-transient and exits instead of recovering. This turns a recoverable truncation into a hard task failure for capped models.

Useful? React with 👍 / 👎.

Three related fixes from bugbot/codex on f8d43bb: 1. (codex P1) Cap length-retry token bump at config.max_completion_tokens. The previous hard-coded cap of 12000 could exceed the provider's per- response output-token limit (e.g. models with a 4096 hard cap), turning a recoverable truncation into a non-transient HTTP 400 hard failure. 2. (bugbot Low) Also reset token_budget when recovering from a length streak, not only when consecutive_length_stops >= 5. Previously a 1-4 stop streak followed by recovery left the elevated budget in place for the rest of the run. 3. (bugbot Low) Skip environment_error entries when appending to failure_class_history. They are infra noise that would break the sliding window same-class check (e.g. [type_error, env_error, type_error]) even though the filter at the trigger site rejects them -- consistent with the PR's stated intent that env_error be invisible to stagnation tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07c3880b8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T10:08:31Z

+            # provider-enforced per-response limit (some models hard-cap at the
+            # configured value and would return HTTP 400 on anything larger).
+            if consecutive_length_stops <= 3:
+                token_budget = min(int(token_budget * 1.5), config.max_completion_tokens)


Make length-stop retries raise max_tokens

execute_interactive_agent_task initializes token_budget to config.max_completion_tokens, but the length-retry branch sets it to min(int(token_budget * 1.5), config.max_completion_tokens), which can never increase the value. As a result, repeated finish_reason == "length" responses are retried with the same token limit instead of a larger one, so the new silent recovery path is effectively disabled and long responses can still exhaust attempts unnecessarily.

Useful? React with 👍 / 👎.

Python 3.12 surfaces socket read timeouts during SSL as bare TimeoutError, which slipped past the urllib.error.URLError handler and killed the task with a traceback. Catch it alongside URLError and apply the same backoff retry policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previous commit capped the silent length-retry bump at config.max_completion_tokens, which -- as bugbot and codex both pointed out -- turned the feature into a no-op because token_budget starts at exactly that value. Add a config.extra_body.length_retry_token_cap knob (stripped from the provider payload in send_chat_completion). Default stays at max_completion_tokens so models with a hard per-response cap don't get surprised by HTTP 400; interactive.json opts into a 12000 ceiling, which is the original intended behavior for gpt-class models that accept larger single-response budgets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5f118c37b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T11:04:21Z

+    length_retry_token_cap = int(
+        config.extra_body.get("length_retry_token_cap", config.max_completion_tokens)
+    )


Handle invalid length_retry_token_cap without crashing

Interactive runs cast extra_body.length_retry_token_cap with int(...) unconditionally, but extra_body is schema-free and often operator-edited; non-integer values (e.g., null, "12k", or nested objects) will raise before the first model turn and abort the task. This knob should be parsed defensively (or validated up front) and fall back to max_completion_tokens instead of terminating execution.

Useful? React with 👍 / 👎.

Both knobs are read from schema-free extra_body: - fallback_models="gpt-4o-mini" (string shorthand) previously iterated character-by-character, producing fake model ids like "g", "p", "t" during failover. Normalize a bare string to a one-element list first. - length_retry_token_cap=null / "12k" / nested object would crash the int() cast before the first model turn, aborting the task. Fall back to max_completion_tokens silently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-04-23T08:40:42Z

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- `harness/agents/interactive-gpt.json`: OpenRouter openai/gpt-5.4 - `harness/agents/interactive-opus.json`: OpenRouter anthropic/claude-opus-4.6 - `scripts/run_resumable_matrix.py`: runs every (profile, task) combo, skips already-produced results so the script can be re-run after rate-limit / crash / interruption to continue where it left off; writes an incremental summary after every task so partial runs still leave analyzable output. Progress and summary live under `results/matrix_runs/<run_id>/`.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4c4cf50f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Introduce a grind-first template generator for Benchmark/Generated/** task skeletons so new/regenerated proof templates import Benchmark.Grindset and start with `unfold <spec>; grind [ContractName.fn, storage fields]`. Deliverables: - scripts/generate_task_skeletons.py: parses the existing `Tasks/*.lean` templates, extracts the contract function call from the theorem signature, looks up the companion `Benchmark/Cases/.../Contract.lean` to pull every storage field declared inside `verity_contract`, and emits a rewritten skeleton with `import Benchmark.Grindset` and a grind-first proof body. Supports --preview (writes under Benchmark/GeneratedPreview/), --in-place (rewrites live files when you are sure nothing is consuming them), and --patch (unified diff on stdout). Idempotent when re-run on its own output. - Benchmark/Grindset.lean: minimal empty stub so generator output is buildable standalone before the real `@[grind]` lemma bundle lands on grindset/s1-verity-grindset. Content-free on purpose — the s1 branch replaces it with the real bundle. - harness/PROOF_PATTERNS.md: rewritten to lead with a "grind-first" pattern (unfold spec, grind with ContractName.fn + all storage fields, grind? for lemma discovery on stuck goals). The pre-grindset simp+by_cases recipe is preserved under a "Fallback" section. - harness/PROMPT.md: matching agent-facing rewrite. Enumerates the grind-first strategy as steps 1-4 and demotes simp/simp_all/native_decide to the step-5 fallback. - Benchmark/GeneratedPreview/: 88 preview task skeletons so reviewers can diff the before/after shape without touching live Benchmark/Generated/** that a running benchmark could be reading. Assumptions about the grindset interface: - `@[grind]`-tagged lemmas on getStorage, setStorage, setMapping, setMappingUint, Verity.require, Verity.bind, Bind.bind, Verity.pure, Pure.pure, Contract.run, ContractResult.snd (i.e. the operational bundle `simp` already needs today). The generator therefore does NOT hint any of these in its `grind [...]` list — it only hints `ContractName.fn` and the storage fields declared in the companion `verity_contract` block.

…[grind] Introduces Benchmark/Grindset/Invariants.lean which re-exports 118 domain-level invariant lemmas with `attribute [grind …]` so the grind tactic can use them during proof search. Complementary to S1's operational-primitives tagging. Coverage (see Benchmark/Grindset/INVARIANTS_AUDIT.md for per-entry rationale): - 49 × [grind =] : store/load identities, zero/self identities, sum lemmas whose LHS captures every bound parameter. - 48 × [grind →] : forward-only implications (monotonicity, safeAdd/Sub/Mul bounds, wad gated identities, positivity). - 21 × [grind] : case-local `def` unfolds (17 across 5 cases) plus 4 mulDivDown inequality lemmas whose conclusions are `≤`/`<` rather than `=`. Modules covered: Verity.Core.Uint256, Verity.Proofs.Stdlib.Math / ListSum / MappingAutomation, Verity.Specs.Common (+.Sum), and case Specs.lean files for Kleros/SortitionTrees, Lido/VaulthubLocked, PaladinVotes/StreamRecoveryClaimUsdc, Safe/OwnerManagerReach, Zama/ERC7984ConfidentialToken. Deliberately NOT tagged: - Commutativity rewrites (E-match loop traps). - `map_sum_point_update/decrease/transfer_eq` and 3 of the 5 `sumBalances_*` — their equation LHS doesn't mention every bound parameter, so grind refuses to register them; callers should pass them manually. - Case-local `reachable`/`acyclic`/`freshInList`/`calculateBuyReserve` etc. (unbounded branching). - Already-@[simp] Uint256 algebraic identities. Also adds a minimal Benchmark/Grindset.lean stub that imports only Benchmark.Grindset.Invariants — to be replaced/extended by sibling worker S1. Builds clean under `lake build Benchmark.Grindset.Invariants` and `lake build Benchmark.Grindset`. No file under `.lake/packages/verity/**` or `Benchmark/Cases/**/{Specs,Proofs}.lean` was modified.

…grind) Adds Benchmark/Grindset/Reach.lean with @[grind]-tagged reach closure lemmas for two reach flavours found in (or likely useful for) the reach-heavy benchmark cases: - Inductive via mathlib Relation.ReflTransGen: reach_refl, reach_of_step, reach_tail, reach_head, reach_trans, reach_preserves_invariant. - Witness-based (matching Safe.OwnerManagerReach.reachable shape): IsChain + Reachable defs, reachable_refl / _step / _of_step (@[grind]), plus reachable_snoc / _trans / _preserves_invariant (intentionally not @[grind] — they loop the E-matcher). Includes a verity_reach_grind tactic macro that first applies reachable_preserves_invariant / reach_preserves_invariant before falling back to plain grind, and ReachTests.lean with two independent demo proofs authored without consulting Proofs.lean. Ships Benchmark/Grindset.lean stub and REACH_NOTES.md documenting the design, the fact that only Safe/OwnerManagerReach among the four flagged cases actually uses reach, and applicability estimates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Updates run_slug and model for the interactive-opus agent profile from claude-opus-4.6 to claude-opus-4.7. Matches the model used by the matrix_runs/matrix-opus run (pass rate 52/77 = 67.5%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…grind) # Conflicts: # Benchmark/Grindset.lean

…ompts) # Conflicts: # Benchmark/Grindset.lean

… WIP) # Conflicts: # Benchmark/Grindset.lean

Combines: - grindset/a1-invariant-tags (118 @[grind] invariants) - grindset/a3-reach-grind-ext (Reach lemma pack + verity_reach_grind) - grindset/s1-verity-grindset + s1-fix (Attr/Monad/Core/Tests, all 3 demos pass) - grindset/s3-skeleton-gen (skeleton generator + grind-first prompts) - harness/interactive-opus-4-7-config (profile bump to opus-4.7) Not included: - grindset/s2-grind-tool (worker refused 3x; try_grind tool to be implemented in a follow-up pass) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

verity-benchmark/harness/interactive_runtime.py

Lines 452 to 455 in 332bce4

    
           return { 
        
               "status": "failed", 
        
               "failure_mode": "hidden_proof_import_detected", 
        
               "details": "candidate proof imports hidden Benchmark.Cases.*.Proofs modules",

Route new preflight failures through retry loop

Adding hidden_proof_import_detected/hidden_case_import_detected introduces new failure_mode values, but the no-tool fallback in execute_interactive_agent_task only retries lean_check_failed, placeholder_detected, and theorem_statement_mismatch (it returns immediately for other modes). That means a plain-text model response containing a blocked import now hard-fails the task after one attempt instead of feeding back a correction prompt and continuing, which regresses interactive recovery in non-tool turns.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

P1 (runtime reliability): - interactive_runtime.py: defensive read-side guard against serving an `environment_error` result from the run_lean_check fast-path cache. The write-side already skipped caching env errors; this closes the remaining loophole in case a future refactor leaves one in the cache, keeping `_attempt_lake_build` heal retries reachable on stale infra. - run_resumable_matrix.py: treat unreadable/corrupt result artifacts as missing. `read_result()` returning None previously marked the task SKIP with `unknown` status, so a resumed matrix could silently finish with gaps that resume was meant to fill. Now we delete the corrupt file, log a rerun event, and fall through to the RUN branch. Codex P2: - default_agent.py: route hidden_proof_import_detected and hidden_case_import_detected through the retry loop alongside placeholder_detected / theorem_statement_mismatch. Previously these fell through to the generic `return`, aborting the task on the first hidden-import mistake even though the model can trivially recover by dropping the bad import. Medium: - interactive_runtime.py: invalidate `_last_eval_cache` at the top of `write_editable_proof`. Repeat writes of identical content (common during stagnation loops) would otherwise hit the cache and surface a misleading `cached: true` / "redundant run_lean_check" note on what is actually a fresh write turn. Low: - default_agent.py: re-export `_PREFLIGHT_FAILURE_MODES` from `interactive_runtime` instead of maintaining a second copy. An earlier refactor had already dropped `empty_response` from the duplicate; importing removes that whole class of drift bug. - default_agent.py: never let the escalation schedule drop the temperature below `config.temperature`. With a configured base of 1.0 the old `min(0.7, ...)` cap would DECREASE to 0.7 on the first stagnation trigger, the opposite of the escalation intent.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6538119cbc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Add Benchmark/Grindset/Arith.lean with 13 @[grind_norm, simp]-tagged lemmas that help grind and omega close the three arithmetic obligations in the lido/vaulthub_locked case: - Uint256→Nat wrappers: mul, sub, div, add val-level reductions - ceilDiv unfolding: ceilDiv_val_eq (Nat-level identity) - ceilDiv bounds: ceilDiv_le_numerator, ceilDiv_mul_ge (sandwich), ceilDiv_monotone - Spec-level: ceildiv_sandwich_spec_holds, shares_conversion_monotone_spec_holds - 3 demo theorems closing specs without sorry Zero sorry, zero new axioms. Builds clean on Lean 4.22.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds Benchmark/Grindset/Arith.lean (13 lemmas) covering: - Uint256 <-> Nat wrappers (mul/sub/div/add under overflow guards) - ceilDiv core (val, ≤-numerator, sandwich, monotonicity) - Spec-level closers: ceildiv_sandwich_spec, shares_conversion_monotone_spec - 3 demo theorems exercising the pack via simp only [grind_norm] No new axioms, zero sorry, pure Nat arithmetic (omega-reachable). Umbrella Benchmark.Grindset.lean imports the new module.

The self-hosted Blacksmith sticky disk for bucket `pr-26-*` has been stuck with a corrupt `.lake/packages/aesop` (missing HEAD) for the entire lifetime of this PR. Since the sticky-disk key is derived purely from `lean-toolchain / lakefile.lean / lake-manifest.json` hashes, plain reruns keep remounting the broken state. Add a `.github/cache-salt` file hashed into the elan / packages / build keys. Bumping the salt (currently `1`) rotates every sticky bucket, forcing a fresh checkout of all `.lake/packages/*` next run. The salt only affects cache-key derivation; no build inputs change.

Three review findings from PR #26: - Cursor Medium: `countOcc_cons_ne` / `countOccU_cons_ne` were tagged `[grind =]` but the audit specifies `[grind →]` (conditional `a ≠ t →` equalities are forward-only). Move them to the forward block and update the section docstring. - Cursor Low: `mulDivDown_nat_eq` was tagged `[grind =]` while its three siblings (`mulDivUp_nat_eq`, `wMulDown_nat_eq`, `wDivUp_nat_eq`) are `[grind →]`. The audit specifies all four as `[grind →]` (fits-within-modulus antecedent). Consolidate into the forward block. - Codex P2: `_append_failure_class` dedupes on `(candidate_hash, failure_class)` across the whole task. When the model stalls and re-submits the same unchanged candidate across turns, legitimate cross-turn repeats were silently dropped, preventing the repeated- class temperature escalation from firing. Reset `_last_history_key` at the top of each turn so dedupe is strictly intra-turn (its actual purpose: coalescing `write_editable_proof` + `run_lean_check` on the same candidate within one model turn). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b98f70f179

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Remove unused/legacy profiles (builtin-smart, combined-lean-tools, interactive-candidate, interactive, leanstral, openrouter-gemini-3.1- flash-lite-preview) and keep only the three interactive profiles we actually use, with larger budgets (max_attempts=32, max_tool_calls=80). - scripts/run_benchmark_matrix.py: update TARGET_CONFIGS + CLI flags (--interactive-{gpt,opus,smart}-repeats) accordingly. - scripts/repeat_benchmark_compare.py: refresh usage example. - harness/README.md: refresh bundled-profile listing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses two Codex P2 findings on scripts/run_resumable_matrix.py: - Use $HOME/.elan/bin (falling back to expanduser) when prepending the elan toolchain to PATH, so non-root environments (local dev, CI) pick up lake/lean instead of silently missing them. - Gate the unlink of unreadable artifacts behind "not args.dry_run": dry-run mode now just reports the would-be deletion and continues, keeping the workspace untouched so users can inspect the corrupt file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e1544ab. Configure here.}

cursor · 2026-04-24T06:15:50Z

+        packages_key="lake-packages-benchmark-${CACHE_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"
+        packages_main_key="lake-packages-benchmark-${MAIN_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"
+        build_key="lake-build-benchmark-${CACHE_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"
+        build_main_key="lake-build-benchmark-${MAIN_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"


Cache salt missing from GitHub Actions fallback keys

Medium Severity

The new CACHE_SALT_HASH is appended to all sticky-disk cache keys (lines 35–39) but is not added to the GitHub Actions cache fallback keys at line 65 (elan-${{ runner.os }}-...) and line 170 (lake-${{ runner.os }}-...). Bumping .github/cache-salt will bust caches only for workflows using sticky disks; workflows on the non-sticky path will continue hitting stale caches, defeating the cache-busting mechanism.

Additional Locations (1)

.github/actions/setup-lean/action.yml#L64-L65

^{Reviewed by Cursor Bugbot for commit e1544ab. Configure here.}

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/interactive_runtime.py Outdated

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py

Comment thread harness/interactive_runtime.py

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py Outdated

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/interactive_runtime.py

Comment thread harness/default_agent.py Outdated

fix: rfind Benchmark marker + extra_body cannot clobber overrides

f8d43bb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py

Comment thread harness/default_agent.py Outdated

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py Outdated

claude added 2 commits April 22, 2026 12:18

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread harness/default_agent.py Outdated

Comment thread harness/default_agent.py Outdated

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread harness/default_agent.py Outdated

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread harness/interactive_runtime.py

Comment thread harness/interactive_runtime.py

Comment thread harness/interactive_runtime.py

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread scripts/run_resumable_matrix.py Outdated

grindset-s3-worker and others added 13 commits April 23, 2026 16:41

grindset/s1: WIP grindset scaffolding (Attr, Monad, Tests; Test 3 fails)

a70f907

merge(grindset): a1-invariant-tags (118 invariants tagged)

4e83119

merge(grindset): a3-reach-grind-ext (Reach lemma pack + verity_reach_…

39e14a0

…grind) # Conflicts: # Benchmark/Grindset.lean

merge(grindset): s3-skeleton-gen (skeleton generator + grind-first pr…

c756c7f

…ompts) # Conflicts: # Benchmark/Grindset.lean

merge(grindset): s1-verity-grindset (Attr, Monad, Core, Tests; Test 3…

d74321e

… WIP) # Conflicts: # Benchmark/Grindset.lean

grindset/s1-fix: close Test 3 flashLoanViaDeposit with simp lemma

a4bfe09

merge(grindset): s1-fix (ite_decide_True simp lemma closes Test 3)

c18f997

merge(harness): bump interactive-opus profile to opus-4.7

820c59e

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread harness/default_agent.py

grindset-s3-worker and others added 2 commits April 23, 2026 20:29

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread Benchmark/Grindset/Invariants.lean

Comment thread Benchmark/Grindset/Invariants.lean

Claude and others added 2 commits April 23, 2026 21:36

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread scripts/run_resumable_matrix.py Outdated

Comment thread scripts/run_resumable_matrix.py

Claude and others added 2 commits April 24, 2026 08:07

cursor Bot reviewed Apr 24, 2026

View reviewed changes

		_LAKE_BUILD_CACHE.add(module_name)
		code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT)

		re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"),
		re.compile(r"failed to load environment"),

	return {
	"status": "failed",
	"failure_mode": "hidden_proof_import_detected",
	"details": "candidate proof imports hidden Benchmark.Cases.*.Proofs modules",

Conversation

Th0rgal commented Apr 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Th0rgal commented Apr 22, 2026 •

edited by cursor Bot

Loading