Skip to content

harness: interactive proxy robustness + analytics fixes#26

Open
Th0rgal wants to merge 99 commits into
mainfrom
harness/interactive-robustness-fixes
Open

harness: interactive proxy robustness + analytics fixes#26
Th0rgal wants to merge 99 commits into
mainfrom
harness/interactive-robustness-fixes

Conversation

@Th0rgal
Copy link
Copy Markdown
Member

@Th0rgal Th0rgal commented Apr 22, 2026

Summary

Robustness and analytics fixes for the interactive benchmark agent runtime. Addresses spurious failures caused by Lake build races, transient 429/5xx responses, truncated responses, and missing or misleading tool feedback. Also corrects stats that were always reporting 0 distinct candidates.

Changes

Pre-build + self-heal (biggest impact)

  • prebuild_task_modules() runs lake build once per task up front for all Benchmark.* modules needed by the case, so inspect_lean_goals / try_tactic_at_hole / run_lean_check never race an incomplete build.
  • New environment_error failure class: detect "object file ...olean does not exist" / "failed to load environment", auto-invoke lake build for the missing module, and retry the check once before surfacing to the model. Stagnation tracking and temperature schedules ignore this class.

Tool feedback

  • write_editable_proof now returns immediate warnings for: placeholder detected, hidden Proofs import, unfilled ?_ hole, theorem-statement mismatch, empty content. The model sees these before spending build budget.
  • HOLE_PATTERN tightened to (?<!\w)\?_(?!\w) so identifiers like ?_foo don't falsely trigger.
  • HIDDEN_PROOF_IMPORT_PATTERN now covers import/open/export of Benchmark.Cases.*.Proofs paths.
  • Interactive-mode TOOLS.md is synthesized from the real tool specs (6 tools) rather than relying on the static shell-oriented prompt.

HTTP resilience

  • send_chat_completion rewritten with retry/backoff + jitter (up to 6 attempts), honours Retry-After, and falls back through extra_body.fallback_models on persistent failure.
  • Silent length retry: on finish_reason: length, bump max_completion_tokens (up to 12000, 3 silent bumps) before counting the attempt as a failure.

Defaults

  • agents/interactive.json: max_completion_tokens 2000 → 4096, max_tool_calls 24 → 40.

Analytics

  • build_run_analysis now has a fallback path for interactive attempts (which don't populate trace) that derives stable_digest from candidate_file_contents, so distinct_candidate_count / candidate_change_count are correct.
  • Light temperature schedule on repeated non-environment failure classes.

Schema

  • schemas/agent-run.schema.json: add prebuild_reports property.

Verification

Ran the 3 quick tasks on both gpt and builtin/smart: 6/6 pass, analytics report distinct=1 change_count=1 (previously 0).

task gpt tool_calls smart tool_calls
ethereum/deposit_contract_minimal/deposit_count 2 4
damn_vulnerable_defi/side_entrance/deposit_sets_sender_credit 2 8
damn_vulnerable_defi/side_entrance/deposit_sets_pool_balance 2 2

Test plan

  • Syntax check (python -m py_compile) on all edited modules
  • 3 quick tasks × 2 models all pass with correct analytics
  • Full benchmark sweep with gpt (in flight)
  • Address any bugbot / codex findings on this PR

🤖 Generated with Claude Code


Note

Medium Risk
Medium risk because it changes CI cache key derivation (may cause cache misses/longer builds) and adds a large new proof-automation layer whose rewrite/tagging choices could impact proof stability and performance.

Overview
Adds a new Benchmark.Grindset umbrella module plus supporting packs (Attr, Monad, Core, Invariants, Arith) to let downstream benchmark proofs discharge common monad/storage/mapping/arithmetic obligations via grind and related simp attributes.

Introduces a large set of Benchmark/GeneratedPreview/**/Tasks/*.lean theorems that import Benchmark.Grindset and provide grind-first proof skeletons for multiple benchmark cases (e.g. SideEntrance, DepositContractMinimal, SortitionTrees, VaultHubLocked, OwnerManagerReach, ERC7984).

Updates the setup-lean GitHub Action to include a .github/cache-salt hash in sticky-disk cache keys, enabling explicit cache invalidation without changing toolchain or Lake manifests.

Reviewed by Cursor Bugbot for commit e1544ab. Bugbot is set up for automated code reviews on this repo. Configure here.

Major improvements to the interactive agent runtime to reduce spurious
failures and improve tool feedback quality:

- Pre-build Benchmark case modules once per task so the interactive tools
  don't race the Lake build (eliminates olean-not-found flake).
- Add environment_error failure class with one-shot self-heal: when Lean
  can't find a dependency olean, auto-run lake build and retry before
  reporting to the model. Stagnation tracking skips these classes.
- write_editable_proof now returns immediate warnings (placeholder, hole,
  theorem_statement_mismatch, hidden_proof_import, etc.) so the model
  sees actionable feedback before spending a build budget.
- Synthesize TOOLS.md for interactive mode from the real tool specs
  instead of relying on static prompts that reference shell commands.
- HTTP layer: retry/backoff with jitter, honour Retry-After, fall back
  through extra_body.fallback_models, and silently grow
  max_completion_tokens (up to 12000) on length finish_reason.
- Raise max_tool_calls (24 -> 40) and max_completion_tokens (2000 ->
  4096) for the default interactive profile.
- Tighten HOLE_PATTERN (avoid matching `?_foo`) and broaden hidden
  import regex to cover `open`/`export` of proof modules.
- Fix candidate_change_count / distinct_candidate_count for interactive
  attempts by falling back to stable_digest of candidate_file_contents
  when trace is missing.
- Add light temperature schedule on repeated failure_class history.

Schema: extend agent-run.schema.json with prebuild_reports.

Verified: 3 quick tasks pass cleanly on both gpt and builtin/smart with
correct analytics (distinct=1, change_count=1).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 503a0654fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py Outdated
Comment on lines +645 to +646
_LAKE_BUILD_CACHE.add(module_name)
code, output = lean_run_command(["lake", "build", module_name], cwd=ROOT)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cache only successful lake builds

prebuild_task_modules adds each module to _LAKE_BUILD_CACHE before running lake build, so a failed build is still marked as cached. In agent_runner.run_many (case/suite runs in one process), that causes later tasks sharing the same module to skip rebuilds and keep reporting cached, which prevents recovery from transient build failures and undermines the new self-healing behavior.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py Outdated
Comment on lines +1016 to +1020
if candidate_hash not in distinct_candidate_hashes:
distinct_candidate_hashes.add(candidate_hash)
if not isinstance(trace, dict) or not trace.get("candidate_sha256"):
if candidate_text != previous_candidate:
candidate_change_count += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count candidate changes on every transition

In the interactive fallback path (attempts without trace), candidate_change_count is incremented only when the hash has not been seen before. That undercounts real transitions when the model revisits an earlier candidate (e.g., A -> B -> A should count two changes after the first attempt, but the second A transition is dropped), so analytics remain inaccurate for non-monotonic candidate sequences.

Useful? React with 👍 / 👎.

- P1: only cache *successful* lake builds so failed builds can be
  retried on subsequent calls. Previously `_LAKE_BUILD_CACHE` marked
  modules as attempted before invoking lake, which prevented the
  self-heal path from ever recovering from a transient build failure in
  batch/suite runs that share a process.

- P2: count every candidate transition in the interactive analytics
  fallback, including reverts (A -> B -> A now counts 2 changes, not 1).
  Previously the increment was gated by "hash not yet seen", which
  undercounted non-monotonic candidate sequences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/interactive_runtime.py Outdated
Bugbot (Medium): `_attempt_lake_build` returned False when the module
was already cached as built, so after `prebuild_task_modules` populated
the cache, every subsequent environment_error auto-heal attempt
short-circuited without invoking lake and without retrying the check —
while the model was told "The harness is attempting to rebuild it".

The auto-heal path only fires when the runtime has observed a missing
.olean at check time, which means the cached 'success' entry is stale.
Always invoke lake build in this path and refresh the cache with the
latest result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/default_agent.py
Comment thread harness/interactive_runtime.py
- Low: merge the two back-to-back `if config.mode == "interactive":`
  blocks in `execute_agent_task` into a single branch.
- Low: `try_tactic_at_hole` now uses the same `HOLE_PATTERN` regex as
  `inspect_lean_goals`, so the two tools agree on what counts as a
  hole (previously the tightened lookbehind in HOLE_PATTERN was not
  mirrored in `try_tactic_at_hole`, which could let it replace a match
  that `inspect_lean_goals` had reported as absent).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 950d7e3145

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py
Codex (P2): `_parse_retry_after` only handled delta-seconds, so when a
provider sent an HTTP-date on 429/503 (which RFC 7231 permits), the
function returned None and `_backoff_delay` fell back to a short
exponential retry instead of honouring the server-requested wait. That
can repeatedly hit rate limits and prolong failures.

Now parse both forms; a date in the past clamps to 0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ddda642d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
Comment on lines +1095 to +1096
if retry_after is not None:
return min(retry_after, 60.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect full Retry-After delay from provider

_backoff_delay clamps Retry-After to 60 seconds, so 429/503 responses that legitimately request longer waits (for example several minutes) will be retried too early and can burn through all retries while the provider is still enforcing the limit. This directly undermines the new “honour Retry-After” behavior and can turn recoverable throttling into hard failures.

Useful? React with 👍 / 👎.

Comment thread harness/interactive_runtime.py Outdated
Comment on lines +707 to +708
re.compile(r"object file ['\"][^'\"]+\.olean['\"]? does not exist"),
re.compile(r"failed to load environment"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow environment-error matching to real infra failures

The new ENVIRONMENT_ERROR_PATTERNS treats any "object file ... .olean does not exist" / "failed to load environment" message as infrastructure, but those strings also occur when the candidate introduces a bad import/module name. Because _annotate_check_result short-circuits on environment_error, these model-caused mistakes are labeled “not your fault” and excluded from stagnation/temperature logic, which can prevent corrective behavior in interactive runs.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py Outdated
Codex P2 #1: narrow `environment_error` classification to real infra
failures. Previously any missing .olean matched, but that string is
also produced when the model imports a non-existent module. In the
latter case we want the normal stagnation/temperature logic to kick in
so the model can correct itself. Now only classify as environment_error
when the missing module is under `Benchmark.*` (our dependency tree
which should have been pre-built) or when lean itself is missing. The
generic "failed to load environment" substring no longer
short-circuits.

Codex P2 #2: raise `_backoff_delay` Retry-After ceiling from 60s to
600s. Providers routinely request several-minute waits on 429/503; the
old 60s clamp caused retries to fire while the rate limit was still in
force, undermining the "honour Retry-After" behaviour.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df1004c256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py Outdated
# pre-built) or the model's fault (imported a module that doesn't exist). We
# only classify the former as environment_error so stagnation/temperature
# logic still applies to model-caused import mistakes.
_MISSING_OLEAN_RE = re.compile(r"object file ['\"]([^'\"]+\.olean)['\"]? does not exist")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle of module in missing-olean regex

Lean commonly reports missing artifacts as object file '...olean' of module <Name> does not exist, but _MISSING_OLEAN_RE only matches the shorter '...olean' does not exist form. Fresh evidence in this revision is that the new regex at this line requires does not exist immediately after the path, so _missing_olean_module returns None for the standard message shape; that prevents classify_failure from returning environment_error and skips the new auto-heal lake build retry path in run_lean_check.

Useful? React with 👍 / 👎.

Bugbot (Medium): the temperature schedule condition was checked at the
top of every loop iteration, so once two consecutive same-class
failures triggered it, temperature was bumped on every subsequent turn
-- including pure search/write turns -- until the 0.7 cap was reached.
Track the history length we have already acted on and only bump once
per new failure entry.

Codex (P1): Lean reports missing artifacts in two forms depending on
context:

  object file '...olean' does not exist
  object file '...olean' of module <Name> does not exist

The `_MISSING_OLEAN_RE` regex only matched the shorter form, so on the
more common "of module" diagnostic `_missing_olean_module` returned
None, `classify_failure` did not return `environment_error`, and the
auto-heal retry path was skipped. Accept arbitrary text between the
path and the "does not exist" tail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 894a285f92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/interactive_runtime.py Outdated
Comment on lines +753 to +755
missing_module = _missing_olean_module(details)
if missing_module and missing_module.startswith("Benchmark."):
return "environment_error"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify missing Benchmark module exists before env classification

Fresh evidence in this revision is that classify_failure now labels any missing .olean as environment_error whenever the parsed module name starts with Benchmark.. A model-caused bad import like import Benchmark.Foo.Bar (which is not a real module) will therefore be treated as infrastructure failure, causing _annotate_check_result to mark it as environment_error and skip stagnation/temperature correction, so interactive runs can keep retrying instead of fixing the import.

Useful? React with 👍 / 👎.

Codex (P2): a model-authored bad import like `import Benchmark.Foo.Bar`
where no such module exists would still produce an "object file ...olean
does not exist" diagnostic, and under the previous check that matched
the `Benchmark.*` prefix it would be misclassified as infrastructure
failure -- skipping stagnation/temperature correction instead of letting
the model discover and fix the bogus import.

Require the corresponding `.lean` source file to actually exist in the
tree before treating a missing .olean as environment_error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42efe75153

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
Comment on lines +1195 to +1197
# Fall back only on rate-limit / service-unavailable style errors.
if exc.status not in (429, 500, 502, 503, 504) and exc.status != 0:
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry alternate models on all transient HTTP statuses

send_chat_completion only falls through to fallback_models for 429/5xx (or status 0), but _post_chat_completion explicitly treats 408, 409, and 425 as transient retryable failures too. If the primary model keeps returning one of those statuses, the code exits after exhausting retries on that model instead of trying configured fallback models, so the new fallback chain is skipped in a recoverable outage scenario.

Useful? React with 👍 / 👎.

Codex (P2): `send_chat_completion`'s fallback-model gate used a
hard-coded subset {429, 500, 502, 503, 504}, but `_post_chat_completion`
retries the broader `RETRY_STATUS_CODES` set (408, 409, 425, 429, 500,
502, 503, 504). A primary model that kept returning 408/409/425 would
exhaust retries and then skip the configured fallback chain entirely.
Route on the same set used by the underlying retry loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/interactive_runtime.py
Comment thread harness/default_agent.py Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8d43bbfae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
token_budget = min(int(token_budget * 1.5), 4500)
# Up to 3 silent budget bumps before nudging the model to simplify.
if consecutive_length_stops <= 3:
token_budget = min(int(token_budget * 1.5), 12000)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cap length-retry token bumps at model-safe limit

The new length-retry path raises token_budget to min(int(token_budget * 1.5), 12000), which can exceed the model/provider output-token cap after a finish_reason == "length" response. With the updated interactive default already at 4096, the first bump becomes 6144; providers that enforce a 4096 max will return HTTP 400, and send_chat_completion treats 400 as non-transient and exits instead of recovering. This turns a recoverable truncation into a hard task failure for capped models.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py
Comment thread harness/default_agent.py Outdated
Three related fixes from bugbot/codex on f8d43bb:

1. (codex P1) Cap length-retry token bump at config.max_completion_tokens.
   The previous hard-coded cap of 12000 could exceed the provider's per-
   response output-token limit (e.g. models with a 4096 hard cap), turning
   a recoverable truncation into a non-transient HTTP 400 hard failure.

2. (bugbot Low) Also reset token_budget when recovering from a length
   streak, not only when consecutive_length_stops >= 5. Previously a
   1-4 stop streak followed by recovery left the elevated budget in
   place for the rest of the run.

3. (bugbot Low) Skip environment_error entries when appending to
   failure_class_history. They are infra noise that would break the
   sliding window same-class check (e.g. [type_error, env_error,
   type_error]) even though the filter at the trigger site rejects
   them -- consistent with the PR's stated intent that env_error be
   invisible to stagnation tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07c3880b8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py Outdated
# provider-enforced per-response limit (some models hard-cap at the
# configured value and would return HTTP 400 on anything larger).
if consecutive_length_stops <= 3:
token_budget = min(int(token_budget * 1.5), config.max_completion_tokens)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make length-stop retries raise max_tokens

execute_interactive_agent_task initializes token_budget to config.max_completion_tokens, but the length-retry branch sets it to min(int(token_budget * 1.5), config.max_completion_tokens), which can never increase the value. As a result, repeated finish_reason == "length" responses are retried with the same token limit instead of a larger one, so the new silent recovery path is effectively disabled and long responses can still exhaust attempts unnecessarily.

Useful? React with 👍 / 👎.

Comment thread harness/default_agent.py Outdated
claude added 2 commits April 22, 2026 12:18
Python 3.12 surfaces socket read timeouts during SSL as bare
TimeoutError, which slipped past the urllib.error.URLError handler
and killed the task with a traceback. Catch it alongside URLError and
apply the same backoff retry policy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous commit capped the silent length-retry bump at
config.max_completion_tokens, which -- as bugbot and codex both
pointed out -- turned the feature into a no-op because token_budget
starts at exactly that value.

Add a config.extra_body.length_retry_token_cap knob (stripped from
the provider payload in send_chat_completion). Default stays at
max_completion_tokens so models with a hard per-response cap don't
get surprised by HTTP 400; interactive.json opts into a 12000
ceiling, which is the original intended behavior for gpt-class
models that accept larger single-response budgets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5f118c37b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py
Comment thread harness/default_agent.py Outdated
Comment on lines +1794 to +1796
length_retry_token_cap = int(
config.extra_body.get("length_retry_token_cap", config.max_completion_tokens)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle invalid length_retry_token_cap without crashing

Interactive runs cast extra_body.length_retry_token_cap with int(...) unconditionally, but extra_body is schema-free and often operator-edited; non-integer values (e.g., null, "12k", or nested objects) will raise before the first model turn and abort the task. This knob should be parsed defensively (or validated up front) and fall back to max_completion_tokens instead of terminating execution.

Useful? React with 👍 / 👎.

Both knobs are read from schema-free extra_body:

- fallback_models="gpt-4o-mini" (string shorthand) previously iterated
  character-by-character, producing fake model ids like "g", "p", "t"
  during failover. Normalize a bare string to a one-element list first.

- length_retry_token_cap=null / "12k" / nested object would crash the
  int() cast before the first model turn, aborting the task. Fall back
  to max_completion_tokens silently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread harness/default_agent.py Outdated
Comment thread harness/default_agent.py Outdated
Comment thread harness/default_agent.py Outdated
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- `harness/agents/interactive-gpt.json`: OpenRouter openai/gpt-5.4
- `harness/agents/interactive-opus.json`: OpenRouter anthropic/claude-opus-4.6
- `scripts/run_resumable_matrix.py`: runs every (profile, task) combo,
  skips already-produced results so the script can be re-run after
  rate-limit / crash / interruption to continue where it left off;
  writes an incremental summary after every task so partial runs
  still leave analyzable output.

Progress and summary live under `results/matrix_runs/<run_id>/`.
Comment thread harness/interactive_runtime.py
Comment thread harness/interactive_runtime.py
Comment thread harness/interactive_runtime.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4c4cf50f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/run_resumable_matrix.py Outdated
grindset-s3-worker and others added 13 commits April 23, 2026 16:41
Introduce a grind-first template generator for Benchmark/Generated/** task
skeletons so new/regenerated proof templates import Benchmark.Grindset and
start with `unfold <spec>; grind [ContractName.fn, storage fields]`.

Deliverables:

- scripts/generate_task_skeletons.py: parses the existing `Tasks/*.lean`
  templates, extracts the contract function call from the theorem signature,
  looks up the companion `Benchmark/Cases/.../Contract.lean` to pull every
  storage field declared inside `verity_contract`, and emits a rewritten
  skeleton with `import Benchmark.Grindset` and a grind-first proof body.
  Supports --preview (writes under Benchmark/GeneratedPreview/), --in-place
  (rewrites live files when you are sure nothing is consuming them), and
  --patch (unified diff on stdout). Idempotent when re-run on its own
  output.

- Benchmark/Grindset.lean: minimal empty stub so generator output is
  buildable standalone before the real `@[grind]` lemma bundle lands on
  grindset/s1-verity-grindset. Content-free on purpose — the s1 branch
  replaces it with the real bundle.

- harness/PROOF_PATTERNS.md: rewritten to lead with a "grind-first" pattern
  (unfold spec, grind with ContractName.fn + all storage fields, grind? for
  lemma discovery on stuck goals). The pre-grindset simp+by_cases recipe is
  preserved under a "Fallback" section.

- harness/PROMPT.md: matching agent-facing rewrite. Enumerates the
  grind-first strategy as steps 1-4 and demotes simp/simp_all/native_decide
  to the step-5 fallback.

- Benchmark/GeneratedPreview/: 88 preview task skeletons so reviewers can
  diff the before/after shape without touching live Benchmark/Generated/**
  that a running benchmark could be reading.

Assumptions about the grindset interface:

- `@[grind]`-tagged lemmas on getStorage, setStorage, setMapping,
  setMappingUint, Verity.require, Verity.bind, Bind.bind, Verity.pure,
  Pure.pure, Contract.run, ContractResult.snd (i.e. the operational bundle
  `simp` already needs today). The generator therefore does NOT hint any of
  these in its `grind [...]` list — it only hints `ContractName.fn` and the
  storage fields declared in the companion `verity_contract` block.
…[grind]

Introduces Benchmark/Grindset/Invariants.lean which re-exports 118 domain-level
invariant lemmas with `attribute [grind …]` so the grind tactic can use them
during proof search. Complementary to S1's operational-primitives tagging.

Coverage (see Benchmark/Grindset/INVARIANTS_AUDIT.md for per-entry rationale):
  - 49 × [grind =] : store/load identities, zero/self identities, sum lemmas
                     whose LHS captures every bound parameter.
  - 48 × [grind →] : forward-only implications (monotonicity, safeAdd/Sub/Mul
                     bounds, wad gated identities, positivity).
  - 21 × [grind]   : case-local `def` unfolds (17 across 5 cases) plus 4
                     mulDivDown inequality lemmas whose conclusions are
                     `≤`/`<` rather than `=`.

Modules covered: Verity.Core.Uint256, Verity.Proofs.Stdlib.Math / ListSum /
MappingAutomation, Verity.Specs.Common (+.Sum), and case Specs.lean files for
Kleros/SortitionTrees, Lido/VaulthubLocked, PaladinVotes/StreamRecoveryClaimUsdc,
Safe/OwnerManagerReach, Zama/ERC7984ConfidentialToken.

Deliberately NOT tagged:
  - Commutativity rewrites (E-match loop traps).
  - `map_sum_point_update/decrease/transfer_eq` and 3 of the 5
    `sumBalances_*` — their equation LHS doesn't mention every bound parameter,
    so grind refuses to register them; callers should pass them manually.
  - Case-local `reachable`/`acyclic`/`freshInList`/`calculateBuyReserve` etc.
    (unbounded branching).
  - Already-@[simp] Uint256 algebraic identities.

Also adds a minimal Benchmark/Grindset.lean stub that imports only
Benchmark.Grindset.Invariants — to be replaced/extended by sibling worker S1.

Builds clean under `lake build Benchmark.Grindset.Invariants` and
`lake build Benchmark.Grindset`.

No file under `.lake/packages/verity/**` or `Benchmark/Cases/**/{Specs,Proofs}.lean`
was modified.
…grind)

Adds Benchmark/Grindset/Reach.lean with @[grind]-tagged reach closure
lemmas for two reach flavours found in (or likely useful for) the
reach-heavy benchmark cases:

- Inductive via mathlib Relation.ReflTransGen: reach_refl,
  reach_of_step, reach_tail, reach_head, reach_trans,
  reach_preserves_invariant.
- Witness-based (matching Safe.OwnerManagerReach.reachable shape):
  IsChain + Reachable defs, reachable_refl / _step / _of_step
  (@[grind]), plus reachable_snoc / _trans / _preserves_invariant
  (intentionally not @[grind] — they loop the E-matcher).

Includes a verity_reach_grind tactic macro that first applies
reachable_preserves_invariant / reach_preserves_invariant before
falling back to plain grind, and ReachTests.lean with two independent
demo proofs authored without consulting Proofs.lean. Ships
Benchmark/Grindset.lean stub and REACH_NOTES.md documenting the
design, the fact that only Safe/OwnerManagerReach among the four
flagged cases actually uses reach, and applicability estimates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updates run_slug and model for the interactive-opus agent profile from
claude-opus-4.6 to claude-opus-4.7. Matches the model used by the
matrix_runs/matrix-opus run (pass rate 52/77 = 67.5%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…grind)

# Conflicts:
#	Benchmark/Grindset.lean
…ompts)

# Conflicts:
#	Benchmark/Grindset.lean
… WIP)

# Conflicts:
#	Benchmark/Grindset.lean
Combines:
- grindset/a1-invariant-tags (118 @[grind] invariants)
- grindset/a3-reach-grind-ext (Reach lemma pack + verity_reach_grind)
- grindset/s1-verity-grindset + s1-fix (Attr/Monad/Core/Tests, all 3 demos pass)
- grindset/s3-skeleton-gen (skeleton generator + grind-first prompts)
- harness/interactive-opus-4-7-config (profile bump to opus-4.7)

Not included:
- grindset/s2-grind-tool (worker refused 3x; try_grind tool to be
  implemented in a follow-up pass)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

return {
"status": "failed",
"failure_mode": "hidden_proof_import_detected",
"details": "candidate proof imports hidden Benchmark.Cases.*.Proofs modules",

P2 Badge Route new preflight failures through retry loop

Adding hidden_proof_import_detected/hidden_case_import_detected introduces new failure_mode values, but the no-tool fallback in execute_interactive_agent_task only retries lean_check_failed, placeholder_detected, and theorem_statement_mismatch (it returns immediately for other modes). That means a plain-text model response containing a blocked import now hard-fails the task after one attempt instead of feeding back a correction prompt and continuing, which regresses interactive recovery in non-tool turns.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

P1 (runtime reliability):
- interactive_runtime.py: defensive read-side guard against serving an
  `environment_error` result from the run_lean_check fast-path cache.
  The write-side already skipped caching env errors; this closes the
  remaining loophole in case a future refactor leaves one in the cache,
  keeping `_attempt_lake_build` heal retries reachable on stale infra.
- run_resumable_matrix.py: treat unreadable/corrupt result artifacts as
  missing. `read_result()` returning None previously marked the task
  SKIP with `unknown` status, so a resumed matrix could silently finish
  with gaps that resume was meant to fill. Now we delete the corrupt
  file, log a rerun event, and fall through to the RUN branch.

Codex P2:
- default_agent.py: route hidden_proof_import_detected and
  hidden_case_import_detected through the retry loop alongside
  placeholder_detected / theorem_statement_mismatch. Previously these
  fell through to the generic `return`, aborting the task on the first
  hidden-import mistake even though the model can trivially recover by
  dropping the bad import.

Medium:
- interactive_runtime.py: invalidate `_last_eval_cache` at the top of
  `write_editable_proof`. Repeat writes of identical content (common
  during stagnation loops) would otherwise hit the cache and surface a
  misleading `cached: true` / "redundant run_lean_check" note on what
  is actually a fresh write turn.

Low:
- default_agent.py: re-export `_PREFLIGHT_FAILURE_MODES` from
  `interactive_runtime` instead of maintaining a second copy. An
  earlier refactor had already dropped `empty_response` from the
  duplicate; importing removes that whole class of drift bug.
- default_agent.py: never let the escalation schedule drop the
  temperature below `config.temperature`. With a configured base of
  1.0 the old `min(0.7, ...)` cap would DECREASE to 0.7 on the first
  stagnation trigger, the opposite of the escalation intent.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6538119cbc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread harness/default_agent.py
grindset-s3-worker and others added 2 commits April 23, 2026 20:29
Add Benchmark/Grindset/Arith.lean with 13 @[grind_norm, simp]-tagged
lemmas that help grind and omega close the three arithmetic obligations
in the lido/vaulthub_locked case:

  - Uint256→Nat wrappers: mul, sub, div, add val-level reductions
  - ceilDiv unfolding: ceilDiv_val_eq (Nat-level identity)
  - ceilDiv bounds: ceilDiv_le_numerator, ceilDiv_mul_ge (sandwich),
    ceilDiv_monotone
  - Spec-level: ceildiv_sandwich_spec_holds,
    shares_conversion_monotone_spec_holds
  - 3 demo theorems closing specs without sorry

Zero sorry, zero new axioms. Builds clean on Lean 4.22.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Benchmark/Grindset/Arith.lean (13 lemmas) covering:
- Uint256 <-> Nat wrappers (mul/sub/div/add under overflow guards)
- ceilDiv core (val, ≤-numerator, sandwich, monotonicity)
- Spec-level closers: ceildiv_sandwich_spec, shares_conversion_monotone_spec
- 3 demo theorems exercising the pack via simp only [grind_norm]

No new axioms, zero sorry, pure Nat arithmetic (omega-reachable).
Umbrella Benchmark.Grindset.lean imports the new module.
Comment thread Benchmark/Grindset/Invariants.lean
Comment thread Benchmark/Grindset/Invariants.lean
Claude and others added 2 commits April 23, 2026 21:36
The self-hosted Blacksmith sticky disk for bucket `pr-26-*` has been
stuck with a corrupt `.lake/packages/aesop` (missing HEAD) for the
entire lifetime of this PR. Since the sticky-disk key is derived
purely from `lean-toolchain / lakefile.lean / lake-manifest.json`
hashes, plain reruns keep remounting the broken state.

Add a `.github/cache-salt` file hashed into the elan / packages /
build keys. Bumping the salt (currently `1`) rotates every sticky
bucket, forcing a fresh checkout of all `.lake/packages/*` next run.

The salt only affects cache-key derivation; no build inputs change.
Three review findings from PR #26:

- Cursor Medium: `countOcc_cons_ne` / `countOccU_cons_ne` were tagged
  `[grind =]` but the audit specifies `[grind →]` (conditional `a ≠ t →`
  equalities are forward-only). Move them to the forward block and
  update the section docstring.

- Cursor Low: `mulDivDown_nat_eq` was tagged `[grind =]` while its
  three siblings (`mulDivUp_nat_eq`, `wMulDown_nat_eq`, `wDivUp_nat_eq`)
  are `[grind →]`. The audit specifies all four as `[grind →]`
  (fits-within-modulus antecedent). Consolidate into the forward block.

- Codex P2: `_append_failure_class` dedupes on `(candidate_hash,
  failure_class)` across the whole task. When the model stalls and
  re-submits the same unchanged candidate across turns, legitimate
  cross-turn repeats were silently dropped, preventing the repeated-
  class temperature escalation from firing. Reset `_last_history_key`
  at the top of each turn so dedupe is strictly intra-turn (its actual
  purpose: coalescing `write_editable_proof` + `run_lean_check` on the
  same candidate within one model turn).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b98f70f179

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/run_resumable_matrix.py Outdated
Comment thread scripts/run_resumable_matrix.py
Claude and others added 2 commits April 24, 2026 08:07
Remove unused/legacy profiles (builtin-smart, combined-lean-tools,
interactive-candidate, interactive, leanstral, openrouter-gemini-3.1-
flash-lite-preview) and keep only the three interactive profiles we
actually use, with larger budgets (max_attempts=32, max_tool_calls=80).

- scripts/run_benchmark_matrix.py: update TARGET_CONFIGS + CLI flags
  (--interactive-{gpt,opus,smart}-repeats) accordingly.
- scripts/repeat_benchmark_compare.py: refresh usage example.
- harness/README.md: refresh bundled-profile listing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses two Codex P2 findings on scripts/run_resumable_matrix.py:

- Use $HOME/.elan/bin (falling back to expanduser) when prepending the
  elan toolchain to PATH, so non-root environments (local dev, CI) pick
  up lake/lean instead of silently missing them.
- Gate the unlink of unreadable artifacts behind "not args.dry_run":
  dry-run mode now just reports the would-be deletion and continues,
  keeping the workspace untouched so users can inspect the corrupt file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e1544ab. Configure here.

packages_key="lake-packages-benchmark-${CACHE_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"
packages_main_key="lake-packages-benchmark-${MAIN_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"
build_key="lake-build-benchmark-${CACHE_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"
build_main_key="lake-build-benchmark-${MAIN_BUCKET}-${RUNNER_OS_NAME}-${LEAN_TOOLCHAIN_HASH}-${LAKEFILE_HASH}-${LAKE_MANIFEST_HASH}-${CACHE_SALT_HASH}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache salt missing from GitHub Actions fallback keys

Medium Severity

The new CACHE_SALT_HASH is appended to all sticky-disk cache keys (lines 35–39) but is not added to the GitHub Actions cache fallback keys at line 65 (elan-${{ runner.os }}-...) and line 170 (lake-${{ runner.os }}-...). Bumping .github/cache-salt will bust caches only for workflows using sticky disks; workflows on the non-sticky path will continue hitting stale caches, defeating the cache-busting mechanism.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e1544ab. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants