Skip to content

feat(autofix): aider harness behind AUTOFIX_HARNESS flag (Phase 2a)#7

Open
dnplkndll wants to merge 8 commits into
feature/explorer-endpointsfrom
feat/aider-harness
Open

feat(autofix): aider harness behind AUTOFIX_HARNESS flag (Phase 2a)#7
dnplkndll wants to merge 8 commits into
feature/explorer-endpointsfrom
feat/aider-harness

Conversation

@dnplkndll

@dnplkndll dnplkndll commented May 20, 2026

Copy link
Copy Markdown

What changed

Adds aider as an alternate autofix orchestrator behind the new AUTOFIX_HARNESS feature flag. Default remains builtin (existing AutofixAgent), so live behavior is unchanged until an operator opts in.

This is Phase 2a of the coding-harness rollout — see docs/coding-harnesses.md (merged in #6) for the comparison matrix and constraint summary that drove the design.

Scope (8 commits, in order)

  1. feat(harness): scaffold harness package + AUTOFIX_HARNESS flag (478f0c2) — adds seer.automation.harness package with select_orchestrator() registry + factory and the AUTOFIX_HARNESS / AUTOFIX_HARNESS_STRICT / AUTOFIX_HARNESS_MODEL config flags.
  2. feat(harness): AiderHarness subprocess wrapper + diff adapter (53ede97) — new harness/aider.py with sandboxed subprocess invocation, plus a minimal diff_adapter.py.
  3. build(docker): bake aider-chat into Lightweight.Dockerfile (43a5bbf) — aider-chat==0.65.0 installed into an isolated /opt/aider-venv so its sprawling dep tree (litellm, prompt-toolkit, gitpython) can't fight seer's pinned versions.
  4. feat(autofix): wire orchestrator selection in components (3a8b834) — root_cause / solution / coding components each resolve the orchestrator via select_orchestrator(config) instead of constructing AutofixAgent directly.
  5. test(autofix): gated dogfood integration test (d305c73) — opt-in (SEER_AIDER_DOGFOOD=1) live test runnable inside the container after deploy.
  6. fix(harness): aider 0.65.0 flag set + vertex deps in venv (5a36248) — fallout from live VM verification: --chat-mode ask (not --ask), --no-check-update to stop aider clobbering seer's tokenizers via mid-session pip-upgrade, google-auth + google-cloud-aiplatform in /opt/aider-venv for litellm Vertex routing.
  7. fix(harness): align AiderHarness with AutofixAgent contract surface (01c9625) — bugbot review caught that the standalone smoke test bypassed real component integration. Adds usage (so cur.usage += agent.usage works), add_user_message (coding/solution push prompt this way), _resolve_prompt fallback to last user message, tools as public attribute (settable to [] mid-flow), constructor-seeded memory, and synthesized assistant Message after each run so the formatter LLMs have aider's output to extract from. Drops the broken Pydantic attribute assignment in _record_diff (BaseStep has no extra="allow"). New tests cover each contract method + regression guards for the flag fixes.
  8. refactor(harness): idempotent re-registration (11b3bb8) — same name → cls is a no-op, only different class under same name still raises. Survives pytest module reloads.

Design highlights

  • Sandbox: each invocation runs in /tmp/aider-<run_id>/ with a shallow git clone, cleaned up in a finally: block. Walltime capped at 600s.
  • Hardcoded repo (Phase 2a): ledoent/seer on feature/explorer-endpoints since the benchmark issues are all seer-side. Phase 2b switches to Sentry code-mapping RPC lookup.
  • Mode selection: root_cause_analysis and solution steps run aider in --chat-mode ask (no commits). coding step gets --auto-commits; the resulting git diff HEAD~1 HEAD is logged at INFO (UI persistence deferred to Phase 2b — see below).
  • Vertex Gemini only: aider model defaults to vertex_ai/gemini-2.5-flash; uses the existing seer-vertex ADC at /etc/sentry-extra/seer-vertex-key.json. No new env vars.
  • Strict mode: AUTOFIX_HARNESS_STRICT=True raises HarnessNotAvailableError on unknown harness; default false falls back to builtin.

Tests

  • test_aider.py — 19 unit tests patching subprocess.run + tempfile + shutil. Covers ask vs coding mode, diff capture, clone failure, clone timeout, aider timeout, aider non-zero, missing binary, empty prompt, should_continue, registry lookup, AutofixAgent contract surface (usage, add_user_message, tools, memory pass-through, assistant-message synthesis), regression guards for --chat-mode ask and --no-check-update.
  • test_select_orchestrator.py — 6 tests covering registry, fallback, strict mode, duplicate-class registration error, idempotent same-class re-registration.
  • test_diff_adapter.py — golden tests for the file-count + path-extraction helpers.
  • test_component_wiring.py — source-level checks that all three autofix components still route through select_orchestrator and forward AUTOFIX_HARNESS_STRICT.
  • test_aider_dogfood.py — gated live test (SEER_AIDER_DOGFOOD=1) for in-container verification.

Run in seer container: 35 passed, 2 skipped in 7.14s.

Live verification on sentry-seer-1

Aider v0.65.0
Model: vertex_ai/gemini-2.5-flash with ask edit format
Git repo: .git with 471 files
Repo-map: using 1024 tokens, auto refresh
[... Gemini response ...]
Tokens: 2.5k sent, 31 received. Cost: $0.00082 message, $0.00082 session.

Clone + Vertex Gemini round-trip in ~9s, $0.0008 per invocation. ADC auth via /etc/sentry-extra/seer-vertex-key.json works. select_orchestrator("aider", strict=True) resolves correctly.

Known Phase 2a gaps (Phase 2b backlog)

  • Diff UI persistence: captured diff is logged at INFO only. Real persistence into ChangesStep.changes needs a unified-diff → FilePatch + Hunk parser.
  • Dynamic repo resolution: hardcoded to ledoent/seer on feature/explorer-endpoints. Phase 2b resolves from Sentry code-mappings.
  • Diagnostic context: in --chat-mode ask, aider doesn't auto-load files from the repo map. For real autofix root-cause/solution, pass relevant files via --read <path> extracted from the autofix request's relevant_code_files.
  • Usage attribution: Usage() is zero-init; aider's token-count output (Tokens: 2.5k sent, 31 received) is currently discarded. Phase 2b parses this for accurate per-step token totals.

Rollback

Set / leave AUTOFIX_HARNESS=builtin (default) — the AutofixAgent path is untouched. If a wider revert is needed, git revert the 8 commits in reverse order; no schema or migration changes.

dnplkndll added 6 commits May 19, 2026 17:53
… coding components

Each component now resolves AUTOFIX_HARNESS via select_orchestrator
instead of constructing AutofixAgent directly. Default behavior
unchanged (builtin → AutofixAgent); aider path activates only when
AUTOFIX_HARNESS=aider.
Skipped by default; opt in with SEER_AIDER_DOGFOOD=1 to run inside the
seer container after rebuilding :lightweight. Confirms aider can clone
the repo and answer in --ask mode against Vertex Gemini.
Two issues found during VM live test against ghcr.io/ledoent/seer:7-merge:

  * `--ask` is not a flag in aider 0.65.0 — use `--chat-mode ask` instead.
  * Add `--no-check-update --no-show-release-notes` so aider doesn't
    auto-pip-install upgrades mid-run (the auto-upgrade clobbered seer's
    pinned tokenizers when it ran against the system Python).
  * Install `google-auth` and `google-cloud-aiplatform` into
    /opt/aider-venv so litellm's vertex_ai/* model routing can
    authenticate via the existing seer-vertex ADC. Without these,
    aider hits `ModuleNotFoundError: No module named 'google'` from
    litellm.llms.vertex_ai_and_google_ai_studio.

Verified end-to-end on sentry-seer-1: clone + ask-mode aider call
against vertex_ai/gemini-2.5-flash completes in ~9s, returns real
Gemini output, costs ~$0.0008 per invocation.
@dnplkndll

Copy link
Copy Markdown
Author

Live verification on sentry-seer-1 ✅

Image pulled and tested against real Vertex Gemini:

Aider v0.65.0
Model: vertex_ai/gemini-2.5-flash with ask edit format
Git repo: .git with 471 files
Repo-map: using 1024 tokens, auto refresh
Initial repo scan can be slow in larger repos, but only happens once.
I cannot answer that question without seeing the contents of
`src/seer/automation/harness/__init__.py`. Please add it to the chat.

Tokens: 2.5k sent, 31 received. Cost: $0.00082 message, $0.00082 session.
  • Clone + Gemini round-trip: ~9 seconds
  • Per-invocation cost: ~$0.0008
  • ADC auth via /etc/sentry-extra/seer-vertex-key.json works
  • select_orchestrator("aider", strict=True) resolved correctly under AUTOFIX_HARNESS_STRICT=1

Two issues found and fixed in 5a36248

  1. --ask is not a real flag in aider 0.65.0 — must use --chat-mode ask.
  2. Aider auto-upgrade clobbered the system Python's tokenizers when it ran pip install --upgrade --upgrade-strategy only-if-needed aider-chat mid-session. Suppressed via --no-check-update --no-show-release-notes.
  3. google-auth was missing from /opt/aider-venv — litellm needs it for vertex_ai/* model routing. Added to the Dockerfile pip install line.

Note on diagnostic mode (deferred to Phase 2b)

The "I need the file" response is correct aider behavior — in --chat-mode ask it doesn't auto-load files from the repo map for diagnostic questions. For real autofix root-cause/solution steps, we'll need to pass relevant files via --read <path> (extracted from the autofix request's relevant_code_files). This refinement belongs in Phase 2b alongside the dynamic-repo-resolution work.

This PR delivers Phase 2a — orchestrator wiring + sandboxed invocation + Vertex auth — and is ready to merge behind the default AUTOFIX_HARNESS=builtin flag.

dnplkndll added 2 commits May 20, 2026 09:44
Bugbot review found the standalone smoke test bypassed real component
integration. The autofix components in root_cause / solution / coding
touch the agent in five ways that AiderHarness was missing or breaking:

  * `agent.usage` summed into per-step totals via `cur.usage += agent.usage`
    → would AttributeError. Now zero-inits `Usage()`.
  * `agent.add_user_message(prompt)` called *before* `agent.run(...)` in
    coding + solution → no method existed. Added with same signature as
    `AutofixAgent.add_user_message` and a `_resolve_prompt` fallback
    that picks up the last user message when `run_config.prompt` is empty.
  * `agent.tools = []` set mid-flow in root_cause before the reasoning
    pass → `_unused_tools` private name made it un-settable. Now a public
    `self.tools` attribute.
  * `agent.memory` fed to `llm_client.generate_structured` as the
    formatter LLM's context → empty memory means the formatter has nothing
    to extract from. Now synthesizes an assistant `Message` with aider's
    stdout after each run.
  * Constructor accepts a pre-existing memory list (CodingComponent's
    `_prefill_initial_memory` seeds expand_document tool calls).

Also drops the broken Pydantic attribute assignment in `_record_diff`:
`BaseStep` has no `extra="allow"` config, so the runtime assignment
would fail in Pydantic v2. Phase 2a logs the diff at INFO; Phase 2b adds
proper `ChangesStep.changes` persistence via FilePatch parsing.

New tests cover each contract method plus regression guards for the
flag fixes that landed in 5a36248 (`--chat-mode ask`, `--no-check-update`).
A new test_component_wiring.py source-checks each of the three components
still routes through `select_orchestrator` and forwards
`AUTOFIX_HARNESS_STRICT`.
Pytest test isolation and module reloads can re-import harness modules,
which would trip the existing duplicate-registration guard and crash at
import time even though the registration is a no-op (same name, same
class). Now register_harness:

  * Returns early when name → cls is already in the registry (same cls).
  * Still raises ValueError when name → *different* cls, which is the
    real bug it was guarding against (two modules silently clobbering
    each other's registration).

This is the only path the duplicate-error guard was protecting; the
existing test_duplicate_registration_raises still holds because that
test registers two distinct classes.
@dnplkndll

Copy link
Copy Markdown
Author

Self-review pass: bugbot findings + fixes

Re-reading the branch as if it were someone else's PR surfaced real integration bugs the standalone smoke test missed. Addressed in 2 follow-up commits:

🔴 Critical (would fail at first real autofix run)

agent.usage missing on AiderHarness — all three components do cur.usage += agent.usage. Would AttributeError on every autofix invocation. → Fixed in 01c9625 (zero-init Usage()).

agent.add_user_message(...) missingcoding/component.py:161 and solution/component.py:152 push their formatted prompts this way before calling run(). Without this method, the prompt was lost and aider got an empty --message. → Fixed: added the method, plus a _resolve_prompt fallback that picks up the last user message when run_config.prompt is empty.

agent.memory was empty for the formatter LLM — root_cause and solution both call llm_client.generate_structured(messages=agent.memory, ...) after agent.run(). With our empty memory list, the formatter had nothing to extract from. → Fixed: synthesize an assistant Message with aider's stdout after each run.

agent.tools = [] set mid-flow in root_cause/component.py:119 — was stored as _unused_tools, can't be reassigned cleanly. → Fixed: public self.tools attribute.

_record_diff assignment to Pydantic BaseStep would failBaseStep has no extra="allow" config, so setting an unknown attribute fails in Pydantic v2. Test mocks hid this because they mocked state.update() entirely. → Fixed: log the diff at INFO; Phase 2b adds ChangesStep.changes persistence via FilePatch parsing.

🟡 Medium

register_harness raised on idempotent re-import — pytest module reload would trip the duplicate guard even for the same class. → Fixed in 11b3bb8 (same name → cls is a no-op; only different class under same name still raises).

Docstring inaccuracy — claimed --ask mode; the actual flag set has been --chat-mode ask since 5a36248. → Fixed.

Redundant AIDER_NO_PRETTY=1 env var — already passing --no-pretty flag. → Removed.

🟢 Low (left as-is, documented)

  • _resolve_model_name reads os.environ directly rather than from the AppConfig instance. The harness only gets the AgentConfig (not AppConfig) at construction time, and Pydantic populates AUTOFIX_HARNESS_MODEL from env anyway, so this is fine. Documented.
  • Diff capture timeout exceptions propagate instead of returning empty string. Acceptable — 30s for git diff HEAD~1 HEAD on a single-commit shallow clone would indicate something seriously wrong.
  • Test imports private underscored constants (_AIDER_TIMEOUT_SECONDS). Convention violation but the constants are intentionally importable; tests prefer it over magic numbers.

Test coverage added

8 new tests (test_aider.py now 19 total) covering the contract surface: usage default, tools settable, add_user_message appends, constructor memory seeding (with defensive-copy check), _resolve_prompt fallback, assistant-message synthesis, --chat-mode ask regression, --no-check-update regression.

New test_component_wiring.py source-checks each of the three components still routes through select_orchestrator and forwards AUTOFIX_HARNESS_STRICT — guards against a future refactor silently reverting to direct AutofixAgent(...).

All tests pass in the seer container: 35 passed, 2 skipped in 7.14s.

No UI screenshots

This PR is server-side only (Celery worker code path + a feature flag). No frontend changes; no Sentry UI changes. The "UI surface" affected is the autofix step log in the worker, which is verified via the inline aider stdout shown in the prior live-verification comment above.

PR description refreshed to reflect the 8-commit shape + Phase 2b backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant