Backport agent e2e suite infrastructure and first backport test#510
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces end-to-end (E2E) tests for the backport agent, featuring a new LLM-as-judge evaluation framework and artifact capture utilities. The backport agent's workflow has been refactored for improved modularity, and the triage skill documentation was significantly expanded to provide a more detailed workflow description. Infrastructure updates include new Makefile targets, Docker Compose services, and support for global git configurations in mock environments. Feedback on the new E2E tests identifies critical issues regarding the use of asyncio.run() within fixtures and test functions while pytest-asyncio is active, which can lead to runtime errors; the reviewer recommends converting these to async def and using await directly.
opohorel
left a comment
There was a problem hiding this comment.
LGTM, nice job!
I've verified that DRY_RUN backport works and I've run the test locally, which also passed
TomasTomecek
left a comment
There was a problem hiding this comment.
Opus is truly impressed:
Verdict
The core refactoring is solid and achieves its goal of making the backport workflow testable. The e2e test infrastructure is thoughtfully designed with the mock repo setup, artifact capture, and optional LLM judge layers. The mutable default field and type annotation issues are minor but worth fixing. The SKILL.md regeneration adds noise to the diff but doesn't change behavior. Overall this is good work — address the mutable default and it's ready.
I think these two issues it found are worth addressing:
- Mutable default in BackportState (backport_agent.py around line 839 in the PR):
backport_log: list[str] = Field(default=[])
- Pydantic's Field(default=[]) shares the same list object across instances. This should be Field(default_factory=list). The original code had the same bug inside main() but it was less likely to trigger since
only one State was created per process. With the class now at module level and used in tests, this is more risky.
- local_tool_options initialized outside the mcp_tools context (backport_agent.py line ~862): local_tool_options = {"working_directory": None}
silent_run = os.getenv("SILENT_RUN", "false").lower() == "true"
- local_tool_options is now created at the top of run_workflow(), but silent_run reads from env — this is fine for production but in tests SILENT_RUN is never set explicitly so it defaults to false. Not a bug
but worth documenting.
| VERDICT: FAIL | ||
|
|
||
| Before the verdict, provide a brief explanation for each criterion. | ||
| """ |
There was a problem hiding this comment.
truly fascinating that the evaluation is this short, 200 lines of python, no need for a dedicated framework - brilliant!
| {diff_section} | ||
|
|
||
| {spec_section} | ||
|
|
||
| {patches_section} | ||
|
|
||
| {result_section} | ||
|
|
||
| {reference_section} |
There was a problem hiding this comment.
shouldn't all of these have headers so the model knows what that output is?
There was a problem hiding this comment.
The headers are in the definion of these strings.
683e71a
No description provided.