Skip to content

Backport agent e2e suite infrastructure and first backport test#510

Merged
TomasKorbar merged 9 commits into
packit:mainfrom
TomasKorbar:backport-first-fix
May 21, 2026
Merged

Backport agent e2e suite infrastructure and first backport test#510
TomasKorbar merged 9 commits into
packit:mainfrom
TomasKorbar:backport-first-fix

Conversation

@TomasKorbar
Copy link
Copy Markdown
Collaborator

No description provided.

@TomasKorbar TomasKorbar requested review from nforro and opohorel May 21, 2026 07:29
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces end-to-end (E2E) tests for the backport agent, featuring a new LLM-as-judge evaluation framework and artifact capture utilities. The backport agent's workflow has been refactored for improved modularity, and the triage skill documentation was significantly expanded to provide a more detailed workflow description. Infrastructure updates include new Makefile targets, Docker Compose services, and support for global git configurations in mock environments. Feedback on the new E2E tests identifies critical issues regarding the use of asyncio.run() within fixtures and test functions while pytest-asyncio is active, which can lead to runtime errors; the reviewer recommends converting these to async def and using await directly.

Comment thread ymir/agents/tests/e2e/backport_agent/test_backport.py
Comment thread ymir/agents/tests/e2e/backport_agent/test_backport.py
Comment thread ymir/agents/tests/e2e/backport_agent/test_backport.py
opohorel
opohorel previously approved these changes May 21, 2026
Copy link
Copy Markdown
Collaborator

@opohorel opohorel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice job!
I've verified that DRY_RUN backport works and I've run the test locally, which also passed

nforro
nforro previously approved these changes May 21, 2026
Comment thread ymir/agents/tests/e2e/backport_agent/artifact_capture.py Outdated
Comment thread ymir/agents/tests/e2e/backport_agent/evaluation.py
Comment thread ymir/agents/tests/e2e/backport_agent/test_backport.py Outdated
Comment thread ymir/agents/tests/e2e/backport_agent/test_backport.py Outdated
TomasTomecek
TomasTomecek previously approved these changes May 21, 2026
Copy link
Copy Markdown
Member

@TomasTomecek TomasTomecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opus is truly impressed:

Verdict

The core refactoring is solid and achieves its goal of making the backport workflow testable. The e2e test infrastructure is thoughtfully designed with the mock repo setup, artifact capture, and optional LLM judge layers. The mutable default field and type annotation issues are minor but worth fixing. The SKILL.md regeneration adds noise to the diff but doesn't change behavior. Overall this is good work — address the mutable default and it's ready.

I think these two issues it found are worth addressing:

  1. Mutable default in BackportState (backport_agent.py around line 839 in the PR):
    backport_log: list[str] = Field(default=[])
  • Pydantic's Field(default=[]) shares the same list object across instances. This should be Field(default_factory=list). The original code had the same bug inside main() but it was less likely to trigger since
    only one State was created per process. With the class now at module level and used in tests, this is more risky.
  1. local_tool_options initialized outside the mcp_tools context (backport_agent.py line ~862): local_tool_options = {"working_directory": None}
    silent_run = os.getenv("SILENT_RUN", "false").lower() == "true"
  • local_tool_options is now created at the top of run_workflow(), but silent_run reads from env — this is fine for production but in tests SILENT_RUN is never set explicitly so it defaults to false. Not a bug
    but worth documenting.

VERDICT: FAIL

Before the verdict, provide a brief explanation for each criterion.
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truly fascinating that the evaluation is this short, 200 lines of python, no need for a dedicated framework - brilliant!

Comment on lines +170 to +178
{diff_section}

{spec_section}

{patches_section}

{result_section}

{reference_section}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't all of these have headers so the model knows what that output is?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The headers are in the definion of these strings.

lbarcziova
lbarcziova previously approved these changes May 21, 2026
Copy link
Copy Markdown
Member

@lbarcziova lbarcziova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great! 🏅

@TomasKorbar TomasKorbar dismissed stale reviews from lbarcziova, TomasTomecek, nforro, and opohorel via 683e71a May 21, 2026 14:57
Copy link
Copy Markdown
Member

@nforro nforro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomasKorbar TomasKorbar merged commit 91fe1b1 into packit:main May 21, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants