Integrate autotune agent into the auto agent by shangkunwang01 · Pull Request #42 · AI-Hypercomputer/accelerator-agents

shangkunwang01 · 2026-05-26T17:20:41Z

Upgrade the eval server with asynchronous task polling and proper timeout
Add a apply_best_config subagent into the autotune agent
Adapt prompt and callbacks for integration with the rest of the agents

…bagents to automate Pallas kernel optimization.

…implement total timeout logic for TPU autotuning

…ate with pipeline agent

…nfiguration naming in prompts and improve summary and autotune prompt.

… agent output

NinaCai · 2026-05-29T03:57:58Z

 TPU_SERVER_PORT = 5463
 CPU_SERVER_PORT = 5464
 EVAL_SERVER_PORT = 1245
-PERF_THRESHOLD = 1.1


Is it not needed anymore?

I didn't see it is used by any code.

I believe this is for evaluation? When perf improves by more than 10%, we consider it as an improvement?

The evaluation code is independent of any agent code and the threshold was set as 1.05.

NinaCai · 2026-05-29T04:04:02Z

      logging.info(f"[{self.name}] Running code")
      async with aiohttp.ClientSession(
-        timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT)
+        timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT + 10)


Do we need 3 hours timeout for compilation?

Same as above.

Why do you add 10 (seconds) here?

The purpose of adding this buffer is to make the timeout in the outer layer to be slightly longer than inner layers.

NinaCai · 2026-05-29T04:06:41Z

+from auto_agent.constants import EVAL_SERVER_PORT, REQUEST_TIMEOUT
+
+AUTOTUNE_INDIVIDUAL_TIMEOUT = 300
+AUTOTUNE_TOTAL_TIMEOUT = 5400


I am confused with AUTOTUNE_TOTAL_TIMEOUT vs REQUEST_TIMEOUT. Can we use REQUEST_TIMEOUT?

REQUEST_TIMEOUT is client time out and I make all the client (compilation, test, etc) to share the same time out).
AUTOTUNE_TOTAL_TIMEOUT is the backend timeout (they are different for different tasks)

So when it is 5400 seconds, the backend stops, but the subprocesses still run? I am confused.

When backend stops (the autotune endpoint timeout), subprocesses (for combo not yet run) will not run anymore.

NinaCai · 2026-05-29T04:08:26Z

+        f"Could not connect to server at {url}. Make sure it is running."
+      ),
+    }
+


There is no timeout handling here, when autotune times out, will it kill eval server and tpu server?

The timeout are specified so that eval client timeout > eval server timeout > backend timeout. As a result, if the client (the autotune tool) timeout the autotune backend should already time out most likely.
In the auto agent, backend (eval server, tpu server) and the agent are decoupled and no matter what happens in the agent, the backend will not be killed unless you kill it manually.

But then semaphore will be 0, and the next task can run on the server, but then it would fail with TPU is occupied by other processes. Right?

I think to verify timeout, can you do an experiment, making the timeout very small, and try 5 ops and monitor the TPU server log?

NinaCai · 2026-05-29T04:15:21Z

@@ -0,0 +1,25 @@
+"""Prompt for AutotuneSummarizerAgent."""


This is very different from the prompt in hitl_agent. Could we use the hitl prompt? My concern is this benchmark result will be very different from hitl result.

I change it because

autotune_results is available as state and there is no need to use the read_file tool

I need the summary agent to verify the apply_best_config agent actually apply the change.

I actually asked jetski to make the prompt clearer and personally I think this prompt indeed is more concise

Functionally, I wouldn't believe this summary agent will have big influence on hitl agent.

NinaCai · 2026-05-29T04:17:21Z

@@ -0,0 +1,13 @@
+"""Prompt for ApplyBestConfigAgent."""


Are we able to use autotune agent to apply the best config?

Yes, this is what this subagent does.

In hitl, there is no apply_best_config_prompt. Can this be merged with autotune agent?

The subagent is added so that it is fully autonomous. It can be a natural 3rd step after we get AutotuneRunner. If we want to merge it, we have to merge it with summary agent which seems confusing. I would prefer keep it a seperate subagent.

I think the goal of autoagent is to benchmark the same system as hitl-agent. My concern is that these two system will diverge too much that autoagent won't benchmarking hitl. If this prompt separation is necessary and achieves better performance, we should make hitl the same.

NinaCai · 2026-05-29T04:21:41Z

+      for combo in combinations:
+        if (
+          request.total_timeout
+          and (time.time() - start_time) > request.total_timeout


this for loop only generates the combos all at once, I don't think there will be a timeout here?

This timeout is also a guard against what you worry about: the autotune timeout 5400s is over but it still keeps running the combo cases. With this check, one the autotune timeout, no new combo will be executed except perhaps the last one.

But appending all possible configs is done at the beginning, and then all subprocesses start running, there will be no timeout here.

The subprocess is run one by one sequentially (they could not run together on TPU) and once there is timeout, the subprocess that is not run yet will not be able to run.

NinaCai · 2026-05-29T04:29:05Z

+              }
+            )
+
+        except asyncio.TimeoutError:


This timeout will only capture one process and kill it, and the rest of the subprocesses are still scheduled. My concern is with the rest of subprocesses, when the next ops autotuning starts running, all of them would fail because the TPU is occupied. It actually would skip autotuning.

We have two timeout: timeout(for single combo) and total_timeout(for the full combinations). Single combo time_out will not prevent other combos within this full combinations from executing. If the total_timeout is met, the rest of the combo will not be executed.

I am not sure about the timeout in general, could you test with total_timeout=single_timeout=300 on 5 ops, and see if the timeout works as expected?

shangkunwang01 added 7 commits May 21, 2026 22:53

feat: implement AutotuneAgent with planner, runner, and summarizer su…

cd5409d

…bagents to automate Pallas kernel optimization.

refactor: migrate autotune_kernel to use asynchronous aiohttp requests

1ca5e07

feat: introduce asynchronous task polling for eval server client and …

6e52d68

…implement total timeout logic for TPU autotuning

fix: fix import path for eval client

cff421f

feat: add a apply_best_config subagent into autotune agent and integr…

eeb3584

…ate with pipeline agent

fix: update regex to include 'ms' in RESULT_TIME, standardize best co…

980244d

…nfiguration naming in prompts and improve summary and autotune prompt.

feat: incorporate autotuning summary into kernel planning prompts and…

cb365a4

… agent output

shangkunwang01 assigned NinaCai May 26, 2026

NinaCai reviewed May 29, 2026

View reviewed changes

fix: propagate total_timeout to eval server payload

98defd6

NinaCai self-requested a review June 1, 2026 20:54

NinaCai approved these changes Jun 1, 2026

View reviewed changes

shangkunwang01 merged commit 2b59ac0 into main Jun 1, 2026
6 checks passed

Conversation

shangkunwang01 commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants