Integrate autotune agent into the auto agent#42
Conversation
shangkunwang01
commented
May 26, 2026
- Upgrade the eval server with asynchronous task polling and proper timeout
- Add a apply_best_config subagent into the autotune agent
- Adapt prompt and callbacks for integration with the rest of the agents
…bagents to automate Pallas kernel optimization.
…implement total timeout logic for TPU autotuning
…ate with pipeline agent
…nfiguration naming in prompts and improve summary and autotune prompt.
| TPU_SERVER_PORT = 5463 | ||
| CPU_SERVER_PORT = 5464 | ||
| EVAL_SERVER_PORT = 1245 | ||
| PERF_THRESHOLD = 1.1 |
There was a problem hiding this comment.
Is it not needed anymore?
There was a problem hiding this comment.
I didn't see it is used by any code.
There was a problem hiding this comment.
I believe this is for evaluation? When perf improves by more than 10%, we consider it as an improvement?
There was a problem hiding this comment.
The evaluation code is independent of any agent code and the threshold was set as 1.05.
| logging.info(f"[{self.name}] Running code") | ||
| async with aiohttp.ClientSession( | ||
| timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT) | ||
| timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT + 10) |
There was a problem hiding this comment.
Do we need 3 hours timeout for compilation?
There was a problem hiding this comment.
Same as above.
There was a problem hiding this comment.
Why do you add 10 (seconds) here?
There was a problem hiding this comment.
The purpose of adding this buffer is to make the timeout in the outer layer to be slightly longer than inner layers.
| from auto_agent.constants import EVAL_SERVER_PORT, REQUEST_TIMEOUT | ||
|
|
||
| AUTOTUNE_INDIVIDUAL_TIMEOUT = 300 | ||
| AUTOTUNE_TOTAL_TIMEOUT = 5400 |
There was a problem hiding this comment.
I am confused with AUTOTUNE_TOTAL_TIMEOUT vs REQUEST_TIMEOUT. Can we use REQUEST_TIMEOUT?
There was a problem hiding this comment.
REQUEST_TIMEOUT is client time out and I make all the client (compilation, test, etc) to share the same time out).
AUTOTUNE_TOTAL_TIMEOUT is the backend timeout (they are different for different tasks)
There was a problem hiding this comment.
So when it is 5400 seconds, the backend stops, but the subprocesses still run? I am confused.
There was a problem hiding this comment.
When backend stops (the autotune endpoint timeout), subprocesses (for combo not yet run) will not run anymore.
| f"Could not connect to server at {url}. Make sure it is running." | ||
| ), | ||
| } | ||
|
|
There was a problem hiding this comment.
There is no timeout handling here, when autotune times out, will it kill eval server and tpu server?
There was a problem hiding this comment.
The timeout are specified so that eval client timeout > eval server timeout > backend timeout. As a result, if the client (the autotune tool) timeout the autotune backend should already time out most likely.
In the auto agent, backend (eval server, tpu server) and the agent are decoupled and no matter what happens in the agent, the backend will not be killed unless you kill it manually.
There was a problem hiding this comment.
But then semaphore will be 0, and the next task can run on the server, but then it would fail with TPU is occupied by other processes. Right?
I think to verify timeout, can you do an experiment, making the timeout very small, and try 5 ops and monitor the TPU server log?
| @@ -0,0 +1,25 @@ | |||
| """Prompt for AutotuneSummarizerAgent.""" | |||
There was a problem hiding this comment.
This is very different from the prompt in hitl_agent. Could we use the hitl prompt? My concern is this benchmark result will be very different from hitl result.
There was a problem hiding this comment.
I change it because
- autotune_results is available as state and there is no need to use the read_file tool
- I need the summary agent to verify the apply_best_config agent actually apply the change.
- I actually asked jetski to make the prompt clearer and personally I think this prompt indeed is more concise
Functionally, I wouldn't believe this summary agent will have big influence on hitl agent.
| @@ -0,0 +1,13 @@ | |||
| """Prompt for ApplyBestConfigAgent.""" | |||
There was a problem hiding this comment.
Are we able to use autotune agent to apply the best config?
There was a problem hiding this comment.
Yes, this is what this subagent does.
There was a problem hiding this comment.
In hitl, there is no apply_best_config_prompt. Can this be merged with autotune agent?
There was a problem hiding this comment.
The subagent is added so that it is fully autonomous. It can be a natural 3rd step after we get AutotuneRunner. If we want to merge it, we have to merge it with summary agent which seems confusing. I would prefer keep it a seperate subagent.
There was a problem hiding this comment.
I think the goal of autoagent is to benchmark the same system as hitl-agent. My concern is that these two system will diverge too much that autoagent won't benchmarking hitl. If this prompt separation is necessary and achieves better performance, we should make hitl the same.
| for combo in combinations: | ||
| if ( | ||
| request.total_timeout | ||
| and (time.time() - start_time) > request.total_timeout |
There was a problem hiding this comment.
this for loop only generates the combos all at once, I don't think there will be a timeout here?
There was a problem hiding this comment.
This timeout is also a guard against what you worry about: the autotune timeout 5400s is over but it still keeps running the combo cases. With this check, one the autotune timeout, no new combo will be executed except perhaps the last one.
There was a problem hiding this comment.
But appending all possible configs is done at the beginning, and then all subprocesses start running, there will be no timeout here.
There was a problem hiding this comment.
The subprocess is run one by one sequentially (they could not run together on TPU) and once there is timeout, the subprocess that is not run yet will not be able to run.
| } | ||
| ) | ||
|
|
||
| except asyncio.TimeoutError: |
There was a problem hiding this comment.
This timeout will only capture one process and kill it, and the rest of the subprocesses are still scheduled. My concern is with the rest of subprocesses, when the next ops autotuning starts running, all of them would fail because the TPU is occupied. It actually would skip autotuning.
There was a problem hiding this comment.
We have two timeout: timeout(for single combo) and total_timeout(for the full combinations). Single combo time_out will not prevent other combos within this full combinations from executing. If the total_timeout is met, the rest of the combo will not be executed.
There was a problem hiding this comment.
I am not sure about the timeout in general, could you test with total_timeout=single_timeout=300 on 5 ops, and see if the timeout works as expected?