Skip to content

Integrate autotune agent into the auto agent#42

Merged
shangkunwang01 merged 8 commits into
mainfrom
shangkun-auto-agent-adapt-autotune
Jun 1, 2026
Merged

Integrate autotune agent into the auto agent#42
shangkunwang01 merged 8 commits into
mainfrom
shangkun-auto-agent-adapt-autotune

Conversation

@shangkunwang01
Copy link
Copy Markdown
Collaborator

  1. Upgrade the eval server with asynchronous task polling and proper timeout
  2. Add a apply_best_config subagent into the autotune agent
  3. Adapt prompt and callbacks for integration with the rest of the agents

TPU_SERVER_PORT = 5463
CPU_SERVER_PORT = 5464
EVAL_SERVER_PORT = 1245
PERF_THRESHOLD = 1.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not needed anymore?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see it is used by any code.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is for evaluation? When perf improves by more than 10%, we consider it as an improvement?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluation code is independent of any agent code and the threshold was set as 1.05.

Comment thread MaxKernel/auto_agent/constants.py
logging.info(f"[{self.name}] Running code")
async with aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT)
timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT + 10)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need 3 hours timeout for compilation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you add 10 (seconds) here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of adding this buffer is to make the timeout in the outer layer to be slightly longer than inner layers.

from auto_agent.constants import EVAL_SERVER_PORT, REQUEST_TIMEOUT

AUTOTUNE_INDIVIDUAL_TIMEOUT = 300
AUTOTUNE_TOTAL_TIMEOUT = 5400
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused with AUTOTUNE_TOTAL_TIMEOUT vs REQUEST_TIMEOUT. Can we use REQUEST_TIMEOUT?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUEST_TIMEOUT is client time out and I make all the client (compilation, test, etc) to share the same time out).
AUTOTUNE_TOTAL_TIMEOUT is the backend timeout (they are different for different tasks)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when it is 5400 seconds, the backend stops, but the subprocesses still run? I am confused.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backend stops (the autotune endpoint timeout), subprocesses (for combo not yet run) will not run anymore.

f"Could not connect to server at {url}. Make sure it is running."
),
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no timeout handling here, when autotune times out, will it kill eval server and tpu server?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout are specified so that eval client timeout > eval server timeout > backend timeout. As a result, if the client (the autotune tool) timeout the autotune backend should already time out most likely.
In the auto agent, backend (eval server, tpu server) and the agent are decoupled and no matter what happens in the agent, the backend will not be killed unless you kill it manually.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then semaphore will be 0, and the next task can run on the server, but then it would fail with TPU is occupied by other processes. Right?

I think to verify timeout, can you do an experiment, making the timeout very small, and try 5 ops and monitor the TPU server log?

@@ -0,0 +1,25 @@
"""Prompt for AutotuneSummarizerAgent."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very different from the prompt in hitl_agent. Could we use the hitl prompt? My concern is this benchmark result will be very different from hitl result.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it because

  1. autotune_results is available as state and there is no need to use the read_file tool
  2. I need the summary agent to verify the apply_best_config agent actually apply the change.
  3. I actually asked jetski to make the prompt clearer and personally I think this prompt indeed is more concise

Functionally, I wouldn't believe this summary agent will have big influence on hitl agent.

@@ -0,0 +1,13 @@
"""Prompt for ApplyBestConfigAgent."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to use autotune agent to apply the best config?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what this subagent does.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hitl, there is no apply_best_config_prompt. Can this be merged with autotune agent?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subagent is added so that it is fully autonomous. It can be a natural 3rd step after we get AutotuneRunner. If we want to merge it, we have to merge it with summary agent which seems confusing. I would prefer keep it a seperate subagent.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal of autoagent is to benchmark the same system as hitl-agent. My concern is that these two system will diverge too much that autoagent won't benchmarking hitl. If this prompt separation is necessary and achieves better performance, we should make hitl the same.

for combo in combinations:
if (
request.total_timeout
and (time.time() - start_time) > request.total_timeout
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this for loop only generates the combos all at once, I don't think there will be a timeout here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout is also a guard against what you worry about: the autotune timeout 5400s is over but it still keeps running the combo cases. With this check, one the autotune timeout, no new combo will be executed except perhaps the last one.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But appending all possible configs is done at the beginning, and then all subprocesses start running, there will be no timeout here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess is run one by one sequentially (they could not run together on TPU) and once there is timeout, the subprocess that is not run yet will not be able to run.

}
)

except asyncio.TimeoutError:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout will only capture one process and kill it, and the rest of the subprocesses are still scheduled. My concern is with the rest of subprocesses, when the next ops autotuning starts running, all of them would fail because the TPU is occupied. It actually would skip autotuning.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two timeout: timeout(for single combo) and total_timeout(for the full combinations). Single combo time_out will not prevent other combos within this full combinations from executing. If the total_timeout is met, the rest of the combo will not be executed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the timeout in general, could you test with total_timeout=single_timeout=300 on 5 ops, and see if the timeout works as expected?

@NinaCai NinaCai self-requested a review June 1, 2026 20:54
@shangkunwang01 shangkunwang01 merged commit 2b59ac0 into main Jun 1, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants