Skip to content

[bug] Switch to reload_weights API for loading weights in legacy inference codepath#1685

Open
SumanthRH wants to merge 1 commit into
mainfrom
fix-load-weights-moe
Open

[bug] Switch to reload_weights API for loading weights in legacy inference codepath#1685
SumanthRH wants to merge 1 commit into
mainfrom
fix-load-weights-moe

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

What does this PR do?

Fixes #1680

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review May 18, 2026 05:33
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the weight loading logic in vllm_worker.py by introducing the set_current_vllm_config context manager and switching to model_runner.reload_weights. Feedback suggests optimizing memory efficiency by passing the weight generator directly to the loading function instead of accumulating tensors in an intermediate list.

Comment on lines 92 to +96
for name, tensor in self._weight_receiver.receive_weights(request):
weight_list.append((name, tensor))

self.model_runner.model.load_weights(weights=weight_list)
with torch.device(self.device), set_current_vllm_config(self.vllm_config):
self.model_runner.reload_weights(weights_iterator=iter(weight_list))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of collecting all weights into an intermediate list, you can pass the generator from receive_weights directly to reload_weights. This reduces memory overhead by avoiding storing all tensors in a list simultaneously, which is particularly important for large models. It also allows vLLM to pipeline the weight loading process as tensors are received.

Note that this change makes the subsequent weight_list cleanup loop (lines 98-99) redundant as the list will remain empty, but since those lines are context, they can be left as-is or removed in a separate cleanup.

Suggested change
for name, tensor in self._weight_receiver.receive_weights(request):
weight_list.append((name, tensor))
self.model_runner.model.load_weights(weights=weight_list)
with torch.device(self.device), set_current_vllm_config(self.vllm_config):
self.model_runner.reload_weights(weights_iterator=iter(weight_list))
with torch.device(self.device), set_current_vllm_config(self.vllm_config):
self.model_runner.reload_weights(
weights_iterator=self._weight_receiver.receive_weights(request))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WorkerWrap.load_weights silently corrupts unquantized MoE rollouts after the first weight sync since vllm>=0.20.0

1 participant