WorkerWrap.load_weights calls vLLM's raw model.load_weights(...) directly. Per the upstream vllm-project/vllm#42821 (since vllm==0.20.0 was pulled in via #1628), that entrypoint is broken for unquantized MoE on FlashInfer backends (e.g. Qwen/Qwen3.6-35B-A3B).
SkyRL should move to self.model_runner.reload_weights(weights_iterator=...) (link), which is idempotent across repeated weight syncs.
+ from vllm.config import set_current_vllm_config
...
def load_weights(self, request: bytes) -> None:
...
weight_list = []
for name, tensor in self._weight_receiver.receive_weights(request):
weight_list.append((name, tensor))
- self.model_runner.model.load_weights(weights=weight_list)
+ with set_current_vllm_config(self.vllm_config):
+ self.model_runner.reload_weights(weights_iterator=iter(weight_list))
for weight in weight_list:
del weight
This will also match vLLM's own reload_weights RPC.
WorkerWrap.load_weightscalls vLLM's rawmodel.load_weights(...)directly. Per the upstream vllm-project/vllm#42821 (sincevllm==0.20.0was pulled in via #1628), that entrypoint is broken for unquantized MoE on FlashInfer backends (e.g. Qwen/Qwen3.6-35B-A3B).SkyRL should move to
self.model_runner.reload_weights(weights_iterator=...)(link), which is idempotent across repeated weight syncs.This will also match vLLM's own
reload_weightsRPC.