You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
` File "/home/azureuser/Work/Challenge/Task_1/fets_challenge/fets_challenge_model.py", line 110, in validate
epoch_valid_loss, epoch_valid_metric = validate_network(
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/forward_pass.py", line 284, in validate_network
result = step(model, image, label, params, train=True)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/step.py", line 78, in step
output = model(image)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
`
Expected behavior
Should run on multi-gpu without crash.
Media
If applicable, add images, screenshots or other relevant media to help explain your problem.
Describe the bug
When trying to run FeTS-Challenge Task 1 in multi-GPU instance, it is crashing in send_model_to_device function.
To Reproduce
Steps to reproduce the behavior:
export CUDA_VISIBLE_DEVICES=0,1,2,3
Run FeTS-Challenge from Migrating TaskRunner based FeTS Task_1 Challenge to Workflow API FeTS-AI/Challenge#204
It is crashing in send_model_to_device
` File "/home/azureuser/Work/Challenge/Task_1/fets_challenge/fets_challenge_model.py", line 110, in validate
epoch_valid_loss, epoch_valid_metric = validate_network(
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/forward_pass.py", line 284, in validate_network
result = step(model, image, label, params, train=True)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/GANDLF/compute/step.py", line 78, in step
output = model(image)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/azureuser/Work/fets-venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
`
Expected behavior
Should run on multi-gpu without crash.
Media
If applicable, add images, screenshots or other relevant media to help explain your problem.
Environment information
GANDLF version: 0.1.0
Git hash: 4d614fe
Platform: Linux-6.11.0-1012-azure-x86_64-with-glibc2.39
Machine: x86_64
Processor: x86_64
Architecture: 64bit ELF
Python environment:
Version: 3.10.1
Implementation: CPython
Compiler: GCC 13.3.0
Build: main Apr 7 2025 07:01:16
Additional context
Add any other context about the problem here.