On my system nvx works as expected after booting:
- nvidia dGPU is disabled
- nvx on works: turns dGPU on
- nvx off works: turns dGPU off
- nvx start works: launches program, turns on dGPU, and then turns off dGPU when program finishes
But after some time the daemon becomes unresponsive:
File "/usr/bin/nvx", line 299, in <module>
sock.recv(1024).decode("utf-8")
~~~~~~~~~^^^^^^
I looked at the source code and the log points into the "remove PCI device" direction.
This is what I can see after the daemon freezes:
- the PCI device of the Nvidia dGPU is still available
- the "remove" interface of the Nvidia dGPU is missing
- the PCI bridge is still available
- interacting with the "[bridge path]/power/control" (e.g. "auto") freezes the terminal
- the "nvidia_drm", "nvidia_modeset", and "nvidia" kernel modules are still loaded
What I tried:
Removing the nvidia dGPU via PCI call:
echo 1 | sudo tee /sys/bus/pci/devices/0000:01:00.0/reset
This fails because the reset interface is missing (already reset?)
Powering down the PCI bridge:
echo auto | sudo tee /sys/bus/pci/devices/0000:00:1c.0/power/control
This hangs / freezes the console.
Unloading the kernel modules unfreezes the terminals and allows restarting nvx:
sudo modprobe --remove --remove-holders nvidia_drm
sudo modprobe --remove --remove-holders nvidia_modeset
I think for some reason nvx fails to unload the kernel modules before turning off the dGPU via PCI calls. Then the daemon freezes.
Restarting nvx leads to the daemon freezing up again because it does not unload the kernel modules on start.
Question:
- Should the daemon unload the modules on start? (omit deadlock)
- Why could the daemon fail to unload the modules on exit? (race condition?)
- Is this caused by my setup? (for me it seems that the modules are not unloaded e.g. on start; also nvx normally works for some time; so it should not be a missing module in the nvx config?)
On my system nvx works as expected after booting:
But after some time the daemon becomes unresponsive:
I looked at the source code and the log points into the "remove PCI device" direction.
This is what I can see after the daemon freezes:
What I tried:
Removing the nvidia dGPU via PCI call:
This fails because the reset interface is missing (already reset?)
Powering down the PCI bridge:
This hangs / freezes the console.
Unloading the kernel modules unfreezes the terminals and allows restarting nvx:
I think for some reason nvx fails to unload the kernel modules before turning off the dGPU via PCI calls. Then the daemon freezes.
Restarting nvx leads to the daemon freezing up again because it does not unload the kernel modules on start.
Question: