Skip to content

GPU utilization for one of four CUDA devices drops to 0% occasionally during huge batch dockings #303

@bmp192529

Description

@bmp192529

Describe the bug
In huge batch dockings we are running (~1-2 million ligands), the GPU utilization of each of our 4 NVIDIA GPUs will be around 100% initially, then drop to 3 GPUs at ~100% and one at 0%. Which GPU is inactive changes every time this occurs and will often change over time. For instance, GPU 2 could be at 0% utilization at 2 pm, but, by 4 pm, GPU 4 could be at 0% utilization, with GPU 2 at ~100%. This isn't due to thermal throttling as far as I can tell, since all GPUs are always below 80c and around only 50% fan speed at maximum for the duration of dockings. It also isn't likely due to CPU bottle-necking, since we have a Xeon CPU that never exceeds 25% utilization during the docking. CPU temps are a constant 40-45C. The issue is usually first noticeable after around 200,000-300,000 compounds have been docked. No increase in failed dockings or error messages were noticeable.

To Reproduce
Run a docking of any receptor with a large number of ligands (>250,000?) on a multi-GPU device.

Expected behavior
All 4 GPUs to be running at nearly 100% for the duration of the docking. This issue did not occur when I previously had run docking porjects as 4 separate processes that utilized 1 GPU each. I first noticed this occuring when I switched to using one docking process, with the addition of "-D 2,3,4,5" (our GPU #1 is not CUDA-enabled).

Information to help narrow down the bug

  • Which version of AutoDock-GPU are you using?
    v1.6 (develop)
  • Which operating system are you on?
    Rocky Linux 9.5
  • Which compiler, compiler version, and make compile options did you use?
    gcc 11.5.0, make DEVICE= NUMWI = <128>
  • Which GPU(s) are you running on and is Cuda or OpenCL used?
    4x NVIDIA RTX A5000 GPUs, CUDA
  • Which driver version and if applicable, which Cuda version are you using?
    CUDA 12.8
  • When compiling AutoDock-GPU, are GPU_INCLUDE_PATH and GPU_LIBRARY_PATH set? Are both environment variables set to the correct directories, i.e. corresponding to the correct Cuda version or OpenCL library?
    Yes, this was manually confirmed in the bin directories as well.
  • Did this bug only show up recently? Which version of AutoDock-GPU, compiler, settings, etc. were you using that worked?
    This bug only showed up after I started running large dockings as one batch with the addition of "-D 2,3,4,5" to our command. This didn't occur noticeably when we ran the docking as 4 separate simultaneous batches that each use one GPU. Again, the inactive GPU is frequently rotating, suggesting it isn't a hardware related issue with one particular GPU. Unclear if this is related, but we are now storing docking files on a mechanical HDD because of the large file size. Although we initially ran dockings off of an SSD, I had previously tested the 4 seperate batchs on the HDD without noticeable throttling.

Even with this bug, AutoDock-GPU is still an insanely fast docking model, and we love to use it. Just hoping to contribute back something useful to the project with this report! Thank you.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions