It would be great if some information could be added to the SHT performance comparison section (IV-B2), which makes it easier to assess the relative performance of the various implementations.
- How was
ducc installed on the testing machine? Did you use the recommended method of compiling from source (using pip3 install --no-binary ducc0 --user ducc0) or did you use a precompiled binary wheel? Since your target CPU supports AVX-512, while the portable binary wheel can only support SSE-2, this makes a large performance difference.
- What was the relation between
lmax and nside in the SHT performance tests?
- Is there a strong reason to only use 20 cores of the CPU for
ducc? A node with AMD EPYC 9454 should have at least 48 CPU cores, and utilizing it to its full capability seems fair, given that the employed GPU is in a whole different ballpark price-wise.
It would be great if some information could be added to the SHT performance comparison section (IV-B2), which makes it easier to assess the relative performance of the various implementations.
duccinstalled on the testing machine? Did you use the recommended method of compiling from source (usingpip3 install --no-binary ducc0 --user ducc0) or did you use a precompiled binary wheel? Since your target CPU supports AVX-512, while the portable binary wheel can only support SSE-2, this makes a large performance difference.lmaxandnsidein the SHT performance tests?ducc? A node with AMD EPYC 9454 should have at least 48 CPU cores, and utilizing it to its full capability seems fair, given that the employed GPU is in a whole different ballpark price-wise.