Skip to content

Inconsistent energy and grad/forces for different GPU4PySCF versions #754

@TimotheMelin

Description

@TimotheMelin

Hello,

In the past weeks, I have been testing different version of GPU4PySCF versus the pyscf cpu version. During those tests, I noticed some inconsistencies between different versions when compared to the CPU version.

All the test were run on a National cluster using:
NVIDIA Ampere100 custom, 64GiB HBM2e NVLink 3.0 (200 GB/s)
Intel Ice Lake Intel Xeon Platinum 8358 processors

PySCF was built in a python venv using cuda12-cutensor 2.2.0 or 2.0.2

I tested on different types of dimer interaction but I see inconsistent energy when molecules of water are interacting with each other (Water-Water examples) or interacting with a positively charged molecule (Tetramethylamonium-Water examples). I included all the outputs in the tar file for the different versions (res_outputs.tar.gz).

The energy differences can be very large. For example, for Tetramethylamonium-Water:

  • CPU 2.7.0 E = -56.78424235 Hartree
  • GPU 1.3.0 E = -56.7842423 Hartree
  • GPU 1.4.1 E = -54.3729462 Hartree
  • GPU 1.4.3 E = -54.37294628 Hartree
  • GPU 1.6.0 E = -56.7842405 Hartree

I know for a fact that for these dimers, the CPU interaction energy is the same as the one I got with a different software (Psi4), thus I am trusting the results from the CPU version.
Thus based on those results, the results from GPU 1.4.1 and 1.4.3 seem wrong, while 1.3.0 and 1.6.0 seem correct.

However, while checking for larger molecules, I noticed that the forces from GPU 1.6.0 compared to CPU 2.7.0 and other GPU versions are very different.

I also added the files for 1 drug molecule cypd_lig40.
For this particular case, if we look at the forces for the first atom x coord, we have in Eh/Bohr:

  • CPU 2.7.0: -9.44605401e-03
  • GPU 1.3.0: -9.44609184e-03
  • GPU 1.4.1: -9.44613860e-03
  • GPU 1.4.3: -9.44599386e-03
  • GPU 1.6.0: 1.12483895

Thus based on those results, I will stick with GPU4PySCF 1.3.0 to run my calculations.

I know that the setup is not exactly the same between the different run but since I have overlaps in packages versions between the different run, I am confident that the results I have would be consistent even if I change the Cutensor or PySCF version in GPU 1.4.1, 1.4.3 and 1.6.0.

Do you have any insights why I have such large discrepancies between the versions?

Let me know if you need the exact spec of each python environment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions