Skip to content

Running with mpirun hangs #46

@anbenali

Description

@anbenali

When Running with:
python3 ../../main.py nh3.1det.fcidump nh3.1det.wf 100
I get the following within 10 seconds:

Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775

However, when running with mpirun, system hangs (at least 5min).
mpirun -n 8 python3 ../../main.py nh3.1det.fcidump nh3.1det.wf 100
CPU is busy though:


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                              
  10633 abenali   20   0 1750856   1.3g  36500 R 100.0   4.1   7:12.84 python3                                                                                                                                                              
  10635 abenali   20   0 1751112   1.3g  36224 R 100.0   4.3   7:12.86 python3                                                                                                                                                              
  10636 abenali   20   0 1751112   1.3g  36252 R 100.0   4.2   7:12.38 python3                                                                                                                                                              
  10637 abenali   20   0 1750856   1.3g  36308 R 100.0   4.2   7:12.77 python3                                                                                                                                                              
  10638 abenali   20   0 1605704   1.3g  36244 R 100.0   4.1   7:12.83 python3                                                                                                                                                              
  10639 abenali   20   0 1750856   1.3g  36508 R 100.0   4.2   7:12.82 python3                                                                                                                                                              
  10640 abenali   20   0 1605448   1.3g  36432 R 100.0   4.1   7:12.49 python3                                                                                                                                                              
  10634 abenali   20   0 1604936   1.2g  36432 R  99.3   3.7   7:12.84 python3                                                                                                                                                              

And then when the calculation is done we get the following:

abenali@abenali:~/Work/src/QuantumEnvelope/data/test$ mpirun -n 8 python3 ../../main.py nh3.1det.fcidump nh3.1det.wf 100
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865
Davidson Failed, fallback to numpy eigh
N_det: 2, E -56.170139747833076
Davidson Failed, fallback to numpy eigh
N_det: 4, E -56.17802529415852
N_det: 8, E -56.18834028429127
N_det: 16, E -56.203293304313775
N_det: 32, E -56.222182516458645
N_det: 64, E -56.24186291203908
N_det: 128, E -56.25848774667865

As you can see, the print out is just jammed and is not produced at each iteration by the master but all ranks at the same time.

This is obviously from here (main.py):

 
    while len(psi_det) < N_det_target:
        E, psi_coef, psi_det = selection_step(comm, lewis, n_ord, psi_coef, psi_det, len(psi_det))
        # Update Hamiltonian engine
        lewis = Hamiltonian_generator(
            comm, E0, d_one_e_integral, d_two_e_integral, psi_det, driven_by=driven_by
        )
        print(f"N_det: {len(psi_det)}, E {E}")


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions