Skip to content

Distributed training fails for large system size #14

@braceal

Description

@braceal

The line failing: https://github.com/DeepDriveMD/DeepDriveMD-pipeline/blob/dbg/integration/deepdrivemd/models/aae/train.py#L181

The Traceback:

Traceback (most recent call last):
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module>
    main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed)
  File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 180, in main
    dist.init_process_group(backend="nccl", init_method="env://")
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

Useful references:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions