Traceback (most recent call last):
File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 390, in <module>
main(cfg, args.encoder_gpu, args.generator_gpu, args.decoder_gpu, args.distributed)
File "/p/gpfs1/brace3/src/DeepDriveMD-pipeline/deepdrivemd/models/aae/train.py", line 180, in main
dist.init_process_group(backend="nccl", init_method="env://")
File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/g/g15/brace3/.conda/envs/conda-pytorch/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.
The line failing: https://github.com/DeepDriveMD/DeepDriveMD-pipeline/blob/dbg/integration/deepdrivemd/models/aae/train.py#L181
The Traceback:
Useful references: