Skip to content

[BUG] Crashed when scaling out to 4 workers 4 servers. #27

@Zhangmj0621

Description

@Zhangmj0621

We successfully tested the StepMesh configuration with M=2 and N=2 workers and servers, but encountered a crash when configuring with M=4 and N=4.
Below is the error log. What steps should I take to resolve this issue?

RuntimeError: [02:11:04] /infrawaves/StepMesh/src/./rdma_van.h:206: Check failed: (getaddrinfo(node.hostname.c_str(), std::to_string(node.port).c_str(), nullptr, &remote_addr)) == (0) 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions