Skip to content

[Bug] Get crashed when change batchsize in bmk_comm_latency_multiserver.py #28

@Zhangmj0621

Description

@Zhangmj0621

When we change bsz in bmk_comm_latency_multiserver.py, we get error logs like below.
RuntimeError: [15:17:41] /infrawaves/StepMesh/fserver/csrc/./public.hpp:103: Check failed: (tensors.size()) == (reqmeta.pull_tensors.size())
After deep dive into the code, we found that the hard code below only use the first tensor in input_tensors.

elif is_server:
    ret_buffer = torch.rand([65535, dim], dtype=torch.bfloat16, device='cuda')
    count = 0
    f.barrier(True, False)
    def server():
        global count
        iter_count = 0
        while True:
            batches = f.get_batch()
            if len(batches) != 0:
                iter_count += 1
                # hard code
                - recv_tensor_list = [batches[i][1][0] for i in range(worker_count)]
                + recv_tensor_list = [batches[i][1] for i in range(worker_count)]
                comm_id_list = [batches[i][0] for i in range(worker_count)]

                f.respond_vec(ret_buffer, recv_tensor_list, comm_id_list)
                if iter_count == num_iters:
                    break
    server()

f.stop()

Besides change the hard code above, we also change the related code in fserver/csrc/public.hpp as follows:

void respond_vec(torch::Tensor& ret_buffer,
                 std::vector<std::vector<torch::Tensor>>& tensors_vec,
                 std::vector<uint64_t>& handler_vec) {
  PS_CHECK_EQ(tensors_vec.size(), handler_vec.size());
  for (size_t i = 0; i < handler_vec.size(); i++) {
    std::vector<torch::Tensor> sliced_buffer_list;
    int64_t tensor_shape_0 = tensors_vec[i][0].size(0);
    for (int j = 0; j < tensors_vec[i].size() - 1; j++) {
      sliced_buffer_list.push_back(
          ret_buffer.slice(0, j * tensor_shape_0, tensor_shape_0));
    }
    //std::vector<torch::Tensor> sliced_buffer_list = {
    //    ret_buffer.slice(0, 0, tensor_shape_0)
    //};
    respond(sliced_buffer_list, handler_vec[i], i == 0);
  }
}

Unfortunatly, we see further error as follows.

[01:55:14] server /gpfs/Stepmesh/include/dmlc/logging.h:301: [01:55:14] /infrawaves/StepMesh/src/./rdma_van.h:875: Check failed: (temp_mr) != (mem_mr_.end()) 
Stack trace returned 6 entries:
[bt] (0) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x5422f) [0x7f897146822f]
[bt] (1) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(+0x54563) [0x7f8971468563]
[bt] (2) /gpfs/Stepmesh/fserver_lib.cpython-310-x86_64-linux-gnu.so(ps::RDMAVan::PollCQ()+0x1a81) [0x7f89714d8951]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8a8b6b0253]
[bt] (4) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8aeb334ac3]
[bt] (5) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8aeb3c6850]

So I wonder if you have encounter the question below and if any solutions to this. I will highly appreciate it if any feedbacks from you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions