Summary
While porting StepMesh to a non-CUDA, Ascend NPU (ARM64) environment, Push operations crash consistently.
Initially the run failed with:
But after further instrumentation, the crash was traced to the following protocol-layer assertion:
rdma_transport.h: PS_CHECK_EQ(buffer_ctx->data_num, 3)
This confirms the failure is not caused by RDMA verbs, but by an incorrect PushRequest data layout on the sender side.
Environment
How I Identified the Real Root Cause
The initial crash happened inside ibv_post_send(). To rule out Inline SEND issues (EINVAL is common if inline size > device limit), I modified RDMAWriteWithImm:
if (inline_write) {
wr.send_flags |= IBV_SEND_INLINE;
}
// Force-disable inline (important for ruling out EINVAL)
wr.send_flags &= ~IBV_SEND_INLINE;
if (prev_wr == nullptr) {
PS_CHECK_EQ(ibv_post_send(qp, &wr, &bad_wr), 0);
} else {
prev_wr->next = ≀
PS_CHECK_EQ(ibv_post_send(qp, prev_wr, &bad_wr), 0);
}
After disabling inline writes, the failure moved deterministically to:
rdma_transport.h: PS_CHECK_EQ(buffer_ctx->data_num, 3)
This shows:
-
RDMA verbs and QP are functional.
-
Inline was not the real cause.
-
The true bug is that the PushRequest sender is constructing an incorrect number of data segments.
Thus the crash is purely a protocol mismatch, not a hardware issue.
Expected Behavior
Per StepMesh internal RDMA protocol:
PushRequest (worker → server)
data_num = 3
segment[0] = keys
segment[1] = vals
segment[2] = lens
RecvPushRequest enforces this strictly:
PS_CHECK_EQ(buffer_ctx->data_num, 3);
rdma_transport
Actual Behavior on NPU Port
Sender constructs 1 or 2 segments, not 3.
As a result:
-
msg_buf->data.size() is wrong
-
SendRendezvousBegin sends a wrong data_num to peer
-
Receiver fails the assertion (data_num != 3)
-
Before disabling inline, this propagated as ibv_post_send errors (invalid WR state)
Root Cause
StepMesh currently relies on implicit, scattered assumptions for how many data segments each message type should contain.
The real contract is:
Message Type | Expected data_num | Required Segments
-- | -- | --
PushRequest | 3 | keys / vals / lens
PullRequest | 2 | keys / empty vals
PullResponse | 3 | keys / vals / lens
PushResponse | 0 | none
For PushRequest (non-GDR), the NPU sender must construct exactly 3 segments.
Because this convention is not centralized or validated, it is easy for non-GPU backends to violate it and cause fatal RDMA failures.
Attachments
Relevant source locations:
rdma_transport.h – receives PushRequest and asserts data_num == 3
rdma_van.h – converts msg.data into MessageBuffer
rdma_utils.h – Rendezvous structures
van.cc – SendMsg() code path
Closing
This issue is not RDMA-hardware related.
After disabling inline writes, the crash clearly originates from:
PushRequest sender not constructing 3 data segments.
A centralized protocol definition or normalization step would fix the issue and make StepMesh portable beyond CUDA/GDR environments.
Summary
While porting StepMesh to a non-CUDA, Ascend NPU (ARM64) environment, Push operations crash consistently.
Initially the run failed with:
But after further instrumentation, the crash was traced to the following protocol-layer assertion:
This confirms the failure is not caused by RDMA verbs, but by an incorrect PushRequest data layout on the sender side.
Environment
Hardware: Ascend NPU (ARM64)
Architecture: aarch64
StepMesh version: master (2025-01)
RDMA: roce
No CUDA / no GDR support
How I Identified the Real Root Cause
The initial crash happened inside
ibv_post_send(). To rule out Inline SEND issues (EINVALis common if inline size > device limit), I modifiedRDMAWriteWithImm:After disabling inline writes, the failure moved deterministically to:
This shows:
RDMA verbs and QP are functional.
Inline was not the real cause.
The true bug is that the PushRequest sender is constructing an incorrect number of data segments.
Thus the crash is purely a protocol mismatch, not a hardware issue.
Expected Behavior
Per StepMesh internal RDMA protocol:
RecvPushRequestenforces this strictly:rdma_transport
Actual Behavior on NPU Port
Sender constructs 1 or 2 segments, not 3.
As a result:
msg_buf->data.size()is wrongSendRendezvousBeginsends a wrongdata_numto peerReceiver fails the assertion (
data_num != 3)Before disabling inline, this propagated as
ibv_post_senderrors (invalid WR state)Root Cause
StepMesh currently relies on implicit, scattered assumptions for how many data segments each message type should contain.
The real contract is:
For PushRequest (non-GDR), the NPU sender must construct exactly 3 segments.
AttachmentsBecause this convention is not centralized or validated, it is easy for non-GPU backends to violate it and cause fatal RDMA failures.
Relevant source locations:
rdma_transport.h – receives PushRequest and asserts data_num == 3
rdma_van.h – converts msg.data into MessageBuffer
rdma_utils.h – Rendezvous structures
van.cc – SendMsg() code path
Closing
This issue is not RDMA-hardware related.
After disabling inline writes, the crash clearly originates from:
PushRequest sender not constructing 3 data segments.
A centralized protocol definition or normalization step would fix the issue and make StepMesh portable beyond CUDA/GDR environments.