Question
AFAIK, the QPs of GDAKI are finally created by DEVX, and we could not see any counters of DEVX's QP in /sys/class/infiniband/mlx5_xxx/ports/1/hw_counters/ such as out_of_sequence, req_cqe_error and local_ack_timeout_err. If we want to get the counter of DEVX's QPs, we should bind a qp counter set id to the QPs created by DEVX. Does NCCL have any plan to provide the QP counter metrics of GDAKI 's QPs so that we could easily observe the network issues when using GDAKI?
Question
AFAIK, the QPs of GDAKI are finally created by DEVX, and we could not see any counters of DEVX's QP in
/sys/class/infiniband/mlx5_xxx/ports/1/hw_counters/such as out_of_sequence, req_cqe_error and local_ack_timeout_err. If we want to get the counter of DEVX's QPs, we should bind a qp counter set id to the QPs created by DEVX. Does NCCL have any plan to provide the QP counter metrics of GDAKI 's QPs so that we could easily observe the network issues when using GDAKI?