When using mpich/opt/develop-git.6037a7a on Aurora, I notice that the following test crashes for large message size (~64MB) when GPU pipelining is turned on. For smaller message sizes, there is a performance regression.
mpich/opt/4.2.3-intel does not seem to have this issue.
This is the performance upto 4MB using the two versions:
| message size |
GPU PIPLN/ develop-git.6037a7a (MB/s) |
GPU PIPLN/4.2.3-intel (MB/s) |
| 1 |
1 |
1.61 |
| 2 |
1.59 |
3.23 |
| 4 |
3.19 |
6.45 |
| 8 |
6.38 |
12.96 |
| 16 |
12.78 |
25.89 |
| 32 |
25.57 |
51.75 |
| 64 |
50.78 |
100.14 |
| 128 |
36.33 |
123.29 |
| 256 |
94.03 |
121.6 |
| 512 |
98.78 |
134.72 |
| 1024 |
106.68 |
143.59 |
| 2048 |
111.51 |
732.25 |
| 4096 |
113.87 |
1472.69 |
| 8192 |
109.69 |
2957.42 |
| 16384 |
110.34 |
5861.81 |
| 32768 |
107.82 |
11590.16 |
| 65536 |
21586.53 |
21417.23 |
| 131072 |
28338.11 |
34983.17 |
| 262144 |
30891.85 |
43363.4 |
| 524288 |
32059.07 |
46077.84 |
| 1048576 |
28726.47 |
46744.3 |
| 2097152 |
29452.7 |
47071.55 |
| 4194304 |
29858.49 |
47247.48 |
This is the test:
export FI_CXI_RDZV_THRESHOLD=131072
export EnableImplicitScaling=0
export NEOReadDebugKeys=1
export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
export MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1
# Enable GPU pipelining
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_THRESHOLD=0
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_NUM_BUFFERS_PER_CHUNK=4
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_MAX_NUM_BUFFERS=4
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_D2H_ENGINE_TYPE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_H2D_ENGINE_TYPE=1
mpiexec -np 4 -ppn 2 --cpu-bind list:2:15 ~/gpu_wrappers/2-2.sh $PATH_TO_OSU/pt2pt/osu_mbw_mr -m 1:67108864 -i 100 -x 20 -d ze D D
The wrapper script used here is
#!/bin/bash
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
if [ $PALS_LOCAL_RANKID -eq 0 ]
then
AFFINITY_MASK=0.0
NIC_NUM=cxi0
elif [ $PALS_LOCAL_RANKID -eq 1 ]
then
AFFINITY_MASK=1.0
NIC_NUM=cxi1
fi
echo "[I am rank $PALS_RANKID] Localrank=$PALS_LOCAL_RANKID : Affinity mask = $AFFINITY_MASK, PREFERRED_NIC = $NIC_NUM"
export ZE_AFFINITY_MASK=$AFFINITY_MASK
export FI_CXI_DEVICE_NAME=$NIC_NUM
# Invoke the main program
$*
When using
mpich/opt/develop-git.6037a7aon Aurora, I notice that the following test crashes for large message size (~64MB) when GPU pipelining is turned on. For smaller message sizes, there is a performance regression.mpich/opt/4.2.3-inteldoes not seem to have this issue.This is the performance upto 4MB using the two versions:
This is the test:
The wrapper script used here is