Summary
shmem_internal_team_choose_psync() in src/shmem_team.c contains an incorrect
index formula that causes pSync array collisions between different teams.
Specifically, SHMEM_TEAM_WORLD using slot i=1 and SHMEM_TEAM_SHARED using
slot i=0 both resolve to the same symmetric memory address.
Root Cause
Bug 1 — Fast path (line 553)
return &shmem_internal_psync_pool[(team->psync_idx + i) * PSYNC_CHUNK_SIZE];
PSYNC_CHUNK_SIZE = N_PSYNCS_PER_TEAM * SHMEM_SYNC_SIZE = 2 * SHMEM_SYNC_SIZE
The formula (psync_idx + i) conflates the team index and the slot index before
multiplying, causing adjacent teams' slots to overlap:
| Team |
psync_idx |
i |
Computed offset |
Pool index |
| WORLD |
0 |
0 |
(0+0)×2S |
0 |
| WORLD |
0 |
1 |
(0+1)×2S |
2S ← |
| SHARED |
1 |
0 |
(1+0)×2S |
2S ← COLLISION |
| SHARED |
1 |
1 |
(1+1)×2S |
4S |
The correct formula separates the two dimensions:
return &shmem_internal_psync_pool[team->psync_idx * PSYNC_CHUNK_SIZE + i * SHMEM_SYNC_SIZE];
Bug 2 — Quiesce path (lines 561/570)
size_t psync = team-\>psync_idx * SHMEM_SYNC_SIZE; // wrong multiplier
...
return &shmem_internal_psync_pool[psync];
The returned collective psync pointer uses SHMEM_SYNC_SIZE as the stride instead
of PSYNC_CHUNK_SIZE, placing it at an incorrect (too-low) offset. The fix:
return &shmem_internal_psync_pool[team-\>psync_idx * PSYNC_CHUNK_SIZE];
Note: the existing psync variable is correctly used for shmem_internal_psync_barrier_pool
indexing on line 562 and should be left unchanged for that purpose.
Impact
Any workload that concurrently issues two or more team collectives where one team's
psync_idx + 1 == another team's psync_idx (e.g., WORLD+SHARED, SHARED+NODE, or any
adjacent dynamically-created teams) will silently corrupt the pSync state of both
collectives, leading to hangs or incorrect results.
Suggested Fix
// src/shmem_team.c, line 553 — fast path
- return &shmem_internal_psync_pool[(team-\>psync_idx + i) * PSYNC_CHUNK_SIZE];
+ return &shmem_internal_psync_pool[team-\>psync_idx * PSYNC_CHUNK_SIZE + i * SHMEM_SYNC_SIZE];
// src/shmem_team.c, line 570 — quiesce path
- return &shmem_internal_psync_pool[psync];
+ return &shmem_internal_psync_pool[team-\>psync_idx * PSYNC_CHUNK_SIZE];
Summary
shmem_internal_team_choose_psync()insrc/shmem_team.ccontains an incorrectindex formula that causes pSync array collisions between different teams.
Specifically,
SHMEM_TEAM_WORLDusing sloti=1andSHMEM_TEAM_SHAREDusingslot
i=0both resolve to the same symmetric memory address.Root Cause
Bug 1 — Fast path (line 553)
PSYNC_CHUNK_SIZE = N_PSYNCS_PER_TEAM * SHMEM_SYNC_SIZE = 2 * SHMEM_SYNC_SIZEThe formula
(psync_idx + i)conflates the team index and the slot index beforemultiplying, causing adjacent teams' slots to overlap:
The correct formula separates the two dimensions:
Bug 2 — Quiesce path (lines 561/570)
The returned collective psync pointer uses
SHMEM_SYNC_SIZEas the stride insteadof
PSYNC_CHUNK_SIZE, placing it at an incorrect (too-low) offset. The fix:Note: the existing
psyncvariable is correctly used forshmem_internal_psync_barrier_poolindexing on line 562 and should be left unchanged for that purpose.
Impact
Any workload that concurrently issues two or more team collectives where one team's
psync_idx + 1 == another team's psync_idx(e.g., WORLD+SHARED, SHARED+NODE, or anyadjacent dynamically-created teams) will silently corrupt the pSync state of both
collectives, leading to hangs or incorrect results.
Suggested Fix