Skip to content

Bug: pSync collision between teams in shmem_internal_team_choose_psync() #1225

@bcmIntc

Description

@bcmIntc

Summary

shmem_internal_team_choose_psync() in src/shmem_team.c contains an incorrect
index formula that causes pSync array collisions between different teams.
Specifically, SHMEM_TEAM_WORLD using slot i=1 and SHMEM_TEAM_SHARED using
slot i=0 both resolve to the same symmetric memory address.

Root Cause

Bug 1 — Fast path (line 553)

return &shmem_internal_psync_pool[(team->psync_idx + i) * PSYNC_CHUNK_SIZE];

PSYNC_CHUNK_SIZE = N_PSYNCS_PER_TEAM * SHMEM_SYNC_SIZE = 2 * SHMEM_SYNC_SIZE

The formula (psync_idx + i) conflates the team index and the slot index before
multiplying, causing adjacent teams' slots to overlap:

Team psync_idx i Computed offset Pool index
WORLD 0 0 (0+0)×2S 0
WORLD 0 1 (0+1)×2S 2S
SHARED 1 0 (1+0)×2S 2S ← COLLISION
SHARED 1 1 (1+1)×2S 4S

The correct formula separates the two dimensions:

return &shmem_internal_psync_pool[team->psync_idx * PSYNC_CHUNK_SIZE + i * SHMEM_SYNC_SIZE];

Bug 2 — Quiesce path (lines 561/570)

size_t psync = team-\>psync_idx * SHMEM_SYNC_SIZE;  // wrong multiplier
...
return &shmem_internal_psync_pool[psync];

The returned collective psync pointer uses SHMEM_SYNC_SIZE as the stride instead
of PSYNC_CHUNK_SIZE, placing it at an incorrect (too-low) offset. The fix:

return &shmem_internal_psync_pool[team-\>psync_idx * PSYNC_CHUNK_SIZE];

Note: the existing psync variable is correctly used for shmem_internal_psync_barrier_pool
indexing on line 562 and should be left unchanged for that purpose.

Impact

Any workload that concurrently issues two or more team collectives where one team's
psync_idx + 1 == another team's psync_idx (e.g., WORLD+SHARED, SHARED+NODE, or any
adjacent dynamically-created teams) will silently corrupt the pSync state of both
collectives, leading to hangs or incorrect results.

Suggested Fix

// src/shmem_team.c, line 553 — fast path
- return &shmem_internal_psync_pool[(team-\>psync_idx + i) * PSYNC_CHUNK_SIZE];
+ return &shmem_internal_psync_pool[team-\>psync_idx * PSYNC_CHUNK_SIZE + i * SHMEM_SYNC_SIZE];

// src/shmem_team.c, line 570 — quiesce path
- return &shmem_internal_psync_pool[psync];
+ return &shmem_internal_psync_pool[team-\>psync_idx * PSYNC_CHUNK_SIZE];

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions