Fix NPE in ShardReplicationTask under high shard concurrency (#1660) by monusingh-1 · Pull Request #1663 · opensearch-project/cross-cluster-replication

monusingh-1 · 2026-04-28T08:19:41Z

Description

FollowerClusterStats.stats was backed by a non-thread-safe LinkedHashMap (mutableMapOf()). The singleton is shared across all ShardReplicationTasks (one per shard), and under concurrent puts from ~1024 shards, internal rehashing caused stats[shardId] to return a transient null, which the '!!' assertion in ShardReplicationTask.replicate() and TranslogSequencer turned into a NullPointerException, auto-pausing replication.

Replace mutableMapOf() with ConcurrentHashMap() so concurrent put/get/remove never observe a spurious null. Change var to val since the reference is never reassigned.

Related Issues

Resolves #1660

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…rch-project#1660) FollowerClusterStats.stats was backed by a non-thread-safe LinkedHashMap (mutableMapOf()). The singleton is shared across all ShardReplicationTasks (one per shard), and under concurrent puts from ~1024 shards, internal rehashing caused stats[shardId] to return a transient null, which the '!!' assertion in ShardReplicationTask.replicate() and TranslogSequencer turned into a NullPointerException, auto-pausing replication. Replace mutableMapOf() with ConcurrentHashMap() so concurrent put/get/remove never observe a spurious null. Change var to val since the reference is never reassigned. Add FollowerClusterStatsTests with a regression test that stresses the shared map under concurrent put/get/remove patterns mirroring the production call sites. Signed-off-by: Monu Singh <msnghgw@amazon.com>

csatl · 2026-04-30T07:05:16Z

It should have been mentioned in the issue but it seems like RemoteClusterStats has the same issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NPE in ShardReplicationTask under high shard concurrency (#1660)#1663

Fix NPE in ShardReplicationTask under high shard concurrency (#1660)#1663
monusingh-1 wants to merge 1 commit into
opensearch-project:mainfrom
monusingh-1:fix/1660-follower-cluster-stats-concurrent-map

monusingh-1 commented Apr 28, 2026 •

edited

Loading

Uh oh!

csatl commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

monusingh-1 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

csatl commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

monusingh-1 commented Apr 28, 2026 •

edited

Loading