Skip to content

Fix NPE in ShardReplicationTask under high shard concurrency (#1660)#1663

Draft
monusingh-1 wants to merge 1 commit into
opensearch-project:mainfrom
monusingh-1:fix/1660-follower-cluster-stats-concurrent-map
Draft

Fix NPE in ShardReplicationTask under high shard concurrency (#1660)#1663
monusingh-1 wants to merge 1 commit into
opensearch-project:mainfrom
monusingh-1:fix/1660-follower-cluster-stats-concurrent-map

Conversation

@monusingh-1
Copy link
Copy Markdown
Collaborator

@monusingh-1 monusingh-1 commented Apr 28, 2026

Description

FollowerClusterStats.stats was backed by a non-thread-safe LinkedHashMap (mutableMapOf()). The singleton is shared across all ShardReplicationTasks (one per shard), and under concurrent puts from ~1024 shards, internal rehashing caused stats[shardId] to return a transient null, which the '!!' assertion in ShardReplicationTask.replicate() and TranslogSequencer turned into a NullPointerException, auto-pausing replication.

Replace mutableMapOf() with ConcurrentHashMap() so concurrent put/get/remove never observe a spurious null. Change var to val since the reference is never reassigned.

Related Issues

Resolves #1660

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…rch-project#1660)

FollowerClusterStats.stats was backed by a non-thread-safe LinkedHashMap
(mutableMapOf()). The singleton is shared across all ShardReplicationTasks
(one per shard), and under concurrent puts from ~1024 shards, internal
rehashing caused stats[shardId] to return a transient null, which the '!!'
assertion in ShardReplicationTask.replicate() and TranslogSequencer turned
into a NullPointerException, auto-pausing replication.

Replace mutableMapOf() with ConcurrentHashMap() so concurrent put/get/remove
never observe a spurious null. Change var to val since the reference is
never reassigned.

Add FollowerClusterStatsTests with a regression test that stresses the
shared map under concurrent put/get/remove patterns mirroring the
production call sites.

Signed-off-by: Monu Singh <msnghgw@amazon.com>
@csatl
Copy link
Copy Markdown

csatl commented Apr 30, 2026

It should have been mentioned in the issue but it seems like RemoteClusterStats has the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] NullPointerException thrown when there are large number of shards

2 participants