IT: assert lastCommitFileName invariant under concurrent flush/refresh#21868
IT: assert lastCommitFileName invariant under concurrent flush/refresh#21868ask-kamal-nayan wants to merge 2 commits into
Conversation
PR Reviewer Guide 🔍(Review updated until commit 9bf6bf4)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 64673c8 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 00094d0
Suggestions up to commit 48a1a59
|
|
❌ Gradle check result for 48a1a59: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
48a1a59 to
00094d0
Compare
|
Persistent review updated to latest commit 00094d0 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21868 +/- ##
============================================
+ Coverage 73.49% 73.54% +0.05%
- Complexity 75557 75602 +45
============================================
Files 6034 6034
Lines 342604 342604
Branches 49279 49279
============================================
+ Hits 251805 251980 +175
+ Misses 70739 70593 -146
+ Partials 20060 20031 -29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Kamal Nayan <askkamal@amazon.com>
00094d0 to
64673c8
Compare
|
Persistent review updated to latest commit 64673c8 |
Signed-off-by: Kamal Nayan <askkamal@amazon.com>
|
Persistent review updated to latest commit 9bf6bf4 |
|
❌ Gradle check result for 9bf6bf4: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
What
This PR adds two changes:
CompositeConcurrentIndexingIT#testConcurrentFlushAndRefreshHoldsRefreshLock) for therefreshLockfix inDataFormatAwareEngine.flush().indexSortmismatch inLuceneCommitter.discoverAndTrimUnsafeCommitsthat prevents engine reinit when the safe commit has many Lucene segments.Change 1 — IT for
refreshLockfix inDataFormatAwareEngine.flush()Why
Without holding
refreshLockacross the full flush body, a refresh that fires betweencommitter.commit()andupdateLastCommitInfo()advanceslatestCatalogSnapshotso thejust-committed snapshot keeps a stale
lastCommitFileName. The stale name eventually points to asegments_<N>the deletion sweep has already removed, and surfaces as peer-recoveryNoSuchFileException.The existing IT covered concurrent index/refresh/flush correctness but did not assert the on-disk invariant that the fix protects.
What the test does
CyclicBarriergetLastCommitFileName()is non-null and starts withsegments_store().directory().listAll()verifyIndexconfirms doc counts and segment-generation invariantsValidation
refreshLockfixrefreshLockremovedCommitted snapshot's lastCommitFileName must always match an existing segments_<N> on disk(also surfacesafterRefresh called but not beforeRefresh)|
Test fails deterministically without the fix and passes with it.
Change 2 — Fix
indexSortmismatch inLuceneCommitter.discoverAndTrimUnsafeCommitsWhy
LuceneCommitter.discoverAndTrimUnsafeCommitsopens a tempIndexWriterinAPPENDmode at the safe-commit point with nosetMergePolicy(defaults toTieredMergePolicy) and nosetIndexSort. When the safe commit references ≥ ~10 Lucene segments, the temp writer fires a merge during open/close and produces a segment withsource=mergeandindexSort=null.The new (sorted)
MergeIndexWriterthen fails to open at the polluted commit:IllegalArgumentException: cannot change previous indexSort=null
(from segment=_X(...source=merge,...))
to new indexSort=<sortednumeric: "row_id"> ...
at IndexWriter.validateIndexSort:1225
at LuceneCommitter.:126
The shard cannot start; the index goes RED.
Fix
Pin
NoMergePolicy.INSTANCEon the temp writer — matches the existing guard vanillaStore.trimUnsafeCommitsuses (Store.java:2255). The temp writer's job is to re-anchor thecommit point; it must never merge.
Test
DataFormatAwareReplicaResilienceIT#testEngineReinitDoesNotFailWithIndexSortMismatchfails deterministically without the fix (RED) and passes with it (~60s GREEN).