-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add consensus check before selecting a segment for compaction or dele… #17352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17352 +/- ##
=========================================
Coverage 63.22% 63.22%
- Complexity 1475 1477 +2
=========================================
Files 3147 3147
Lines 187575 187597 +22
Branches 28712 28717 +5
=========================================
+ Hits 118590 118614 +24
+ Misses 59794 59792 -2
Partials 9191 9191
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| List<ValidDocIdsMetadataInfo> replicaMetadataList = validDocIdsMetadataInfoMap.get(segmentName); | ||
|
|
||
| // Check consensus across all replicas before proceeding with any operations | ||
| if (!hasValidDocConsensus(segmentName, replicaMetadataList)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will reduce the chances but is it still possible that a subsequent restart/repalce and doSnapshot leads to a segment with 0 validDocIds to become non zero again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this check in controller and immediately fire the delete call for selected segments. Since this is happening inside the same method, chances of a server restart and segment coming up with non-zero validDocIds between this period is very very low.
...org/apache/pinot/plugin/minion/tasks/upsertcompactmerge/UpsertCompactMergeTaskGenerator.java
Show resolved
Hide resolved
...org/apache/pinot/plugin/minion/tasks/upsertcompactmerge/UpsertCompactMergeTaskGenerator.java
Outdated
Show resolved
Hide resolved
...org/apache/pinot/plugin/minion/tasks/upsertcompactmerge/UpsertCompactMergeTaskGenerator.java
Outdated
Show resolved
Hide resolved
abce88c to
ee87a38
Compare
|
@tibrewalpratik17 @xiangfu0 can you please take a look at this? |
Problem
As discussed in #17337, Segments are incorrectly deleted when any single replica reported zero valid documents, causing permanent data loss during server restarts and async state transitions where replicas had inconsistent validDocIds.
Solution
Added consensus requirement: only delete/compact segments when ALL replicas agree on validDoc counts
Prevents race conditions during OFFLINE→ONLINE transitions and tie-breaking logic divergence
Changes
Tests
Deployed this change in a test cluster and see this warning message about skipping few segments due to validDocId mismatch.