[chore](compaction) remove single replica compaction#63771
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
61db3dd to
d97a724
Compare
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 31818 ms |
TPC-DS: Total hot run time: 172532 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
5a02820 to
65422ce
Compare
|
run buildall |
TPC-H: Total hot run time: 31501 ms |
TPC-DS: Total hot run time: 171768 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
There was a problem hiding this comment.
Review opinion: no blocking issues found.
Critical checkpoint conclusions:
- Goal and proof: The PR removes single replica compaction support and its table/schema/task plumbing, while cleaning affected regression cases. The code changes consistently stop accepting/propagating the removed property and remove BE execution paths. Existing unrelated tests were updated to remove the obsolete property; I did not run the suite in this runner.
- Scope: The implementation is focused on deleting the feature and its references. The broad regression-test edits are mechanical removals of the same property.
- Concurrency: The removed BE background thread and manual compaction path reduce concurrency surface. Remaining compaction locks and manual base/cumulative/full compaction paths keep the existing lock model; I did not find a new lock-order or shared-state hazard.
- Lifecycle/static initialization: No new static/global lifecycle dependency was introduced. Removed single-replica compaction thread lifecycle appears fully detached from
StorageEngine/OlapServerstartup and shutdown. - Configuration: Removed BE configs and FE table property handling for the deleted feature. No new config was added.
- Compatibility: Thrift field ids are left unused and protobuf field 22 is reserved, so old serialized fields can be ignored without reusing ids. FE replay of table property maps can still rebuild other properties; the removed property is no longer materialized or emitted by SHOW CREATE.
- Parallel paths: Create replica, restore/recover replica, rollup/schema-change tasks, partition property updates, report handling, tablet meta, and cloud schema PB conversion were all updated consistently for the removed field.
- Special checks: The MOW-specific rejection for
enable_single_replica_compactionwas removed together with the property parser, so the property is now treated as unsupported instead of as a supported-but-invalid combination. - Test coverage: Dedicated single-replica compaction tests are removed with the feature; remaining affected regression files were adjusted. No additional user focus points were provided.
- Test results: Not run locally in this review environment.
- Observability: Removed metrics and HTTP remote-compaction behavior are consistent with feature removal; remaining compaction status paths retain existing observability.
- Transaction/persistence/data correctness: No transaction visibility, delete-bitmap, or rowset versioning behavior is added. Removing the compaction mode does not alter normal base/cumulative/full compaction correctness paths.
- FE-BE variables: Removed field propagation is reflected across FE task construction, thrift schema, BE task handling, and tablet schema conversion.
- Performance: Removing the single-replica scheduler and config reduces background work; I did not find a new hot-path regression in the remaining code.
User focus response: No additional review focus was provided.
|
skip check_coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
Write down the reason for deletion, and delete the corresponding doc as well. |
Remove the single replica compaction (SRC) feature end-to-end across BE, FE, protocol definitions, and regression tests. SRC let one replica run compaction and others fetch the result; this drops the code path along with its tablet property, thread pool, metrics, HTTP `remote=true` knob, and FE plumbing. - BE: delete `SingleReplicaCompaction` class and unit tests; drop the SRC thread pool, replica-info refresher, tablet hooks, schema field, metrics, and config (`max_single_replica_compaction_threads`, `update_replica_infos_interval_seconds`). - FE: drop `enable_single_replica_compaction` property handling (PropertyAnalyzer, TableProperty, OlapTable, Env, InternalCatalog, SchemaChangeHandler, ModifyTablePropertiesOp) and the corresponding CreateReplicaTask / UpdateTabletMetaInfoTask fields. - Proto/Thrift: mark the removed fields as `reserved` / `// deprecated` to preserve wire compatibility. - Tests: delete SRC-only suites and strip the property from ~130 regression-test groovy/.out files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
65422ce to
dfc334d
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29656 ms |
TPC-DS: Total hot run time: 170047 ms |
|
/review |
There was a problem hiding this comment.
I found one behavior regression that should be fixed before merging.
Critical checkpoint conclusions:
- Goal/test: The PR removes single-replica compaction and associated table/schema/protocol plumbing. The removal is broadly implemented and many obsolete regression cases/properties are updated, but the manual HTTP compaction endpoint still accepts the old remote request semantics silently, and there is no replacement negative test for that.
- Scope/focus: Most changes are direct removals and are focused on the feature deletion.
- Concurrency: No new compaction/thread lifecycle path was added; removed background thread and thread pool cleanup are consistent. No new lock-order issue found.
- Lifecycle/static initialization: No new static/global lifecycle dependency found.
- Configuration: Removed BE configs are no longer consumed; Doris BE config loading ignores unknown file keys, so I did not find a startup blocker from stale config entries.
- Compatibility/storage/protocol: Thrift/protobuf field numbers are left deprecated/reserved and old values are ignored, which is consistent with removing the feature.
- Parallel paths: FE create/alter/create-replica/update-meta paths and BE tablet schema/task paths were checked; no other active single-replica-compaction path found.
- Tests: Obsolete positive tests were removed and expected outputs updated, but the endpoint compatibility/negative behavior for remote=true is missing.
- Observability/performance: No new logging/metric/performance issue found beyond the endpoint behavior below.
- Transaction/persistence/data correctness: No transaction or visible-version correctness issue found in the reviewed removal. Existing persisted table/proto metadata for the removed flag is ignored.
- Other issues: See inline comment.
User focus points: No additional user-provided review focus was present.
Remove the single replica compaction (SRC) feature end-to-end across BE, FE and regression tests. Doc: apache/doris-website#3870 ### Why remove it The main reason is **correctness risk in peer selection**. A follower replica had to pick a peer holding a "proper" version (`_find_rowset_to_fetch`) and fetch its compacted result, based on replica info that was only refreshed periodically. Because replicas progress through versions independently and this "leader" selection ran against a stale, time-sensitive view of the cluster, the hoice of which peer to fetch from — and which version — was racy and could select a peer whose state no longer matched, leading to subtle inconsistencies.
Remove the single replica compaction (SRC) feature end-to-end across BE, FE and regression tests. Doc: apache/doris-website#3870 ### Why remove it The main reason is **correctness risk in peer selection**. A follower replica had to pick a peer holding a "proper" version (`_find_rowset_to_fetch`) and fetch its compacted result, based on replica info that was only refreshed periodically. Because replicas progress through versions independently and this "leader" selection ran against a stale, time-sensitive view of the cluster, the hoice of which peer to fetch from — and which version — was racy and could select a peer whose state no longer matched, leading to subtle inconsistencies.
Remove the single replica compaction (SRC) feature end-to-end across BE, FE and regression tests.
Doc: apache/doris-website#3870
Why remove it
The main reason is correctness risk in peer selection. A follower replica had to pick a peer holding a "proper" version (
_find_rowset_to_fetch) and fetch its compacted result, based on replica info that was only refreshedperiodically. Because replicas progress through versions independently and this "leader" selection ran against a stale, time-sensitive view of the cluster, the hoice of which peer to fetch from — and which version — was racy and could select a peer whose state no longer matched, leading to subtle inconsistencies.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)