Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RefitLevel using last level as the base level even when these two can be different #11041

Closed
wants to merge 2 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Dec 17, 2022

Context/Summary: Credit to @cbi42 for finding the bug in https://github.com/facebook/rocksdb/pull/10988/files#r1048004670. Basically as titled.

Test:
Make check

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -1177,7 +1177,9 @@ Status DBImpl::CompactRangeInternal(const CompactRangeOptions& options,
break;
}
if (output_level == ColumnFamilyData::kCompactToBaseLevel) {
final_output_level = cfd->NumberLevels() - 1;
assert(cfd->current() && cfd->current()->storage_info());
VersionStorageInfo* vstorage = cfd->current()->storage_info();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need mutex_ or referenced super version for this? I'm not sure if this is thread safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - added the mutex. (I overlooked this while UT didn't signal anything. Let me do some stress test on this to see if there is anything else popping up)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other concern I realize now is if some other compaction that changed the based level that ran after RunManualCompaction() but before we get base_level here.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -13,6 +13,7 @@
* Fixed a BackupEngine bug in which RestoreDBFromLatestBackup would fail if the latest backup was deleted and there is another valid backup available.
* Fix L0 file misorder corruption caused by ingesting files of overlapping seqnos with memtable entries' through introducing `epoch_number`. Before the fix, `force_consistency_checks=true` may catch the corruption before it's exposed to readers, in which case writes returning `Status::Corruption` would be expected. Also replace the previous incomplete fix (#5958) to the same corruption with this new and more complete fix.
* Fixed a bug in LockWAL() leading to re-locking mutex (#11020).
* Fixed a bug that `RefitLevel()` can move files from wrong level when base level is different from the last level under universal compaction and `Options::level_compaction_dynamic_level_bytes > 0`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"universal compaction and Options::level_compaction_dynamic_level_bytes > 0" looks like a contradiction. I didn't see universal compaction mentioned in the description, title, or linked discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

universal compaction was a typo from me, supposed to be level compaction

@@ -1177,7 +1177,11 @@ Status DBImpl::CompactRangeInternal(const CompactRangeOptions& options,
break;
}
if (output_level == ColumnFamilyData::kCompactToBaseLevel) {
final_output_level = cfd->NumberLevels() - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code assumed the final output level of a CompactRange() with level_compaction_dynamic_level_bytes==true would be the last level. That assumption seems reasonable because dynamic leveled compaction is supposed to fill in levels starting with the last level, and CompactRange() is supposed to compact down to the last level containing any data (or last level if all of L1+ are empty).

However as you found there can be edge cases where the final output level is not the last level. In the unit test it looks like the cause was a DB that already had data in L1+ reopened with num_levels_new > num_levels_old (num_levels_old is the pre-reopen value of num_levels, and num_levels_new is the post-reopen value). Then the target level of the post-reopen CompactRange() is num_levels_old-1, which is not the last level anymore. Another scenario it happens, often surprising/inconveniencing users, is after enabling level_compaction_dynamic_level_bytes. That makes the migration to enable level_compaction_dynamic_level_bytes difficult as even compacting the whole DB through CompactRange() with nullptr endpoints is not enough.

In my opinion, CompactRange() should move the data to the ideal level according to the current config, i.e., not stop at empty level when level_compaction_dynamic_level_bytes==true. I think the type of change in this PR adds complexity and has race conditions (base level can change after RunManualCompaction()). If there's a way to do it without the race conditions it might be worthwhile as a stop-gap solution. The conceptual simplicity of the old code will be missed (by me), though.

Copy link
Member

@cbi42 cbi42 Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompactRange() is supposed to compact down to the last level containing any data

Trying to address this problem in #11375. I think the current behavior is to compact down to the lowest level which contains any data that overlaps with the key range passed into CompactRange(). So there are more cases where the old code would move wrong the wrong level during refit level.

@hx235
Copy link
Contributor Author

hx235 commented Dec 20, 2022

Since we have found out more complexity than I originally anticipated as a quick fix, I will postpone fixing this. Will reopen when I have time and idea on how to fix this.

@hx235 hx235 closed this Dec 20, 2022
@ajkr
Copy link
Contributor

ajkr commented Dec 20, 2022

If we do T117522570 to ensure last level is nonempty during open when there is data in L1+, and possibly another change to make it stay nonempty when there is data in L1+, then the current code would become correct. I think we'd have less LSM shape operational challenges with that approach. OTOH, there's not much time to do that, so quick fixes are nice too if they happen to be simple/correct

facebook-github-bot pushed a commit that referenced this pull request Apr 20, 2023
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in #7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in #11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: #11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 14, 2023
…1375)

Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in facebook#7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in facebook#11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: facebook#11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
Signed-off-by: tabokie <[email protected]>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 15, 2023
…1375)

Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in facebook#7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in facebook#11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: facebook#11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
Signed-off-by: tabokie <[email protected]>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 16, 2023
…1375)

Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in facebook#7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in facebook#11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: facebook#11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
Signed-off-by: tabokie <[email protected]>
tabokie added a commit to tikv/rocksdb that referenced this pull request Aug 23, 2023
* Always allow L0->L1 trivial move during manual compaction (facebook#11375)

Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in facebook#7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in facebook#11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: facebook#11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
Signed-off-by: tabokie <[email protected]>

* `CompactRange()` always compacts to bottommost level for leveled compaction (facebook#11468)

Summary:
currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here: https://github.com/cbi42/rocksdb/blob/796f58f42ad1bdbf49e5fcf480763f11583b790e/db/db_impl/db_impl_compaction_flush.cc#L1180C29-L1184).

This PR also changes the behavior of CompactRange() when `bottommost_level_compaction=kIfHaveCompactionFilter` (the default option). The old behavior is that, if a compaction filter is provided, CompactRange() always does an intra-level compaction at the final output level for all files in the manual compaction key range. The only exception when `first_overlapped_level = 0` and `max_overlapped_level = 0`. It’s awkward to maintain the same behavior after this PR since we do not compute max_overlapped_level anymore. So the new behavior is similar to kForceOptimized: always does intra-level compaction at the bottommost level, but not including new files generated during this manual compaction.

Several unit tests are updated to work with this new manual compaction behavior.

Pull Request resolved: facebook#11468

Test Plan: Add new unit tests `DBCompactionTest.ManualCompactionCompactAllKeysInRange*`

Reviewed By: ajkr

Differential Revision: D46079619

Pulled By: cbi42

fbshipit-source-id: 19d844ba4ec8dc1a0b8af5d2f36ff15820c6e76f
Signed-off-by: tabokie <[email protected]>

* add check in range interface

Signed-off-by: tabokie <[email protected]>

---------

Signed-off-by: tabokie <[email protected]>
Co-authored-by: Changyu Bi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants