Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger by yhchiang-sol · Pull Request #26651 · solana-labs/solana

yhchiang-sol · 2022-07-16T10:11:06Z

Problem

LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.

Summary of Changes

This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with
fewer IOps because 1) we don't need to wait for rocksdb compactions, and 2) the purge
is done by deleting old files without iterating the checking every single key inside the file
using the compaction filter.

To make the above cleanup consistent with our range-deletion and reclaim disk space
more effectively, the cleanup range in LedgerCleanupService::cleanup_ledger has been
changed from (the-oldest-slot-in-db, lowest_cleanup_slot) to (0, lowest_cleanup_slot).
The detailed reason for this is also commented in the code.

steviez · 2022-07-16T11:32:06Z

I had actually noticed this method and been thinking about it too 😁 ; however, the following bit of documentation makes me suspect of whether it will work well for our case:

Delete sst files whose keys are entirely in the given range.

We call delete_range(start, end) from ledger_cleanup_service; those deletes are typically for small ranges. As such, I'm not sure that all the keys in single SST's will be contained in the range that delete_range() is called with so I'm not sure this will actually reclaim.

The idea that I had been mulling over was to call delete_file_in_range(0, n) immediately after calling delete_range() where n would be the new lowest slot that we cleaned up. ledger_cleanup_service deletes the oldest stuff first, so calling with 0 as the lower bound is consistent with previous behavior.

The only problem case I can think of here is the one you described on one of our calls that follows a sequence like this:

Slot n inserted into blockstore
Slot n keys are compacted down
- Let's say enough time passes that this ends up in L2
Slot n is included in a delete_range(x, y) call where x < n < y
- Let's say enough times passes that this ends up in L1
Slot n is then included in a delete_file_in_range() call
- delete_file_in_range() knocks out the tombstone in L1, which makes the items that in L2 now "available" (potentially only for a short period) should someone query them

However, recall that delete_file_in_range() does not operate on L0. So, my idea was that the following sequence might work:

Call delete_range(0, end) instead of delete_range(start, end), this will have duplicate deletes but don't think that is a problem
Cal delete_file_in_range(0, end) immediately after above call; we shouldn't run into consistency issue because the issued delete_range(0, end) call should be in L0, and L0 is skipped.

If their is some subtlety here where the delete_file_in_range(0, end) may not execute immediately, the other idea was to have delete_file_in_range() run in reverse. That is, do [end_level, ... , 1] instead of [1, ..., end_level] here:
https://github.com/facebook/rocksdb/blob/95ef007adc9365fbefc0f957722a191c1fd7dcd3/db/db_impl/db_impl.cc#L4019

yhchiang-sol · 2022-07-16T15:12:25Z

Oh wow, that's a quick response! I'm still actually baking the PR and just updated it.

The updated PR does range delete first, then the delete_file_in_range, as we want to make sure the write-batch went through first -- I should probably check the result of the write_batch before issuing delete_file_in_range. I will update the PR.

However, recall that delete_file_in_range() does not operate on L0. So, my idea was that the following sequence might work:
Call delete_range(0, end) instead of delete_range(start, end), this will have duplicate deletes but don't think that is a problem
Cal delete_file_in_range(0, end) immediately after above call; we shouldn't run into consistency issue because the issued delete_range(0, end) call should be in L0, and L0 is skipped.

Yes, delete_file_in_range works best when we pass (nullptr, end_slot), and you're also right at the L0 assumption that we won't miss any tombstones because they are either in mem-table (most likely) or in L0.

My follow-up question is: is it always safe to do delete_file_in_range(nullptr, end_slot) after every range_delete(start_slot, end_slot)? If the answer is no, then we probably want to find a better place where we can safely do delete_file_in_range(nullptr, end_slot).

steviez · 2022-07-17T14:45:07Z

Oh wow, that's a quick response!

Ha, just happened to see it near the top on PR's tab

My follow-up question is: is it always safe to do delete_file_in_range(nullptr, end_slot) after every range_delete(start_slot, end_slot)? If the answer is no, then we probably want to find a better place where we can safely do delete_file_in_range(nullptr, end_slot).

No, I don't think so. One example that comes to mind immediately is when we purge a slot from replay_stage; in this case, we're typically calling range_delete(s, s) for some slot s, where we keep some number of slots before s.

I think ledger_cleanup_service is the only place where we should be calling delete_file_in_range(nullptr, end_slot) in validator code; delete_file_in_range() might provide a speedup if the user runs compaction as part of solana-ledger-tool purge too. However, it would't be safe to assume low bound of 0 in solana-ledger-tool

yhchiang-sol

Made the following changes:

LedgerCleanupService::cleanup_ledger becomes the only place that will issue purge_files_in_range.
The cleanup range in LedgerCleanupService::cleanup_ledger is changed from
(the_oldest_slot_in_db, lowest_slot_to_purge) to (0, lowest_slot_to_purge).
(i.e., anything older than or equal to lowest_slot_to_purge will be purged.)
blockstore::run_purge will run delete_file_in_range if the starting slot is 0.

steviez · 2022-07-25T19:25:01Z

Will circle back for a proper review, but it might be nice to get some runtime on a validator to see if this functions as expected. Two things I'd expect:

Actual ledger disk space is closer to our estimate as shreds that are below lowest cleanup slot actually get reclaimed
Hypothetically, this should be much more efficient than background compactions, so maybe a drop in I/O and/or CPU
- This one might be harder to quantify looking at overall validator, maybe there is something in RocksDB logs tho ?

yhchiang-sol · 2022-07-28T14:47:00Z

Currently testing. Will share the result once I have enough data points.

yhchiang-sol · 2022-07-29T09:56:05Z

Re-running the experiments with the minimum --limit-ledger-size 50000000 to speed up the iteration.

yhchiang-sol · 2022-07-30T09:32:01Z

The experiment with --limit-ledger-size 50000000 shows that this PR can effectively cap the ledger disk usage. I will later repeat the experiment with the default ledger size to obtain the new suggested ledger size.

Below are two different recent data points showing ledger size (the green one) constantly stays at ~230GB with this PR while it continues growing in the master branch (the purple one):

The validator log that includes the purge and disk utilization information also shows the same thing: the one with this PR has disk utilization capped at the same level while the disk utilization on the master branch keeps growing.

master branch

validator.log:[2022-07-30T05:45:33.042293526Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143713895, last_purge_slot=143713382, purge_interval=512, disk_utilization=Ok(238217440078)
...
validator.log:[2022-07-30T08:23:43.153597380Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143727840, last_purge_slot=143727327, purge_interval=512, disk_utilization=Ok(269657424050)
...
validator.log:[2022-07-30T09:12:38.505896309Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143731947, last_purge_slot=143731434, purge_interval=512, disk_utilization=Ok(279697104792)

with this PR (the ERROR in the log is my debug log):

validator.log.1:[2022-07-30T03:51:37.399736445Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143707215, last_purge_slot=143706702, purge_interval=512, disk_utilization=Ok(196294712338)
validator.log.1:[2022-07-30T03:51:38.383047511Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 143653371
validator.log.1:[2022-07-30T03:51:38.668689868Z INFO  solana_core::ledger_cleanup_service] purge_slots took 285ms
...
validator.log.1:[2022-07-30T04:32:07.071533348Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143711325, last_purge_slot=143710806, purge_interval=512, disk_utilization=Ok(192974163896)
validator.log.1:[2022-07-30T04:32:08.036080146Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 143657660
validator.log.1:[2022-07-30T04:32:08.236217942Z INFO  solana_core::ledger_cleanup_service] purge_slots took 200ms
...
last_root=143713382, last_purge_slot=143712869, purge_interval=512, disk_utilization=Ok(195494184009)
validator.log.1:[2022-07-30T04:53:01.287423797Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 143659607
validator.log.1:[2022-07-30T04:53:01.419216524Z INFO  solana_core::ledger_cleanup_service] purge_slots took 131ms

yhchiang-sol · 2022-08-01T19:12:34Z

Repeated the experiment with --limit-ledger-size 50000000 for an extended period of time. The one with this PR stabilizes its ledger disk space usage at ~210GB, while the one with the master branch is stabilized at ~320GB.

In other words, this PR is able to reduce ~33% of the ledger disk space.

Below is the disk space usage graph of the two builds (sorry that I failed to start two instances simultaneously, but the experiment should still be valid.)

I will go ahead and run this PR with the default --limit-ledger-size to obtain the suggested ledger disk space.

steviez · 2022-08-01T20:46:28Z

Repeated the experiment with --limit-ledger-size 50000000 for an extended period of time. The one with this PR stabilizes its ledger disk space usage at ~210GB, while the one with the master branch is stabilized at ~320GB.

In other words, this PR is able to reduce ~33% of the ledger disk space.

Nice, definitely a meaningful improvement!

Below is the disk space usage graph of the two builds (sorry that I failed to start two instances simultaneously, but the experiment should still be valid.)

I agree, results valid despite offset start times.

I will go ahead and run this PR with the default --limit-ledger-size to obtain the suggested ledger disk space.

Dumping some DM's, the numbers (210/320 GB) seemed a bit large. Upon further inspection, we realized that this number is inclusive of the accounts_index. The accounts index is currently ~55 GB on mainnet, and by default, lives at .../ledger/accounts_index/. So, this is being included in the Google dashboard numbers, but not the metrics that only report on the rocksdb subdirectory.

With this in mind, above numbers are roughly:
210 GB - 55 GB = 155 GB for rocksdb
320 GB - 55 GB = 265 GB for master

That works out to > 40%, so even better than the 33% number that was previously mentioned 😄

yhchiang-sol · 2022-08-05T15:03:19Z

Here're the numbers for the default ----limit-ledger-size.

The ledger size is capped at ~565GB (607402138084 bytes), which is slightly smaller than what we previously predicted.

[2022-08-05T14:36:18.200163844Z INFO  solana_core::ledger_cleanup_service] purge: last_root=144594041, last_purge_slot=144593528, purge_interval=512, disk_utilization=Ok(607209149245)
[2022-08-05T14:36:21.816699128Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 144374054
[2022-08-05T14:36:21.916079447Z INFO  solana_core::ledger_cleanup_service] purge_slots took 99ms
[2022-08-05T14:41:34.767795051Z INFO  solana_core::ledger_cleanup_service] purge: last_root=144594554, last_purge_slot=144594041, purge_interval=512, disk_utilization=Ok(605534535878)
[2022-08-05T14:41:38.278674488Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 144374551
[2022-08-05T14:41:38.337789842Z INFO  solana_core::ledger_cleanup_service] purge_slots took 59ms
[2022-08-05T14:47:24.315633976Z INFO  solana_core::ledger_cleanup_service] purge: last_root=144595067, last_purge_slot=144594554, purge_interval=512, disk_utilization=Ok(607402138084)
[2022-08-05T14:47:27.756434726Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 144375033
[2022-08-05T14:47:27.857260277Z INFO  solana_core::ledger_cleanup_service] purge_slots took 100ms

Together with the account index, it is ~618GB.

As for the suggested ledger disk size for the default settings, I previously discussed it with @steviez, and we think 750GB would be a good number as it gives us ~20% additional buffer to handle spikes and some room for the growth of account index.

steviez

Looks good to me once you fix the typo. I think this was already the plan, but we can update docs in a different PR.

Additionally, I think this might be a candidate for backport, definitely for v1.11, and maybe even for v1.10. Arguably, this is a bug fix as we allow the ledger to expand to much larger footprint than what the docs state. I saw some chatter in Discord about backports / audits; I don't think there is any official policy change, but might be good to post a question in Discord and check

Pull request has been modified.

…ce::cleanup_ledger (solana-labs#26651) #### Problem LedgerCleanupService requires compactions to propagate & digest range-delete tombstones to eventually reclaim disk space. #### Summary of Changes This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with fewer IOps. Experimental results on mainnet validators show that the PR can effectively reduce 33% to 40% ledger disk size.

jeffwashington · 2022-08-18T21:05:28Z

fwiw, I ran bench-tps on this and did not see any significant difference.

yhchiang-sol · 2022-08-18T21:57:45Z

fwiw, I ran bench-tps on this and did not see any significant difference.

Thanks for the information, @jeffwashington. The change requires blockstore to hit the limit-ledger-size before we can see any difference as the PR improves the way to clean up blockstore when it hits the configured size limit.

I will experiment with bench-tps a bit.

…ce::cleanup_ledger (#26651) #### Problem LedgerCleanupService requires compactions to propagate & digest range-delete tombstones to eventually reclaim disk space. #### Summary of Changes This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with fewer IOps. Experimental results on mainnet validators show that the PR can effectively reduce 33% to 40% ledger disk size. (cherry picked from commit 99ef218)

…ce::cleanup_ledger (backport #26651) (#27304) Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (#26651) #### Problem LedgerCleanupService requires compactions to propagate & digest range-delete tombstones to eventually reclaim disk space. #### Summary of Changes This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with fewer IOps. Experimental results on mainnet validators show that the PR can effectively reduce 33% to 40% ledger disk size. (cherry picked from commit 99ef218) Co-authored-by: Yueh-Hsuan Chiang <93241502+yhchiang-sol@users.noreply.github.com>

#### Problem Previously before #26651, our LedgerCleanupService needs RocksDB background compactions to reclaim ledger disk space via our custom CompactionFilter. However, since RocksDB's compaction isn't smart enough to know which file to pick, we rely on the 1-day compaction period so that each file will be forced to be compacted once a day so that we can reclaim ledger disk space in time. The downside of this is each ledger file will be rewritten once per day. #### Summary of Changes As #26651 makes LedgerCleanupService actively delete those files whose entire slot-range is older than both --limit-ledger-size and the current root, we can remove the 1-day compaction period and get rid of the daily ledger file rewrite. The results on mainnet-beta shows that this PR reduces ~20% write-bytes-per-second and reduces ~50% read-bytes-per-second on ledger disk.

…ce::cleanup_ledger (#26651) #### Problem LedgerCleanupService requires compactions to propagate & digest range-delete tombstones to eventually reclaim disk space. #### Summary of Changes This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with fewer IOps. Experimental results on mainnet validators show that the PR can effectively reduce 33% to 40% ledger disk size. (cherry picked from commit 99ef218)

…ce::cleanup_ledger (backport #26651) (#28721) * Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (#26651) #### Problem LedgerCleanupService requires compactions to propagate & digest range-delete tombstones to eventually reclaim disk space. #### Summary of Changes This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with fewer IOps. Experimental results on mainnet validators show that the PR can effectively reduce 33% to 40% ledger disk size. (cherry picked from commit 99ef218) * Fixup Co-authored-by: Yueh-Hsuan Chiang <93241502+yhchiang-sol@users.noreply.github.com> Co-authored-by: Steven Czabaniuk <steven@solana.com>

Periodic compaction was previously disabled on all columns in solana-labs#27571 in favor of the delete_file_in_range() approach that solana-labs#26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns as these columns contain a slot in their key, but as the secondary index. The result of periodic compaction not running on these columns is that no storage space was being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it.

Periodic compaction was previously disabled on all columns in #27571 in favor of the delete_file_in_range() approach that #26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns, as these columns contain a slot in their key, but as a non-primary index. The result of periodic compaction not running on these columns is that no storage space is being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it.

Periodic compaction was previously disabled on all columns in #27571 in favor of the delete_file_in_range() approach that #26651 introduced. However, several columns still rely on periodic compaction to reclaim storage. Namely, the TransactionStatus and AddressSignatures columns, as these columns contain a slot in their key, but as a non-primary index. The result of periodic compaction not running on these columns is that no storage space is being reclaimed from columns. This is obviously bad and would lead to a node eventually running of storage space and crashing. This PR reintroduces periodic compaction, but only for the columns that need it. (cherry picked from commit d73fa1b)

yhchiang-sol changed the title ~~(LedgerStore) Delete sst files whose key-range is completely within a delete_range_cf~~ Free-up disk space from files older than the lowest_cleanup_slot in LedgerCleanupService Jul 25, 2022

yhchiang-sol changed the title ~~Free-up disk space from files older than the lowest_cleanup_slot in LedgerCleanupService~~ Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger Jul 25, 2022

yhchiang-sol commented Jul 25, 2022

View reviewed changes

Comment thread ledger/src/blockstore/blockstore_purge.rs Outdated

yhchiang-sol requested review from CriesofCarrots and sakridge July 25, 2022 15:58

steviez mentioned this pull request Jul 25, 2022

Actual ledger disk utilization grows past expected utilization #23248

Closed

steviez reviewed Aug 1, 2022

View reviewed changes

Comment thread ledger/src/blockstore/blockstore_purge.rs

Comment thread core/src/ledger_cleanup_service.rs Outdated

Comment thread ledger/src/blockstore/blockstore_purge.rs Outdated

Comment thread ledger/src/blockstore/blockstore_purge.rs Outdated

This was referenced Aug 2, 2022

Tools that help recover corrupted ledger store file #26813

Closed

(Ledger-Tool) Add command --purge-older-slots #26905

Closed

yhchiang-sol requested a review from steviez August 5, 2022 15:03

steviez previously approved these changes Aug 8, 2022

View reviewed changes

Comment thread core/src/ledger_cleanup_service.rs Outdated

(LedgerStore) Delete sst files whose range is within a delete_range_cf.

758c2d5

yhchiang-sol merged commit 99ef218 into solana-labs:master Aug 8, 2022

yhchiang-sol added the v1.11 label Aug 22, 2022

mergify Bot mentioned this pull request Aug 22, 2022

Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (backport #26651) #27304

Merged

ryoqun mentioned this pull request Sep 1, 2022

Fix the boundary inconsistency between delete_file_in_range and delete_range #27201

Merged

This was referenced Sep 2, 2022

Add RocksSlotTtl storage type: A slot-based FIFO compatible with Level #27567

Closed

Remove daily rewrite/compaction of each ledger file #27571

Merged

steviez mentioned this pull request Oct 9, 2022

Make Ledger Store reclaim disk space more efficiently and effectively #26003

Closed

yhchiang-sol mentioned this pull request Oct 27, 2022

Remove Blockstore manual compaction code #28409

Merged

steviez added the v1.13 label Nov 2, 2022

mergify Bot mentioned this pull request Nov 2, 2022

Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (backport #26651) #28721

Merged

steviez mentioned this pull request Jan 31, 2023

docs: Update on --limit-ledger-size and ledger disk usage #30002

Merged

steviez mentioned this pull request Jul 19, 2023

Re-enable periodic compaction on several columns #32548

Merged

steviez mentioned this pull request Jun 26, 2024

Deprecate --rocksdb-shred-compaction fifo anza-xyz/agave#1882

Merged

mergify Bot mentioned this pull request Jun 28, 2024

v2.0: Deprecate --rocksdb-shred-compaction fifo (backport of #1882) anza-xyz/agave#1907

Merged

steviez mentioned this pull request Nov 14, 2025

blockstore: Remove slow test_purge_huge anza-xyz/agave#9087

Merged

Conversation

yhchiang-sol commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

steviez commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhchiang-sol commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steviez commented Jul 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhchiang-sol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

steviez commented Jul 25, 2022

Uh oh!

yhchiang-sol commented Jul 28, 2022

Uh oh!

yhchiang-sol commented Jul 29, 2022

Uh oh!

yhchiang-sol commented Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhchiang-sol commented Aug 1, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steviez commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhchiang-sol commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steviez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffwashington commented Aug 18, 2022

Uh oh!

yhchiang-sol commented Aug 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhchiang-sol commented Jul 16, 2022 •

edited

Loading

steviez commented Jul 16, 2022 •

edited

Loading

yhchiang-sol commented Jul 16, 2022 •

edited

Loading

steviez commented Jul 17, 2022 •

edited

Loading

yhchiang-sol commented Jul 30, 2022 •

edited

Loading

steviez commented Aug 1, 2022 •

edited

Loading

yhchiang-sol commented Aug 5, 2022 •

edited

Loading