Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger#26651

Merged
yhchiang-sol merged 1 commit intosolana-labs:masterfrom
yhchiang-sol:delete_file_in_range
Aug 8, 2022
Merged

Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger#26651
yhchiang-sol merged 1 commit intosolana-labs:masterfrom
yhchiang-sol:delete_file_in_range

Conversation

@yhchiang-sol
Copy link
Copy Markdown
Contributor

@yhchiang-sol yhchiang-sol commented Jul 16, 2022

Problem

LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.

Summary of Changes

This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with
fewer IOps because 1) we don't need to wait for rocksdb compactions, and 2) the purge
is done by deleting old files without iterating the checking every single key inside the file
using the compaction filter.

To make the above cleanup consistent with our range-deletion and reclaim disk space
more effectively, the cleanup range in LedgerCleanupService::cleanup_ledger has been
changed from (the-oldest-slot-in-db, lowest_cleanup_slot) to (0, lowest_cleanup_slot).
The detailed reason for this is also commented in the code.

@steviez
Copy link
Copy Markdown
Contributor

steviez commented Jul 16, 2022

I had actually noticed this method and been thinking about it too 😁 ; however, the following bit of documentation makes me suspect of whether it will work well for our case:

Delete sst files whose keys are entirely in the given range.

We call delete_range(start, end) from ledger_cleanup_service; those deletes are typically for small ranges. As such, I'm not sure that all the keys in single SST's will be contained in the range that delete_range() is called with so I'm not sure this will actually reclaim.

The idea that I had been mulling over was to call delete_file_in_range(0, n) immediately after calling delete_range() where n would be the new lowest slot that we cleaned up. ledger_cleanup_service deletes the oldest stuff first, so calling with 0 as the lower bound is consistent with previous behavior.

The only problem case I can think of here is the one you described on one of our calls that follows a sequence like this:

  • Slot n inserted into blockstore
  • Slot n keys are compacted down
    • Let's say enough time passes that this ends up in L2
  • Slot n is included in a delete_range(x, y) call where x < n < y
    • Let's say enough times passes that this ends up in L1
  • Slot n is then included in a delete_file_in_range() call
    • delete_file_in_range() knocks out the tombstone in L1, which makes the items that in L2 now "available" (potentially only for a short period) should someone query them

However, recall that delete_file_in_range() does not operate on L0. So, my idea was that the following sequence might work:

  • Call delete_range(0, end) instead of delete_range(start, end), this will have duplicate deletes but don't think that is a problem
  • Cal delete_file_in_range(0, end) immediately after above call; we shouldn't run into consistency issue because the issued delete_range(0, end) call should be in L0, and L0 is skipped.

If their is some subtlety here where the delete_file_in_range(0, end) may not execute immediately, the other idea was to have delete_file_in_range() run in reverse. That is, do [end_level, ... , 1] instead of [1, ..., end_level] here:
https://github.com/facebook/rocksdb/blob/95ef007adc9365fbefc0f957722a191c1fd7dcd3/db/db_impl/db_impl.cc#L4019

@yhchiang-sol
Copy link
Copy Markdown
Contributor Author

yhchiang-sol commented Jul 16, 2022

Oh wow, that's a quick response! I'm still actually baking the PR and just updated it.

The updated PR does range delete first, then the delete_file_in_range, as we want to make sure the write-batch went through first -- I should probably check the result of the write_batch before issuing delete_file_in_range. I will update the PR.

However, recall that delete_file_in_range() does not operate on L0. So, my idea was that the following sequence might work:
Call delete_range(0, end) instead of delete_range(start, end), this will have duplicate deletes but don't think that is a problem
Cal delete_file_in_range(0, end) immediately after above call; we shouldn't run into consistency issue because the issued delete_range(0, end) call should be in L0, and L0 is skipped.

Yes, delete_file_in_range works best when we pass (nullptr, end_slot), and you're also right at the L0 assumption that we won't miss any tombstones because they are either in mem-table (most likely) or in L0.

My follow-up question is: is it always safe to do delete_file_in_range(nullptr, end_slot) after every range_delete(start_slot, end_slot)? If the answer is no, then we probably want to find a better place where we can safely do delete_file_in_range(nullptr, end_slot).

@steviez
Copy link
Copy Markdown
Contributor

steviez commented Jul 17, 2022

Oh wow, that's a quick response!

Ha, just happened to see it near the top on PR's tab

My follow-up question is: is it always safe to do delete_file_in_range(nullptr, end_slot) after every range_delete(start_slot, end_slot)? If the answer is no, then we probably want to find a better place where we can safely do delete_file_in_range(nullptr, end_slot).

No, I don't think so. One example that comes to mind immediately is when we purge a slot from replay_stage; in this case, we're typically calling range_delete(s, s) for some slot s, where we keep some number of slots before s.

I think ledger_cleanup_service is the only place where we should be calling delete_file_in_range(nullptr, end_slot) in validator code; delete_file_in_range() might provide a speedup if the user runs compaction as part of solana-ledger-tool purge too. However, it would't be safe to assume low bound of 0 in solana-ledger-tool

@yhchiang-sol yhchiang-sol changed the title (LedgerStore) Delete sst files whose key-range is completely within a delete_range_cf Free-up disk space from files older than the lowest_cleanup_slot in LedgerCleanupService Jul 25, 2022
@yhchiang-sol yhchiang-sol changed the title Free-up disk space from files older than the lowest_cleanup_slot in LedgerCleanupService Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger Jul 25, 2022
Copy link
Copy Markdown
Contributor Author

@yhchiang-sol yhchiang-sol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the following changes:

  • LedgerCleanupService::cleanup_ledger becomes the only place that will issue purge_files_in_range.
  • The cleanup range in LedgerCleanupService::cleanup_ledger is changed from
    (the_oldest_slot_in_db, lowest_slot_to_purge) to (0, lowest_slot_to_purge).
    (i.e., anything older than or equal to lowest_slot_to_purge will be purged.)
  • blockstore::run_purge will run delete_file_in_range if the starting slot is 0.

Comment thread ledger/src/blockstore/blockstore_purge.rs Outdated
@steviez
Copy link
Copy Markdown
Contributor

steviez commented Jul 25, 2022

Will circle back for a proper review, but it might be nice to get some runtime on a validator to see if this functions as expected. Two things I'd expect:

  • Actual ledger disk space is closer to our estimate as shreds that are below lowest cleanup slot actually get reclaimed
  • Hypothetically, this should be much more efficient than background compactions, so maybe a drop in I/O and/or CPU
    • This one might be harder to quantify looking at overall validator, maybe there is something in RocksDB logs tho ?

@yhchiang-sol
Copy link
Copy Markdown
Contributor Author

Currently testing. Will share the result once I have enough data points.

@yhchiang-sol
Copy link
Copy Markdown
Contributor Author

Re-running the experiments with the minimum --limit-ledger-size 50000000 to speed up the iteration.

@yhchiang-sol
Copy link
Copy Markdown
Contributor Author

yhchiang-sol commented Jul 30, 2022

The experiment with --limit-ledger-size 50000000 shows that this PR can effectively cap the ledger disk usage. I will later repeat the experiment with the default ledger size to obtain the new suggested ledger size.

Below are two different recent data points showing ledger size (the green one) constantly stays at ~230GB with this PR while it continues growing in the master branch (the purple one):

Screen Shot 2022-07-30 at 5 19 32 PM

Screen Shot 2022-07-30 at 5 19 58 PM

The validator log that includes the purge and disk utilization information also shows the same thing: the one with this PR has disk utilization capped at the same level while the disk utilization on the master branch keeps growing.

master branch

validator.log:[2022-07-30T05:45:33.042293526Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143713895, last_purge_slot=143713382, purge_interval=512, disk_utilization=Ok(238217440078)
...
validator.log:[2022-07-30T08:23:43.153597380Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143727840, last_purge_slot=143727327, purge_interval=512, disk_utilization=Ok(269657424050)
...
validator.log:[2022-07-30T09:12:38.505896309Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143731947, last_purge_slot=143731434, purge_interval=512, disk_utilization=Ok(279697104792)

with this PR (the ERROR in the log is my debug log):

validator.log.1:[2022-07-30T03:51:37.399736445Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143707215, last_purge_slot=143706702, purge_interval=512, disk_utilization=Ok(196294712338)
validator.log.1:[2022-07-30T03:51:38.383047511Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 143653371
validator.log.1:[2022-07-30T03:51:38.668689868Z INFO  solana_core::ledger_cleanup_service] purge_slots took 285ms
...
validator.log.1:[2022-07-30T04:32:07.071533348Z INFO  solana_core::ledger_cleanup_service] purge: last_root=143711325, last_purge_slot=143710806, purge_interval=512, disk_utilization=Ok(192974163896)
validator.log.1:[2022-07-30T04:32:08.036080146Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 143657660
validator.log.1:[2022-07-30T04:32:08.236217942Z INFO  solana_core::ledger_cleanup_service] purge_slots took 200ms
...
last_root=143713382, last_purge_slot=143712869, purge_interval=512, disk_utilization=Ok(195494184009)
validator.log.1:[2022-07-30T04:53:01.287423797Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 143659607
validator.log.1:[2022-07-30T04:53:01.419216524Z INFO  solana_core::ledger_cleanup_service] purge_slots took 131ms

@yhchiang-sol
Copy link
Copy Markdown
Contributor Author

Repeated the experiment with --limit-ledger-size 50000000 for an extended period of time. The one with this PR stabilizes its ledger disk space usage at ~210GB, while the one with the master branch is stabilized at ~320GB.

In other words, this PR is able to reduce ~33% of the ledger disk space.

Below is the disk space usage graph of the two builds (sorry that I failed to start two instances simultaneously, but the experiment should still be valid.)
Screen Shot 2022-08-02 at 3 09 22 AM

I will go ahead and run this PR with the default --limit-ledger-size to obtain the suggested ledger disk space.

Comment thread ledger/src/blockstore/blockstore_purge.rs
Comment thread core/src/ledger_cleanup_service.rs Outdated
Comment thread ledger/src/blockstore/blockstore_purge.rs Outdated
Comment thread ledger/src/blockstore/blockstore_purge.rs Outdated
@steviez
Copy link
Copy Markdown
Contributor

steviez commented Aug 1, 2022

Repeated the experiment with --limit-ledger-size 50000000 for an extended period of time. The one with this PR stabilizes its ledger disk space usage at ~210GB, while the one with the master branch is stabilized at ~320GB.

In other words, this PR is able to reduce ~33% of the ledger disk space.

Nice, definitely a meaningful improvement!

Below is the disk space usage graph of the two builds (sorry that I failed to start two instances simultaneously, but the experiment should still be valid.)

I agree, results valid despite offset start times.

I will go ahead and run this PR with the default --limit-ledger-size to obtain the suggested ledger disk space.

Dumping some DM's, the numbers (210/320 GB) seemed a bit large. Upon further inspection, we realized that this number is inclusive of the accounts_index. The accounts index is currently ~55 GB on mainnet, and by default, lives at .../ledger/accounts_index/. So, this is being included in the Google dashboard numbers, but not the metrics that only report on the rocksdb subdirectory.

With this in mind, above numbers are roughly:
210 GB - 55 GB = 155 GB for rocksdb
320 GB - 55 GB = 265 GB for master

That works out to > 40%, so even better than the 33% number that was previously mentioned 😄

@yhchiang-sol
Copy link
Copy Markdown
Contributor Author

yhchiang-sol commented Aug 5, 2022

Here're the numbers for the default ----limit-ledger-size.

The ledger size is capped at ~565GB (607402138084 bytes), which is slightly smaller than what we previously predicted.

[2022-08-05T14:36:18.200163844Z INFO  solana_core::ledger_cleanup_service] purge: last_root=144594041, last_purge_slot=144593528, purge_interval=512, disk_utilization=Ok(607209149245)
[2022-08-05T14:36:21.816699128Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 144374054
[2022-08-05T14:36:21.916079447Z INFO  solana_core::ledger_cleanup_service] purge_slots took 99ms
[2022-08-05T14:41:34.767795051Z INFO  solana_core::ledger_cleanup_service] purge: last_root=144594554, last_purge_slot=144594041, purge_interval=512, disk_utilization=Ok(605534535878)
[2022-08-05T14:41:38.278674488Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 144374551
[2022-08-05T14:41:38.337789842Z INFO  solana_core::ledger_cleanup_service] purge_slots took 59ms
[2022-08-05T14:47:24.315633976Z INFO  solana_core::ledger_cleanup_service] purge: last_root=144595067, last_purge_slot=144594554, purge_interval=512, disk_utilization=Ok(607402138084)
[2022-08-05T14:47:27.756434726Z ERROR solana_ledger::blockstore::blockstore_purge] purge_files_in_range: 0, 144375033
[2022-08-05T14:47:27.857260277Z INFO  solana_core::ledger_cleanup_service] purge_slots took 100ms

Together with the account index, it is ~618GB.

Screen Shot 2022-08-05 at 10 54 37 PM

As for the suggested ledger disk size for the default settings, I previously discussed it with @steviez, and we think 750GB would be a good number as it gives us ~20% additional buffer to handle spikes and some room for the growth of account index.

@yhchiang-sol yhchiang-sol requested a review from steviez August 5, 2022 15:03
steviez
steviez previously approved these changes Aug 8, 2022
Copy link
Copy Markdown
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once you fix the typo. I think this was already the plan, but we can update docs in a different PR.

Additionally, I think this might be a candidate for backport, definitely for v1.11, and maybe even for v1.10. Arguably, this is a bug fix as we allow the ledger to expand to much larger footprint than what the docs state. I saw some chatter in Discord about backports / audits; I don't think there is any official policy change, but might be good to post a question in Discord and check

Comment thread core/src/ledger_cleanup_service.rs Outdated
@mergify mergify Bot dismissed steviez’s stale review August 8, 2022 13:33

Pull request has been modified.

@yhchiang-sol yhchiang-sol merged commit 99ef218 into solana-labs:master Aug 8, 2022
xiangzhu70 pushed a commit to xiangzhu70/solana that referenced this pull request Aug 17, 2022
…ce::cleanup_ledger (solana-labs#26651)

#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.

#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot.  This allows us to reclaim disk space more often with
fewer IOps.  Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.
@jeffwashington
Copy link
Copy Markdown
Contributor

fwiw, I ran bench-tps on this and did not see any significant difference.

@yhchiang-sol
Copy link
Copy Markdown
Contributor Author

fwiw, I ran bench-tps on this and did not see any significant difference.

Thanks for the information, @jeffwashington. The change requires blockstore to hit the limit-ledger-size before we can see any difference as the PR improves the way to clean up blockstore when it hits the configured size limit.

I will experiment with bench-tps a bit.

mergify Bot pushed a commit that referenced this pull request Aug 22, 2022
…ce::cleanup_ledger (#26651)

#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.

#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot.  This allows us to reclaim disk space more often with
fewer IOps.  Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.

(cherry picked from commit 99ef218)
mergify Bot added a commit that referenced this pull request Aug 22, 2022
…ce::cleanup_ledger (backport #26651) (#27304)

Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (#26651)

#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.

#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot.  This allows us to reclaim disk space more often with
fewer IOps.  Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.

(cherry picked from commit 99ef218)

Co-authored-by: Yueh-Hsuan Chiang <93241502+yhchiang-sol@users.noreply.github.com>
yhchiang-sol added a commit that referenced this pull request Sep 16, 2022
#### Problem
Previously before #26651, our LedgerCleanupService needs RocksDB background
compactions to reclaim ledger disk space via our custom CompactionFilter.
However, since RocksDB's compaction isn't smart enough to know which file to pick,
we rely on the 1-day compaction period so that each file will be forced to be compacted
once a day so that we can reclaim ledger disk space in time.  The downside of this is
each ledger file will be rewritten once per day.

#### Summary of Changes
As #26651 makes LedgerCleanupService actively delete those files whose entire slot-range
is older than both --limit-ledger-size and the current root, we can remove the 1-day compaction
period and get rid of the daily ledger file rewrite.

The results on mainnet-beta shows that this PR reduces ~20% write-bytes-per-second
and reduces ~50% read-bytes-per-second on ledger disk.
@steviez steviez added the v1.13 label Nov 2, 2022
mergify Bot pushed a commit that referenced this pull request Nov 2, 2022
…ce::cleanup_ledger (#26651)

#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.

#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot.  This allows us to reclaim disk space more often with
fewer IOps.  Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.

(cherry picked from commit 99ef218)
steviez pushed a commit that referenced this pull request Nov 10, 2022
…ce::cleanup_ledger (backport #26651) (#28721)

* Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (#26651)

#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.

#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot.  This allows us to reclaim disk space more often with
fewer IOps.  Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.

(cherry picked from commit 99ef218)

* Fixup

Co-authored-by: Yueh-Hsuan Chiang <93241502+yhchiang-sol@users.noreply.github.com>
Co-authored-by: Steven Czabaniuk <steven@solana.com>
steviez pushed a commit to steviez/solana that referenced this pull request Jul 19, 2023
Periodic compaction was previously disabled on all columns in solana-labs#27571 in
favor of the delete_file_in_range() approach that solana-labs#26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns as
these columns contain a slot in their key, but as the secondary index.

The result of periodic compaction not running on these columns is that
no storage space was being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.
steviez pushed a commit to steviez/solana that referenced this pull request Jul 19, 2023
Periodic compaction was previously disabled on all columns in solana-labs#27571 in
favor of the delete_file_in_range() approach that solana-labs#26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns as
these columns contain a slot in their key, but as the secondary index.

The result of periodic compaction not running on these columns is that
no storage space was being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.
steviez pushed a commit that referenced this pull request Jul 20, 2023
Periodic compaction was previously disabled on all columns in #27571 in
favor of the delete_file_in_range() approach that #26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns, as
these columns contain a slot in their key, but as a non-primary index.

The result of periodic compaction not running on these columns is that
no storage space is being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.
mergify Bot pushed a commit that referenced this pull request Jul 20, 2023
Periodic compaction was previously disabled on all columns in #27571 in
favor of the delete_file_in_range() approach that #26651 introduced.
However, several columns still rely on periodic compaction to reclaim
storage. Namely, the TransactionStatus and AddressSignatures columns, as
these columns contain a slot in their key, but as a non-primary index.

The result of periodic compaction not running on these columns is that
no storage space is being reclaimed from columns. This is obviously bad
and would lead to a node eventually running of storage space and
crashing.

This PR reintroduces periodic compaction, but only for the columns that
need it.

(cherry picked from commit d73fa1b)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants