Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix live WALs purged while file deletions disabled #3341

Closed
wants to merge 2 commits into from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Jan 9, 2018

When calling DisableFileDeletions followed by GetSortedWalFiles, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, GetSortedWalFiles didn't omit files already planned for deletion via PurgeObsoleteFiles, so the guarantee could be broken.

We fix it by making GetSortedWalFiles wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since PurgeObsoleteFiles is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, GetSortedWalFiles simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.

Test Plan:

  • new unit test

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr requested review from maysamyabandeh and siying January 9, 2018 03:30
@ajkr
Copy link
Contributor Author

ajkr commented Jan 9, 2018

btw the diff looks much simpler on internal phabricator.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. View: changes, changes since last import

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr
Copy link
Contributor Author

ajkr commented Jan 9, 2018

there's a problem where files scheduled for archiving will be skipped, then there'd be a gap in the returned WALs between the live WAL files and the archived WAL files :(

@ajkr
Copy link
Contributor Author

ajkr commented Jan 9, 2018

have another idea: how about we track unmatched calls to FindObsoleteFiles (i.e., ones without corresponding calls to PurgeObsoleteFiles). We can wait for this counter to hit zero in GetSortedWalFiles. It should fix the race condition for both WAL and archive and let us clean up a lot of confusing code in WalManager

sent out #3350 to make this possible

@ajkr ajkr force-pushed the live-obsolete-wal branch from 2af2e46 to 4c60749 Compare January 12, 2018 21:57
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request.

@ajkr
Copy link
Contributor Author

ajkr commented Jan 12, 2018

I updated the PR and description with a new approach now that #3350 has landed.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@miasantreble miasantreble self-assigned this Jan 16, 2018
// files are going to be purged. Additional purges won't be scheduled as
// long as deletions are disabled (so the below loop must terminate).
InstrumentedMutexLock l(&mutex_);
while (disable_delete_obsolete_files_ > 0 &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to also make sure bg_cv_ is signaled when disable_delete_obsolete_files_ goes to zero.

Copy link
Contributor

@miasantreble miasantreble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. View: changes, changes since last import

@ajkr ajkr force-pushed the live-obsolete-wal branch from 09ac69b to 5722768 Compare January 17, 2018 23:51
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajkr is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

amytai pushed a commit to amytai/rocksdb that referenced this pull request Mar 9, 2018
Summary:
When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken.

We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.
Closes facebook#3341

Differential Revision: D6681131

Pulled By: ajkr

fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634
pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Sep 10, 2020
Summary: facebook#3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in #XXXX opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.

This change patches the hole.

Test Plan: apparent fix to backups in crash test observed. Will work
on a unit test for another commit
facebook-github-bot pushed a commit that referenced this pull request Sep 11, 2020
…7369)

Summary:
#3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in #5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.

This change patches the hole.

Pull Request resolved: #7369

Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit

Reviewed By: ajkr

Differential Revision: D23620258

Pulled By: pdillinger

fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
codingrhythm pushed a commit to SafetyCulture/rocksdb that referenced this pull request Mar 5, 2021
…acebook#7369)

Summary:
facebook#3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in facebook#5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.

This change patches the hole.

Pull Request resolved: facebook#7369

Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit

Reviewed By: ajkr

Differential Revision: D23620258

Pulled By: pdillinger

fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants