-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-8209. [SNAPSHOT] Synchronize tarball creation with background processes. #4680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/lock/BootstrapStateHandler.java
Outdated
Show resolved
Hide resolved
|
@hemantk-12 @prashantpogde @aswinshakil @smengcl Please take a look. |
smengcl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeorgeJahad for the patch and detailed PR description. The approach looks straight forward enough to me.
Regarding compaction logs (and backup SSTs), we probably need to lock pruneOlderSnapshotsWithCompactionHistory and pruneSstFiles as well:
Lines 226 to 237 in 08cb520
| this.executor.scheduleWithFixedDelay( | |
| this::pruneOlderSnapshotsWithCompactionHistory, | |
| pruneCompactionDagDaemonRunIntervalInMs, | |
| pruneCompactionDagDaemonRunIntervalInMs, | |
| TimeUnit.MILLISECONDS); | |
| this.executor.scheduleWithFixedDelay( | |
| this::pruneSstFiles, | |
| pruneCompactionDagDaemonRunIntervalInMs, | |
| pruneCompactionDagDaemonRunIntervalInMs, | |
| TimeUnit.MILLISECONDS | |
| ); |
With that, it could still append new compaction log entries (and hardlink new SST files) in CompactionBegin/CompletedListener.
In this case, do you think it's reasonable to call pauseBackgroundWork() at the beginning of bootstrapping, so that all RDB background work, including compaction, would be paused? And it would be resumed by calling continueBackgroundWork() whenever the tarball is generated or when bootstrapping process erred out.
hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/lock/BootstrapStateHandler.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java
Show resolved
Hide resolved
...op-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestSstFilteringService.java
Show resolved
Hide resolved
I don't know much about compaction so I'm not sure. It depends on how long compactions last, and how bad is it to turn them off for a while. Will that cause problems for the active fs? My gut feeling is that it is best to add it so we can do more testing and see if the side effects are acceptable. |
They do call locks internally. Do you think that is insufficient? |
I don't think the new sst files are a serious problem. The question is are the new compaction log entries a problem. That is what I need help determining. If they are not a problem, then we don't need to pause compactions. The issue is after the bootstrap process takes the checkpoint of the active fs, any compactions before the tarball is finished will create entries to the comaction logs. Those entries don't correspond the active fs in the tarball. Once the follower loads the tarball, it will start doing it's own compactions which may differ from the compaction log entries. It could cause conflicting entries in the compaction log, which might break the dag. That is the problem we may need to avoid by invoking pauseBackgroundWork(). For example, the leader could compact files a and b into file c while the follower could compact them into file d. What is the compaction dag supposed to do with conflicting entries like that? That is why I'm leaning towards stopping compactions. |
Ah that is good enough. Thanks. They count as ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java Lines 304 to 309 in 08cb520
I agree with sufficient handling in DAG traversal (e.g. when used in SnapDiff) this should not be a big problem. cc @hemantk-12 I think an alternative approach to |
I think that is a great solution if it is not too difficult. Does each compaction log entry have a corresponding sequence number? How do we know which entries are newer than a certain number? |
Here is an example of compaction log that is used in a UT: Lines 281 to 302 in 08cb520
So it looks like atm we only have sequence number for So by that I mean this line: Line 285 in 08cb520
should become something like: This could be achieved by adding this line: right after this: Lines 545 to 546 in 08cb520
And we will also need to tune the compaction log reading/parsing to correctly handle this during OM startup. You could leave a TODO in this PR for this, and file a new jira for the compaction entry sequence number addition. (up to you) |
|
Thank you @GeorgeJahad for this PR. Also thank you for the detailed abstract. Yes, we should call pauseBackgroundWork() during tarball creation and be sure to resume it once the tarball is done. Other changes look good to me. |
That would be my preference. I think it is a much better solution than the pausing compactions. Is that ok with you @prashantpogde ? |
Co-authored-by: Siyao Meng <[email protected]>
|
@smengcl I created a new ticket for the compaction log work: https://issues.apache.org/jira/browse/HDDS-8652 |
|
Thanks @GeorgeJahad for the patch and filing HDDS-8652. Thanks @prashantpogde for the review. |
It's a better solution but Its much simpler to just pause and resume the background compaction. I would favor simplicity over the other solution because sync up with leader would be an infrequent activity. I would leave it to you to decide on this. |
Another downside of |
What changes were proposed in this pull request?
The BootstrapStateHandler
This PR creates a new interface, BootstrapStateHandler. Each process that manages state that needs to be copied into the bootstrap tarball must implement this interface.
The interface has a single method, getBoostrapStateLock(). The processes managing bootstrap state implement this method, which returns the lock used to protect state that must be changed atomically with respect to the tarball creation process.
The tarball creation process takes each of these locks before generating the tarball. That prevents any of the processes from changing the state while the tarball is being created.
Then the tarball creator waits for the double buffer to flush so that any remaining operations that may effect the bootstrap state have completed before the tarball is created.
The state to protect/synchronize
Outside of the active rocksdb itself, the omSnapshot subsystem uses many types of persistent state data. I've listed the types below along with an indication of whether they are guarded by a BootstrapStateHandler.Lock.
If there is any type I've forgotten please let me know, so that we can account for it.
delete/rename key entries
The sdt, and eventually the kdt, move delete/rename key entries between snapshots. These are guarded by a BootstrapStateHandler.Lock, (and by waiting for the double buffer flush.)
deleted sst files
The rocksdb differ and sst filtering service delete sst files when no longer needed. These are guarded with by the BootstrapStateHandler.Lock.
The reason for these need to be guarded is that the tarball creation process does a calculation of hard links prior to tarball creation. That calculation requires a stable set of sst files. If some get deleted during the process, the hard links calculated may be invalid.
compaction logs
These get created by the active rocksdb as sst files are compacted.
These are not currently guarded by a BootstrapStateHandler.Lock.
My reasoning for this is that it would require turning off compactions on the active rocksdb for the duration of the tarball creation. Before the tarball is created, the BootstrapStateHandler's are locked and a checkpoint of the active fs is taken. Any compactions that happen after that checkpoint is taken, but before the tarball is finished, will have compaction log entries add to the tarball, that don't correspond to the checkpoint.
This may be a problem. If so, we'll have to pause compactions on the active fs long enough to make a copy of the compaction logs. So please include your thoughts when reviewing.
intermediate rocksdb snapdiff files
These are just used during the calculation of the snapdiff. They will not be a part of the tarball.
They are not guarded by a BootstrapStateHandler.Lock.
SST filter service history file
This file keeps track of all the snapshots that have been filtered.
It is guarded by a BootstrapStateHandler.Lock.
Other changes in this PR
Flush Snapshot WAL's after moving deleted keys
When deleted keys are moved from one snapshot to another, the double buffer opens and writes to both the corresponding rocksdb images, in addition to the rocksdb for the active fs. I've modified that operation to flush the wal for the snapshot rocksdb images. (The active rocksdb doesn't need to be flushed because the tarball creator takes a rocksdb checkpoint of that image after the doublebuffer is flushed.)
Moved some methods to the AbstractKeyDeletingService class
The AbstractKeyDeletingService class is the parent class of both the SnapshotDeletingService and the KeyDeletingService. I moved the submitSnapshotMoveDeletedKeys() and submitRequest() methods from the SnapshotDeletingService class to the AbstractKeyDeletingService class. This is because thesnapshot delete design includes an optimization that has the KeyDeletingService also moving deleted keys.
Since that operation needs to be protected by a BootstrapStateHandler.Lock, I moved that code in anticipation of when the optimization is implemented.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8209
How was this patch tested?
added unit tests
I will add a test for the SnapshotDeletingService once this is merged: #4571
TODO
As mentioned above, we need to decide if the compaction logs need to be synchronized with the actvie fs checkpoint.