HDDS-8209. [SNAPSHOT] Synchronize tarball creation with background processes. #4680

GeorgeJahad · 2023-05-09T02:27:29Z

What changes were proposed in this pull request?

The BootstrapStateHandler

This PR creates a new interface, BootstrapStateHandler. Each process that manages state that needs to be copied into the bootstrap tarball must implement this interface.

The interface has a single method, getBoostrapStateLock(). The processes managing bootstrap state implement this method, which returns the lock used to protect state that must be changed atomically with respect to the tarball creation process.

The tarball creation process takes each of these locks before generating the tarball. That prevents any of the processes from changing the state while the tarball is being created.

Then the tarball creator waits for the double buffer to flush so that any remaining operations that may effect the bootstrap state have completed before the tarball is created.

The state to protect/synchronize

Outside of the active rocksdb itself, the omSnapshot subsystem uses many types of persistent state data. I've listed the types below along with an indication of whether they are guarded by a BootstrapStateHandler.Lock.

If there is any type I've forgotten please let me know, so that we can account for it.

delete/rename key entries

The sdt, and eventually the kdt, move delete/rename key entries between snapshots. These are guarded by a BootstrapStateHandler.Lock, (and by waiting for the double buffer flush.)

deleted sst files

The rocksdb differ and sst filtering service delete sst files when no longer needed. These are guarded with by the BootstrapStateHandler.Lock.

The reason for these need to be guarded is that the tarball creation process does a calculation of hard links prior to tarball creation. That calculation requires a stable set of sst files. If some get deleted during the process, the hard links calculated may be invalid.

compaction logs

These get created by the active rocksdb as sst files are compacted.

These are not currently guarded by a BootstrapStateHandler.Lock.

My reasoning for this is that it would require turning off compactions on the active rocksdb for the duration of the tarball creation. Before the tarball is created, the BootstrapStateHandler's are locked and a checkpoint of the active fs is taken. Any compactions that happen after that checkpoint is taken, but before the tarball is finished, will have compaction log entries add to the tarball, that don't correspond to the checkpoint.

This may be a problem. If so, we'll have to pause compactions on the active fs long enough to make a copy of the compaction logs. So please include your thoughts when reviewing.

intermediate rocksdb snapdiff files

These are just used during the calculation of the snapdiff. They will not be a part of the tarball.

They are not guarded by a BootstrapStateHandler.Lock.

SST filter service history file

This file keeps track of all the snapshots that have been filtered.

It is guarded by a BootstrapStateHandler.Lock.

Other changes in this PR

Flush Snapshot WAL's after moving deleted keys

When deleted keys are moved from one snapshot to another, the double buffer opens and writes to both the corresponding rocksdb images, in addition to the rocksdb for the active fs. I've modified that operation to flush the wal for the snapshot rocksdb images. (The active rocksdb doesn't need to be flushed because the tarball creator takes a rocksdb checkpoint of that image after the doublebuffer is flushed.)

Moved some methods to the AbstractKeyDeletingService class

The AbstractKeyDeletingService class is the parent class of both the SnapshotDeletingService and the KeyDeletingService. I moved the submitSnapshotMoveDeletedKeys() and submitRequest() methods from the SnapshotDeletingService class to the AbstractKeyDeletingService class. This is because thesnapshot delete design includes an optimization that has the KeyDeletingService also moving deleted keys.

Since that operation needs to be protected by a BootstrapStateHandler.Lock, I moved that code in anticipation of when the optimization is implemented.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8209

How was this patch tested?

added unit tests

I will add a test for the SnapshotDeletingService once this is merged: #4571

TODO

As mentioned above, we need to decide if the compaction logs need to be synchronized with the actvie fs checkpoint.

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/lock/BootstrapStateHandler.java

GeorgeJahad · 2023-05-09T22:26:53Z

@hemantk-12 @prashantpogde @aswinshakil @smengcl Please take a look.

smengcl

Thanks @GeorgeJahad for the patch and detailed PR description. The approach looks straight forward enough to me.

Regarding compaction logs (and backup SSTs), we probably need to lock pruneOlderSnapshotsWithCompactionHistory and pruneSstFiles as well:

ozone/hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java

Lines 226 to 237 in 08cb520

    
           this.executor.scheduleWithFixedDelay( 
        
               this::pruneOlderSnapshotsWithCompactionHistory, 
        
               pruneCompactionDagDaemonRunIntervalInMs, 
        
               pruneCompactionDagDaemonRunIntervalInMs, 
        
               TimeUnit.MILLISECONDS); 
        
           this.executor.scheduleWithFixedDelay( 
        
               this::pruneSstFiles, 
        
               pruneCompactionDagDaemonRunIntervalInMs, 
        
               pruneCompactionDagDaemonRunIntervalInMs, 
        
               TimeUnit.MILLISECONDS 
        
           );

With that, it could still append new compaction log entries (and hardlink new SST files) in CompactionBegin/CompletedListener.

In this case, do you think it's reasonable to call pauseBackgroundWork() at the beginning of bootstrapping, so that all RDB background work, including compaction, would be paused? And it would be resumed by calling continueBackgroundWork() whenever the tarball is generated or when bootstrapping process erred out.

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/lock/BootstrapStateHandler.java

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java

...op-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestSstFilteringService.java

GeorgeJahad · 2023-05-18T18:49:46Z

do you think it's reasonable to call pauseBackgroundWork() at the beginning of bootstrapping, so that all RDB background work, including compaction, would be paused?

I don't know much about compaction so I'm not sure. It depends on how long compactions last, and how bad is it to turn them off for a while. Will that cause problems for the active fs? My gut feeling is that it is best to add it so we can do more testing and see if the side effects are acceptable.

GeorgeJahad · 2023-05-18T18:52:36Z

we probably need to lock pruneOlderSnapshotsWithCompactionHistory and pruneSstFiles

They do call locks internally. Do you think that is insufficient?

GeorgeJahad · 2023-05-18T19:07:47Z

it could still append new compaction log entries (and hardlink new SST files) in CompactionBegin/CompletedListener.

I don't think the new sst files are a serious problem. The question is are the new compaction log entries a problem. That is what I need help determining. If they are not a problem, then we don't need to pause compactions.

The issue is after the bootstrap process takes the checkpoint of the active fs, any compactions before the tarball is finished will create entries to the comaction logs. Those entries don't correspond the active fs in the tarball. Once the follower loads the tarball, it will start doing it's own compactions which may differ from the compaction log entries. It could cause conflicting entries in the compaction log, which might break the dag. That is the problem we may need to avoid by invoking pauseBackgroundWork().

For example, the leader could compact files a and b into file c while the follower could compact them into file d. What is the compaction dag supposed to do with conflicting entries like that? That is why I'm leaning towards stopping compactions.

smengcl · 2023-05-18T19:22:38Z

They do call locks internally. Do you think that is insufficient?

Ah that is good enough. Thanks. They count as rocksDbCheckpointDiffer in the Lock I presume:

ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java

Lines 304 to 309 in 08cb520

    
           static class Lock extends BootstrapStateHandler.Lock { 
        
             private final BootstrapStateHandler keyDeletingService; 
        
             private final BootstrapStateHandler sstFilteringService; 
        
             private final BootstrapStateHandler rocksDbCheckpointDiffer; 
        
             private final BootstrapStateHandler snapshotDeletingService; 
        
             private final OzoneManager om;

I don't think the new sst files are a serious problem. The question is are the new compaction log entries a problem. That is what I need help determining. If they are not a problem, then we don't need to pause compactions.

I agree with sufficient handling in DAG traversal (e.g. when used in SnapDiff) this should not be a big problem. cc @hemantk-12

I think an alternative approach to pauseBackgroundWork() that should also keep compaction log consistent would be that we ask bootstrapping followers to ignore any newer compaction log entries with sequence number larger than the checkpointed active DB's sequence number, because any DB compactions happened after that sequence number, even persisted to compaction log right during tarball creation, would be irrelevant to the active DB checkpoint being transferred. What do you think?

GeorgeJahad · 2023-05-18T19:31:06Z

we ask bootstrapping followers to ignore any newer compaction log entries than the checkpointed active DB's sequence number,

I think that is a great solution if it is not too difficult. Does each compaction log entry have a corresponding sequence number? How do we know which entries are newer than a certain number?

smengcl · 2023-05-18T19:49:38Z

we ask bootstrapping followers to ignore any newer compaction log entries than the checkpointed active DB's sequence number,

I think that is a great solution if it is not too difficult. Does each compaction log entry have a corresponding sequence number? How do we new which entries are newer than a certain number?

Here is an example of compaction log that is used in a UT:

ozone/hadoop-hdds/rocksdb-checkpoint-differ/src/test/java/org/apache/ozone/rocksdiff/TestRocksDBCheckpointDiffer.java

Lines 281 to 302 in 08cb520

    
           String compactionLog = "" 
        
               // Snapshot 0 
        
               + "S 1000 df6410c7-151b-4e90-870e-5ef12875acd5 " + createdTime + " \n" 
        
               // Additional "compaction" to trigger and test early exit condition 
        
               + "C 000001,000002:000062\n" 
        
               // Snapshot 1 
        
               + "S 3008 ef6410c7-151b-4e90-870e-5ef12875acd5 " + createdTime + " \n" 
        
               // Regular compaction 
        
               + "C 000068,000062:000069\n" 
        
               // Trivial move 
        
               + "C 000071,000064,000060,000052:000071,000064,000060,000052\n" 
        
               + "C 000073,000066:000074\n" 
        
               + "C 000082,000076,000069:000083\n" 
        
               + "C 000087,000080,000074:000088\n" 
        
               // Deletion? 
        
               + "C 000093,000090,000083:\n" 
        
               // Snapshot 2 
        
               + "S 14980 e7ad72f8-52df-4430-93f6-0ee91d4a47fd " + createdTime + "\n" 
        
               + "C 000098,000096,000085,000078,000071,000064,000060,000052:000099\n" 
        
               + "C 000105,000095,000088:000107\n" 
        
               // Snapshot 3 
        
               + "S 17975 4f084f6e-ed3d-4780-8362-f832303309ea " + createdTime + "\n";

So it looks like atm we only have sequence number for S (snapshot taken) event entries. For a cleaner solution we should prepend sequence number to those C (compaction) entries as well. (At the time I thought about adding sequence number for C entries but didn't because we didn't have a use for it back then. Now is the time to add it.)

So by that I mean this line:

ozone/hadoop-hdds/rocksdb-checkpoint-differ/src/test/java/org/apache/ozone/rocksdiff/TestRocksDBCheckpointDiffer.java

Line 285 in 08cb520

+ "C 000001,000002:000062\n"

should become something like:

"C 2004 000001,000002:000062\n"

This could be achieved by adding this line:

sb.append(db.getLatestSequenceNumber()).append(SPACE_DELIMITER);

right after this:

ozone/hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java

Lines 545 to 546 in 08cb520

    
           // Mark the beginning of a compaction log 
        
           sb.append(COMPACTION_LOG_ENTRY_LINE_PREFIX);

And we will also need to tune the compaction log reading/parsing to correctly handle this during OM startup.

You could leave a TODO in this PR for this, and file a new jira for the compaction entry sequence number addition. (up to you)

prashantpogde · 2023-05-18T20:34:01Z

Thank you @GeorgeJahad for this PR. Also thank you for the detailed abstract. Yes, we should call pauseBackgroundWork() during tarball creation and be sure to resume it once the tarball is done. Other changes look good to me.

GeorgeJahad · 2023-05-18T21:45:43Z

You could leave a TODO in this PR for this, and file a new jira for the compaction entry sequence number addition. (up to you)

That would be my preference. I think it is a much better solution than the pausing compactions. Is that ok with you @prashantpogde ?

Co-authored-by: Siyao Meng <[email protected]>

GeorgeJahad · 2023-05-19T02:59:23Z

@smengcl I created a new ticket for the compaction log work: https://issues.apache.org/jira/browse/HDDS-8652

smengcl · 2023-05-19T08:02:41Z

Thanks @GeorgeJahad for the patch and filing HDDS-8652.

Thanks @prashantpogde for the review.

prashantpogde · 2023-05-20T02:22:45Z

@GeorgeJahad @smengcl

You could leave a TODO in this PR for this, and file a new jira for the compaction entry sequence number addition. (up to you)

That would be my preference. I think it is a much better solution than the pausing compactions. Is that ok with you @prashantpogde ?

It's a better solution but Its much simpler to just pause and resume the background compaction. I would favor simplicity over the other solution because sync up with leader would be an infrequent activity. I would leave it to you to decide on this.

smengcl · 2023-05-20T03:34:23Z

@GeorgeJahad @smengcl

You could leave a TODO in this PR for this, and file a new jira for the compaction entry sequence number addition. (up to you)

That would be my preference. I think it is a much better solution than the pausing compactions. Is that ok with you @prashantpogde ?

It's a better solution but Its much simpler to just pause and resume the background compaction. I would favor simplicity over the other solution because sync up with leader would be an infrequent activity. I would leave it to you to decide on this.

Another downside of pauseBackgroundWork() is that it waits for ALL RocksDB background work to finish before pausing them, not just compaction. That includes any ongoing memtable flush as well, which is unnecessary in our use case here. At this moment I don't see RocksDB implementing API to pause only compaction. Thus, IMO pauseBackgroundWork() is not ideal and I would only consider it when we don't have other choices (which we do now).

George Jahad added 9 commits May 5, 2023 15:42

updated to latest master

8185e9c

cleanup

e554199

make BootstrapStateHandler AutoCloseable

11f7f31

added BootstrapStateHandler.Lock

fd5b31a

changed method name to getBoostrapStateLock()

612161f

fixed lock mock

ae410b1

checkstyle

f332811

cleanup

33fa169

removed test changes to TestOzoneManagerDoubleBuffer

0965942

GeorgeJahad changed the title ~~HDDS-8209. Synchronize tarball creation with background processes.~~ HDDS-8209. [SNAPSHOT] Synchronize tarball creation with background processes. May 9, 2023

smengcl reviewed May 9, 2023

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/lock/BootstrapStateHandler.java Outdated Show resolved Hide resolved

adoroszlai added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label May 9, 2023

corrected method name

08cb520

neils-dev added the gr label May 10, 2023

smengcl requested a review from prashantpogde May 17, 2023 02:48

smengcl reviewed May 18, 2023

View reviewed changes

Apply suggestions from code review

e887813

Co-authored-by: Siyao Meng <[email protected]>

smengcl approved these changes May 19, 2023

View reviewed changes

smengcl merged commit 1cfaf4e into apache:master May 19, 2023

hemantk-12 mentioned this pull request Jul 11, 2023

HDDS-8652. [Snapshot] Added DB sequence number to the compaction log entry #5046

Merged

This was referenced Apr 17, 2024

HDDS-9039. Added a test to verify that no compaction log entry is added to compactionLogTable and DAG when tarball creation is in progress. #6171

Closed

HDDS-9039. Removed the pause and wait in RocksDB compaction when tarball creation is in progress #6552

Merged

hemantk-12 mentioned this pull request Feb 28, 2025

HDDS-12210. Use correct BootstrapStateHandler.Lock in SnapshotDeletingService #7991

Merged

	this.executor.scheduleWithFixedDelay(
	this::pruneOlderSnapshotsWithCompactionHistory,
	pruneCompactionDagDaemonRunIntervalInMs,
	pruneCompactionDagDaemonRunIntervalInMs,
	TimeUnit.MILLISECONDS);

	this.executor.scheduleWithFixedDelay(
	this::pruneSstFiles,
	pruneCompactionDagDaemonRunIntervalInMs,
	pruneCompactionDagDaemonRunIntervalInMs,
	TimeUnit.MILLISECONDS
	);

HDDS-8209. [SNAPSHOT] Synchronize tarball creation with background processes. #4680

HDDS-8209. [SNAPSHOT] Synchronize tarball creation with background processes. #4680

Uh oh!

Conversation

GeorgeJahad commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

The BootstrapStateHandler

The state to protect/synchronize

delete/rename key entries

deleted sst files

compaction logs

intermediate rocksdb snapdiff files

SST filter service history file

Other changes in this PR

Flush Snapshot WAL's after moving deleted keys

Moved some methods to the AbstractKeyDeletingService class

What is the link to the Apache JIRA

How was this patch tested?

TODO

Uh oh!

Uh oh!

GeorgeJahad commented May 9, 2023

Uh oh!

smengcl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GeorgeJahad commented May 18, 2023

Uh oh!

GeorgeJahad commented May 18, 2023

Uh oh!

GeorgeJahad commented May 18, 2023

Uh oh!

smengcl commented May 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GeorgeJahad commented May 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smengcl commented May 18, 2023

Uh oh!

prashantpogde commented May 18, 2023

Uh oh!

GeorgeJahad commented May 18, 2023

Uh oh!

GeorgeJahad commented May 19, 2023

Uh oh!

smengcl commented May 19, 2023

Uh oh!

prashantpogde commented May 20, 2023

Uh oh!

smengcl commented May 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GeorgeJahad commented May 9, 2023 •

edited

Loading

smengcl commented May 18, 2023 •

edited

Loading

GeorgeJahad commented May 18, 2023 •

edited

Loading