HDDS-3208. Implement Ratis snapshot on SCM #1725

amaliujia · 2020-12-20T07:15:36Z

What changes were proposed in this pull request?

Design doc: https://docs.google.com/document/d/1uy4_ER2V6nNQJ7_5455Wz8NmI142JHPnif6Y1OdPi8E/edit?usp=sharing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3208

How was this patch tested?

UT

amaliujia · 2020-12-21T08:50:35Z

I found one of the failed test is TestMiniOzoneCluster.testDNstartAfterSCM, which failed on loading RocksDB when restarting SCM. Will try investigate the root cause.

amaliujia · 2020-12-22T08:41:56Z

cc @nandakumar131 @ChenSammi @GlenGeng

GlenGeng-awx

LGTM. only some inline comments.

Could we rename the title to implement Ratis takeSnapshot on SCM?

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

GlenGeng-awx · 2020-12-23T06:36:20Z

...server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerStateManagerImpl.java

Can we remove the 'else {}' branch ? When we will migrate to batch operations, there will be no other way to do db writes.

This is left for MockHAManager code path in which the buffer is NULL :)

If we choose to update MockHAManager to have a sort of MockBuffer implementation, then we can move this else branch because at that time I think all existing code will access this buffer.

I can create a JIRA to track that add buffer to MockHAManager and remove this else :-)

https://issues.apache.org/jira/browse/HDDS-4634

Create JIRA to add buffer in MockHAManager thus we can remove else

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

GlenGeng-awx · 2020-12-23T06:45:24Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMTransactionInfo.java

Will it be better to put SCMTransactionInfo near OMTransactionInfo, under interface-storage ?

Can we reuse the OMTransactionInfo? For example, rename OMTransactionInfo to TransactionInfo, and use them in both OM and SCM ?

To reuse OMTransactionInfo, my current answer is probably no.

Though both OMTransactionInfo and SCMTransactionInfo seems are doing the same thing for now, decoupling is good when there is a need that either OMTransactionInfo or SCMTransactionInfo needs to add(drop) new(existing) fields while the other does not need. Especially at this moment that SCM snapshot is still under development.

I think we can consider this after SCM snapshot is stable. I can create a JIRA though to track the merge of two transaction info.

Sorry, I'm not clear why we can not reuse OMTransactionInfo. They has the same fields.

I think we are better to reuse OMTransactionInfo after merge HDDS-2823 back to master.

Not reusing OMTransactionInfo is to try to reduce the risk that have conflicts after syncing master int HDDS-2823. Also, to reuse OMTransactionInfo, I think we'd better to update the name to TransactionInfo, which will be a source of conflicts , unless we can change it in the master branch first. And I also add some more functions in SCMTransactionInfo, if we reuse OMTransactionInfo, those new functions will also become source of conflicts.

To conclude a bit, we can reuse it after merging SCM HA back to master, the goal is try to reduce potential conflicts if there could be any.

SCM HA is in developing branch anyway and I am sure we will need to clean up lots of stuff eventually. As long as we have JIRAs to not lose track of such work.

OMTransactionInfo is written by Bharat, since SCM and OM has a quite similar requirement, we may have a talk with him, to determine whether OM and SCM can share the same TransactionInfo and the Codec. @amaliujia What do you think ?

Sure we can do that. I will need add some changes in main branch, and then sync that to HDDS-2823 and then reuse in SCM.

I think we should do that separately as that will take time. I have created https://issues.apache.org/jira/browse/HDDS-4660 to track this effort.

GlenGeng-awx · 2020-12-23T07:07:09Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

Could we avoid using this Ctor ? We may need to change the related test cases.

Same https://issues.apache.org/jira/browse/HDDS-4634.

This is because MockHAManager does not have a mock buffer implementation.

Could we implement the mock buffer in this PR ? Since missing this change will pollute the production code, and will also give a burden to the on-going PR of deleted block log.

...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMTransactionInfoCodec.java

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManagerV2Impl.java

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java

amaliujia · 2020-12-26T07:01:19Z

With #1733 we will be able to write tests for snapshot based on on single Ratis server SCM setup in MiniOzoneCluster.

ChenSammi · 2020-12-30T02:41:15Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java

Shall we add some checks here to guarantee new info is greater than current lastTrxInfo?

Good point. I will add a check or a LOG.error for that case.

Theoretically, the Ratis log will be applied with index monotonically increasing thus the new info is supposed to greater than current lastTrxInfo. But have a check or a LOG will better handle rare cases or bugs.

In my opinion, we should stop the ratis server here if such a case arise. It is fatal case where we cannot move ahead from here.

Such case should be considered as a bug and be fixed.

I created a JIRA to further discuss how to deal with such case: https://issues.apache.org/jira/browse/HDDS-4723

ChenSammi · 2020-12-30T03:56:14Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMDBDefinition.java

Since we add a new table in SCM DB, we need a JIRA to track how to handle back compatibility during upgrade non HA SCM to HA SCM.

https://issues.apache.org/jira/browse/HDDS-4635

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMMetadataStore.java

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMDBDefinition.java

ChenSammi · 2020-12-31T03:02:25Z

@amaliujia, thanks for working on this, I left a few comments.

amaliujia · 2020-12-31T06:01:15Z

@ChenSammi thank you! I will start to address comments since there are a bunch of actionable ones existing.

linyiqun · 2021-01-02T15:46:32Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java

Could we document javadoc for this? That will be easier understood by others. At lease, we should document the purpose for this transaction buffer.

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMTransactionInfo.java

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

runzhiwang · 2021-01-07T13:13:17Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java

Expose currentBatchOperation has a risk, if someone commit this batch operation not by calling flush(), then applyIndex did not wrote into RocksDB.

hmm I think the goal here is commiting batch operations won't need to flush DB. The flushing is controlled by takeSnapshot().

Is there a case that committing will have a following flush?

I'm afraid if someone do not know must call flush() to sync, but sync currentBatchOperation directly, then inconsistent will happen.

Do you know is there a better mechanism to achieve the buffering?

I studied what OM does and that is different from SCM. OM just buffer entry and then apply entries in a batch, but SCM need to router such entry to handlers and then different handler will apply the entry.

So if we change to the way that OM is doing, there will be a good amount of refactoring needs happen, which might be not appropriate in this PR. E.g. We will need to move handlers to the buffer class and insert entries into buffer. Buffer will be the place to trouter entries to right state managers.

What do you think? I can create a PR to track such refactoring thus we won't need to expose currentBatchOperation

How about return an anonymous subclass of RDBBatchOperation whose commit() will throw RuntimeException, which means you can only write to the batch, but there is no way to commit this batch ?

RDBBatchOperation#commit() is called by RDBStore#commitBatchOperation().

Also add javadoc to this getCurrentBatchOperation(), notifying that this returned batch can not be commited.

Glen's idea might be much easier. That's a good option.

I think I will separate it to:

add javadoc to getCurrentBatchOperation() to remind that do not use this batch to commit.

Apply a solution to fix this issue. Created https://issues.apache.org/jira/browse/HDDS-4661 to track it.

Because getCurrentBatchOperation() is not a user-facing API, so only Ozone developer could make such mistakes, and in short term we have code review process to catch those. But I agree we should apply a fix for the longer term.

amaliujia · 2021-01-07T19:48:38Z

@ChenSammi @runzhiwang @GlenGeng @linyiqun

I have rebased and updated this PR. Now this PR includes the takeSnapshot and loadSnapshot with a test.

amaliujia · 2021-01-07T22:56:30Z

Will fix failing tests.

linyiqun · 2021-01-10T05:03:17Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMRatisSnapshotInfo.java

I don't see this method is used in this PR. Will this be used in follow-up task? If not we could remove this and remove volatile keyword for term/snapshotIndex as well. Because I didn't find the concurrent update for this two variables.

got it. Will remove this method.

The volatile is a good point. I am also looking for comments to point me out whether there are potential concurrent operations. I will remove volatile if there is no comment about this situation.

The SCMRatisSnapshotInfo is an immutable object after being created. How about make term and snapshotIndex be final and remove updateTermIndex() ?

That makes sense. Done

GlenGeng-awx

Thanks for the job!

Better solve following task in this PR, it should be part of the job of load snapshot.
https://issues.apache.org/jira/browse/HDDS-4533

GlenGeng-awx · 2021-01-13T08:11:12Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java

Will it be better to call latestSnapshot = latestTrxInfo.toSnapshotInfo(); here, and remove the setLatestSnapshot () method ?
If caller always has to call flush() and setLatestSnapshot() together, better to merge them to avoid human mistake.

The snapshot info is set also during the initialization. So not only flush() will need to update that information.

GlenGeng-awx · 2021-01-13T08:15:00Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

How about revert the change to process() and change like this ?

applyTransactionFuture.complete(process(request)); transactionBuffer.updateLatestTrxInfo(SCMTransactionInfo.builder() .setCurrentTerm(trx.getLogEntry().getTerm()) .setTransactionIndex(trx.getLogEntry().getIndex()) .build()));

the process() does not needs to know about the trxInfo

GlenGeng-awx · 2021-01-13T08:20:00Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

How about move this shouldUpdate() in to SCMTransactionInfo, as a method isEmpty()? We'd better encapulate the magic number 0 and -1 into SCMTransactionInfo .

Makes sense

GlenGeng-awx · 2021-01-13T08:22:31Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

Better remove the info in line 133 and replace 156 as

LOG.info("Current Snapshot Index {}, takeSnapshot took {} ms", getLastAppliedTermIndex(), Time.monotonicNow() - startTime);

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

amaliujia · 2021-01-14T05:17:07Z

Thanks for the job!

Better solve following task in this PR, it should be part of the job of load snapshot.
https://issues.apache.org/jira/browse/HDDS-4533

@GlenGeng I prefer to fix such issue separately so create a PR for it: #1796

GlenGeng-awx · 2021-01-14T09:37:03Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMTransactionInfo.java

shouldUpdate() is not very intuitive in the context of SCMTransactionInfo, how about isInitialized() or something like this ?

isInitialized is a good naming.

GlenGeng-awx · 2021-01-14T10:31:47Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManagerV2Impl.java

DBTransactionBuffer can be fetched by calling SCMHAManager#getDBTransactionBuffer, so this function newPipelineManagerWithMockBuffer is not needed.

// Create PipelineStateManager StateManager stateManager = PipelineStateManagerV2Impl .newBuilder().setPipelineStore(pipelineStore) .setRatisServer(scmhaManager.getRatisServer()) .setNodeManager(nodeManager) .setSCMDBTransactionBuffer(scmhaManager.getDBTransactionBuffer()) .build()

This is a very good point.

GlenGeng-awx · 2021-01-15T02:51:15Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

It should be

if (latestTrxInfo.isInitialized()) { updateLastAppliedTermIndex(...) }

?

Actually it is !latestTrxInfo.isInitialized().

SCMTransactionInfo latestTrxInfo = buffer.getLatestTrxInfo(); comes from the buffer. The buffer will load the SCMTransactionInfo from the DB. If there was no snapshot taken before (e.g. brand new SCM), there isn't a SCMTransactionInfo from DB thus buffer will create a default one.

So we only need to updateLastAppliedTermIndex when latestTrxInfo is not initialized (thus it is loaded from DB and it's meaningful).

bshashikant · 2021-01-15T09:55:58Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java

i think, reset of the rocks db batch operation should be made independent of flush. We may/may not require to reinitialise every time we call flush. For example, shutting down the raft server instance may initiate the last snapshot but will not require the batch reinitialisation.

I agree with it :-)

In fact we have discussed not to expose batch operation above and I created https://issues.apache.org/jira/browse/HDDS-4661.

As this PR becomes larger and larger so I am planning to address this batch operation related comments in HDDS-4661, including when to init and close it.

How do you feel? Do you agree with my idea?

I am ok with addressing in a different PR.

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

bshashikant · 2021-01-15T10:08:09Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManagerV2Impl.java

unintended change??

Yes :) will undo it.

amaliujia · 2021-01-16T00:01:28Z

Failed UT is the decommission one, which is known to be flaky.

bshashikant · 2021-01-18T05:37:24Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

Can we rename applyTransaction to applyTansactionSerial as this is a serialized operation anyways?

I created https://issues.apache.org/jira/browse/HDDS-4684.

After check the Ratis state machine interface, I can see two functions:

/** * Called for transactions that have been committed to the RAFT log. This step is called * sequentially in strict serial order that the transactions have been committed in the log. * The SM is expected to do only necessary work, and leave the actual apply operation to the * applyTransaction calls that can happen concurrently. * @param trx the transaction state including the log entry that has been committed to a quorum * of the raft peers * @return The Transaction context. */ TransactionContext applyTransactionSerial(TransactionContext trx); /** * Apply a committed log entry to the state machine. This method can be called concurrently with * the other calls, and there is no guarantee that the calls will be ordered according to the * log commit order. * @param trx the transaction state including the log entry that has been committed to a quorum * of the raft peers */ CompletableFuture<Message> applyTransaction(TransactionContext trx);

So there is no API as the following

Message applyTransactionSerial(TransactionContext trx);

Basically a function to apply transaction in strict serial order where it returns a Message.

I am planning to send an email to Ratis community to discuss the intention of only having one applyTransaction that returns Message (CompletableFuture) but can be called concurrently.

We can address this API change in HDDS-4684.

GlenGeng-awx · 2021-01-18T07:31:41Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

Please add an annotation of @VisibleForTesting.

GlenGeng-awx · 2021-01-18T07:50:47Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMRatisSnapshotInfo.java

The SCMRatisSnapshotInfo is an immutable object after being created. How about make term and snapshotIndex be final and remove updateTermIndex() ?

bshashikant

The patch looks ok. The changes discussed will be addressed in subsequent jira.

The transaction buffer flush now can only happen via a ratis snapshot , but in case if ratis is not enabled, there needs to be rocks db sync on every update to db or we need a way to periodically flush the buffer changes to db. This can be done as a part of separate jira.

bshashikant · 2021-01-20T17:54:28Z

Thanks @amaliujia for the contribution.

GlenGeng-awx reviewed Dec 23, 2020

View reviewed changes

amaliujia changed the title ~~HDDS-3208. Implement Ratis Snapshots on SCM~~ HDDS-3208. Implement Ratis takeSnapshot on SCM Dec 23, 2020

amaliujia changed the title ~~HDDS-3208. Implement Ratis takeSnapshot on SCM~~ HDDS-3208. SCM Snapshot Phase 1: Implement Ratis takeSnapshot on SCM Dec 23, 2020

ChenSammi reviewed Dec 30, 2020

View reviewed changes

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMMetadataStore.java Outdated Show resolved Hide resolved

ChenSammi reviewed Dec 30, 2020

View reviewed changes

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMDBDefinition.java Outdated Show resolved Hide resolved

linyiqun reviewed Jan 2, 2021

View reviewed changes

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMTransactionInfo.java Outdated Show resolved Hide resolved

runzhiwang reviewed Jan 7, 2021

View reviewed changes

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java Outdated Show resolved Hide resolved

runzhiwang reviewed Jan 7, 2021

View reviewed changes

amaliujia force-pushed the HDDS-3208-new branch from dfddaed to 1e0787a Compare January 7, 2021 19:41

amaliujia changed the title ~~HDDS-3208. SCM Snapshot Phase 1: Implement Ratis takeSnapshot on SCM~~ HDDS-3208. Implement Ratis takeSnapshot on SCM Jan 7, 2021

amaliujia changed the title ~~HDDS-3208. Implement Ratis takeSnapshot on SCM~~ HDDS-3208. Implement Ratis snapshot on SCM Jan 8, 2021

linyiqun reviewed Jan 10, 2021

View reviewed changes

GlenGeng-awx reviewed Jan 13, 2021

View reviewed changes

amaliujia force-pushed the HDDS-3208-new branch from 22b07be to b1aa53e Compare January 14, 2021 05:45

GlenGeng-awx reviewed Jan 14, 2021

View reviewed changes

GlenGeng-awx reviewed Jan 15, 2021

View reviewed changes

bshashikant reviewed Jan 15, 2021

View reviewed changes

bshashikant reviewed Jan 18, 2021

View reviewed changes

GlenGeng-awx reviewed Jan 18, 2021

View reviewed changes

amaliujia added 2 commits January 18, 2021 19:53

HDDS-3208. Implement Ratis Snapshots on SCM

78fb54c

fixup! fix comments

76509c8

amaliujia force-pushed the HDDS-3208-new branch from 0d51d78 to 76509c8 Compare January 19, 2021 04:18

amaliujia added 4 commits January 18, 2021 20:32

fixup! fix UT

1751c5c

trigger new CI check

f22d5ea

fixup! fix NPE

cde5996

trigger new CI check

bbd561d

mukul1987 added the scm-ha label Jan 19, 2021

amaliujia added 2 commits January 19, 2021 10:04

trigger new CI check

cb4d4f4

trigger new CI check

68e02fa

bshashikant approved these changes Jan 20, 2021

View reviewed changes

bshashikant merged commit 3ff677d into apache:HDDS-2823 Jan 20, 2021

amaliujia deleted the HDDS-3208-new branch January 20, 2021 18:33

HDDS-3208. Implement Ratis snapshot on SCM #1725

HDDS-3208. Implement Ratis snapshot on SCM #1725

Uh oh!

Conversation

amaliujia commented Dec 20, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

amaliujia commented Dec 21, 2020

Uh oh!

amaliujia commented Dec 22, 2020

Uh oh!

GlenGeng-awx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amaliujia commented Dec 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bshashikant Jan 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChenSammi commented Dec 31, 2020

Uh oh!

amaliujia commented Dec 31, 2020

Uh oh!

linyiqun Jan 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

amaliujia Dec 23, 2020 •

edited

Loading

amaliujia Dec 31, 2020 •

edited

Loading

amaliujia Dec 23, 2020 •

edited

Loading

amaliujia Jan 8, 2021 •

edited

Loading

bshashikant Jan 15, 2021 •

edited

Loading

linyiqun Jan 2, 2021 •

edited

Loading

GlenGeng-awx Jan 8, 2021 •

edited

Loading

amaliujia Jan 8, 2021 •

edited

Loading