-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-3208. Implement Ratis snapshot on SCM #1725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I found one of the failed test is |
|
cc @nandakumar131 @ChenSammi @GlenGeng |
GlenGeng-awx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. only some inline comments.
Could we rename the title to implement Ratis takeSnapshot on SCM?
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the 'else {}' branch ? When we will migrate to batch operations, there will be no other way to do db writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is left for MockHAManager code path in which the buffer is NULL :)
If we choose to update MockHAManager to have a sort of MockBuffer implementation, then we can move this else branch because at that time I think all existing code will access this buffer.
I can create a JIRA to track that add buffer to MockHAManager and remove this else :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/HDDS-4634
Create JIRA to add buffer in MockHAManager thus we can remove else
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Will it be better to put
SCMTransactionInfonearOMTransactionInfo, underinterface-storage? -
Can we reuse the
OMTransactionInfo? For example, renameOMTransactionInfotoTransactionInfo, and use them in both OM and SCM ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reuse OMTransactionInfo, my current answer is probably no.
Though both OMTransactionInfo and SCMTransactionInfo seems are doing the same thing for now, decoupling is good when there is a need that either OMTransactionInfo or SCMTransactionInfo needs to add(drop) new(existing) fields while the other does not need. Especially at this moment that SCM snapshot is still under development.
I think we can consider this after SCM snapshot is stable. I can create a JIRA though to track the merge of two transaction info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not clear why we can not reuse OMTransactionInfo. They has the same fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are better to reuse OMTransactionInfo after merge HDDS-2823 back to master.
Not reusing OMTransactionInfo is to try to reduce the risk that have conflicts after syncing master int HDDS-2823. Also, to reuse OMTransactionInfo, I think we'd better to update the name to TransactionInfo, which will be a source of conflicts , unless we can change it in the master branch first. And I also add some more functions in SCMTransactionInfo, if we reuse OMTransactionInfo, those new functions will also become source of conflicts.
To conclude a bit, we can reuse it after merging SCM HA back to master, the goal is try to reduce potential conflicts if there could be any.
SCM HA is in developing branch anyway and I am sure we will need to clean up lots of stuff eventually. As long as we have JIRAs to not lose track of such work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMTransactionInfo is written by Bharat, since SCM and OM has a quite similar requirement, we may have a talk with him, to determine whether OM and SCM can share the same TransactionInfo and the Codec. @amaliujia What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can do that. I will need add some changes in main branch, and then sync that to HDDS-2823 and then reuse in SCM.
I think we should do that separately as that will take time. I have created https://issues.apache.org/jira/browse/HDDS-4660 to track this effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid using this Ctor ? We may need to change the related test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same https://issues.apache.org/jira/browse/HDDS-4634.
This is because MockHAManager does not have a mock buffer implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we implement the mock buffer in this PR ? Since missing this change will pollute the production code, and will also give a burden to the on-going PR of deleted block log.
...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMTransactionInfoCodec.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManagerV2Impl.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMDBTransactionBuffer.java
Outdated
Show resolved
Hide resolved
|
With #1733 we will be able to write tests for snapshot based on on single Ratis server SCM setup in MiniOzoneCluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add some checks here to guarantee new info is greater than current lastTrxInfo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will add a check or a LOG.error for that case.
Theoretically, the Ratis log will be applied with index monotonically increasing thus the new info is supposed to greater than current lastTrxInfo. But have a check or a LOG will better handle rare cases or bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, we should stop the ratis server here if such a case arise. It is fatal case where we cannot move ahead from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such case should be considered as a bug and be fixed.
I created a JIRA to further discuss how to deal with such case: https://issues.apache.org/jira/browse/HDDS-4723
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we add a new table in SCM DB, we need a JIRA to track how to handle back compatibility during upgrade non HA SCM to HA SCM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMMetadataStore.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/metadata/SCMDBDefinition.java
Outdated
Show resolved
Hide resolved
|
@amaliujia, thanks for working on this, I left a few comments. |
|
@ChenSammi thank you! I will start to address comments since there are a bunch of actionable ones existing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we document javadoc for this? That will be easier understood by others. At lease, we should document the purpose for this transaction buffer.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMTransactionInfo.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expose currentBatchOperation has a risk, if someone commit this batch operation not by calling flush(), then applyIndex did not wrote into RocksDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think the goal here is commiting batch operations won't need to flush DB. The flushing is controlled by takeSnapshot().
Is there a case that committing will have a following flush?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid if someone do not know must call flush() to sync, but sync currentBatchOperation directly, then inconsistent will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know is there a better mechanism to achieve the buffering?
I studied what OM does and that is different from SCM. OM just buffer entry and then apply entries in a batch, but SCM need to router such entry to handlers and then different handler will apply the entry.
So if we change to the way that OM is doing, there will be a good amount of refactoring needs happen, which might be not appropriate in this PR. E.g. We will need to move handlers to the buffer class and insert entries into buffer. Buffer will be the place to trouter entries to right state managers.
What do you think? I can create a PR to track such refactoring thus we won't need to expose currentBatchOperation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about return an anonymous subclass of RDBBatchOperation whose commit() will throw RuntimeException, which means you can only write to the batch, but there is no way to commit this batch ?
RDBBatchOperation#commit() is called by RDBStore#commitBatchOperation().
Also add javadoc to this getCurrentBatchOperation(), notifying that this returned batch can not be commited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glen's idea might be much easier. That's a good option.
I think I will separate it to:
- add javadoc to
getCurrentBatchOperation()to remind that do not use this batch to commit. - Apply a solution to fix this issue. Created https://issues.apache.org/jira/browse/HDDS-4661 to track it.
Because getCurrentBatchOperation() is not a user-facing API, so only Ozone developer could make such mistakes, and in short term we have code review process to catch those. But I agree we should apply a fix for the longer term.
dfddaed to
1e0787a
Compare
|
@ChenSammi @runzhiwang @GlenGeng @linyiqun I have rebased and updated this PR. Now this PR includes the takeSnapshot and loadSnapshot with a test. |
|
Will fix failing tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this method is used in this PR. Will this be used in follow-up task? If not we could remove this and remove volatile keyword for term/snapshotIndex as well. Because I didn't find the concurrent update for this two variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. Will remove this method.
The volatile is a good point. I am also looking for comments to point me out whether there are potential concurrent operations. I will remove volatile if there is no comment about this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SCMRatisSnapshotInfo is an immutable object after being created. How about make term and snapshotIndex be final and remove updateTermIndex() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Done
GlenGeng-awx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the job!
Better solve following task in this PR, it should be part of the job of load snapshot.
https://issues.apache.org/jira/browse/HDDS-4533
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to call latestSnapshot = latestTrxInfo.toSnapshotInfo(); here, and remove the setLatestSnapshot () method ?
If caller always has to call flush() and setLatestSnapshot() together, better to merge them to avoid human mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot info is set also during the initialization. So not only flush() will need to update that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about revert the change to process() and change like this ?
applyTransactionFuture.complete(process(request));
transactionBuffer.updateLatestTrxInfo(SCMTransactionInfo.builder()
.setCurrentTerm(trx.getLogEntry().getTerm())
.setTransactionIndex(trx.getLogEntry().getIndex())
.build()));
the process() does not needs to know about the trxInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about move this shouldUpdate() in to SCMTransactionInfo, as a method isEmpty()? We'd better encapulate the magic number 0 and -1 into SCMTransactionInfo .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better remove the info in line 133 and replace 156 as
LOG.info("Current Snapshot Index {}, takeSnapshot took {} ms",
getLastAppliedTermIndex(), Time.monotonicNow() - startTime);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
@GlenGeng I prefer to fix such issue separately so create a PR for it: #1796 |
22b07be to
b1aa53e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldUpdate() is not very intuitive in the context of SCMTransactionInfo, how about isInitialized() or something like this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isInitialized is a good naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DBTransactionBuffer can be fetched by calling SCMHAManager#getDBTransactionBuffer, so this function newPipelineManagerWithMockBuffer is not needed.
// Create PipelineStateManager
StateManager stateManager = PipelineStateManagerV2Impl
.newBuilder().setPipelineStore(pipelineStore)
.setRatisServer(scmhaManager.getRatisServer())
.setNodeManager(nodeManager)
.setSCMDBTransactionBuffer(scmhaManager.getDBTransactionBuffer())
.build()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be
if (latestTrxInfo.isInitialized()) {
updateLastAppliedTermIndex(...)
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is !latestTrxInfo.isInitialized().
SCMTransactionInfo latestTrxInfo = buffer.getLatestTrxInfo(); comes from the buffer. The buffer will load the SCMTransactionInfo from the DB. If there was no snapshot taken before (e.g. brand new SCM), there isn't a SCMTransactionInfo from DB thus buffer will create a default one.
So we only need to updateLastAppliedTermIndex when latestTrxInfo is not initialized (thus it is loaded from DB and it's meaningful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think, reset of the rocks db batch operation should be made independent of flush. We may/may not require to reinitialise every time we call flush. For example, shutting down the raft server instance may initiate the last snapshot but will not require the batch reinitialisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with it :-)
In fact we have discussed not to expose batch operation above and I created https://issues.apache.org/jira/browse/HDDS-4661.
As this PR becomes larger and larger so I am planning to address this batch operation related comments in HDDS-4661, including when to init and close it.
How do you feel? Do you agree with my idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with addressing in a different PR.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unintended change??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :) will undo it.
|
Failed UT is the decommission one, which is known to be flaky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename applyTransaction to applyTansactionSerial as this is a serialized operation anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created https://issues.apache.org/jira/browse/HDDS-4684.
After check the Ratis state machine interface, I can see two functions:
/**
* Called for transactions that have been committed to the RAFT log. This step is called
* sequentially in strict serial order that the transactions have been committed in the log.
* The SM is expected to do only necessary work, and leave the actual apply operation to the
* applyTransaction calls that can happen concurrently.
* @param trx the transaction state including the log entry that has been committed to a quorum
* of the raft peers
* @return The Transaction context.
*/
TransactionContext applyTransactionSerial(TransactionContext trx);
/**
* Apply a committed log entry to the state machine. This method can be called concurrently with
* the other calls, and there is no guarantee that the calls will be ordered according to the
* log commit order.
* @param trx the transaction state including the log entry that has been committed to a quorum
* of the raft peers
*/
CompletableFuture<Message> applyTransaction(TransactionContext trx);
So there is no API as the following
Message applyTransactionSerial(TransactionContext trx);
Basically a function to apply transaction in strict serial order where it returns a Message.
I am planning to send an email to Ratis community to discuss the intention of only having one applyTransaction that returns Message (CompletableFuture) but can be called concurrently.
We can address this API change in HDDS-4684.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an annotation of @VisibleForTesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SCMRatisSnapshotInfo is an immutable object after being created. How about make term and snapshotIndex be final and remove updateTermIndex() ?
0d51d78 to
76509c8
Compare
bshashikant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch looks ok. The changes discussed will be addressed in subsequent jira.
The transaction buffer flush now can only happen via a ratis snapshot , but in case if ratis is not enabled, there needs to be rocks db sync on every update to db or we need a way to periodically flush the buffer changes to db. This can be done as a part of separate jira.
|
Thanks @amaliujia for the contribution. |
What changes were proposed in this pull request?
Design doc: https://docs.google.com/document/d/1uy4_ER2V6nNQJ7_5455Wz8NmI142JHPnif6Y1OdPi8E/edit?usp=sharing
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3208
How was this patch tested?
UT