-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-4568. Add SCMContext to SCM HA #1737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
amaliujia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the changes related to new Ratis API are in master branch already. It might be better to merge master to 2823 branch and then rebase this PR to remove those changes (e.g. changes in XceiverServerRatis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of a change is contained in master (#1728)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. so this is from the cherry picked commit from master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it make sense to use SCMContext to get current term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use SCMContext to encapsulate raft related info, so that components in SCM won't need to hold a reference of SCMHAManager or SCMRatisServer.
For non-HA mode or unit test, we just need an empty SCMContext, instead of a mocked SCMHAManager or a mocked SCMRatisServer.
linyiqun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GlenGeng , I left one minor comment below.
I see now we use the term index value as the SCM leader check. This is used across SCM internal components. Does this will cover the client request behaviour?
For example, one client configured a single SCM address that is a Follower role and then send the request. Will it update the SCM metadata, like pipeline, containers? I see there was a isLeader check before but now that was removed in HDDS-4551. Do you know the context for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also catch the NotLeaderException like other places does? There are some other places we don't catch this exception that is swallowed by IOException.
} catch (NotLeaderException nle) {
} catch (IOException e) {
// We may tolerate a number of failures for sometime
// but if it continues to fail, at some point we need to raise
// an exception and probably fail the SCM ? At present, it simply
// continues to retry the scanning.
LOG.error("Failed to get block deletion transactions from delTX log",
e);
return EmptyTaskResult.newResult();
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will fix them in the next patch.
Hey Yiqun, thanks for the review!
For now, we need client to know all the SCM instances that engage in the SCM raft cluster, if client send request to a follower SCM, it will get a NotLeaderException, and failover to the next SCM instance.
We removed the leader check in HDDS-4551, since all the metadata updates that will be saved into rocksdb will go through ratis, as a RaftClientRequest, so if underly Raft is in a non-Leader role, the replied RaftClientReply will be injected with a NotLeaderException. Please check If needed, we can schedule a zoom meeting to discuss about current SCM HA design. |
|
@GlenGeng , thanks for the detailed explanation! |
ee5e166 to
369efc5
Compare
369efc5 to
4ea79af
Compare
|
Thanks Glen to rebase PR! Will try to give another pass on this PR. |
b358bfc to
b38236c
Compare
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/ha/TestSCMContext.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Left a comment.
runzhiwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks @GlenGeng for detailed explanation. The changes look good to me except one minor comment. Thanks for the efforts. |
What changes were proposed in this pull request?
We want to provide SCMContext, which would be a single source of truth for some key information that is shared across all components within SCM.
SCMContext holds two kind of key information:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4568
How was this patch tested?
CI