-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-5148. Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT #2184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @runzhiwang Please take a look at this PR. Thanks ! |
bharatviswa504
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
elek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started a discussion on the mailing list and would like to continue the discussion o community meeting.
I propose to avoid any usage of SNAPSHOT Ratis versions as last release is delayed because of similar reasons.
Releasing 2.0.1 from Ratis seems to be very easy as Ratis is TLP.
|
Let me make it clear: It's not a blocking comment as we already have a SNAPSHOT dependency, feel free to go ahead. But I am planning to suggest 2.0.1 Ratis release. Can you please share what is the Ratis commit which was required here? (Just to be sure it's included if we do a patch release..) |
|
@elek for SCM HA snapshot implementation to work we require RATIS-1326, RATIS-1356, and RATIS-1369. |
|
@GlenGeng I think pushing empty commit is better for triggering new CI run than closing/reopening. Most importantly it keeps the PR's build history easily available. We also gain more data points (regarding CI failures), because it triggers two builds: PR and push. It is also slightly less noisy. |
|
Thanks @elek for looking at this PR, and overall I agree with your suggestion. Thanks @bharatviswa504 for the explanation and the sharing of the commit and the related context.
Thanks @adoroszlai for the suggestion. I will follow it. |
|
@GlenGeng Can you check if test failure is related I see one of the test related to InstallSnapshot is failing. |
|
@bharatviswa504 After installSnapshot, the lifecycle of follower's SCMStateMachine is in PAUSED state. The reason is, after |
|
|
||
| @Override | ||
| public void reinitialize() { | ||
| if (getLifeCycleState() == LifeCycle.State.PAUSED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hanishakoneru does OmStateMachine also need similar code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: How these tests are working before? (Not able to understand that? can you shed some info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After notifyInstallSnapshotFromLeader is called, ratis calls
if (reply != null) {
LOG.info("{}: StateMachine successfully installed snapshot index {}. Reloading the StateMachine.",
getMemberId(), reply.getIndex());
stateMachine.pause();
state.updateInstalledSnapshotIndex(reply);
state.reloadStateMachine(reply.getIndex());
}
stateMachine.pause(); will make SM to be in PAUSED state, state.reloadStateMachine(reply.getIndex()) will trigger StateMachineUpdater#reload() to be called, which will then call stateMachine.reinitialize();.
This is the reason of the fix.
As far as I known, at the end of TestSCMInstallSnapshotWithHA#testInstallSnapshot, the followerSCM is also in PAUSED state, which is not checked before.
And for testInstallOldCheckpointFailure, notifyInstallSnapshotFromLeader is not really called, since without the fix in RATIS-1369, downloading snapshot taken at index 0 is ignore by follower SCM.
|
I think, in Ratis we need to fix the applyTransaction behaviour as well. If the stateMachine is in paused state, the applyIndex should not increase in the StateMachineUpdater. This is not happening right now. Then, as part of reloading the stateMachine as part of reinitializing the stateMachine, it can call unpause(). |
|
Have filed https://issues.apache.org/jira/browse/RATIS-1370 for ratis issue. |
|
Thanks @GlenGeng for the work and others for the reviews. |
…ing-upgrade-master-merge2 * upstream/master: (56 commits) HDDS-2212. Genconf tool should generate config files for secure clust… (apache#1788) HDDS-5166. Remove duplicate assignment of OZONE_OPTS for freon and sh (apache#2195) Revert "HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)" HDDS-4983. Display key offset for each block in command key info (apache#2051) HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177) HDDS-4585. Support bucket acl operation in S3g (apache#1701) HDDS-5153. Decommissioning a dead node should complete immediately (apache#2190) HDDS-5147. Intermittent test failure in TestContainerDeletionChoosingPolicy#testRandomChoosingPolicy (apache#2188) HDDS-5152. Fix Suggested leader in Client. (apache#2189) HDDS-5148. Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT (apache#2184) HDDS-4515. Datanodes should be able to persist and load CRL (apache#2181) HDDS-5060. [SCM HA Security] Make InterSCM grpc channel secure. (apache#2187) HDDS-5051. Ensure failover to suggested leader if any for NotLeaderException. (apache#2141) HDDS-5127. Fix getServiceList when SCM HA is enabled (apache#2173) HDDS-4889. Add simple CI check for docs (apache#2156) HDDS-5131. Use timeout in github actions (apache#2176) HDDS-5103. Fix Install Snapshot Mechanism in SCMStateMachine. (apache#2155) HDDS-5124. Use OzoneConsts.OZONE_TIME_ZONE instead of "GMT" (apache#2166) HDDS-5047. Refactor Pipeline to use ReplicationConfig instead of factor/type (apache#2096) HDDS-5083. Bump version of common-compress (apache#2139) ... Conflicts: hadoop-hdds/common/pom.xml hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManager.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMStorageConfig.java hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestSCMNodeManager.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/MiniOzoneClusterImpl.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMStorage.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconStorageContainerManagerFacade.java
…k-in-auth * HDDS-3698-nonrolling-upgrade: (57 commits) Fix compilation errors afte merge Update javassist in recon pom Fix changes introduced in merge that failed TestSCMNodeManager upgrade tests Fix checkstyle Fix intermittent test failure TestSCMNodeManager#testSetNodeOpStateAndCommandFired after merge Skip scm init default layout version in TestOzoneConfigurationFields HDDS-2212. Genconf tool should generate config files for secure clust… (apache#1788) HDDS-5166. Remove duplicate assignment of OZONE_OPTS for freon and sh (apache#2195) Revert "HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)" HDDS-4983. Display key offset for each block in command key info (apache#2051) HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177) HDDS-4585. Support bucket acl operation in S3g (apache#1701) HDDS-5153. Decommissioning a dead node should complete immediately (apache#2190) HDDS-5147. Intermittent test failure in TestContainerDeletionChoosingPolicy#testRandomChoosingPolicy (apache#2188) HDDS-5152. Fix Suggested leader in Client. (apache#2189) HDDS-5148. Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT (apache#2184) HDDS-4515. Datanodes should be able to persist and load CRL (apache#2181) HDDS-5060. [SCM HA Security] Make InterSCM grpc channel secure. (apache#2187) HDDS-5051. Ensure failover to suggested leader if any for NotLeaderException. (apache#2141) HDDS-5127. Fix getServiceList when SCM HA is enabled (apache#2173) HDDS-4889. Add simple CI check for docs (apache#2156) HDDS-5131. Use timeout in github actions (apache#2176) HDDS-5103. Fix Install Snapshot Mechanism in SCMStateMachine. (apache#2155) HDDS-5124. Use OzoneConsts.OZONE_TIME_ZONE instead of "GMT" (apache#2166) HDDS-5047. Refactor Pipeline to use ReplicationConfig instead of factor/type (apache#2096) ...
Co-authored-by: Doroszlai, Attila <[email protected]> (cherry picked from commit ce29843) Change-Id: I26e47e4136b67b5628bb35caba1a8b2bcecf8a9b
What changes were proposed in this pull request?
Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5148
How was this patch tested?
CI