Skip to content

Conversation

@GlenGeng-awx
Copy link
Contributor

What changes were proposed in this pull request?

Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5148

How was this patch tested?

CI

@GlenGeng-awx
Copy link
Contributor Author

cc @runzhiwang Please take a look at this PR. Thanks !

@GlenGeng-awx GlenGeng-awx requested a review from runzhiwang April 26, 2021 12:35
Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a discussion on the mailing list and would like to continue the discussion o community meeting.

I propose to avoid any usage of SNAPSHOT Ratis versions as last release is delayed because of similar reasons.

Releasing 2.0.1 from Ratis seems to be very easy as Ratis is TLP.

@elek
Copy link
Member

elek commented Apr 26, 2021

Let me make it clear: It's not a blocking comment as we already have a SNAPSHOT dependency, feel free to go ahead. But I am planning to suggest 2.0.1 Ratis release.

Can you please share what is the Ratis commit which was required here? (Just to be sure it's included if we do a patch release..)

@bharatviswa504
Copy link
Contributor

@elek for SCM HA snapshot implementation to work we require RATIS-1326, RATIS-1356, and RATIS-1369.
The commit related to this is 00f0c85

@adoroszlai
Copy link
Contributor

@GlenGeng I think pushing empty commit is better for triggering new CI run than closing/reopening. Most importantly it keeps the PR's build history easily available. We also gain more data points (regarding CI failures), because it triggers two builds: PR and push. It is also slightly less noisy.

@GlenGeng-awx
Copy link
Contributor Author

GlenGeng-awx commented Apr 27, 2021

Thanks @elek for looking at this PR, and overall I agree with your suggestion. Thanks @bharatviswa504 for the explanation and the sharing of the commit and the related context.

I think pushing empty commit is better for triggering new CI run than closing/reopening

Thanks @adoroszlai for the suggestion. I will follow it.

@bharatviswa504
Copy link
Contributor

@GlenGeng Can you check if test failure is related I see one of the test related to InstallSnapshot is failing.

@GlenGeng-awx
Copy link
Contributor Author

@bharatviswa504 After installSnapshot, the lifecycle of follower's SCMStateMachine is in PAUSED state.

          stateMachine.followerEvent().notifyInstallSnapshotFromLeader(roleInfoProto, firstAvailableLogTermIndex)
              .whenComplete((reply, exception) -> {
                if (exception != null) {
                  LOG.warn("{}: Failed to notify StateMachine to InstallSnapshot. Exception: {}",
                      getMemberId(), exception.getMessage());
                  inProgressInstallSnapshotRequest.compareAndSet(firstAvailableLogTermIndex, null);
                  return;
                }

                if (reply != null) {
                  LOG.info("{}: StateMachine successfully installed snapshot index {}. Reloading the StateMachine.",
                      getMemberId(), reply.getIndex());
                  stateMachine.pause();
                  state.updateInstalledSnapshotIndex(reply);
                  state.reloadStateMachine(reply.getIndex());
                }
                inProgressInstallSnapshotRequest.compareAndSet(firstAvailableLogTermIndex, null);
              });
        } 

The reason is, after notifyInstallSnapshotFromLeader is called, ratis call stateMachine.pause(); again.
I am checking it.


@Override
public void reinitialize() {
if (getLifeCycleState() == LifeCycle.State.PAUSED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hanishakoneru does OmStateMachine also need similar code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: How these tests are working before? (Not able to understand that? can you shed some info)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After notifyInstallSnapshotFromLeader is called, ratis calls

                if (reply != null) {
                  LOG.info("{}: StateMachine successfully installed snapshot index {}. Reloading the StateMachine.",
                      getMemberId(), reply.getIndex());
                  stateMachine.pause();
                  state.updateInstalledSnapshotIndex(reply);
                  state.reloadStateMachine(reply.getIndex());
                }

stateMachine.pause(); will make SM to be in PAUSED state, state.reloadStateMachine(reply.getIndex()) will trigger StateMachineUpdater#reload() to be called, which will then call stateMachine.reinitialize();.

This is the reason of the fix.

As far as I known, at the end of TestSCMInstallSnapshotWithHA#testInstallSnapshot, the followerSCM is also in PAUSED state, which is not checked before.

And for testInstallOldCheckpointFailure, notifyInstallSnapshotFromLeader is not really called, since without the fix in RATIS-1369, downloading snapshot taken at index 0 is ignore by follower SCM.

@bshashikant
Copy link
Contributor

I think, in Ratis we need to fix the applyTransaction behaviour as well. If the stateMachine is in paused state, the applyIndex should not increase in the StateMachineUpdater. This is not happening right now.

Then, as part of reloading the stateMachine as part of reinitializing the stateMachine, it can call unpause().

@bshashikant
Copy link
Contributor

Have filed https://issues.apache.org/jira/browse/RATIS-1370 for ratis issue.

@bshashikant bshashikant merged commit ce29843 into apache:master Apr 28, 2021
@bshashikant
Copy link
Contributor

Thanks @GlenGeng for the work and others for the reviews.

errose28 added a commit to errose28/ozone that referenced this pull request May 4, 2021
…ing-upgrade-master-merge2

* upstream/master: (56 commits)
  HDDS-2212. Genconf tool should generate config files for secure clust… (apache#1788)
  HDDS-5166. Remove duplicate assignment of OZONE_OPTS for freon and sh (apache#2195)
  Revert "HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)"
  HDDS-4983. Display key offset for each block in command key info (apache#2051)
  HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)
  HDDS-4585. Support bucket acl operation in S3g (apache#1701)
  HDDS-5153. Decommissioning a dead node should complete immediately (apache#2190)
  HDDS-5147. Intermittent test failure in TestContainerDeletionChoosingPolicy#testRandomChoosingPolicy (apache#2188)
  HDDS-5152. Fix Suggested leader in Client. (apache#2189)
  HDDS-5148. Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT (apache#2184)
  HDDS-4515. Datanodes should be able to persist and load CRL (apache#2181)
  HDDS-5060. [SCM HA Security] Make InterSCM grpc channel secure. (apache#2187)
  HDDS-5051. Ensure failover to suggested leader if any for NotLeaderException. (apache#2141)
  HDDS-5127. Fix getServiceList when SCM HA is enabled (apache#2173)
  HDDS-4889. Add simple CI check for docs (apache#2156)
  HDDS-5131. Use timeout in github actions (apache#2176)
  HDDS-5103. Fix Install Snapshot Mechanism in SCMStateMachine. (apache#2155)
  HDDS-5124. Use OzoneConsts.OZONE_TIME_ZONE instead of "GMT" (apache#2166)
  HDDS-5047. Refactor Pipeline to use ReplicationConfig instead of factor/type (apache#2096)
  HDDS-5083. Bump version of common-compress (apache#2139)
  ...

Conflicts:
	hadoop-hdds/common/pom.xml
	hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConfigKeys.java
	hadoop-hdds/common/src/main/java/org/apache/hadoop/ozone/OzoneConsts.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/HddsDatanodeService.java
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/SCMNodeManager.java
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/pipeline/PipelineManager.java
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMStorageConfig.java
	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java
	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestSCMNodeManager.java
	hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/MiniOzoneClusterImpl.java
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMStorage.java
	hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
	hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconStorageContainerManagerFacade.java
errose28 added a commit to errose28/ozone that referenced this pull request May 13, 2021
…k-in-auth

* HDDS-3698-nonrolling-upgrade: (57 commits)
  Fix compilation errors afte merge Update javassist in recon pom Fix changes introduced in merge that failed TestSCMNodeManager upgrade tests Fix checkstyle Fix intermittent test failure TestSCMNodeManager#testSetNodeOpStateAndCommandFired after merge Skip scm init default layout version in TestOzoneConfigurationFields
  HDDS-2212. Genconf tool should generate config files for secure clust… (apache#1788)
  HDDS-5166. Remove duplicate assignment of OZONE_OPTS for freon and sh (apache#2195)
  Revert "HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)"
  HDDS-4983. Display key offset for each block in command key info (apache#2051)
  HDDS-5144. Create github check to alert when dependency tree is changed (apache#2177)
  HDDS-4585. Support bucket acl operation in S3g (apache#1701)
  HDDS-5153. Decommissioning a dead node should complete immediately (apache#2190)
  HDDS-5147. Intermittent test failure in TestContainerDeletionChoosingPolicy#testRandomChoosingPolicy (apache#2188)
  HDDS-5152. Fix Suggested leader in Client. (apache#2189)
  HDDS-5148. Bump ratis version to 2.1.0-ff8aa66-SNAPSHOT (apache#2184)
  HDDS-4515. Datanodes should be able to persist and load CRL (apache#2181)
  HDDS-5060. [SCM HA Security] Make InterSCM grpc channel secure. (apache#2187)
  HDDS-5051. Ensure failover to suggested leader if any for NotLeaderException. (apache#2141)
  HDDS-5127. Fix getServiceList when SCM HA is enabled (apache#2173)
  HDDS-4889. Add simple CI check for docs (apache#2156)
  HDDS-5131. Use timeout in github actions (apache#2176)
  HDDS-5103. Fix Install Snapshot Mechanism in SCMStateMachine. (apache#2155)
  HDDS-5124. Use OzoneConsts.OZONE_TIME_ZONE instead of "GMT" (apache#2166)
  HDDS-5047. Refactor Pipeline to use ReplicationConfig instead of factor/type (apache#2096)
  ...
bharatviswa504 pushed a commit to bharatviswa504/hadoop-ozone that referenced this pull request Jul 25, 2021
Co-authored-by: Doroszlai, Attila <[email protected]>
(cherry picked from commit ce29843)
Change-Id: I26e47e4136b67b5628bb35caba1a8b2bcecf8a9b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants