Skip to content

Conversation

@hanishakoneru
Copy link
Contributor

What changes were proposed in this pull request?

OzoneManagerStateMachine#notifyInstallSnapshotFromLeader() checks the incoming roleInfoProto and proceeds with install snapshot request only if the role is Leader. This check is wrong and the roleInfoProto will contain the self node ID and not the leaders.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4063

How was this patch tested?

Testes manually on a docker cluster.

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.
Tested it on a OM HA Cluster, now I see installSnapshot is working when logs are missing.

Scenario tried:

  1. Stopped one of the follower OM,
  2. Ran freon.
  3. Deleted logs from leader and other follower.

Restarted OM, and now seen that transaction info is uptodate.

Log Snippet:
2020-08-05 23:30:35,660 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: inconsistency entries. Reply:om3<-om2#1:FAIL,INCONSISTENCY,nextIndex:21994,term:31,followerCommit:21992 2020-08-05 23:30:35,675 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: receive installSnapshot: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,675 DEBUG org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Latest Snapshot Info 27#21988 2020-08-05 23:30:35,675 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: notifyInstallSnapshot: nextIndex is 21994 but the leader's first available index is 45162. 2020-08-05 23:30:35,676 INFO org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Received install snapshot notification from OM leader: om3 with term index: (t:28, i:45162) 2020-08-05 23:30:35,677 INFO org.apache.hadoop.ozone.om.OzoneManager: Downloading checkpoint from leader OM om3 and reloading state from the checkpoint. 2020-08-05 23:30:35,677 INFO org.apache.hadoop.ozone.om.snapshot.OzoneManagerSnapshotProvider: Downloading latest checkpoint from Leader OM om3. Checkpoint URL: https://bv-oz-3.bv-oz.root.hwx.site:9875/dbCheckpoint?flushBeforeCheckpoint=true 2020-08-05 23:30:35,680 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: reply installSnapshot: om3<-om2#0:FAIL-t31,IN_PROGRESS 2020-08-05 23:30:35,681 INFO org.apache.ratis.grpc.server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,731 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: receive installSnapshot: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,731 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: Snapshot Installation by StateMachine is in progress. 2020-08-05 23:30:35,731 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: reply installSnapshot: om3<-om2#0:FAIL-t31,IN_PROGRESS 2020-08-05 23:30:35,731 INFO org.apache.ratis.grpc.server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,733 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: receive installSnapshot: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:35,733 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: Snapshot Installation by StateMachine is in progress. 2020-08-05 23:30:35,733 INFO org.apache.ratis.server.impl.RaftServerImpl: om2@group-9F198C4C3682: reply installSnapshot: om3<-om2#0:FAIL-t31,IN_PROGRESS 2020-08-05 23:30:35,734 INFO org.apache.ratis.grpc.server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om3->om2#0-t31,notify:(t:28, i:45162) 2020-08-05 23:30:36,087 INFO org.apache.hadoop.ozone.om.snapshot.OzoneManagerSnapshotProvider: Successfully downloaded latest checkpoint from leader OM: om3 2020-08-05 23:30:36,089 INFO org.apache.hadoop.ozone.om.OzoneManager: Downloaded checkpoint from Leader om3 to the location /var/lib/hadoop-ozone/om/ratis/snapshot/om.db-om3-1596670235677 2020-08-05 23:30:36,175 INFO org.apache.hadoop.ozone.om.OzoneManager: Installing checkpoint with OMTransactionInfo org.apache.hadoop.ozone.om.ratis.OMTransactionInfo@e19e

@bharatviswa504 bharatviswa504 merged commit cc5901f into apache:master Aug 5, 2020
@bharatviswa504
Copy link
Contributor

Thank You @hanishakoneru for the contribution.

llemec pushed a commit to llemec/hadoop-ozone that referenced this pull request Aug 7, 2020
ChenSammi pushed a commit to ChenSammi/ozone that referenced this pull request Aug 25, 2020
rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020
@hanishakoneru hanishakoneru deleted the HDDS-4063 branch December 1, 2020 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants