Skip to content

Conversation

@bharatviswa504
Copy link
Contributor

What changes were proposed in this pull request?

During reload state for any exception terminate OM, and also cleanup DB backup folder.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4501

How was this patch tested?

Existing tests, this issue is idenitified from one of our customer issues, due to this follower OM went into bad state.

}
} catch (Exception e) {
LOG.error("Failed to delete the backup of the original DB {}", dbBackup);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bharat, in this scenario where the OM state was not reloaded with new checkpoint, it would be better to keep the backup also in place. Just in case the OM needs to be manually reversed back to old state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya thought of it, but my thought process here is when OM restarts it will get a new install Snapshot and try again. Because when OM it starts, anyway it cannot use this backup path, until some one manual replacing.

But I think to be on safer side, will remove this code for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@hanishakoneru hanishakoneru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bharatviswa504 for finding and fixing this.
+1 pending CI.

@bharatviswa504 bharatviswa504 merged commit 4b69f08 into apache:master Nov 24, 2020
@bharatviswa504
Copy link
Contributor Author

Thank You @hanishakoneru for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Dec 1, 2020
* HDDS-3698-upgrade:
  HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551)
  HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588)
  HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625)
  HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626)
  HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618)
  HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620)
  HDDS-4512. Remove unused netty3 transitive dependency (apache#1627)
  HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608)
  HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621)
  HDDS-4471. GrpcOutputStream length can overflow (apache#1617)
  HDDS-4308. Fix issue with quota update (apache#1489)
  HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602)
  HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622)
  HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609)
  HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420)
  HDDS-4497. Recon File Size Count task throws SQL Exception. (apache#1612)
errose28 added a commit to errose28/ozone that referenced this pull request Dec 1, 2020
* HDDS-3698-upgrade:
  HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551)
  HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588)
  HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625)
  HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626)
  HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618)
  HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620)
  HDDS-4512. Remove unused netty3 transitive dependency (apache#1627)
  HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608)
  HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621)
  HDDS-4471. GrpcOutputStream length can overflow (apache#1617)
  HDDS-4308. Fix issue with quota update (apache#1489)
  HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602)
  HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622)
  HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609)
  HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420)
  HDDS-4497. Recon File Size Count task throws SQL Exception. (apache#1612)
errose28 added a commit to errose28/ozone that referenced this pull request Jan 5, 2021
* master: (40 commits)
  HDDS-4473. Reduce number of sortDatanodes RPC calls (apache#1610)
  HDDS-4485. [DOC] add the authentication rules of the Ozone Ranger. (apache#1603)
  HDDS-4528. Upgrade slf4j to 1.7.30 (apache#1639)
  HDDS-4424. Update README with information how to report security issues (apache#1548)
  HDDS-4484. Use RaftServerImpl isLeader instead of periodic leader update logic in OM and isLeaderReady for read/write requests (apache#1638)
  HDDS-4429. Create unit test for SimpleContainerDownloader. (apache#1551)
  HDDS-4461. Reuse compiled binaries in acceptance test (apache#1588)
  HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. (apache#1625)
  HDDS-4510. SCM can avoid creating RetriableDatanodeEventWatcher for deletion command ACK (apache#1626)
  HDDS-3363. Intermittent failure in testContainerImportExport (apache#1618)
  HDDS-4370. Datanode deletion service can avoid storing deleted blocks. (apache#1620)
  HDDS-4512. Remove unused netty3 transitive dependency (apache#1627)
  HDDS-4481. With HA OM can send deletion blocks to SCM multiple times. (apache#1608)
  HDDS-4487. SCM can avoid using RETRIABLE_DATANODE_COMMAND for datanode deletion commands. (apache#1621)
  HDDS-4471. GrpcOutputStream length can overflow (apache#1617)
  HDDS-4308. Fix issue with quota update (apache#1489)
  HDDS-4392. [DOC] Add Recon architecture to docs (apache#1602)
  HDDS-4501. Reload OM State fail should terminate OM for any exceptions. (apache#1622)
  HDDS-4492. CLI flag --quota should default to 'spaceQuota' to preserve backward compatibility. (apache#1609)
  HDDS-3689. Add various profiles to MiniOzoneChaosCluster to run different modes. (apache#1420)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants