Skip to content

Conversation

@xBis7
Copy link
Contributor

@xBis7 xBis7 commented Mar 8, 2023

What changes were proposed in this pull request?

In #4140 , we added an integration test, testOMHAMetrics where we restart the leader OM and then wait for a new leader to be elected in order to check the metrics during leader change. A timeout, set to 80000 millis, was added on waiting for the new leader. For the OM failover, the failover can last minutes and therefore we need to expand the timeout for the test.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8035

How was this patch tested?

This patch was tested manually locally. @Test annotation was replaced with @RepeatedTest(100) and the test was run with Maven locally. With the original timeout set to 80000 millis the test failed, while with the increased timeout set to 300000 millis the test passed all 100 repetitions.

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.ozone.om.TestOzoneManagerHAWithData
[INFO] Tests run: 107, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 444.62 s - in org.apache.hadoop.ozone.om.TestOzoneManagerHAWithData
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 107, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

@xBis7
Copy link
Contributor Author

xBis7 commented Mar 8, 2023

@neils-dev @adoroszlai Can you please take a look at this PR?

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xBis7 for identifying the problem, and running repeated executions with the fix.

Are there any timeouts related to OM Ratis that could be tweaked to reduce the time needed for leader election after restart?

@xBis7
Copy link
Contributor Author

xBis7 commented Mar 8, 2023

Are there any timeouts related to OM Ratis that could be tweaked to reduce the time needed for leader election after restart?

@adoroszlai I've been looking into OzoneManagerRatisServer.newRaftProperties() and OMConfigKeys and I can't see something that could make a difference. I tweaked some of them and then retried the tests but there was no change. Do you have something specific in mind?

@adoroszlai
Copy link
Contributor

Thanks @xBis7 for checking.

Do you have something specific in mind?

No, I'm not familiar with these properties.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xBis7 I have run 100x iterations of the test and it has timed out in 30% of repetitions even with 300s limit.

https://github.com/adoroszlai/hadoop-ozone/actions/runs/4393202042

@xBis7
Copy link
Contributor Author

xBis7 commented Mar 13, 2023

@adoroszlai Thanks for testing it. I'll track it down and fix it.

@neils-dev neils-dev added the gr label Mar 16, 2023
@xBis7
Copy link
Contributor Author

xBis7 commented Mar 22, 2023

@adoroszlai Although, the test always passes when I run it locally on repeat, when I use the workflow you shared, it fails more often than it succeeds.

I have also confirmed that during repetitions there is no leader change. On the first repetition the leader is always Node-1 and it changes to Node-3 and for every other repetition it's always Node-3.

I've added a check that all three OMs are up and running. I've set maximum timeout to the amount of time we wait for the Ratis failover. I don't think that the timeout should be more than that. Furthermore, I've changed the check

- getCluster().getOMLeader().isLeaderReady();
+ getCluster().getOMLeader(); 

as getOMLeader() also checks if the leader is ready and returns null if not. Check here.

There is an underlying issue here that causes the timeout and it might even have to do with the MiniCluster and its workings. I don't have the time to investigate more and we have other priorities. I plan to convert this into a draft PR and maybe close it later on.

@adoroszlai
Copy link
Contributor

Thanks @xBis7 for the update.

when I use the workflow you shared, it fails more often than it succeeds.

That may happen because Github runners have less resources than your dev environment, or because of the difference in repetitions (repeating a single test method vs. the entire test class).

I don't have the time to investigate more and we have other priorities.

No problem, we can keep it marked as @Flaky.

@xBis7
Copy link
Contributor Author

xBis7 commented Mar 22, 2023

@adoroszlai I can put back the @Flaky annotation and we can keep the improvements for whoever picks up fixing it.

@xBis7
Copy link
Contributor Author

xBis7 commented Mar 31, 2023

@adoroszlai The latest changes fix the timeout issue. I've launched multiple workflows and it's not occurring anymore. But this revealed another underlying issue that might not even have to do with the test. During leader change the metrics don't get updated.

OMHAMetrics rely upon calling OzoneManager.updatePeerList(), at the end of this method we unregister the metrics and then register them again. It was my understanding that after every time an OM gets started, stopped or restarted, there is a conf change and OMStateMachine calls that method. That doesn't seem to be the case.

Latest four workflows, where you can see that there is no timeout failure. All failures are due to the metrics not getting updated.

https://github.com/xBis7/ozone/actions/runs/4566947556

https://github.com/xBis7/ozone/actions/runs/4567066352

https://github.com/xBis7/ozone/actions/runs/4574892711

https://github.com/xBis7/ozone/actions/runs/4574961334

@xBis7
Copy link
Contributor Author

xBis7 commented Apr 4, 2023

@adoroszlai The latest changes fix all the issues that I encountered while looking into this ticket.

To sum everything up, testOMHAMetrics was failing most of the time while waiting for a new leader to be elected. This was due to the leader not being elected on time or because there weren’t enough pipelines allocated and while waiting for that we were encountering a time out. By moving the test to a new class the timeout issue was resolved. There is no interference and every time we start with 3 available OMs.

Three latest workflows, running TestOzoneManagerHAWithData on repeat, without testOMHAMetrics under it. All passing.

https://github.com/xBis7/ozone/actions/runs/4596538780

https://github.com/xBis7/ozone/actions/runs/4596542019

https://github.com/xBis7/ozone/actions/runs/4596958030

After resolving the timeout issue, another error was uncovered. For 30% of the repetitions, during a leader change, the metrics weren’t getting updated. The old leader was now a follower but its metrics were still registered with the old state. To fix this, void notifyLeaderChanged() from StateMachine.EventApi was overridden and used to initialize the metrics. This method gets called every time there is a leader change unlike updatePeerList() and it also wasn’t used in OzoneManagerStateMachine.

Two latest workflows, running the new class TestOzoneManagerHAMetrics on repeat. All passing.

https://github.com/xBis7/ozone/actions/runs/4607890850

https://github.com/xBis7/ozone/actions/runs/4607671480

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xBis7
Copy link
Contributor Author

xBis7 commented Apr 5, 2023

@adoroszlai Thanks for reviewing this patch!

@adoroszlai adoroszlai merged commit cb90fda into apache:master Apr 5, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Apr 6, 2023
* master: (155 commits)
  update readme (apache#4535)
  HDDS-8374. Disable flaky unit test: TestContainerStateCounts
  HDDS-8016. updated the ozone doc for linked bucket and deletion async limitation (apache#4526)
  HDDS-8237. [Snapshot] loadDb() used by SstFiltering service creates extraneous directories. (apache#4446)
  HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics (apache#4362)
  HDDS-8039. Allow container inspector to run from ozone debug. (apache#4337)
  HDDS-8304. [Snapshot] Reduce flakiness in testSkipTrackingWithZeroSnapshot (apache#4487)
  HDDS-7974. [Snapshot] KeyDeletingService to be aware of Ozone snapshots (apache#4486)
  HDDS-8368. ReplicationManager: Create ContainerReplicaOp with correct target Datanode (apache#4532)
  HDDS-8358. Fix the space usage comparator in ContainerBalancerSelectionCriteria (apache#4527)
  HDDS-8359. ReplicationManager: Fix getContainerReplicationHealth() so that it builds ContainerCheckRequest correctly (apache#4528)
  HDDS-8361. Useless object in TestOzoneBlockTokenIdentifier (apache#4517)
  HDDS-8325. Consolidate and refine RocksDB metrics of services (apache#4506)
  HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. (apache#4381)
  HDDS-8127. Exclude deleted containers from Recon container count (apache#4440)
  HDDS-8364. ReadReplicas may give wrong results with topology-aware read enabled (apache#4522)
  HDDS-8354. Avoid WARNING about ObjectEndpoint#get (apache#4515)
  HDDS-8324. DN data cache gets removed randomly asking for data from disk (apache#4499)
  HDDS-8291. Upgrade to Hadoop 3.3.5 (apache#4484)
  HDDS-8355. Mark TestOMRatisSnapshots#testInstallSnapshot as flaky
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants