-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics #4362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@neils-dev @adoroszlai Can you please take a look at this PR? |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xBis7 for identifying the problem, and running repeated executions with the fix.
Are there any timeouts related to OM Ratis that could be tweaked to reduce the time needed for leader election after restart?
@adoroszlai I've been looking into |
|
Thanks @xBis7 for checking.
No, I'm not familiar with these properties. |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xBis7 I have run 100x iterations of the test and it has timed out in 30% of repetitions even with 300s limit.
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4393202042
|
@adoroszlai Thanks for testing it. I'll track it down and fix it. |
...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java
Outdated
Show resolved
Hide resolved
|
@adoroszlai Although, the test always passes when I run it locally on repeat, when I use the workflow you shared, it fails more often than it succeeds. I have also confirmed that during repetitions there is no leader change. On the first repetition the leader is always Node-1 and it changes to Node-3 and for every other repetition it's always Node-3. I've added a check that all three OMs are up and running. I've set maximum timeout to the amount of time we wait for the Ratis failover. I don't think that the timeout should be more than that. Furthermore, I've changed the check - getCluster().getOMLeader().isLeaderReady();
+ getCluster().getOMLeader(); as There is an underlying issue here that causes the timeout and it might even have to do with the MiniCluster and its workings. I don't have the time to investigate more and we have other priorities. I plan to convert this into a draft PR and maybe close it later on. |
|
Thanks @xBis7 for the update.
That may happen because Github runners have less resources than your dev environment, or because of the difference in repetitions (repeating a single test method vs. the entire test class).
No problem, we can keep it marked as |
|
@adoroszlai I can put back the |
|
@adoroszlai The latest changes fix the timeout issue. I've launched multiple workflows and it's not occurring anymore. But this revealed another underlying issue that might not even have to do with the test. During leader change the metrics don't get updated.
Latest four workflows, where you can see that there is no timeout failure. All failures are due to the metrics not getting updated. https://github.com/xBis7/ozone/actions/runs/4566947556 https://github.com/xBis7/ozone/actions/runs/4567066352 https://github.com/xBis7/ozone/actions/runs/4574892711 https://github.com/xBis7/ozone/actions/runs/4574961334 |
...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java
Outdated
Show resolved
Hide resolved
|
@adoroszlai The latest changes fix all the issues that I encountered while looking into this ticket. To sum everything up, Three latest workflows, running https://github.com/xBis7/ozone/actions/runs/4596538780 https://github.com/xBis7/ozone/actions/runs/4596542019 https://github.com/xBis7/ozone/actions/runs/4596958030 After resolving the timeout issue, another error was uncovered. For 30% of the repetitions, during a leader change, the metrics weren’t getting updated. The old leader was now a follower but its metrics were still registered with the old state. To fix this, Two latest workflows, running the new class https://github.com/xBis7/ozone/actions/runs/4607890850 https://github.com/xBis7/ozone/actions/runs/4607671480 |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xBis7 for persisting with this issue.
100/100 passed:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4617974982
|
@adoroszlai Thanks for reviewing this patch! |
* master: (155 commits) update readme (apache#4535) HDDS-8374. Disable flaky unit test: TestContainerStateCounts HDDS-8016. updated the ozone doc for linked bucket and deletion async limitation (apache#4526) HDDS-8237. [Snapshot] loadDb() used by SstFiltering service creates extraneous directories. (apache#4446) HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics (apache#4362) HDDS-8039. Allow container inspector to run from ozone debug. (apache#4337) HDDS-8304. [Snapshot] Reduce flakiness in testSkipTrackingWithZeroSnapshot (apache#4487) HDDS-7974. [Snapshot] KeyDeletingService to be aware of Ozone snapshots (apache#4486) HDDS-8368. ReplicationManager: Create ContainerReplicaOp with correct target Datanode (apache#4532) HDDS-8358. Fix the space usage comparator in ContainerBalancerSelectionCriteria (apache#4527) HDDS-8359. ReplicationManager: Fix getContainerReplicationHealth() so that it builds ContainerCheckRequest correctly (apache#4528) HDDS-8361. Useless object in TestOzoneBlockTokenIdentifier (apache#4517) HDDS-8325. Consolidate and refine RocksDB metrics of services (apache#4506) HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. (apache#4381) HDDS-8127. Exclude deleted containers from Recon container count (apache#4440) HDDS-8364. ReadReplicas may give wrong results with topology-aware read enabled (apache#4522) HDDS-8354. Avoid WARNING about ObjectEndpoint#get (apache#4515) HDDS-8324. DN data cache gets removed randomly asking for data from disk (apache#4499) HDDS-8291. Upgrade to Hadoop 3.3.5 (apache#4484) HDDS-8355. Mark TestOMRatisSnapshots#testInstallSnapshot as flaky ...
What changes were proposed in this pull request?
In #4140 , we added an integration test,
testOMHAMetricswhere we restart the leader OM and then wait for a new leader to be elected in order to check the metrics during leader change. A timeout, set to 80000 millis, was added on waiting for the new leader. For the OM failover, the failover can last minutes and therefore we need to expand the timeout for the test.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8035
How was this patch tested?
This patch was tested manually locally.
@Testannotation was replaced with@RepeatedTest(100)and the test was run with Maven locally. With the original timeout set to 80000 millis the test failed, while with the increased timeout set to 300000 millis the test passed all 100 repetitions.