HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics #4362

xBis7 · 2023-03-08T12:25:47Z

What changes were proposed in this pull request?

In #4140 , we added an integration test, testOMHAMetrics where we restart the leader OM and then wait for a new leader to be elected in order to check the metrics during leader change. A timeout, set to 80000 millis, was added on waiting for the new leader. For the OM failover, the failover can last minutes and therefore we need to expand the timeout for the test.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8035

How was this patch tested?

This patch was tested manually locally. @Test annotation was replaced with @RepeatedTest(100) and the test was run with Maven locally. With the original timeout set to 80000 millis the test failed, while with the increased timeout set to 300000 millis the test passed all 100 repetitions.

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.ozone.om.TestOzoneManagerHAWithData
[INFO] Tests run: 107, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 444.62 s - in org.apache.hadoop.ozone.om.TestOzoneManagerHAWithData
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 107, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

xBis7 · 2023-03-08T12:27:05Z

@neils-dev @adoroszlai Can you please take a look at this PR?

adoroszlai

Thanks @xBis7 for identifying the problem, and running repeated executions with the fix.

Are there any timeouts related to OM Ratis that could be tweaked to reduce the time needed for leader election after restart?

xBis7 · 2023-03-08T16:32:39Z

Are there any timeouts related to OM Ratis that could be tweaked to reduce the time needed for leader election after restart?

@adoroszlai I've been looking into OzoneManagerRatisServer.newRaftProperties() and OMConfigKeys and I can't see something that could make a difference. I tweaked some of them and then retried the tests but there was no change. Do you have something specific in mind?

adoroszlai · 2023-03-08T16:35:12Z

Thanks @xBis7 for checking.

Do you have something specific in mind?

No, I'm not familiar with these properties.

adoroszlai

@xBis7 I have run 100x iterations of the test and it has timed out in 30% of repetitions even with 300s limit.

https://github.com/adoroszlai/hadoop-ozone/actions/runs/4393202042

xBis7 · 2023-03-13T13:46:31Z

@adoroszlai Thanks for testing it. I'll track it down and fix it.

...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java

xBis7 · 2023-03-22T12:11:10Z

@adoroszlai Although, the test always passes when I run it locally on repeat, when I use the workflow you shared, it fails more often than it succeeds.

I have also confirmed that during repetitions there is no leader change. On the first repetition the leader is always Node-1 and it changes to Node-3 and for every other repetition it's always Node-3.

I've added a check that all three OMs are up and running. I've set maximum timeout to the amount of time we wait for the Ratis failover. I don't think that the timeout should be more than that. Furthermore, I've changed the check

- getCluster().getOMLeader().isLeaderReady();
+ getCluster().getOMLeader();

as getOMLeader() also checks if the leader is ready and returns null if not. Check here.

There is an underlying issue here that causes the timeout and it might even have to do with the MiniCluster and its workings. I don't have the time to investigate more and we have other priorities. I plan to convert this into a draft PR and maybe close it later on.

adoroszlai · 2023-03-22T12:36:54Z

Thanks @xBis7 for the update.

when I use the workflow you shared, it fails more often than it succeeds.

That may happen because Github runners have less resources than your dev environment, or because of the difference in repetitions (repeating a single test method vs. the entire test class).

I don't have the time to investigate more and we have other priorities.

No problem, we can keep it marked as @Flaky.

xBis7 · 2023-03-22T12:43:22Z

@adoroszlai I can put back the @Flaky annotation and we can keep the improvements for whoever picks up fixing it.

xBis7 · 2023-03-31T13:22:47Z

@adoroszlai The latest changes fix the timeout issue. I've launched multiple workflows and it's not occurring anymore. But this revealed another underlying issue that might not even have to do with the test. During leader change the metrics don't get updated.

OMHAMetrics rely upon calling OzoneManager.updatePeerList(), at the end of this method we unregister the metrics and then register them again. It was my understanding that after every time an OM gets started, stopped or restarted, there is a conf change and OMStateMachine calls that method. That doesn't seem to be the case.

Latest four workflows, where you can see that there is no timeout failure. All failures are due to the metrics not getting updated.

https://github.com/xBis7/ozone/actions/runs/4566947556

https://github.com/xBis7/ozone/actions/runs/4567066352

https://github.com/xBis7/ozone/actions/runs/4574892711

https://github.com/xBis7/ozone/actions/runs/4574961334

...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java

xBis7 · 2023-04-04T13:11:35Z

@adoroszlai The latest changes fix all the issues that I encountered while looking into this ticket.

To sum everything up, testOMHAMetrics was failing most of the time while waiting for a new leader to be elected. This was due to the leader not being elected on time or because there weren’t enough pipelines allocated and while waiting for that we were encountering a time out. By moving the test to a new class the timeout issue was resolved. There is no interference and every time we start with 3 available OMs.

Three latest workflows, running TestOzoneManagerHAWithData on repeat, without testOMHAMetrics under it. All passing.

https://github.com/xBis7/ozone/actions/runs/4596538780

https://github.com/xBis7/ozone/actions/runs/4596542019

https://github.com/xBis7/ozone/actions/runs/4596958030

After resolving the timeout issue, another error was uncovered. For 30% of the repetitions, during a leader change, the metrics weren’t getting updated. The old leader was now a follower but its metrics were still registered with the old state. To fix this, void notifyLeaderChanged() from StateMachine.EventApi was overridden and used to initialize the metrics. This method gets called every time there is a leader change unlike updatePeerList() and it also wasn’t used in OzoneManagerStateMachine.

Two latest workflows, running the new class TestOzoneManagerHAMetrics on repeat. All passing.

https://github.com/xBis7/ozone/actions/runs/4607890850

https://github.com/xBis7/ozone/actions/runs/4607671480

adoroszlai

Thanks @xBis7 for persisting with this issue.

100/100 passed:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4617974982

xBis7 · 2023-04-05T12:12:37Z

@adoroszlai Thanks for reviewing this patch!

* master: (155 commits) update readme (apache#4535) HDDS-8374. Disable flaky unit test: TestContainerStateCounts HDDS-8016. updated the ozone doc for linked bucket and deletion async limitation (apache#4526) HDDS-8237. [Snapshot] loadDb() used by SstFiltering service creates extraneous directories. (apache#4446) HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics (apache#4362) HDDS-8039. Allow container inspector to run from ozone debug. (apache#4337) HDDS-8304. [Snapshot] Reduce flakiness in testSkipTrackingWithZeroSnapshot (apache#4487) HDDS-7974. [Snapshot] KeyDeletingService to be aware of Ozone snapshots (apache#4486) HDDS-8368. ReplicationManager: Create ContainerReplicaOp with correct target Datanode (apache#4532) HDDS-8358. Fix the space usage comparator in ContainerBalancerSelectionCriteria (apache#4527) HDDS-8359. ReplicationManager: Fix getContainerReplicationHealth() so that it builds ContainerCheckRequest correctly (apache#4528) HDDS-8361. Useless object in TestOzoneBlockTokenIdentifier (apache#4517) HDDS-8325. Consolidate and refine RocksDB metrics of services (apache#4506) HDDS-8135. Incorrect synchronization during certificate renewal in DefaultCertificateClient. (apache#4381) HDDS-8127. Exclude deleted containers from Recon container count (apache#4440) HDDS-8364. ReadReplicas may give wrong results with topology-aware read enabled (apache#4522) HDDS-8354. Avoid WARNING about ObjectEndpoint#get (apache#4515) HDDS-8324. DN data cache gets removed randomly asking for data from disk (apache#4499) HDDS-8291. Upgrade to Hadoop 3.3.5 (apache#4484) HDDS-8355. Mark TestOMRatisSnapshots#testInstallSnapshot as flaky ...

xBis7 added 3 commits March 7, 2023 20:08

Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics

02bd63b

increase timeout

a9e131b

increase timeout to 5 minutes

df485dc

adoroszlai reviewed Mar 8, 2023

View reviewed changes

adoroszlai requested changes Mar 11, 2023

View reviewed changes

neils-dev added the gr label Mar 16, 2023

neils-dev reviewed Mar 16, 2023

View reviewed changes

...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java Outdated Show resolved Hide resolved

test improvements

d89ff3d

start with a fresh cluster - start inactive OMs when checking for leader

4de0341

adoroszlai reviewed Apr 2, 2023

View reviewed changes

...ne/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOzoneManagerHAWithData.java Outdated Show resolved Hide resolved

xBis7 added 3 commits April 3, 2023 21:43

move tests under a new class

75c083a

Merge remote-tracking branch 'origin/master' into HDDS-8035

668b0cc

move OzoneManager.omHAMetricsInit under overriden notifyLeaderChanged()

5cbb401

adoroszlai approved these changes Apr 5, 2023

View reviewed changes

adoroszlai merged commit cb90fda into apache:master Apr 5, 2023

HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics #4362

HDDS-8035. Intermittent timeout in TestOzoneManagerHAWithData.testOMHAMetrics #4362

Uh oh!

Conversation

xBis7 commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

xBis7 commented Mar 8, 2023

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

xBis7 commented Mar 8, 2023

Uh oh!

adoroszlai commented Mar 8, 2023

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

xBis7 commented Mar 13, 2023

Uh oh!

Uh oh!

xBis7 commented Mar 22, 2023

Uh oh!

adoroszlai commented Mar 22, 2023

Uh oh!

xBis7 commented Mar 22, 2023

Uh oh!

xBis7 commented Mar 31, 2023

Uh oh!

Uh oh!

xBis7 commented Apr 4, 2023

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

xBis7 commented Apr 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xBis7 commented Mar 8, 2023 •

edited

Loading