-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-4428. IT TestOzoneManagerHAMetadataOnly.testOMRetryCache is flaky #1579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Assert.assertTrue(logCapturer.getOutput().contains("created volume:" | ||
| + volumeName)); | ||
|
|
||
| Thread.sleep(100L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this sleep?
Is created volume: logged more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error occurred on line 423,I think that when logCapturer.clearOutput(); is executed, there will be a situation where the previous instruction has not been completed, which will cause the data in the cache to not be completely cleared.This will cause this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will be a situation where the previous instruction has not been completed
If that is so, the assertion above shouldn't fail?
Assert.assertTrue(logCapturer.getOutput().contains("created volume:"
+ volumeName));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the assert true ensures, that created volume is loged once for the volume, then it clears the log. Then on the same request, it expects the same log line isn't logged again. I think the create Volume line is logged more than once. So, it passes assertTrue. since the line was logged atleast once, then clears it and the other occurrences come post clearing it, thus the failure.
I was running the test, and it logged multiple time -
2020-11-28 12:51:24,067 [OM StateMachine ApplyTransaction Thread - 0] INFO volume.OMVolumeCreateRequest (OMVolumeCreateRequest.java:validateAndUpdateCache(196)) - created volume:7249ac13-c180-40f8-a636-8d4eb481ca48 for user:ayushsaxena
2020-11-28 12:51:24,067 [OM StateMachine ApplyTransaction Thread - 0] INFO volume.OMVolumeCreateRequest (OMVolumeCreateRequest.java:validateAndUpdateCache(196)) - created volume:7249ac13-c180-40f8-a636-8d4eb481ca48 for user:ayushsaxena
2020-11-28 12:51:24,067 [OM StateMachine ApplyTransaction Thread - 0] INFO volume.OMVolumeCreateRequest (OMVolumeCreateRequest.java:validateAndUpdateCache(196)) - created volume:7249ac13-c180-40f8-a636-8d4eb481ca48 for user:ayushsaxena
So chances are we cleared the log, after first occurrence.
Do you have the logs when this test fails, Can you check this there whether created volume is logged multiple times, I think it is getting logged once per OM.(Wild Guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If in case you are able to repro, try something like this -
- Assert.assertTrue(logCapturer.getOutput().contains("created volume:"
- + volumeName));
+ GenericTestUtils.waitFor(() -> StringUtils
+ .countMatches(logCapturer.getOutput(), "created volume:" + volumeName)
+ == getCluster().getOzoneManagersList().size() + 1, 100, 5000);
If this fixes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed, the issue was previously it used to check Create Volume, but as it is running on all 3 OM's So, other OM's when they run they have this in the log, even after log capture clear. I have changed the log output condition check, which has resolved this issue test failing.
For more info refer below commit
HDDS-4262. Use ClientID and CallID from Rpc Client to detect retry requests (#1436)
I think it is getting logged once per OM.(Wild Guess)
Yes it is logged once per OM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is still failing
https://github.com/apache/ozone/pull/1538/checks?check_run_id=1336371166#step:4:3235
There is one more link in the jira, I think it is failing in the code added as part of HDDS-4262?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I gave the wrong Jira reference.
The issue is fixed by below commit.
4669043
This is fixed by HDDS-4329, and the link provided run is Oct 31st, and HDDS-4329 got committed Dec 8th, 2020.
Check is changed as below in the test.
// If 2nd time executed, it will fail with Volume creation failed. check
// for that.
Assert.assertFalse(logCapturer.getOutput().contains(
"Volume creation failed"));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanx Bharat, Yes, This one should solve the issue according to me.
@wycccccc can you give a try as well, If it is working for you as well, we can close this.
Thanx for the efforts here. :-)
bharatviswa504
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I beleive we don't require the changes, this is fixed as part od HDDS-4262.
Are you still seeing this issue, In recent runs, I don't see this test failing?
| Assert.assertTrue(logCapturer.getOutput().contains("created volume:" | ||
| + volumeName)); | ||
|
|
||
| Thread.sleep(100L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed, the issue was previously it used to check Create Volume, but as it is running on all 3 OM's So, other OM's when they run they have this in the log, even after log capture clear. I have changed the log output condition check, which has resolved this issue test failing.
For more info refer below commit
HDDS-4262. Use ClientID and CallID from Rpc Client to detect retry requests (#1436)
I think it is getting logged once per OM.(Wild Guess)
Yes it is logged once per OM.
|
Thanx @ayushtkn @bharatviswa504 review. I agree with you that the issue has been resolved.So I will close this pr. |

What changes were proposed in this pull request?
Fix the bug about
org.apache.hadoop.ozone.om.TestOzoneManagerHAMetadataOnly.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4428
How was this patch tested?
In my original test, an error occurred on average 30 times,but now there are no more errors in several tests.

``