-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-6646. Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled #5217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hemantk-12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devmadhuu for fixing it.
Can you please also add how this patch would fix the issue in the PR description?
| return; | ||
| } | ||
| deleteRootRecursively(fileStatuses); | ||
| Thread.sleep(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use GenericTestUtils#waitFor instead of Thread.sleep()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use GenericTestUtils#waitFor instead of
Thread.sleep()?
@hemantk-12 thanks for the review, I tried using GenericTestUtils, however this in actual run throwing always TimeOutException and not working in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprise that Thread.sleep() works fine but GenericTestUtils#waitFor doesn't. Just curious what was lambda used?
I feel this should work if it is just about cleaning dir.
GenericTestUtils.waitFor(() -> {
try {
FileStatus[] fileStatus = fs.listStatus(ROOT);
return fileStatus != null && fileStatus.length == 0;
} catch (IOException e) {
// You can decide if it should be false or throw RTE.
return false;
}
}, 100, 500);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hemantk-12 for review. I have handled as per your suggestion.
This patch directly alone not contributing the resolution of this issue, however another patch which has fixed issue in listStatus API fixed the root cause as this test also was showing intermittent fail in deleteRootDir only, but when running multiple times in multiple jobs, this looks to be a race condition when tests running in parallel. And after adding sleep, I can see list status showing correct results. |
|
@sumitagrawl Pls review. |
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/TestOzoneFileSystem.java
Outdated
Show resolved
Hide resolved
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmadhuu thanks for working over this, have few query over the changes.
| return; | ||
| } | ||
| deleteRootRecursively(fileStatuses); | ||
| Thread.sleep(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How sleep for 500ms solves the problem?
And Who throws TimeoutException in what case make it flaky ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I don't get either why TimeoutException is added here. May be @devmadhuu forgot to remove it while reverting some change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since now GenericTestUtils is being used, I have not removed TimeoutException. This change is not directly related, however many other tests are getting impacted because of deleteRootDIr cleanup, so this is important as part of this test as well.
Can you please add the same in the description? |
| Thread.sleep(500); | ||
| fileStatuses = fs.listStatus(ROOT); | ||
| FileStatus[] finalFileStatuses = fileStatuses; | ||
| GenericTestUtils.waitFor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are supposed to fetch fs.listStatus(ROOT); inside the lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hemantk-12 , Its handled as per your suggestion. Kindly re-review.
|
@sumitagrawl @hemantk-12 @sadanand48 Kindly re-review. |
|
Thanks @devmadhuu , patch looks good to me, IIUC, there are 3 tests in question:
|
hemantk-12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing review comments.
Overall changes are fine.
Few doubts and suggestions:
- PR title is incorrect I think. It is supposed to be
HDDS-6646 Intermittent failure .... - Can you please share the workflow run where you run the test 10 times? I found this, but in last iteration test still failed. Also I would suggest to run only
testRenameToTrashEnabledand have a green CI for it. - I think deleteRootDir is not needed here because it is deleted in @After. If you agree, remove this either as part of this PR or may be separate PR.
| // We can safely assert only trash directory here. | ||
| // Asserting Current or checkpoint directory is not feasible here in this | ||
| // test due to independent TrashEmptier thread running in cluster and | ||
| // possible flakyness is hard to avoid unless we test this test case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // possible flakyness is hard to avoid unless we test this test case | |
| // possible flakiness is hard to avoid unless we test this test case |
|
|
||
| // Call moveToTrash. We can't call protected fs.rename() directly | ||
| trash.moveToTrash(path); | ||
| // Added this assertion here and will be tested as part of testTrash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following this comment. What do you mean by will be tested as part of testTrash? This is testTrash test. Is there another testTrash? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following this comment. What do you mean by
will be tested as part of testTrash? This istestTrashtest. Is there anothertestTrash? What am I missing?
As I mentioned, it will be handled as part of 6545 JIRA, and @sadanand48 will be working on it. Had a discussion on reason of flakiness and need to handle separately in testTrash
@hemantk-12 , we can change PR title, however it's better to keep with JIRA title. I'll re-run the test in multiple runs once again and get green CI. I think we should keep deleteRootDir change as i observed that without this change, CI will not pass. Many other PR are blocked because of this and @after annotation makes it call after every test case including |
|
|
|
Here is the successful green CI from fork run in fork. |
hemantk-12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @devmadhuu.
Overall looks good to me.
deleteRootDir is not needed in testListStatusOnLargeDirectory because it is deleted in @After.
I still believe there is no need to call deleteRootDir in testListStatusOnLargeDirectory. Because it is not expected from HDDS-6646, it can be handle as a separate task.
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
| */ | ||
| @Test | ||
| @Flaky("HDDS-6646") | ||
| public void testRenameToTrashEnabled() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the test removal intended? because earlier commit had fixed the test and description also says so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the test removal intended? because earlier commit had fixed the test and description also says so.
Yes @sadanand48 , as HDDS-6645 will be taking care of testTrash test case fix, so the only assertion which was different from testRenameToTrashEnabled and testTrash has been moved to testTrash and testRenameToTrashEnabled has been removed as per suggestion also from @sumitagrawl
| // Waiting for double buffer flush before calling listStatus() again | ||
| // seem to have mitigated the flakiness in cleanup(), but at the cost of | ||
| // almost doubling the test run time. M1 154s->283s (all 4 sets of params) | ||
| cluster.getOzoneManager().awaitDoubleBufferFlush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the removal of this line belongs to this PR?
#5244 still looks required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. 5 out of 9 commits merged to master branch after this PR starts to have flaky integration (filesystem) again as a result of this removal. Need to merge #5244 asap.
https://github.com/apache/ozone/actions/runs/6100351514/job/16554482439
https://github.com/apache/ozone/actions/runs/6105400581/job/16568864195
https://github.com/apache/ozone/actions/runs/6107352731/job/16574193577
https://github.com/apache/ozone/actions/runs/6110706683/job/16584346159
https://github.com/apache/ozone/actions/runs/6114248488/job/16595437359
What changes were proposed in this pull request?
This test case was flaky with listStatus issue which got resolved as part of HDDS-9041
Following points taken care as part of this PR:
testTrashas part of HDDS-6645deleteRootDircleanup method has removed awaitDoubleBuffer call though removal was not intended as part of this PR, however failure of assertion indeleteRootDirwas due to the issue in listStatus API which is being fixed as part of this HDDS-9233. listStatus() API is not correctly iterating cache. #5244.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6646
How was this patch tested?
This patch was tested with 10 times job runs with existing test cases.