Skip to content

Conversation

@devmadhuu
Copy link
Contributor

@devmadhuu devmadhuu commented Aug 24, 2023

What changes were proposed in this pull request?

This test case was flaky with listStatus issue which got resolved as part of HDDS-9041

Following points taken care as part of this PR:

  1. testRenameToTrashEnabled : This test has been removed in this PR, however assertions of this test will be handled in testTrash as part of HDDS-6645
  2. testTrash : this patch improves it, but still marked flaky as there is still a possibility of failure in some cases. will be handled as part of HDDS-6645
  3. deleteRootDir cleanup method has removed awaitDoubleBuffer call though removal was not intended as part of this PR, however failure of assertion in deleteRootDir was due to the issue in listStatus API which is being fixed as part of this HDDS-9233. listStatus() API is not correctly iterating cache. #5244.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6646

How was this patch tested?

This patch was tested with 10 times job runs with existing test cases.

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devmadhuu for fixing it.

Can you please also add how this patch would fix the issue in the PR description?

return;
}
deleteRootRecursively(fileStatuses);
Thread.sleep(500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use GenericTestUtils#waitFor instead of Thread.sleep()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use GenericTestUtils#waitFor instead of Thread.sleep()?

@hemantk-12 thanks for the review, I tried using GenericTestUtils, however this in actual run throwing always TimeOutException and not working in this case.

Copy link
Contributor

@hemantk-12 hemantk-12 Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprise that Thread.sleep() works fine but GenericTestUtils#waitFor doesn't. Just curious what was lambda used?

I feel this should work if it is just about cleaning dir.

    GenericTestUtils.waitFor(() -> {
      try {
        FileStatus[] fileStatus = fs.listStatus(ROOT);
        return fileStatus != null && fileStatus.length == 0;
      } catch (IOException e) {
       // You can decide if it should be false or throw RTE.
        return false;
      }
    }, 100, 500);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemantk-12 for review. I have handled as per your suggestion.

@devmadhuu
Copy link
Contributor Author

Thanks @devmadhuu for fixing it.

Can you please also add how this patch would fix the issue in the PR description?

This patch directly alone not contributing the resolution of this issue, however another patch which has fixed issue in listStatus API fixed the root cause as this test also was showing intermittent fail in deleteRootDir only, but when running multiple times in multiple jobs, this looks to be a race condition when tests running in parallel. And after adding sleep, I can see list status showing correct results.

@devmadhuu devmadhuu marked this pull request as ready for review August 26, 2023 11:49
@devmadhuu
Copy link
Contributor Author

@sumitagrawl Pls review.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu thanks for working over this, have few query over the changes.

return;
}
deleteRootRecursively(fileStatuses);
Thread.sleep(500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sleep for 500ms solves the problem?
And Who throws TimeoutException in what case make it flaky ?

Copy link
Contributor

@hemantk-12 hemantk-12 Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I don't get either why TimeoutException is added here. May be @devmadhuu forgot to remove it while reverting some change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now GenericTestUtils is being used, I have not removed TimeoutException. This change is not directly related, however many other tests are getting impacted because of deleteRootDIr cleanup, so this is important as part of this test as well.

@hemantk-12
Copy link
Contributor

Thanks @devmadhuu for fixing it.
Can you please also add how this patch would fix the issue in the PR description?

This patch directly alone not contributing the resolution of this issue, however another patch which has fixed issue in listStatus API fixed the root cause as this test also was showing intermittent fail in deleteRootDir only, but when running multiple times in multiple jobs, this looks to be a race condition when tests running in parallel. And after adding sleep, I can see list status showing correct results.

Can you please add the same in the description?

Thread.sleep(500);
fileStatuses = fs.listStatus(ROOT);
FileStatus[] finalFileStatuses = fileStatuses;
GenericTestUtils.waitFor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are supposed to fetch fs.listStatus(ROOT); inside the lambda.

#5217 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemantk-12 , Its handled as per your suggestion. Kindly re-review.

@devmadhuu devmadhuu marked this pull request as draft August 29, 2023 13:00
@devmadhuu
Copy link
Contributor Author

@sumitagrawl @hemantk-12 @sadanand48 Kindly re-review.

@devmadhuu devmadhuu marked this pull request as ready for review August 29, 2023 15:56
@sadanand48
Copy link
Contributor

Thanks @devmadhuu , patch looks good to me, IIUC, there are 3 tests in question:

  1. testRenameToTrashEnabled : which the current patch fixes.
  2. testTrash : this patch improves it, but still marked flaky as there is still a possibility of failure in some cases. will be handled as part of HDDS-6645
  3. testListStatusOnLargeDirectory : which by fixing deleteRoot by adding a wait should stabilise the test further.

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing review comments.

Overall changes are fine.

Few doubts and suggestions:

  1. PR title is incorrect I think. It is supposed to be HDDS-6646 Intermittent failure ....
  2. Can you please share the workflow run where you run the test 10 times? I found this, but in last iteration test still failed. Also I would suggest to run only testRenameToTrashEnabled and have a green CI for it.
  3. I think deleteRootDir is not needed here because it is deleted in @After. If you agree, remove this either as part of this PR or may be separate PR.

// We can safely assert only trash directory here.
// Asserting Current or checkpoint directory is not feasible here in this
// test due to independent TrashEmptier thread running in cluster and
// possible flakyness is hard to avoid unless we test this test case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// possible flakyness is hard to avoid unless we test this test case
// possible flakiness is hard to avoid unless we test this test case


// Call moveToTrash. We can't call protected fs.rename() directly
trash.moveToTrash(path);
// Added this assertion here and will be tested as part of testTrash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this comment. What do you mean by will be tested as part of testTrash? This is testTrash test. Is there another testTrash? What am I missing?

Copy link
Contributor Author

@devmadhuu devmadhuu Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this comment. What do you mean by will be tested as part of testTrash? This is testTrash test. Is there another testTrash? What am I missing?

As I mentioned, it will be handled as part of 6545 JIRA, and @sadanand48 will be working on it. Had a discussion on reason of flakiness and need to handle separately in testTrash

@devmadhuu
Copy link
Contributor Author

devmadhuu commented Aug 29, 2023

Thanks for addressing review comments.

Overall changes are fine.

Few doubts and suggestions:

  1. PR title is incorrect I think. It is supposed to be HDDS-6646 Intermittent failure ....
  2. Can you please share the workflow run where you run the test 10 times? I found this, but in last iteration test still failed. Also I would suggest to run only testRenameToTrashEnabled and have a green CI for it.
  3. I think deleteRootDir is not needed here because it is deleted in @After. If you agree, remove this either as part of this PR or may be separate PR.

@hemantk-12 , we can change PR title, however it's better to keep with JIRA title.

I'll re-run the test in multiple runs once again and get green CI.

I think we should keep deleteRootDir change as i observed that without this change, CI will not pass. Many other PR are blocked because of this and @after annotation makes it call after every test case including testRenameToTrashEnabled

@devmadhuu devmadhuu marked this pull request as draft August 29, 2023 18:14
@hemantk-12
Copy link
Contributor

hemantk-12 commented Aug 29, 2023

Thanks for addressing review comments.
Overall changes are fine.
Few doubts and suggestions:

  1. PR title is incorrect I think. It is supposed to be HDDS-6646 Intermittent failure ....
  2. Can you please share the workflow run where you run the test 10 times? I found this, but in last iteration test still failed. Also I would suggest to run only testRenameToTrashEnabled and have a green CI for it.
  3. I think deleteRootDir is not needed here because it is deleted in @After. If you agree, remove this either as part of this PR or may be separate PR.

@hemantk-12 , we can change PR title, however it's better to keep with JIRA title.

I'll re-run the test in multiple runs once again and get green CI.

I think we should keep deleteRootDir change as i observed that without this change, CI will not pass. Many other PR are blocked because of this

  1. I didn't mean to remove Jira ID or Jira Title. ... just mean the continuation. I meant HDDS-6646. Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled not Hdds 6646 - Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled. PR title is supposed to start with HDDS-JiraID. followed by PR or jIra title
  2. I am curious how deleteRootDir() calling solves the issue in test testListStatusOnLargeDirectory. Did you try to remove it and run? Do you have a run where it failed after it was removed?

@devmadhuu devmadhuu changed the title Hdds 6646 - Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled HDDS-6646 - Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled Aug 30, 2023
@devmadhuu devmadhuu changed the title HDDS-6646 - Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled HDDS-6646. Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled Aug 30, 2023
@devmadhuu
Copy link
Contributor Author

devmadhuu commented Aug 30, 2023

Thanks for addressing review comments.
Overall changes are fine.
Few doubts and suggestions:

  1. PR title is incorrect I think. It is supposed to be HDDS-6646 Intermittent failure ....
  2. Can you please share the workflow run where you run the test 10 times? I found this, but in last iteration test still failed. Also I would suggest to run only testRenameToTrashEnabled and have a green CI for it.
  3. I think deleteRootDir is not needed here because it is deleted in @After. If you agree, remove this either as part of this PR or may be separate PR.

@hemantk-12 , we can change PR title, however it's better to keep with JIRA title.
I'll re-run the test in multiple runs once again and get green CI.
I think we should keep deleteRootDir change as i observed that without this change, CI will not pass. Many other PR are blocked because of this

  1. I didn't mean to remove Jira ID or Jira Title. ... just mean the continuation. I meant HDDS-6646. Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled not Hdds 6646 - Intermittent failure in TestOzoneFileSystem#testRenameToTrashEnabled. PR title is supposed to start with HDDS-JiraID. followed by PR or jIra title
  2. I am curious how deleteRootDir() calling solves the issue in test testListStatusOnLargeDirectory. Did you try to remove it and run? Do you have a run where it failed after it was removed?
  1. I fixed the PR title. Thanks for pointing out.
  2. testListStatusOnLargeDirectory test was failed like any other test due to deleteRootDIr cleanup. No changes being done except deleteRootDir, May be we can remove from PR description about this test. I have updated the changes in PR description. Pls have a look.
  3. And here is the link for successful and green CI multiple job runs for only TestOzoneFileSystem#testRenameToTrashEnabled case

@devmadhuu
Copy link
Contributor Author

Here is the successful green CI from fork run in fork.

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @devmadhuu.

Overall looks good to me.

deleteRootDir is not needed in testListStatusOnLargeDirectory because it is deleted in @After.

I still believe there is no need to call deleteRootDir in testListStatusOnLargeDirectory. Because it is not expected from HDDS-6646, it can be handle as a separate task.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

*/
@Test
@Flaky("HDDS-6646")
public void testRenameToTrashEnabled() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the test removal intended? because earlier commit had fixed the test and description also says so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the test removal intended? because earlier commit had fixed the test and description also says so.

Yes @sadanand48 , as HDDS-6645 will be taking care of testTrash test case fix, so the only assertion which was different from testRenameToTrashEnabled and testTrash has been moved to testTrash and testRenameToTrashEnabled has been removed as per suggestion also from @sumitagrawl

@sumitagrawl sumitagrawl merged commit 6657d0c into apache:master Sep 6, 2023
// Waiting for double buffer flush before calling listStatus() again
// seem to have mitigated the flakiness in cleanup(), but at the cost of
// almost doubling the test run time. M1 154s->283s (all 4 sets of params)
cluster.getOzoneManager().awaitDoubleBufferFlush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the removal of this line belongs to this PR?

#5244 still looks required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @smengcl #5244 needs anyway. I think while resolving conflict, it got removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants