Skip to content

Conversation

@ArafatKhan2198
Copy link
Contributor

What changes were proposed in this pull request?

The testDeletedContainerCount() method of TestContainerStateCounts is considered flaky due to the processing order of the container report. If the pipeline does not exist, then the container is not added until the next container report is processed.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8374

How was this patch tested?

Ran the UT several times to make sure it's not flaky

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for the fix. I'll try repeated runs. In the meantime, please fix checkstyle error.

// then container is not added until the next container report is processed
StorageContainerDatanodeProtocolProtos.SCMHeartbeatRequestProto
heartbeatRequestProto =
StorageContainerDatanodeProtocolProtos.SCMHeartbeatRequestProto.newBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/api/TestContainerStateCounts.java
 406: Line is longer than 80 characters (found 84).

Copy link
Contributor Author

@ArafatKhan2198 ArafatKhan2198 Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. For some reason the ./checkStyle script in dev tools did not find it out. But I have changed it

@adoroszlai
Copy link
Contributor

@ArafatKhan2198 seems like the test failed in the first run:

Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.517 s <<< FAILURE! - in org.apache.hadoop.ozone.recon.api.TestContainerStateCounts
Error:  org.apache.hadoop.ozone.recon.api.TestContainerStateCounts.testContainerStatsCount  Time elapsed: 37.468 s  <<< ERROR!
java.util.concurrent.TimeoutException: timeout: after 30000 millis
	at org.apache.ozone.test.LambdaTestUtils$GenerateTimeout.evaluate(LambdaTestUtils.java:677)
	at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:154)
	at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:192)
	at org.apache.hadoop.ozone.recon.api.TestContainerStateCounts.waitAndCheckConditionAfterHeartbeat(TestContainerStateCounts.java:413)
	at org.apache.hadoop.ozone.recon.api.TestContainerStateCounts.testContainerStatsCount(TestContainerStateCounts.java:369)

https://github.com/ArafatKhan2198/ozone/actions/runs/4608724963/jobs/8144896789#step:6:2706

@ArafatKhan2198
Copy link
Contributor Author

@ArafatKhan2198 seems like the test failed in the first run:

Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 37.517 s <<< FAILURE! - in org.apache.hadoop.ozone.recon.api.TestContainerStateCounts
Error:  org.apache.hadoop.ozone.recon.api.TestContainerStateCounts.testContainerStatsCount  Time elapsed: 37.468 s  <<< ERROR!
java.util.concurrent.TimeoutException: timeout: after 30000 millis
	at org.apache.ozone.test.LambdaTestUtils$GenerateTimeout.evaluate(LambdaTestUtils.java:677)
	at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:154)
	at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:192)
	at org.apache.hadoop.ozone.recon.api.TestContainerStateCounts.waitAndCheckConditionAfterHeartbeat(TestContainerStateCounts.java:413)
	at org.apache.hadoop.ozone.recon.api.TestContainerStateCounts.testContainerStatsCount(TestContainerStateCounts.java:369)

https://github.com/ArafatKhan2198/ozone/actions/runs/4608724963/jobs/8144896789#step:6:2706

My apologise @adoroszlai. By mistake, I reverted back to the previous value after running the tests successfully. Please run the tests now. I have already run them a total of 20 times, and they have passed.

image

@adoroszlai
Copy link
Contributor

Thanks @ArafatKhan2198 for the update, but the test is still flaky, failed in 14/100 runs:

https://github.com/adoroszlai/hadoop-ozone/actions/runs/4610115793

@adoroszlai adoroszlai marked this pull request as draft April 4, 2023 17:10
@ArafatKhan2198 ArafatKhan2198 marked this pull request as ready for review April 11, 2023 07:10
@adoroszlai
Copy link
Contributor

Thanks @ArafatKhan2198 for updating the patch. Have you run repeated tests in CI? Can you please share the link to the CI run?

@ArafatKhan2198
Copy link
Contributor Author

@adoroszlai adoroszlai requested review from dombizita and smengcl April 11, 2023 15:50
@adoroszlai
Copy link
Contributor

Thanks @ArafatKhan2198 for updating the patch and for repeated runs, the results look good. I'll defer review of the actual fix to folks more familiar with Recon.

@adoroszlai adoroszlai removed their request for review April 11, 2023 15:51
@ArafatKhan2198
Copy link
Contributor Author

@devmadhuu @adoroszlai Can you please take a look

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 . lgtm

@adoroszlai
Copy link
Contributor

Please try to avoid force-push when updating the PR. Here are some great articles that explain why:

https://developers.mattermost.com/blog/submitting-great-prs/#4-avoid-force-pushing
https://www.freecodecamp.org/news/optimize-pull-requests-for-reviewer-happiness#request-a-review

@adoroszlai
Copy link
Contributor

@ArafatKhan2198 I may be missing something, but the latest commit 3fb721a doesn't seem to be related to the fix.

@ArafatKhan2198
Copy link
Contributor Author

@ArafatKhan2198 I may be missing something, but the latest commit 3fb721a doesn't seem to be related to the fix.

Hi @adoroszlai
In the pull request HDDS-7994, I have added a few more parameters to the JSON response of the clusterStateEndpoint of Recon. One of these parameters is keys pending for deletion. When we decided to expose this information to the UI, there was a suggestion to change the response parameter "deletedKeys" to "keysAwaitingDeletion". This parameter shows the keys in the deleted table waiting to be deleted by the keyDeletionService running in the background. Since HDDS-7994 has already been merged, I did not get a chance to change the response. Therefore, I have decided to incorporate this small name change in this PR.

@adoroszlai
Copy link
Contributor

Therefore, I have decided to incorporate this small name change in this PR.

It would be better to avoid incorporating unrelated changes.

  1. We already had successful CI for this PR, could have been merged. Now we need to trigger another run.
  2. Commit message would be confusing (fix flaky test vs. rename JSON element).
  3. CI needs to run fewer checks if only unit test is changed.

@ArafatKhan2198
Copy link
Contributor Author

Therefore, I have decided to incorporate this small name change in this PR.

It would be better to avoid incorporating unrelated changes.

  1. We already had successful CI for this PR, could have been merged. Now we need to trigger another run.
  2. Commit message would be confusing (fix flaky test vs. rename JSON element).
  3. CI needs to run fewer checks if only unit test is changed.

Understood, I have removed those changes.
Thanks for the feedback

@adoroszlai adoroszlai merged commit bbcb810 into apache:master Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants