Skip to content

Conversation

@devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

This PR is to fix flakiness in listStatus API. It was observed that, cleanup() method in TestOzoneFileSystem (in CI integration (filesystem)), which is triggered after EACH test method is incredibly flaky. However when we add a call to wait for double buffer flush, it no longer seems flaky.
Issue was related to a possible race condition where table cache iterator is flushed already and isKeyDeleted check may not work as expected before putting entries in cacheKeyMap in listStatus API.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9233

How was this patch tested?

This patch is tested with multiple job runs with 50 iterations of single test class to repro of flakiness of listStatus API which was flaky when called in deleteRootDir() method called from cleanup method. Here is the green CI run with multiple iterations of test run.

@devmadhuu devmadhuu marked this pull request as ready for review September 4, 2023 10:44
@devmadhuu
Copy link
Contributor Author

@smengcl @sumitagrawl Pls review.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu Thanks for working over this, have minor comment

}

public static boolean isKeyDeleted(String key, Table keyTable) {
public static boolean isKeyDeleted(String key, Table keyTable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of method is only check in cache, not in table. And as per caller, its checking in cache while iterate the table keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of method is only check in cache, not in table. And as per caller, its checking in cache while iterate the table keys.

I have reverted the change.

@adoroszlai
Copy link
Contributor

Nit: "Investigate ..." sounds like a task, not a code change. Please update the PR title to describe the change itself.

@adoroszlai
Copy link
Contributor

Test failure seems related:

[INFO] Running org.apache.hadoop.ozone.om.TestKeyManagerImpl
Error:  Tests run: 24, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 16.742 s <<< FAILURE! - in org.apache.hadoop.ozone.om.TestKeyManagerImpl
Error:  org.apache.hadoop.ozone.om.TestKeyManagerImpl.testListStatusWithDeletedEntriesInCache  Time elapsed: 0.122 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <[key-10, key-12, key-16, key-2, key-22, key-24, key-26, key-28, key-30, key-32, key-34, key-36, key-38, key-4, key-40, key-42, key-44, key-46, key-48, key-50, key-52, key-54, key-56, key-58, key-6, key-62, key-66, key-70, key-72, key-74, key-76, key-78, key-8, key-80, key-82, key-84, key-86, key-88, key-90, key-92, key-94, key-96, key-98]> but was: <[key-10, key-12, key-16, key-2, key-22, key-26, key-30, key-34, key-38, key-40, key-44, key-48, key-52, key-56, key-6, key-62, key-66, key-70, key-74, key-78, key-80, key-84, key-88, key-92, key-96]>
	at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
	at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141)
	at org.apache.hadoop.ozone.om.TestKeyManagerImpl.testListStatusWithDeletedEntriesInCache(TestKeyManagerImpl.java:1180)

https://github.com/devmadhuu/ozone/actions/runs/6107000532/job/16573141510#step:5:3771

@devmadhuu devmadhuu changed the title HDDS-9233. Investigate whether listStatus() is correctly iterating cache. HDDS-9233. listStatus() API is not correctly iterating cache. Sep 8, 2023
@devmadhuu
Copy link
Contributor Author

Nit: "Investigate ..." sounds like a task, not a code change. Please update the PR title to describe the change itself.

Done.

@devmadhuu
Copy link
Contributor Author

Test failure seems related:

[INFO] Running org.apache.hadoop.ozone.om.TestKeyManagerImpl
Error:  Tests run: 24, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 16.742 s <<< FAILURE! - in org.apache.hadoop.ozone.om.TestKeyManagerImpl
Error:  org.apache.hadoop.ozone.om.TestKeyManagerImpl.testListStatusWithDeletedEntriesInCache  Time elapsed: 0.122 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <[key-10, key-12, key-16, key-2, key-22, key-24, key-26, key-28, key-30, key-32, key-34, key-36, key-38, key-4, key-40, key-42, key-44, key-46, key-48, key-50, key-52, key-54, key-56, key-58, key-6, key-62, key-66, key-70, key-72, key-74, key-76, key-78, key-8, key-80, key-82, key-84, key-86, key-88, key-90, key-92, key-94, key-96, key-98]> but was: <[key-10, key-12, key-16, key-2, key-22, key-26, key-30, key-34, key-38, key-40, key-44, key-48, key-52, key-56, key-6, key-62, key-66, key-70, key-74, key-78, key-80, key-84, key-88, key-92, key-96]>
	at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
	at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1141)
	at org.apache.hadoop.ozone.om.TestKeyManagerImpl.testListStatusWithDeletedEntriesInCache(TestKeyManagerImpl.java:1180)

https://github.com/devmadhuu/ozone/actions/runs/6107000532/job/16573141510#step:5:3771

This is fixed.

@sadanand48 sadanand48 changed the title HDDS-9233. listStatus() API is not correctly iterating cache. HDDS-9233. listStatus() API is not correctly iterating cache. Sep 8, 2023
@adoroszlai
Copy link
Contributor

adoroszlai commented Sep 8, 2023

Ran TestOzoneFileSystem 100x with the fix. There were only 3 failures of testListStatusOnKeyNameContainDelimiter, compared to 60% failure rate (of various test cases) on master. So this is definitely a big improvement.

@sadanand48 @sumitagrawl please check if you're OK with the latest patch

@smengcl
Copy link
Contributor

smengcl commented Sep 8, 2023

Ran TestOzoneFileSystem 100x with the fix. There were only 3 failures of testListStatusOnKeyNameContainDelimiter, compared to 60% failure rate (of various test cases) on master. So this is definitely a big improvement.

@sadanand48 @sumitagrawl please check if you're OK with the latest patch

Thanks @adoroszlai . #5093 might address testListStatusOnKeyNameContainDelimiter flakiness.

@devmadhuu Patch lgtm in its current state. I just have some nits regarding the code comments.

@adoroszlai
Copy link
Contributor

Full integration (filesystem) test also passed 10x.

@adoroszlai adoroszlai merged commit f1552bc into apache:master Sep 9, 2023
@adoroszlai
Copy link
Contributor

Thanks @devmadhuu for the patch, @sadanand48, @smengcl, @sumitagrawl for the review. Merged, since this improves test results significantly. Let's address any further comments in a follow-up.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Oct 24, 2023
…he. (apache#5244)

(cherry picked from commit f1552bc)
Change-Id: I07c1ec1d2c4f761af7401b8925bc2b335c4ef251
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants