HDDS-7941. Race condition in getFileStatus causes flaky testObjectStoreCreateWithO3fs #5252

devmadhuu · 2023-09-07T04:20:05Z

What changes were proposed in this pull request?

This PR is fixing the race condition possibility in getFIleStatus call where keyTable cache iterator cache flush can happen while iterating table iterator in method org.apache.hadoop.ozone.om.KeyManagerImpl#createFakeDirIfShould

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7941

How was this patch tested?

This patch was tested with multiple CI job runs to test flakiness in test failure of this JIRA due to above mentioned cause. Here is the green CI run after applying patch.

devmadhuu · 2023-09-07T04:23:24Z

@sumitagrawl @sadanand48 @smengcl Pls review.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneListStatusHelper.java

hemantk-12 · 2023-09-07T22:43:54Z

I don't understand how this patch is solving the issue. Because to me, nothing is changing. Instead of using cache iterator, you are just directly getting value form cache.
Just to do a simple check, I ran the test 25 times, with and without fix and it passed all the time.

It would be great if you can add more details how this change will fix it.

devmadhuu · 2023-09-08T07:13:42Z

I don't understand how this patch is solving the issue. Because to me, nothing is changing. Instead of using cache iterator, you are just directly getting value form cache. Just to do a simple check, I ran the test 25 times, with and without fix and it passed all the time.

It would be great if you can add more details how this change will fix it.

This test case fails in checkPath method of TestOzoneFSWithObjectStoreCreate.testObjectStoreCreateWithO3fs test case where even after deleting a key on a given path , getFileStatus API returns valid fileStatus, so after analyzing the code flow for getFIleStatus, it was found that isKeyDeleted method is the problem where after getting cache iterator on table, cache flush is unpredictably be flushed and can give wrong result check of key deletion. Here in. this test case, we are doing some rename operation and after that all original set of key paths were sent in batch and each key path by key path , code flow hits createFakeDirIfShould and check for existence of key here below:

if (exists
          && cacheKey.startsWith(targetKey)
          && !Objects.equals(cacheKey, targetKey)) {
        LOG.debug("Fake dir {} required for {}", targetKey, cacheKey);
        return createDirectoryKey(cacheValue.getCacheValue(), dirKey);
      }

Now here for a given key, control is returning from here in if check and after iterating all keys in batch, if a key not found in cache , we don't want to loose deleted keys entry here because after this while loop, if we don't find the keys in cache, we iterate in keyTable and during this iteration we want to make sure that we don't use isKeyDeleted method for key existence in cache due to unpredictable nature of cache flush. So we are making list of deleted keys as well in while loop..

Here is the failed CI run before the patch and green CI run after applying this patch.

smengcl · 2023-09-08T22:19:37Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java

+      if (!exists) {
+        deletedKeys.add(cacheKey);
+      }
    }


This can be dumping a lot of keyTable cache entries into the temp set deletedKeys. And this method can be called frequently from getFileStatus / getOzoneFileStatus.

Is there a better way to fix this? Since the only caller is already holding a BUCKET_LOCK:

ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java

Line 1127 in e4cbc20

metadataManager.getLock().acquireReadLock(BUCKET_LOCK, volumeName,

It might be fine for now since keyTable cache should only hold a few entries that are not flushed yet.

Thanks @smengcl for your review. Another way to fix is this #5093 PR. Here a table.isExist() API being used, but my only concern is that cache is always faster and isExist is being called in #5093 PR till we find the target key as well it is being called inside getNextKey() call. Pls have look #5093 PR

@smengcl Kindly advise.

Alright. Let's merge this PR now for a fix. We could improve this further in #5093 .

smengcl

+1 pending CI

devmadhuu · 2023-09-22T04:53:10Z

+1 pending CI

Thanks @smengcl for review. CI all green.

sumitagrawl

@devmadhuu Thanks for working over this, LGTM +1

…estObjectStoreCreateWithO3fs (apache#5252) (cherry picked from commit 89c76cf) Change-Id: I88c213bdbcbfea4226cd2297b78d68854a5ecd7e

deveshsingh added 13 commits August 24, 2023 18:43

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

2eb1f36

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

56462ad

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

9a06841

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

0cb0131

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

d2563bc

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

0fd54fc

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

4d61435

Merge remote-tracking branch 'origin/master' into HDDS-7941

9826f48

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

b0ce191

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

4056b12

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

782b197

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

23a598a

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

08982cb

devmadhuu mentioned this pull request Sep 7, 2023

HDDS-9233. listStatus() API is not correctly iterating cache. #5244

Merged

hemantk-12 reviewed Sep 7, 2023

View reviewed changes

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneListStatusHelper.java Show resolved Hide resolved

HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.

667d570

smengcl changed the title ~~HDDS-7941. Intermittent failure in testObjectStoreCreateWithO3fs.~~ HDDS-7941. Race condition in getFIleStatus causes flaky testObjectStoreCreateWithO3fs Sep 8, 2023

smengcl changed the title ~~HDDS-7941. Race condition in getFIleStatus causes flaky testObjectStoreCreateWithO3fs~~ HDDS-7941. Race condition in getFileStatus causes flaky testObjectStoreCreateWithO3fs Sep 8, 2023

smengcl reviewed Sep 8, 2023

View reviewed changes

sadanand48 mentioned this pull request Sep 20, 2023

HDDS-6645. Intermittent timeout in TestOzoneFileSystem#testTrash. #5330

Merged

devmadhuu mentioned this pull request Sep 21, 2023

HDDS-8877. Intermittent failure in TestOzoneFileSystem#testListStatusOnKeyNameContainDelimiter #5093

Closed

HDDS-7941. Handled review comments.

0d08eef

smengcl approved these changes Sep 22, 2023

View reviewed changes

sumitagrawl approved these changes Sep 22, 2023

View reviewed changes

sumitagrawl merged commit 89c76cf into apache:master Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7941. Race condition in getFileStatus causes flaky testObjectStoreCreateWithO3fs #5252

HDDS-7941. Race condition in getFileStatus causes flaky testObjectStoreCreateWithO3fs #5252

Uh oh!

devmadhuu commented Sep 7, 2023

Uh oh!

devmadhuu commented Sep 7, 2023

Uh oh!

Uh oh!

hemantk-12 commented Sep 7, 2023 •

edited

Loading

Uh oh!

devmadhuu commented Sep 8, 2023 •

edited

Loading

Uh oh!

smengcl Sep 8, 2023 •

edited

Loading

Uh oh!

devmadhuu Sep 9, 2023

Uh oh!

devmadhuu Sep 12, 2023

Uh oh!

smengcl Sep 22, 2023

Uh oh!

smengcl left a comment

Uh oh!

devmadhuu commented Sep 22, 2023

Uh oh!

sumitagrawl left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-7941. Race condition in getFileStatus causes flaky testObjectStoreCreateWithO3fs #5252

HDDS-7941. Race condition in getFileStatus causes flaky testObjectStoreCreateWithO3fs #5252

Uh oh!

Conversation

devmadhuu commented Sep 7, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

devmadhuu commented Sep 7, 2023

Uh oh!

Uh oh!

hemantk-12 commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devmadhuu commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smengcl Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devmadhuu Sep 9, 2023

Choose a reason for hiding this comment

Uh oh!

devmadhuu Sep 12, 2023

Choose a reason for hiding this comment

Uh oh!

smengcl Sep 22, 2023

Choose a reason for hiding this comment

Uh oh!

smengcl left a comment

Choose a reason for hiding this comment

Uh oh!

devmadhuu commented Sep 22, 2023

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hemantk-12 commented Sep 7, 2023 •

edited

Loading

devmadhuu commented Sep 8, 2023 •

edited

Loading

smengcl Sep 8, 2023 •

edited

Loading