HDDS-6474. Fixed File System Optimized(FSO) bucket list status. #3379

aswinshakil · 2022-05-03T16:55:29Z

What changes were proposed in this pull request?

List status is done as a batch process. In liststatusFSO, it goes through the cache, and if the cache size is greater than the batch size we return just the cache. The problem is that the cache has gaps and the keys that are within these gaps are ignored. List status continues the listing from the last element (which is a sorted cache) of the batch so we do not iterate over the keys in the gaps again.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6474

How was this patch tested?

This patch was tested on a cluster.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java

umamaheswararao · 2022-05-03T21:41:13Z

@aswinshakil thanks you for digging this tricky issue with lot of patience. Appreciate that efforts.
I think it would be good idea to add a unit test.
From our offline discussions, the issue exist mostly in old version and in latest version it is working.

So, let's add a test in your old version fork. Then we could use the same test in master, so that even if it breaks later, that will catch.

Since this issue reported against: @Mengqi could you please confirm in which version you noticed this?

kerneltime · 2022-05-03T23:52:48Z

@aswinshakil I need to look at the listStatusFSO call in addition to buildFinalStatusList to understand the bug. Will get finish the review later today or tomorrow.

kerneltime · 2022-05-03T23:54:23Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java

      fileStatusFinalList.add(fileStatus);
      keyInfoList.add(fileStatus.getKeyInfo());
+      countEntries++;
+      if (countEntries >= numEntries) {


Do the keys in the fileStatusFinalList be a sum of both file and dir caches sorted and then truncated to the numEntries? I think the overall code for listStatusFSO needs to be revisited.

Yes, It does. Along with that we also add Keys and Directories from the DB.

rakeshadr · 2022-05-04T15:26:27Z

Thanks a lot @aswinshakil for the continuous efforts in analysing and digging out the root cause.

It looks like OM TableCache is tricky for the FSO or non-FSO listing status code path, which would create inconsistency across batches.

Please go through the following scenario:

Say, below keys exists in OM.

These keys "x/y/z/a1"...."x/y/z/a1023" and "x/y/z/b1"..."x/y/z/b1023" in DBTable.
Assume, "x/y/z/b1024" only exists in TableCache.

Now, Ozone client Listing Status with keyName = "x/y/z/":

BatchSize = 1024. It will take two iteration to fetch all the values from OM.

Iteration-1:
keyName = "x/y/z/" and startKey = ""

step-1) Seek all the elements from TableCache if its not a deleted key.
Now, the Batch-1 ["x/y/z/b1024"]
step-2) Seek all the elements from
Now, the Batch-1 ["x/y/z/a1"...."x/y/z/a1023"]

FinalList of Batch-1 will be a combined values : ["x/y/z/a1"...."x/y/z/a1023, "x/y/z/b1024"].

Iteration-2:
keyName = "x/y/z/" and startKey = "x/y/z/b1024", which is the last element in the batch-1.
FinalList of Batch-2 : Empty list because there is no elements after "x/y/z/b1024".

Since "x/y/z/b1"..."x/y/z/b1023" exists before "x/y/z/b1024" start key, all these keys would be skipped and won't get fetched/seeks from DBTable.

Proposal: How about to use TableCache only for checking the deleted keys and do not add the cached values into the listing keys ?

kaijchen · 2022-05-04T15:59:27Z

Thanks @rakeshadr for explaining the problem.

I think we can scan cache and table simultaneously and merge the result, providing ordered results.
However, this requires cache to be a SortedMap.

// input: keyName, startKey, numKeys

cIt = cache.seek(startKey);
tIt = table.seek(startKey);
for (i = 0; i < numKeys; i++) {
    // check cIt and tIt are valid
    if (cIt.key() < tIt.key()) {
        it = cIt;
    } else {
        it = tIt;
    }
    result.add(it.key());
    it.next()
} 
return result;

umamaheswararao · 2022-05-04T18:06:18Z

@rakeshadr I have a question on the following two points:

Assume, "x/y/z/b1024" only exists in TableCache.
Proposal: How about to use TableCache only for checking the deleted keys and do not add the cached values into the listing keys ?

How can we ignore cache? from the point 1, entries can present only in cache but not in DB Table right. So, if we ignore cache, then From where we can include x/y/z/b1024?

rakeshadr · 2022-05-05T15:42:40Z

How can we ignore cache? from the point 1, entries can present only in cache but not in DB Table right. So, if we ignore cache, then From where we can include x/y/z/b1024?

Yes, latest entry "x/y/z/b1024" won't be included in the final list. TableCache is transaction cache and in normal case the cached entries will be flushed to DB and will be cleaned up. In an ideal situation flushing will happen in millis or even in nanos. IMHO, its OK to not include recent entries rather than inducing errors in the system. Listing keys can be eventually consistent for the newly added entries.

rakeshadr · 2022-05-05T15:49:22Z

I think we can scan cache and table simultaneously and merge the result, providing ordered results.
However, this requires cache to be a SortedMap..

Thanks @kaijchen for the comment. But in which batch the cached entries to be included. In the above example case, "x/y/z/b1024" has to be included in last batch. Assume, there are many batches for the list keys. Here we need find out the index for the cached item and the exact batch, otw it may affect the order, isn't it?

kaijchen · 2022-05-05T16:00:51Z

Here we need find out the index for the cached item and the exact batch, otw it may affect the order, isn't it?

I think it will be fine, as long as we iterate them sorted.
In your example, the first batch will be "x/y/z/a1".."x/y/z/a1023", "x/y/z/b1".
Then startKey will be "x/y/z/b1", and the second batch will be "x/y/z/b1".."x/y/z/b1024".

aswinshakil · 2022-05-05T16:35:19Z

@kaijchen @umamaheswararao @rakeshadr @kerneltime Thanks everyone for the review. As per @kaijchen last comment, it is correct. Since it is sorted even if we take the cached keys that are out-of-bound for that batch. As we iterate to the DB these out-of-bound keys will be pushed to the last in TreeMap. As a result in the final fileStatusFinalList we truncate these keys as they are out of bound.

rakeshadr · 2022-05-06T10:58:31Z

@kaijchen @umamaheswararao @rakeshadr @kerneltime Thanks everyone for the review. As per @kaijchen last comment, it is correct. Since it is sorted even if we take the cached keys that are out-of-bound for that batch. As we iterate to the DB these out-of-bound keys will be pushed to the last in TreeMap. As a result in the final fileStatusFinalList we truncate these keys as they are out of bound.

@kaijchen , @aswinshakil +1 Agreed with the proposal. Appreciate if you could add test case to cover the example that we discussed, listing across batches with cached keypaths.

Probably, you can write test case into TestKeyManagerImpl suite and you can add cache entry like below:

        omMetadataManager.getOpenKeyTable(BucketLayout)
            .addCacheEntry(new CacheKey<>(ozoneKey),
                new CacheValue<>(Optional.of(omKeyInfo), trxnLogIndex));

umamaheswararao · 2022-05-06T15:11:06Z

...egration-test/src/test/java/org/apache/hadoop/fs/ozone/TestRootedOzoneFileSystemWithFSO.java

+    for (int i = 0; i < 1300; i++) {
+      Path key = new Path(parent, "tempKey" + i);
+      ContractTestUtils.touch(getFs(), key);
+      // To add keys to the cache


@kerneltime @avijayanhwx @rakeshadr @mukul1987 do you know why simple put is not keeping it in cache? Only with rename, it keeping things in cache. It's good to know the differences. May be in separate JIRA if we need more time for investigation.

@umamaheswararao @aswinshakil

Only with rename, it keeping things in cache. It's good to know the differences

Are you still seeing a table cache entry leaked in the latest code, with rename FSO operation ?

Details about the flushing and table cache cleanup logic: Hope this helps.

step-1) Assume user performed o3fs#rename(srcKey, dstKey);

step-2) Then the call will reach OM server and will do rename metadata updation at OM : OMKeyRenameRequest -> validateAndUpdateCache()
Here OM will update the keys and add these keys into TableCache.

step-3) This is an async thread, which will do OzoneManagerDoubleBuffer#flushTransactions()
Here it will add/update the DBTable and do TableCache cleanup by removing the flushed transaction Id.

There are two reasons for seeing an item/key in TableCache:
case-1) The entry is added to the TableCache and not yet flushed to DB. In reality the lifetime of a table cache entry is in between this petty small window.
case-2) Bug situation: The cleanup of the tablecache is not done because of missing Table name in the response logic. This is a bug.
For example, rename response class should have the DBTable names to be looked at and do the cleanup on flushing. If table name is not correctly mentioned, then that entry will not be removed and will exists in TableCache. This is a bug case.

@CleanupTableInfo(cleanupTables = {FILE_TABLE, DIRECTORY_TABLE}) public class OMKeyRenameResponseWithFSO extends OMKeyRenameResponse {

Thanks @rakeshadr for the details. Not sure I followed 100%, but we need to address either of one right? Behavior of putting/evicting into/from cache should be consistent between put and rename?

@aswinshakil Let's provide little more details in comments. It would be great to add Javadoc about what scenario it's testing.

umamaheswararao · 2022-05-10T22:20:15Z

Thanks @aswinshakil for working on this patch.
I got a scenario to check,
Let's say if cache has elements like 1124.....102 101 0
But they may be loaded in different order into TreeMap....what if they loaded 1124...101
Now we will go through the dirTable and it has 1-100. So, the tempCache is having 1....1124. Now we need only 1024. So, it should pick 1...1024.

But the startKey for next iteration is 1024. It did not get chance to load 0 from TableCache before and it will not consider it as startKey(1024) already advanced.

Is this a possible case? or I am missing something here?

umamaheswararao · 2022-05-10T22:21:41Z

...egration-test/src/test/java/org/apache/hadoop/fs/ozone/TestRootedOzoneFileSystemWithFSO.java

+    for (int i = 0; i < 1300; i++) {
+      Path key = new Path(parent, "tempKey" + i);
+      ContractTestUtils.touch(getFs(), key);
+      // To add keys to the cache


@aswinshakil Let's provide little more details in comments. It would be great to add Javadoc about what scenario it's testing.

...egration-test/src/test/java/org/apache/hadoop/fs/ozone/TestRootedOzoneFileSystemWithFSO.java

kaijchen · 2022-05-11T01:51:25Z

Let's say if cache has elements like 0 101 102 .......1125.....
But they may be loaded in different order into TreeMap....what if they loaded 101....1124 = 1024
Now we will go through the dirTable and it has 1-100. So, the tempCache is having 1....1124. Now we need only 1024. So, it should pick 1...1024.

In this case, the first iteration should do:

cache.seek(FIRST) = 0
db.seek(FIRST) = 1

Then 0 - 1023 will be the result of the first iteration.

umamaheswararao · 2022-05-13T22:44:52Z

Just to make it clear: I had offline chat with @kaijchen. He was talking with respective to his solution proposal. However PR was not loading full cache elements. So, above pointed issues seems possible.

Thanks @aswinshakil for correcting it. Not sure how much it will impact due to loading everything from cache. Just to note, @aswinshakil showed me that non-fso does load everything into sortedMap.

I will take a look at the latest patch shortly. I would also request @rakeshadr to check the changes once and please comment if you feel loading all cache into TreeMap can impact in anyway. Thanks

rakeshadr

@aswinshakil Thanks for the contribution.

+1 LGTM. Pending CI.

Appreciate if you could review the patch #3444 to make the output in sorted order.

aswinshakil · 2022-05-31T20:19:44Z

The listStatusFSO() implementation has been changed since this patch, it is tracked here HDDS-6788, Keeping the Test testListStatusFSO() here to capture any inconsistencies in listStatus in the future.

umamaheswararao

LGTM

umamaheswararao · 2022-06-01T16:40:20Z

The original discuss issue covered and fixed in FSO new list status implementation at HDDS-6788

This issue added the test case for the same.

Thanks a lot Aswin Shakil Balasubramanian for the great dedicated investigation on this issue and figured out the actual root cause. Appreciated !

* master: (87 commits) HDDS-6686. Do Leadship check before SASL token verification. (apache#3382) HDDS-4364: [FSO]List FileStatus : startKey can be a non-existed path (apache#3481) HDDS-6091. Add file checksum to OmKeyInfo (apache#3201) HDDS-6706. Exposing Volume Information Metrics to the DataNode UI (apache#3478) HDDS-6759: Add listblock API in MockDatanodeStorage (apache#3452) HDDS-5821 Container cache management for closing RockDB (apache#3426) HDDS-6683. Refactor OM server bucket layout configuration usage (apache#3477) HDDS-6824. Revert changes made in proto.lock by HDDS-6768. (apache#3480) HDDS-6811. Bucket create message with layout type (apache#3479) HDDS-6810. Add a optional flag to trigger listStatus as part of listKeys for FSO buckets. (apache#3461) HDDS-6828. Revert RockDB version pending leak fixes (apache#3475) HDDS-6764: EC: DN ability to create RECOVERING containers for EC reconstruction. (apache#3458) HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates (apache#3453) HDDS-6823. Intermittent failure in TestOzoneECClient#testExcludeOnDNMixed (apache#3476) HDDS-6820. Bucket Layout Post-Finalization Validators for ACL Requests. (apache#3472) HDDS-6819. Add LEGACY to AllowedBucketLayouts in CreateBucketHandler (apache#3473) HDDS-4859. [FSO]ListKeys: seek all the files/dirs from startKey to keyPrefix (apache#3466) HDDS-6705 Add metrics for volume statistics including disk capacity, usage, Reserved (apache#3430) HDDS-6474. Add test to cover the FSO bucket list status with beyond batch boundary and cache. (apache#3379). Contributed by aswinshakil HDDS-6280. Support Container Balancer HA (apache#3423) ...

HDDS-6474. Fixed File System Optimized(FSO) bucket list status.

f40e7b4

kaijchen reviewed May 3, 2022

View reviewed changes

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java Outdated Show resolved Hide resolved

kerneltime reviewed May 3, 2022

View reviewed changes

HDDS-6474. Added test and updated parameters.

4993d8c

umamaheswararao reviewed May 6, 2022

View reviewed changes

umamaheswararao reviewed May 10, 2022

View reviewed changes

aswinshakil added 2 commits May 11, 2022 10:19

Added JavaDoc and updated comments.

016e889

Updated TableCache iteration condition.

311b32b

aswinshakil requested a review from rakeshadr May 17, 2022 18:18

rakeshadr approved these changes May 25, 2022

View reviewed changes

aswinshakil force-pushed the HDDS-6474 branch from 2af753d to 311b32b Compare May 31, 2022 20:06

Undo KeyManagerImpl changes

f640469

umamaheswararao approved these changes Jun 1, 2022

View reviewed changes

umamaheswararao merged commit e021a06 into apache:master Jun 1, 2022

HDDS-6474. Fixed File System Optimized(FSO) bucket list status. #3379

HDDS-6474. Fixed File System Optimized(FSO) bucket list status. #3379

Uh oh!

Conversation

aswinshakil commented May 3, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

umamaheswararao commented May 3, 2022

Uh oh!

kerneltime commented May 3, 2022

Uh oh!

kerneltime May 3, 2022

Choose a reason for hiding this comment

Uh oh!

aswinshakil May 5, 2022

Choose a reason for hiding this comment

Uh oh!

rakeshadr commented May 4, 2022

Uh oh!

kaijchen commented May 4, 2022

Uh oh!

umamaheswararao commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rakeshadr commented May 5, 2022

Uh oh!

rakeshadr commented May 5, 2022

Uh oh!

kaijchen commented May 5, 2022

Uh oh!

aswinshakil commented May 5, 2022

Uh oh!

rakeshadr commented May 6, 2022

Uh oh!

umamaheswararao May 6, 2022

Choose a reason for hiding this comment

Uh oh!

rakeshadr May 9, 2022

Choose a reason for hiding this comment

Uh oh!

umamaheswararao May 10, 2022

Choose a reason for hiding this comment

Uh oh!

umamaheswararao May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umamaheswararao commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

umamaheswararao May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaijchen commented May 11, 2022

Uh oh!

umamaheswararao commented May 13, 2022

Uh oh!

rakeshadr left a comment

Choose a reason for hiding this comment

Uh oh!

aswinshakil commented May 31, 2022

Uh oh!

umamaheswararao left a comment

Choose a reason for hiding this comment

Uh oh!

umamaheswararao commented Jun 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

umamaheswararao commented May 4, 2022 •

edited

Loading

umamaheswararao May 10, 2022 •

edited

Loading

umamaheswararao commented May 10, 2022 •

edited

Loading

umamaheswararao May 10, 2022 •

edited

Loading