Skip to content

Conversation

@aswinshakil
Copy link
Member

What changes were proposed in this pull request?

List status is done as a batch process. In liststatusFSO, it goes through the cache, and if the cache size is greater than the batch size we return just the cache. The problem is that the cache has gaps and the keys that are within these gaps are ignored. List status continues the listing from the last element (which is a sorted cache) of the batch so we do not iterate over the keys in the gaps again.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6474

How was this patch tested?

This patch was tested on a cluster.

@umamaheswararao
Copy link
Contributor

@aswinshakil thanks you for digging this tricky issue with lot of patience. Appreciate that efforts.
I think it would be good idea to add a unit test.
From our offline discussions, the issue exist mostly in old version and in latest version it is working.

So, let's add a test in your old version fork. Then we could use the same test in master, so that even if it breaks later, that will catch.

Since this issue reported against: @Mengqi could you please confirm in which version you noticed this?

@kerneltime
Copy link
Contributor

@aswinshakil I need to look at the listStatusFSO call in addition to buildFinalStatusList to understand the bug. Will get finish the review later today or tomorrow.

fileStatusFinalList.add(fileStatus);
keyInfoList.add(fileStatus.getKeyInfo());
countEntries++;
if (countEntries >= numEntries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the keys in the fileStatusFinalList be a sum of both file and dir caches sorted and then truncated to the numEntries? I think the overall code for listStatusFSO needs to be revisited.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It does. Along with that we also add Keys and Directories from the DB.

@rakeshadr
Copy link
Contributor

Thanks a lot @aswinshakil for the continuous efforts in analysing and digging out the root cause.

It looks like OM TableCache is tricky for the FSO or non-FSO listing status code path, which would create inconsistency across batches.

Please go through the following scenario:

Say, below keys exists in OM.

These keys "x/y/z/a1"...."x/y/z/a1023" and "x/y/z/b1"..."x/y/z/b1023" in DBTable.
Assume, "x/y/z/b1024" only exists in TableCache.

Now, Ozone client Listing Status with keyName = "x/y/z/":

BatchSize = 1024. It will take two iteration to fetch all the values from OM.

Iteration-1:
keyName = "x/y/z/" and startKey = ""

step-1) Seek all the elements from TableCache if its not a deleted key.
Now, the Batch-1 ["x/y/z/b1024"]
step-2) Seek all the elements from
Now, the Batch-1 ["x/y/z/a1"...."x/y/z/a1023"]

FinalList of Batch-1 will be a combined values : ["x/y/z/a1"...."x/y/z/a1023, "x/y/z/b1024"].

Iteration-2:
keyName = "x/y/z/" and startKey = "x/y/z/b1024", which is the last element in the batch-1.
FinalList of Batch-2 : Empty list because there is no elements after "x/y/z/b1024".

Since "x/y/z/b1"..."x/y/z/b1023" exists before "x/y/z/b1024" start key, all these keys would be skipped and won't get fetched/seeks from DBTable.

Proposal: How about to use TableCache only for checking the deleted keys and do not add the cached values into the listing keys ?

@kaijchen
Copy link
Member

kaijchen commented May 4, 2022

Thanks @rakeshadr for explaining the problem.

I think we can scan cache and table simultaneously and merge the result, providing ordered results.
However, this requires cache to be a SortedMap.

// input: keyName, startKey, numKeys

cIt = cache.seek(startKey);
tIt = table.seek(startKey);
for (i = 0; i < numKeys; i++) {
    // check cIt and tIt are valid
    if (cIt.key() < tIt.key()) {
        it = cIt;
    } else {
        it = tIt;
    }
    result.add(it.key());
    it.next()
} 
return result;

@umamaheswararao
Copy link
Contributor

umamaheswararao commented May 4, 2022

@rakeshadr I have a question on the following two points:

Assume, "x/y/z/b1024" only exists in TableCache.
Proposal: How about to use TableCache only for checking the deleted keys and do not add the cached values into the listing keys ?

How can we ignore cache? from the point 1, entries can present only in cache but not in DB Table right. So, if we ignore cache, then From where we can include x/y/z/b1024?

@rakeshadr
Copy link
Contributor

How can we ignore cache? from the point 1, entries can present only in cache but not in DB Table right. So, if we ignore cache, then From where we can include x/y/z/b1024?

Yes, latest entry "x/y/z/b1024" won't be included in the final list. TableCache is transaction cache and in normal case the cached entries will be flushed to DB and will be cleaned up. In an ideal situation flushing will happen in millis or even in nanos. IMHO, its OK to not include recent entries rather than inducing errors in the system. Listing keys can be eventually consistent for the newly added entries.

@rakeshadr
Copy link
Contributor

I think we can scan cache and table simultaneously and merge the result, providing ordered results.
However, this requires cache to be a SortedMap..

Thanks @kaijchen for the comment. But in which batch the cached entries to be included. In the above example case, "x/y/z/b1024" has to be included in last batch. Assume, there are many batches for the list keys. Here we need find out the index for the cached item and the exact batch, otw it may affect the order, isn't it?

@kaijchen
Copy link
Member

kaijchen commented May 5, 2022

Here we need find out the index for the cached item and the exact batch, otw it may affect the order, isn't it?

I think it will be fine, as long as we iterate them sorted.
In your example, the first batch will be "x/y/z/a1".."x/y/z/a1023", "x/y/z/b1".
Then startKey will be "x/y/z/b1", and the second batch will be "x/y/z/b1".."x/y/z/b1024".

@aswinshakil
Copy link
Member Author

@kaijchen @umamaheswararao @rakeshadr @kerneltime Thanks everyone for the review. As per @kaijchen last comment, it is correct. Since it is sorted even if we take the cached keys that are out-of-bound for that batch. As we iterate to the DB these out-of-bound keys will be pushed to the last in TreeMap. As a result in the final fileStatusFinalList we truncate these keys as they are out of bound.

@rakeshadr
Copy link
Contributor

@kaijchen @umamaheswararao @rakeshadr @kerneltime Thanks everyone for the review. As per @kaijchen last comment, it is correct. Since it is sorted even if we take the cached keys that are out-of-bound for that batch. As we iterate to the DB these out-of-bound keys will be pushed to the last in TreeMap. As a result in the final fileStatusFinalList we truncate these keys as they are out of bound.

@kaijchen , @aswinshakil +1 Agreed with the proposal. Appreciate if you could add test case to cover the example that we discussed, listing across batches with cached keypaths.

Probably, you can write test case into TestKeyManagerImpl suite and you can add cache entry like below:

        omMetadataManager.getOpenKeyTable(BucketLayout)
            .addCacheEntry(new CacheKey<>(ozoneKey),
                new CacheValue<>(Optional.of(omKeyInfo), trxnLogIndex));

for (int i = 0; i < 1300; i++) {
Path key = new Path(parent, "tempKey" + i);
ContractTestUtils.touch(getFs(), key);
// To add keys to the cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerneltime @avijayanhwx @rakeshadr @mukul1987 do you know why simple put is not keeping it in cache? Only with rename, it keeping things in cache. It's good to know the differences. May be in separate JIRA if we need more time for investigation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umamaheswararao @aswinshakil

Only with rename, it keeping things in cache. It's good to know the differences

Are you still seeing a table cache entry leaked in the latest code, with rename FSO operation ?

Details about the flushing and table cache cleanup logic: Hope this helps.

step-1) Assume user performed o3fs#rename(srcKey, dstKey);

step-2) Then the call will reach OM server and will do rename metadata updation at OM : OMKeyRenameRequest -> validateAndUpdateCache()
Here OM will update the keys and add these keys into TableCache.

step-3) This is an async thread, which will do OzoneManagerDoubleBuffer#flushTransactions()
Here it will add/update the DBTable and do TableCache cleanup by removing the flushed transaction Id.

There are two reasons for seeing an item/key in TableCache:
case-1) The entry is added to the TableCache and not yet flushed to DB. In reality the lifetime of a table cache entry is in between this petty small window.
case-2) Bug situation: The cleanup of the tablecache is not done because of missing Table name in the response logic. This is a bug.
For example, rename response class should have the DBTable names to be looked at and do the cleanup on flushing. If table name is not correctly mentioned, then that entry will not be removed and will exists in TableCache. This is a bug case.

@CleanupTableInfo(cleanupTables = {FILE_TABLE, DIRECTORY_TABLE})
public class OMKeyRenameResponseWithFSO extends OMKeyRenameResponse {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rakeshadr for the details. Not sure I followed 100%, but we need to address either of one right? Behavior of putting/evicting into/from cache should be consistent between put and rename?

Copy link
Contributor

@umamaheswararao umamaheswararao May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aswinshakil Let's provide little more details in comments. It would be great to add Javadoc about what scenario it's testing.

@umamaheswararao
Copy link
Contributor

umamaheswararao commented May 10, 2022

Thanks @aswinshakil for working on this patch.
I got a scenario to check,
Let's say if cache has elements like 1124.....102 101 0
But they may be loaded in different order into TreeMap....what if they loaded 1124...101
Now we will go through the dirTable and it has 1-100. So, the tempCache is having 1....1124. Now we need only 1024. So, it should pick 1...1024.

But the startKey for next iteration is 1024. It did not get chance to load 0 from TableCache before and it will not consider it as startKey(1024) already advanced.

Is this a possible case? or I am missing something here?

for (int i = 0; i < 1300; i++) {
Path key = new Path(parent, "tempKey" + i);
ContractTestUtils.touch(getFs(), key);
// To add keys to the cache
Copy link
Contributor

@umamaheswararao umamaheswararao May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aswinshakil Let's provide little more details in comments. It would be great to add Javadoc about what scenario it's testing.

@kaijchen
Copy link
Member

Let's say if cache has elements like 0 101 102 .......1125.....
But they may be loaded in different order into TreeMap....what if they loaded 101....1124 = 1024
Now we will go through the dirTable and it has 1-100. So, the tempCache is having 1....1124. Now we need only 1024. So, it should pick 1...1024.

In this case, the first iteration should do:

  1. cache.seek(FIRST) = 0
  2. db.seek(FIRST) = 1

Then 0 - 1023 will be the result of the first iteration.

@umamaheswararao
Copy link
Contributor

Just to make it clear: I had offline chat with @kaijchen. He was talking with respective to his solution proposal. However PR was not loading full cache elements. So, above pointed issues seems possible.

Thanks @aswinshakil for correcting it. Not sure how much it will impact due to loading everything from cache. Just to note, @aswinshakil showed me that non-fso does load everything into sortedMap.

I will take a look at the latest patch shortly. I would also request @rakeshadr to check the changes once and please comment if you feel loading all cache into TreeMap can impact in anyway. Thanks

@aswinshakil aswinshakil requested a review from rakeshadr May 17, 2022 18:18
Copy link
Contributor

@rakeshadr rakeshadr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aswinshakil Thanks for the contribution.

+1 LGTM. Pending CI.

Appreciate if you could review the patch #3444 to make the output in sorted order.

@aswinshakil
Copy link
Member Author

The listStatusFSO() implementation has been changed since this patch, it is tracked here HDDS-6788, Keeping the Test testListStatusFSO() here to capture any inconsistencies in listStatus in the future.

Copy link
Contributor

@umamaheswararao umamaheswararao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@umamaheswararao umamaheswararao merged commit e021a06 into apache:master Jun 1, 2022
@umamaheswararao
Copy link
Contributor

The original discuss issue covered and fixed in FSO new list status implementation at  HDDS-6788

This issue added the test case for the same.

Thanks a lot Aswin Shakil Balasubramanian for the great dedicated investigation on this issue and figured out the actual root cause. Appreciated !

errose28 added a commit to errose28/ozone that referenced this pull request Jun 7, 2022
* master: (87 commits)
  HDDS-6686. Do Leadship check before SASL token verification. (apache#3382)
  HDDS-4364: [FSO]List FileStatus : startKey can be a non-existed path (apache#3481)
  HDDS-6091. Add file checksum to OmKeyInfo (apache#3201)
  HDDS-6706. Exposing Volume Information Metrics to the DataNode UI (apache#3478)
  HDDS-6759: Add listblock API in MockDatanodeStorage (apache#3452)
  HDDS-5821 Container cache management for closing RockDB  (apache#3426)
  HDDS-6683. Refactor OM server bucket layout configuration usage (apache#3477)
  HDDS-6824. Revert changes made in proto.lock by HDDS-6768. (apache#3480)
  HDDS-6811. Bucket create message with layout type (apache#3479)
  HDDS-6810. Add a optional flag to trigger listStatus as part of listKeys for FSO buckets. (apache#3461)
  HDDS-6828. Revert RockDB version pending leak fixes (apache#3475)
  HDDS-6764: EC: DN ability to create RECOVERING containers for EC reconstruction. (apache#3458)
  HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates (apache#3453)
  HDDS-6823. Intermittent failure in TestOzoneECClient#testExcludeOnDNMixed (apache#3476)
  HDDS-6820. Bucket Layout Post-Finalization Validators for ACL Requests. (apache#3472)
  HDDS-6819. Add LEGACY to AllowedBucketLayouts in CreateBucketHandler (apache#3473)
  HDDS-4859. [FSO]ListKeys: seek all the files/dirs from startKey to keyPrefix (apache#3466)
  HDDS-6705 Add metrics for volume statistics including disk capacity, usage, Reserved (apache#3430)
  HDDS-6474. Add test to cover the FSO bucket list status with beyond batch boundary and cache. (apache#3379). Contributed by aswinshakil
  HDDS-6280. Support Container Balancer HA (apache#3423)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants