-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-6474. Fixed File System Optimized(FSO) bucket list status. #3379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java
Outdated
Show resolved
Hide resolved
|
@aswinshakil thanks you for digging this tricky issue with lot of patience. Appreciate that efforts. So, let's add a test in your old version fork. Then we could use the same test in master, so that even if it breaks later, that will catch. Since this issue reported against: @Mengqi could you please confirm in which version you noticed this? |
|
@aswinshakil I need to look at the |
| fileStatusFinalList.add(fileStatus); | ||
| keyInfoList.add(fileStatus.getKeyInfo()); | ||
| countEntries++; | ||
| if (countEntries >= numEntries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the keys in the fileStatusFinalList be a sum of both file and dir caches sorted and then truncated to the numEntries? I think the overall code for listStatusFSO needs to be revisited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It does. Along with that we also add Keys and Directories from the DB.
|
Thanks a lot @aswinshakil for the continuous efforts in analysing and digging out the root cause. It looks like OM TableCache is tricky for the FSO or non-FSO listing status code path, which would create inconsistency across batches. Please go through the following scenario: Say, below keys exists in OM. These keys Now, Ozone client Listing Status with keyName = BatchSize = 1024. It will take two iteration to fetch all the values from OM. Iteration-1: step-1) Seek all the elements from TableCache if its not a deleted key. FinalList of Batch-1 will be a combined values : [ Iteration-2: Since Proposal: How about to use TableCache only for checking the deleted keys and do not add the cached values into the listing keys ? |
|
Thanks @rakeshadr for explaining the problem. I think we can scan cache and table simultaneously and merge the result, providing ordered results. // input: keyName, startKey, numKeys
cIt = cache.seek(startKey);
tIt = table.seek(startKey);
for (i = 0; i < numKeys; i++) {
// check cIt and tIt are valid
if (cIt.key() < tIt.key()) {
it = cIt;
} else {
it = tIt;
}
result.add(it.key());
it.next()
}
return result; |
|
@rakeshadr I have a question on the following two points:
How can we ignore cache? from the point 1, entries can present only in cache but not in DB Table right. So, if we ignore cache, then From where we can include x/y/z/b1024? |
Yes, latest entry "x/y/z/b1024" won't be included in the final list. TableCache is transaction cache and in normal case the cached entries will be flushed to DB and will be cleaned up. In an ideal situation flushing will happen in millis or even in nanos. IMHO, its OK to not include recent entries rather than inducing errors in the system. Listing keys can be eventually consistent for the newly added entries. |
Thanks @kaijchen for the comment. But in which batch the cached entries to be included. In the above example case, "x/y/z/b1024" has to be included in last batch. Assume, there are many batches for the list keys. Here we need find out the index for the cached item and the exact batch, otw it may affect the order, isn't it? |
I think it will be fine, as long as we iterate them sorted. |
|
@kaijchen @umamaheswararao @rakeshadr @kerneltime Thanks everyone for the review. As per @kaijchen last comment, it is correct. Since it is sorted even if we take the cached keys that are out-of-bound for that batch. As we iterate to the DB these out-of-bound keys will be pushed to the last in TreeMap. As a result in the final |
@kaijchen , @aswinshakil +1 Agreed with the proposal. Appreciate if you could add test case to cover the example that we discussed, listing across batches with cached keypaths. Probably, you can write test case into TestKeyManagerImpl suite and you can add cache entry like below: |
| for (int i = 0; i < 1300; i++) { | ||
| Path key = new Path(parent, "tempKey" + i); | ||
| ContractTestUtils.touch(getFs(), key); | ||
| // To add keys to the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kerneltime @avijayanhwx @rakeshadr @mukul1987 do you know why simple put is not keeping it in cache? Only with rename, it keeping things in cache. It's good to know the differences. May be in separate JIRA if we need more time for investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only with rename, it keeping things in cache. It's good to know the differences
Are you still seeing a table cache entry leaked in the latest code, with rename FSO operation ?
Details about the flushing and table cache cleanup logic: Hope this helps.
step-1) Assume user performed o3fs#rename(srcKey, dstKey);
step-2) Then the call will reach OM server and will do rename metadata updation at OM : OMKeyRenameRequest -> validateAndUpdateCache()
Here OM will update the keys and add these keys into TableCache.
step-3) This is an async thread, which will do OzoneManagerDoubleBuffer#flushTransactions()
Here it will add/update the DBTable and do TableCache cleanup by removing the flushed transaction Id.
There are two reasons for seeing an item/key in TableCache:
case-1) The entry is added to the TableCache and not yet flushed to DB. In reality the lifetime of a table cache entry is in between this petty small window.
case-2) Bug situation: The cleanup of the tablecache is not done because of missing Table name in the response logic. This is a bug.
For example, rename response class should have the DBTable names to be looked at and do the cleanup on flushing. If table name is not correctly mentioned, then that entry will not be removed and will exists in TableCache. This is a bug case.
@CleanupTableInfo(cleanupTables = {FILE_TABLE, DIRECTORY_TABLE})
public class OMKeyRenameResponseWithFSO extends OMKeyRenameResponse {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rakeshadr for the details. Not sure I followed 100%, but we need to address either of one right? Behavior of putting/evicting into/from cache should be consistent between put and rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aswinshakil Let's provide little more details in comments. It would be great to add Javadoc about what scenario it's testing.
|
Thanks @aswinshakil for working on this patch. But the startKey for next iteration is 1024. It did not get chance to load 0 from TableCache before and it will not consider it as startKey(1024) already advanced. Is this a possible case? or I am missing something here? |
| for (int i = 0; i < 1300; i++) { | ||
| Path key = new Path(parent, "tempKey" + i); | ||
| ContractTestUtils.touch(getFs(), key); | ||
| // To add keys to the cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aswinshakil Let's provide little more details in comments. It would be great to add Javadoc about what scenario it's testing.
...egration-test/src/test/java/org/apache/hadoop/fs/ozone/TestRootedOzoneFileSystemWithFSO.java
Show resolved
Hide resolved
In this case, the first iteration should do:
Then |
|
Just to make it clear: I had offline chat with @kaijchen. He was talking with respective to his solution proposal. However PR was not loading full cache elements. So, above pointed issues seems possible. Thanks @aswinshakil for correcting it. Not sure how much it will impact due to loading everything from cache. Just to note, @aswinshakil showed me that non-fso does load everything into sortedMap. I will take a look at the latest patch shortly. I would also request @rakeshadr to check the changes once and please comment if you feel loading all cache into TreeMap can impact in anyway. Thanks |
rakeshadr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aswinshakil Thanks for the contribution.
+1 LGTM. Pending CI.
Appreciate if you could review the patch #3444 to make the output in sorted order.
|
The |
umamaheswararao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
The original discuss issue covered and fixed in FSO new list status implementation at HDDS-6788 This issue added the test case for the same. Thanks a lot Aswin Shakil Balasubramanian for the great dedicated investigation on this issue and figured out the actual root cause. Appreciated ! |
* master: (87 commits) HDDS-6686. Do Leadship check before SASL token verification. (apache#3382) HDDS-4364: [FSO]List FileStatus : startKey can be a non-existed path (apache#3481) HDDS-6091. Add file checksum to OmKeyInfo (apache#3201) HDDS-6706. Exposing Volume Information Metrics to the DataNode UI (apache#3478) HDDS-6759: Add listblock API in MockDatanodeStorage (apache#3452) HDDS-5821 Container cache management for closing RockDB (apache#3426) HDDS-6683. Refactor OM server bucket layout configuration usage (apache#3477) HDDS-6824. Revert changes made in proto.lock by HDDS-6768. (apache#3480) HDDS-6811. Bucket create message with layout type (apache#3479) HDDS-6810. Add a optional flag to trigger listStatus as part of listKeys for FSO buckets. (apache#3461) HDDS-6828. Revert RockDB version pending leak fixes (apache#3475) HDDS-6764: EC: DN ability to create RECOVERING containers for EC reconstruction. (apache#3458) HDDS-6795: EC: PipelineStateMap#addPipeline should not have precondition checks post db updates (apache#3453) HDDS-6823. Intermittent failure in TestOzoneECClient#testExcludeOnDNMixed (apache#3476) HDDS-6820. Bucket Layout Post-Finalization Validators for ACL Requests. (apache#3472) HDDS-6819. Add LEGACY to AllowedBucketLayouts in CreateBucketHandler (apache#3473) HDDS-4859. [FSO]ListKeys: seek all the files/dirs from startKey to keyPrefix (apache#3466) HDDS-6705 Add metrics for volume statistics including disk capacity, usage, Reserved (apache#3430) HDDS-6474. Add test to cover the FSO bucket list status with beyond batch boundary and cache. (apache#3379). Contributed by aswinshakil HDDS-6280. Support Container Balancer HA (apache#3423) ...
What changes were proposed in this pull request?
List status is done as a batch process. In
liststatusFSO, it goes through the cache, and if the cache size is greater than the batch size we return just the cache. The problem is that the cache has gaps and the keys that are within these gaps are ignored. List status continues the listing from the last element (which is a sorted cache) of the batch so we do not iterate over the keys in the gaps again.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6474
How was this patch tested?
This patch was tested on a cluster.