-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-8528. [Snapshot] Custom SnapshotCache implementation to replace LoadingCache #4567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…LoadingCache Change-Id: Id28c09eae42d3ba5e90117bbea37891c725b32da
|
I tend to prefer approach 1. The guava doc has specific warnings about using weak/soft values: https://guava.dev/releases/19.0/api/docs/com/google/common/cache/CacheBuilder.html#weakValues() "Weak values will be garbage collected once they are weakly reachable. This makes them a poor candidate for caching; consider softValues() instead. " "Warning: in most circumstances it is better to set a per-cache maximum size instead of using soft references. You should only use this method if you are well familiar with the practical consequences of soft references. " So I am hesitant about approach 2. For me the main problem with approach 1 is that the implementation doesn't yet support try-with-resources. Your concern about encapsulation is valid: As you say, "putting the decrement ref count logic in OmSnapshot#close (so that we can wrap OmSnapshot inside a try-with-resources) would also break the encapsulation thus considered hacky by myself." Would it be better, if instead of the refCount interface, we had a cacheEntry wrapper class, that manages the reference counting and also implements the close? The cache entry constructor would take an omSnapshot as a paramater, and would have a getter method to all the enduser to access the actual omSnapshot methods? So users would always do a try-with-resources{} to get a cache entry and then unwrap that to get the omSnapshot? Or maybe we could make the cache entry implement the IOMetadataReader interface as well, so there would usually be no unwrapping needed. |
|
I also think some of the snapshot user code should be modified to support try-with-resources. For example, this code: Lines 83 to 91 in ddfc4b7
does the checkForSnapshot() call and then passes a reference to the omSnapshot into the double buffer thread. I think those sorts of calls should be modified just to pass the snapshot key into the double buffer thread and have it get the snapshot from the cache, (with try-with-resources.) The way it is implemented now, the reference is passed between two threads, and try-with-resources requires it all happen in the same thread. |
|
Thanks @GeorgeJahad for the comment.
Yes I read the doc before I went through the implementation. Though I should have used
That could work. I had the same idea briefly but I am not too fond of the extra layer either. However, this would preserve the encapsulation. |
Back when @aswinshakil was implementing this in #4486 I asked the same question of why didn't we just pass the snapshot table key to the |
|
Only for the snapshot use case, I am inclined towards approach #2. Reason:
I was thinking to use WeakReference with WeakHashMap which I guess would be similar to |
|
@smengcl I have no experience with "softValues()" but my understanding is that it won't do any reclamation until the heap is completely full. Is that correct?
My fear is that this is not a degenerate case but will rather be common once enough snapshot operations have been performed. |
It does seem to be implemented here, or am I misunderstanding your point? Lines 167 to 174 in ddfc4b7
|
I had the same impression when I look at it first time but it is not bounded because |
Thanks for the reminder. |
…se `ReferenceCounted<>` to the public and replace every single instance of `OmSnapshot` with `ReferenceCounted<OmSnapshot>` so as they could be auto-closed with try-with-resources.
|
@GeorgeJahad In the latest commit posted above (8b57f1f), I wrapped The plan is to replace every single instance of
I will postpone further work on this until #4642 is merged to avoid stepped on each others' toes. In the meantime, |
|
@smengcl I took a quick look at ReferenceCounted.java and SnapshotCache.java. They look good to me. (The cleanup() method probably requires some synchronization.) I'll take a closer look when the PR comes out of draft state. |
…etadataReader>` so as they could be auto-closed with try-with-resources.
Conflicts: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshot.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMDirectoriesPurgeRequestWithFSO.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyPurgeRequest.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/snapshot/OMSnapshotMoveDeletedKeysResponse.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/SnapshotDeletingService.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
…ature; build failure fixes.
|
This PR is ready for review at this point. The overall structure would not receive any big changes from me without further comments. I will mark the PR ready for review once I resolve the test timeout (caused by strict thread ID checking in |
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java
Show resolved
Hide resolved
Conflicts: hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshot.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMDirectoriesPurgeRequestWithFSO.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/key/OMKeyPurgeRequest.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/snapshot/OMSnapshotMoveDeletedKeysRequest.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/DirectoryDeletingService.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/service/SnapshotDeletingService.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotUtils.java hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotManager.java hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/request/key/TestOMKeyPurgeRequestAndResponse.java hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/service/TestSnapshotDeletingService.java
Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java
Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java
Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/snapshot/OMSnapshotMoveDeletedKeysRequest.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/response/key/OMKeyPurgeResponse.java hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/request/snapshot/TestOMSnapshotPurgeRequestAndResponse.java hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java
|
Hey @smengcl It looks like this comment got marked as resolved just as I was responding to it: #4567 (comment) Please take another look |
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| return v; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return v here will put it back in the dbMap, but if it gets removed in cleanup() above, I don' t think that is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Replaced this line with return dbMap.get(k);
Might look hacky, but it works. Tested with a snippet:
import java.util.concurrent.ConcurrentHashMap;
class Untitled {
public static void main(String[] args) {
ConcurrentHashMap map = new ConcurrentHashMap<String, String>();
String k1 = "k1";
map.put(k1, "val1");
map.compute(k1, (k, v) -> {
System.out.format("k=%s, v=%s\n", k, v);
map.remove(k);
return map.get(k);
});
System.out.format("k=%s, v=%s\n", k1, map.get(k1));
}
}Output:
k=k1, v=val1
k=k1, v=null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm. turns out it is a bad idea to update other mappings inside a map.compute() anyways according to its javadoc. In this case it corrupts the map size count, and is caught by my UT:
Repro snippet:
import java.util.concurrent.ConcurrentHashMap;
class Untitled {
public static void main(String[] args) {
ConcurrentHashMap map = new ConcurrentHashMap<String, String>();
String k1 = "k1";
map.put(k1, "v1");
map.put("k2", "v2");
System.out.println("- Map dump 1");
map.forEach((k, v) -> {
System.out.format("k=%s, v=%s\n", k, v, map.size());
});
System.out.println();
System.out.println("- compute()");
map.compute(k1, (k, v) -> {
System.out.format("k=%s, v=%s, size=%d\n", k, v, map.size());
map.remove(k);
System.out.format("k=%s, v=%s, size=%d\n", k, v, map.size());
return map.get(k);
});
System.out.println();
// Map size becomes incorrect at this point
System.out.format("k=%s, v=%s, size=%d\n", k1, map.get(k1), map.size());
System.out.println("- Map dump 2");
map.forEach((k, v) -> {
System.out.format("k=%s, v=%s\n", k, v, map.size());
});
}
}Output:
- Map dump 1
k=k1, v=v1
k=k2, v=v2
- compute()
k=k1, v=v1, size=2
k=k1, v=v1, size=1
k=k1, v=null, size=0
- Map dump 2
k=k2, v=v2
You can see map's size=0 while it obviously still has k2 entry in it.
I have moved cleanup() out of compute() and added additional syncing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflicts: hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java
…the lambda; sync `cleanup()`
Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
|
That looks like it fixes the issues I found. Thanks! |
|
Thanks @GeorgeJahad and @hemantk-12 for reviewing this! I will merge this shortly. |
What changes were proposed in this pull request?
This is a follow-up to HDDS-7935.
Opening this as a draft for suggestions. @GeorgeJahad @hemantk-12 could also take a look at this. I'd like to give some opinions and feedbacks on the overall approach.
The core of this PR is
SnapshotCacheclass andReferenceCountedinterface. One can start from there.SnapshotCacheis supposed to be thread-safe.With this approach, every single time an
OmSnapshotinstance is retrieved usingsnapshotCache.get(), a correspondingsnapshotCache.close()has to be called at the end of its usage (e.g. using try-with-resources). Otherwise, there will be leakages when the reference count is not decremented. I have identified all existing usages ofsnapshotCache.get(), wrapped in another helper method or not, on latest master branch and added the logic there on how they should be properly handled.In the meantime, I will explore ifDone in HDDS-7935.CacheBuilder#softValues()would be a viable alternative (potential approach 2), where we could probably rely on JVM itself to do the "reference counting" on the instances.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8528
How was this patch tested?