-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-7935. [Snapshot] LRU Cache entries may get evicted/closed during long running processes #4568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… long running processes Change-Id: I1402062eb264e7a4a27014e3cd9f1ba91b6a18bd
Change-Id: I1e8b244d48e34b33588e0ca7a64ad0dbc25d5123
Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Change-Id: I653dfc6373862ae39e2e71a77b8998439dcf57b3
Change-Id: I915f770e4864647d324264640b3d8300508df97e
Change-Id: Icb08433e935f14b119eb76e4740038491eea2ac5
hemantk-12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @smengcl
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Outdated
Show resolved
Hide resolved
| final String errorMsg = "no longer active"; | ||
| LambdaTestUtils.intercept(OMException.class, errorMsg, | ||
| final String errorMsg1 = "no longer active"; | ||
| LambdaTestUtils.intercept(FileNotFoundException.class, errorMsg1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on checking exception.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshot.java
Show resolved
Hide resolved
|
@GeorgeJahad Please kindly take a look at this. We plan to fix the early-close issue at hand with |
prashantpogde
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the patch @smengcl. The changes look good to me.
hemantk-12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
lgtm. thanks for taking this off my hands! |
|
One question though, do you know if it is possible to set both softValues() and maximumSize() on the cache? Would that give us the desired behaviour without reference counting? |
Change-Id: I6720da0c4740bf626b9737272e8f35b0621f1cb8
@GeorgeJahad AFAIU, unfortunately as soon as |
Yes, that's my understanding too. |
What changes were proposed in this pull request?
This is approach 2 of 2 that might fix the issue that bothers SnapDiff, where the current
LoadingCachebehaves like a simple LRU cache. We had no control over when anOmSnapshotinstance can be evicted and closed, which can cause the the snapshot DB instance to be closed prematurely while SnapDiff is still running in the background, crashing the OM.For approach 1 that implements a custom
SnapshotCacheand the whole modified-LRU logic from scratch, see #4567 .This approach 2 replaces the hard-limit (
.maximumSize()) with.softValues(). This allows JVM garbage collector to collect the value when they are no longer strongly referenced, for instance from SnapDiff or Hadoop FS API read operations.OmSnapshot#finalizeaddition should be able to properly close the RocksDB handle.The
ozone.om.snapshot.cache.max.sizeeffectively becomes a soft limit (the same as approach 1), with warning printed incheckForSnapshot()when cache size exceeds the soft limit.In a degenerate case where OM JVM heap usage is pushed to the absolute limit, JVM GC could collect
OmSnapshotin cache as soon as e.g. a Hadoop FS API read request is served. However, in this case OM performance would tank anyway and the snapshot cache would be the least of its concern. Note:OmSnapshotshould not consume much (on-heap) memory by itself. The DB instance could consume a good chunk of off-heap memory which is not JVM's concern.Under normal conditions, the cache would still be able to serve its purpose.
This is not fully tested yet. This is much cleaner than approach 1 if this works as expected. In the worst case, we fall back to approach 1.
We are also already overriding
finalize()inPipeInputStreamand other Managed DB wrappers. So I guess it is fine (for now) to add one more usage here inOmSnapshot.cc @GeorgeJahad
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7935
How was this patch tested?