Skip to content

Conversation

@smengcl
Copy link
Contributor

@smengcl smengcl commented Apr 14, 2023

What changes were proposed in this pull request?

This is approach 2 of 2 that might fix the issue that bothers SnapDiff, where the current LoadingCache behaves like a simple LRU cache. We had no control over when an OmSnapshot instance can be evicted and closed, which can cause the the snapshot DB instance to be closed prematurely while SnapDiff is still running in the background, crashing the OM.

For approach 1 that implements a custom SnapshotCache and the whole modified-LRU logic from scratch, see #4567 .

This approach 2 replaces the hard-limit (.maximumSize()) with .softValues(). This allows JVM garbage collector to collect the value when they are no longer strongly referenced, for instance from SnapDiff or Hadoop FS API read operations. OmSnapshot#finalize addition should be able to properly close the RocksDB handle.

The ozone.om.snapshot.cache.max.size effectively becomes a soft limit (the same as approach 1), with warning printed in checkForSnapshot() when cache size exceeds the soft limit.

In a degenerate case where OM JVM heap usage is pushed to the absolute limit, JVM GC could collect OmSnapshot in cache as soon as e.g. a Hadoop FS API read request is served. However, in this case OM performance would tank anyway and the snapshot cache would be the least of its concern. Note: OmSnapshot should not consume much (on-heap) memory by itself. The DB instance could consume a good chunk of off-heap memory which is not JVM's concern.

Under normal conditions, the cache would still be able to serve its purpose.

This is not fully tested yet. This is much cleaner than approach 1 if this works as expected. In the worst case, we fall back to approach 1.

We are also already overriding finalize() in PipeInputStream and other Managed DB wrappers. So I guess it is fine (for now) to add one more usage here in OmSnapshot.

cc @GeorgeJahad

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7935

How was this patch tested?

  • All existing test should pass.
  • Pending SnapDiff test additions that intentionally exceeds the cache limit.
    • Possibly new test cases that triggers GC while SnapDiff is still running to see if it can still finish without crashing OM.

… long running processes

Change-Id: I1402062eb264e7a4a27014e3cd9f1ba91b6a18bd
@smengcl smengcl added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Apr 14, 2023
smengcl added 6 commits April 20, 2023 15:48
Change-Id: I1e8b244d48e34b33588e0ca7a64ad0dbc25d5123
Conflicts:
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Change-Id: I653dfc6373862ae39e2e71a77b8998439dcf57b3
Change-Id: I915f770e4864647d324264640b3d8300508df97e
Change-Id: Icb08433e935f14b119eb76e4740038491eea2ac5
@smengcl smengcl requested a review from prashantpogde April 27, 2023 16:48
Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @smengcl

final String errorMsg = "no longer active";
LambdaTestUtils.intercept(OMException.class, errorMsg,
final String errorMsg1 = "no longer active";
LambdaTestUtils.intercept(FileNotFoundException.class, errorMsg1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on checking exception.

@smengcl
Copy link
Contributor Author

smengcl commented Apr 27, 2023

@GeorgeJahad Please kindly take a look at this. We plan to fix the early-close issue at hand with softValues() first, then do the custom reference counting ( i.e. #4567 ).

Copy link
Contributor

@prashantpogde prashantpogde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the patch @smengcl. The changes look good to me.

Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@GeorgeJahad
Copy link
Contributor

lgtm. thanks for taking this off my hands!

@GeorgeJahad
Copy link
Contributor

One question though, do you know if it is possible to set both softValues() and maximumSize() on the cache? Would that give us the desired behaviour without reference counting?

@smengcl
Copy link
Contributor Author

smengcl commented May 2, 2023

One question though, do you know if it is possible to set both softValues() and maximumSize() on the cache? Would that give us the desired behaviour without reference counting?

@GeorgeJahad AFAIU, unfortunately as soon as maximumSize(limit) is added, limit becomes a hard limit. It will start invalidating cache entries when the limit would be exceeded thus we would have the same issue as before.

@hemantk-12
Copy link
Contributor

hemantk-12 commented May 2, 2023

One question though, do you know if it is possible to set both softValues() and maximumSize() on the cache? Would that give us the desired behaviour without reference counting?

@GeorgeJahad AFAIU, unfortunately as soon as maximumSize(limit) is added, limit becomes a hard limit. It will start invalidating cache entries when the limit would be exceeded thus we would have the same issue as before.

Yes, that's my understanding too.

@smengcl smengcl merged commit ea4b01b into apache:master May 3, 2023
@smengcl smengcl deleted the HDDS-7935-approach-2 branch May 3, 2023 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants