HDDS-7935. [Snapshot] LRU Cache entries may get evicted/closed during long running processes #4568

smengcl · 2023-04-14T15:07:03Z

What changes were proposed in this pull request?

This is approach 2 of 2 that might fix the issue that bothers SnapDiff, where the current LoadingCache behaves like a simple LRU cache. We had no control over when an OmSnapshot instance can be evicted and closed, which can cause the the snapshot DB instance to be closed prematurely while SnapDiff is still running in the background, crashing the OM.

For approach 1 that implements a custom SnapshotCache and the whole modified-LRU logic from scratch, see #4567 .

This approach 2 replaces the hard-limit (.maximumSize()) with .softValues(). This allows JVM garbage collector to collect the value when they are no longer strongly referenced, for instance from SnapDiff or Hadoop FS API read operations. OmSnapshot#finalize addition should be able to properly close the RocksDB handle.

The ozone.om.snapshot.cache.max.size effectively becomes a soft limit (the same as approach 1), with warning printed in checkForSnapshot() when cache size exceeds the soft limit.

In a degenerate case where OM JVM heap usage is pushed to the absolute limit, JVM GC could collect OmSnapshot in cache as soon as e.g. a Hadoop FS API read request is served. However, in this case OM performance would tank anyway and the snapshot cache would be the least of its concern. Note: OmSnapshot should not consume much (on-heap) memory by itself. The DB instance could consume a good chunk of off-heap memory which is not JVM's concern.

Under normal conditions, the cache would still be able to serve its purpose.

This is not fully tested yet. This is much cleaner than approach 1 if this works as expected. In the worst case, we fall back to approach 1.

We are also already overriding finalize() in PipeInputStream and other Managed DB wrappers. So I guess it is fine (for now) to add one more usage here in OmSnapshot.

cc @GeorgeJahad

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7935

How was this patch tested?

All existing test should pass.
Pending SnapDiff test additions that intentionally exceeds the cache limit.
- Possibly new test cases that triggers GC while SnapDiff is still running to see if it can still finish without crashing OM.

… long running processes Change-Id: I1402062eb264e7a4a27014e3cd9f1ba91b6a18bd

Change-Id: I1e8b244d48e34b33588e0ca7a64ad0dbc25d5123

Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java

Change-Id: I653dfc6373862ae39e2e71a77b8998439dcf57b3

Change-Id: I915f770e4864647d324264640b3d8300508df97e

Change-Id: Icb08433e935f14b119eb76e4740038491eea2ac5

hemantk-12

Thanks for the patch @smengcl

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java

hemantk-12 · 2023-04-27T07:36:54Z

...zone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshotFileSystem.java

-    final String errorMsg = "no longer active";
-    LambdaTestUtils.intercept(OMException.class, errorMsg,
+    final String errorMsg1 = "no longer active";
+    LambdaTestUtils.intercept(FileNotFoundException.class, errorMsg1,


+1 on checking exception.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshot.java

smengcl · 2023-04-27T17:07:07Z

@GeorgeJahad Please kindly take a look at this. We plan to fix the early-close issue at hand with softValues() first, then do the custom reference counting ( i.e. #4567 ).

prashantpogde

thanks for the patch @smengcl. The changes look good to me.

hemantk-12

LGTM.

GeorgeJahad · 2023-05-02T17:25:11Z

lgtm. thanks for taking this off my hands!

GeorgeJahad · 2023-05-02T17:28:50Z

One question though, do you know if it is possible to set both softValues() and maximumSize() on the cache? Would that give us the desired behaviour without reference counting?

Change-Id: I6720da0c4740bf626b9737272e8f35b0621f1cb8

smengcl · 2023-05-02T19:52:47Z

One question though, do you know if it is possible to set both softValues() and maximumSize() on the cache? Would that give us the desired behaviour without reference counting?

@GeorgeJahad AFAIU, unfortunately as soon as maximumSize(limit) is added, limit becomes a hard limit. It will start invalidating cache entries when the limit would be exceeded thus we would have the same issue as before.

hemantk-12 · 2023-05-02T20:32:56Z

One question though, do you know if it is possible to set both softValues() and maximumSize() on the cache? Would that give us the desired behaviour without reference counting?

@GeorgeJahad AFAIU, unfortunately as soon as maximumSize(limit) is added, limit becomes a hard limit. It will start invalidating cache entries when the limit would be exceeded thus we would have the same issue as before.

Yes, that's my understanding too.

HDDS-7935. [Snapshot] LRU Cache entries may get evicted/closed during…

d0a4d49

… long running processes Change-Id: I1402062eb264e7a4a27014e3cd9f1ba91b6a18bd

smengcl added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Apr 14, 2023

smengcl added 6 commits April 20, 2023 15:48

Use softValues rather than weakValues; fix finalize(); fix UT.

b31f20c

Change-Id: I1e8b244d48e34b33588e0ca7a64ad0dbc25d5123

Merge remote-tracking branch 'asf/master' into HDDS-7935-approach-2

7bdf0a2

Conflicts: hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java

Re-apply changes in OmSnapshotManager.

45e9ba4

Change-Id: I653dfc6373862ae39e2e71a77b8998439dcf57b3

Fix UT: exception class change.

9d4283b

Change-Id: I915f770e4864647d324264640b3d8300508df97e

Merge branch 'master' into HDDS-7935-approach-2

9c2ef42

Fix checkstyle caused during merge

76c0c83

Change-Id: Icb08433e935f14b119eb76e4740038491eea2ac5

smengcl requested a review from prashantpogde April 27, 2023 16:48

hemantk-12 reviewed Apr 27, 2023

View reviewed changes

Address comments

53d330d

prashantpogde approved these changes Apr 27, 2023

View reviewed changes

hemantk-12 approved these changes Apr 27, 2023

View reviewed changes

smengcl added 2 commits May 2, 2023 11:27

Merge branch 'master' into HDDS-7935-approach-2

0d7f869

Adjust description in ozone-default.xml.

7fabe86

Change-Id: I6720da0c4740bf626b9737272e8f35b0621f1cb8

smengcl merged commit ea4b01b into apache:master May 3, 2023

smengcl deleted the HDDS-7935-approach-2 branch May 3, 2023 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-7935. [Snapshot] LRU Cache entries may get evicted/closed during long running processes #4568

HDDS-7935. [Snapshot] LRU Cache entries may get evicted/closed during long running processes #4568

Uh oh!

smengcl commented Apr 14, 2023 •

edited

Loading

Uh oh!

hemantk-12 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hemantk-12 Apr 27, 2023

Uh oh!

Uh oh!

smengcl commented Apr 27, 2023

Uh oh!

prashantpogde left a comment

Uh oh!

hemantk-12 left a comment

Uh oh!

GeorgeJahad commented May 2, 2023

Uh oh!

GeorgeJahad commented May 2, 2023

Uh oh!

smengcl commented May 2, 2023 •

edited

Loading

Uh oh!

hemantk-12 commented May 2, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-7935. [Snapshot] LRU Cache entries may get evicted/closed during long running processes #4568

HDDS-7935. [Snapshot] LRU Cache entries may get evicted/closed during long running processes #4568

Uh oh!

Conversation

smengcl commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

hemantk-12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hemantk-12 Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smengcl commented Apr 27, 2023

Uh oh!

prashantpogde left a comment

Choose a reason for hiding this comment

Uh oh!

hemantk-12 left a comment

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad commented May 2, 2023

Uh oh!

GeorgeJahad commented May 2, 2023

Uh oh!

smengcl commented May 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hemantk-12 commented May 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smengcl commented Apr 14, 2023 •

edited

Loading

smengcl commented May 2, 2023 •

edited

Loading

hemantk-12 commented May 2, 2023 •

edited

Loading