HDDS-10103. Simplified snapshotCache by using single reentrant lock instead multiple locks #5968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

hemantk-12 wants to merge 7 commits into apache:master from hemantk-12:HDDS-10103

Contributor

hemantk-12 commented Jan 10, 2024

What changes were proposed in this pull request?

Currently we use different locks to provide consistent view of the snapshot cache. Which are causing multiple problem e.g. deadlock, race condition.
In this change, a reentrant lock is used to achieve the synchronization which give the consistency view of the cache. And Guava's Striped is used to distribute the lock so that multiple thread can access the cache at the same time.

What is the link to the Apache JIRA

How was this patch tested?

Existing unit and integration tests.


          Simplified snapshotCache by using single reentrant lock instead multi…

57ffd27

…ple locks

hemantk-12 requested review from aswinshakil, smengcl and swamirishi

January 10, 2024 00:31

hemantk-12 added 3 commits

January 9, 2024 18:05


          Fixed unit test

5daa3a9


          Removed pendingEvictionList and its usage from snapshotCache

369df59


          Removed ReferenceCountedCallback

7b0285a

adoroszlai added the snapshot label

hemantk-12 marked this pull request as ready for review

January 10, 2024 20:21

Member

aswinshakil commented Jan 10, 2024

Thanks for fixing this @hemantk-12. There can still be a case where this can result in a deadlock. For example,
At timet0 Thread 1 -> Acquires lock for snap1
At timet0 Thread 2 -> Acquires lock for snap2

Now both the threads are doing the cleanup(), We may end up in a scenario.
At time t1 Thread 1 -> Waiting for lock snap2
At time t1 Thread 2 -> Waiting for lock snap1

Let me know if this is possible.

Contributor Author

hemantk-12 commented Jan 11, 2024

There can still be a case where this can result in a deadlock. For example, At timet0 Thread 1 -> Acquires lock for snap1 At timet0 Thread 2 -> Acquires lock for snap2

Now both the threads are doing the cleanup(), We may end up in a scenario. At time t1 Thread 1 -> Waiting for lock snap2 At time t1 Thread 2 -> Waiting for lock snap1

Currently cleanup is synchronized at class level and should prevent this. Only one thread will be doing clean up at a time.

I think we can move cleanup here out of the lock.


          Moved out cleanup in get function out of the lock

684fd31

Member

aswinshakil commented Jan 11, 2024

Even with synchronized, we can still hit this scenario.
Thread 1(holding snap1 lock) -> waiting to get lock synchronized method
Thread 2(holding snap2 lock) -> doing cleanup() waiting for snap1 lock

Contributor Author

hemantk-12 commented Jan 11, 2024 •

edited

Loading

@aswinshakil actually you are right (Good catch), even after the synchronized cleanup, it is possible the thread is waiting for the lock which is acquired by another thread.

For example,
At timet0 Thread 1 -> Acquires lock for snap1
At timet0 Thread 2 -> Acquires lock for snap2

Now thread one acquires the cleanup() lock and starts the clean up, but it will stuck for snap2 lock because it has not been released so far.
At time t1 Thread 1 -> Waiting for lock snap2

Moving cleanup in get function out of the lock should fix it.

After the change,
At timet0 Thread 1 -> Acquires lock for snap1 -> does dbMap#getOrCompute() -> release the lock
At timet0 Thread 2 -> Acquires lock for snap2 -> does dbMap#getOrCompute() -> release the lock

One of the thread will acquire the cleanup lock because it is synchronized. Let's say, Thread 1 gets it.
At time t1 Thread 1 -> acquire the cleanup lock -> acquire snap1 (and other keys sequentially) lock -> remove the entry if possible -> release snap1 lock -> release clean up lock.

aswinshakil reviewed

View reviewed changes

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java Show resolved Hide resolved

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java Outdated Show resolved Hide resolved

hemantk-12 added 2 commits

January 10, 2024 16:35


          Moved out cleanup in release function out of the lock

98a771b


          minor fix

dee9b7a

Member

aswinshakil commented Jan 11, 2024

LGTM + 1. It would be great if someone else also took a look at it. Thanks @hemantk-12 for working on this.

smengcl reviewed

View reviewed changes

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

    
                // Soft-limit of the total number of snapshot DB instances allowed to be

                // opened on the OM.

                private final int cacheSizeLimit;

                private final Striped<ReadWriteLock> striped;

Contributor

smengcl Jan 11, 2024

Adding a javadoc paragraph above this line to explain the intended usage of this for other maintainers would be appreciated.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java

    
                  this.dbMap = new ConcurrentHashMap<>();

                  this.pendingEvictionList =

                      Collections.synchronizedSet(new LinkedHashSet<>());

                  this.dbMap = new HashMap<>();

Contributor

smengcl Jan 11, 2024 •

edited

Loading

Could thread visibility be an issue here with plain HashMap since dbMap can be mutated by multiple threads and we are not using synchronize on this particular Map?

https://stackoverflow.com/a/11050613

We likely still need ConcurrentHashMap here.

Contributor Author

hemantk-12 Jan 11, 2024

You made a good point. I never heard before that HashMap could lead to infinite loop if multiple thread are updating it.

But using ConcurrentHashMap with locks could lead to the same problem as #5751.

There is another approach to just use the ConcurrentHashMap. I created draft PR: #5986 according to that. My only concern is if it will throw ConcurrentModificationException while iterating over the map when another thread is updating or adding entries to it. I read couple of blogs and seems like it should not be a problem if I go with compute function https://stackoverflow.com/questions/37127285/iterate-over-concurrenthashmap-while-deleting-entries and https://www.digitalocean.com/community/tutorials/concurrenthashmap-in-java

Let me know what you think about this approach.

smengcl reviewed

View reviewed changes

Contributor

smengcl left a comment

Thanks @hemantk-12 . Overall looks fine. I have some comments inline.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java Show resolved Hide resolved

hemantk-12 mentioned this pull request

HDDS-10103. Simplified snapshotCache with just one ConcurrentHashMap. #5986

Merged

swamirishi requested changes

View reviewed changes

Contributor

swamirishi left a comment •

edited

Loading

@hemantk-12 thanks for the patch. I believe there is still an issue with the locking.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java Show resolved Hide resolved

swamirishi closed this

Contributor

swamirishi commented Jan 11, 2024

Closing the pull request in favour of #5986

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels