Skip to content

Conversation

@hemantk-12
Copy link
Contributor

What changes were proposed in this pull request?

Currently we use different locks to provide consistent view of the snapshot cache. Which are causing multiple problem e.g. deadlock, race condition.
In this change, a reentrant lock is used to achieve the synchronization which give the consistency view of the cache. And Guava's Striped is used to distribute the lock so that multiple thread can access the cache at the same time.

What is the link to the Apache JIRA

HDDS-10103

How was this patch tested?

Existing unit and integration tests.

@adoroszlai adoroszlai added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jan 10, 2024
@hemantk-12 hemantk-12 marked this pull request as ready for review January 10, 2024 20:21
@aswinshakil
Copy link
Member

Thanks for fixing this @hemantk-12. There can still be a case where this can result in a deadlock. For example,
At timet0 Thread 1 -> Acquires lock for snap1
At timet0 Thread 2 -> Acquires lock for snap2

Now both the threads are doing the cleanup(), We may end up in a scenario.
At time t1 Thread 1 -> Waiting for lock snap2
At time t1 Thread 2 -> Waiting for lock snap1

Let me know if this is possible.

@hemantk-12
Copy link
Contributor Author

There can still be a case where this can result in a deadlock. For example, At timet0 Thread 1 -> Acquires lock for snap1 At timet0 Thread 2 -> Acquires lock for snap2

Now both the threads are doing the cleanup(), We may end up in a scenario. At time t1 Thread 1 -> Waiting for lock snap2 At time t1 Thread 2 -> Waiting for lock snap1

Currently cleanup is synchronized at class level and should prevent this. Only one thread will be doing clean up at a time.

I think we can move cleanup here out of the lock.

@aswinshakil
Copy link
Member

Even with synchronized, we can still hit this scenario.
Thread 1(holding snap1 lock) -> waiting to get lock synchronized method
Thread 2(holding snap2 lock) -> doing cleanup() waiting for snap1 lock

@hemantk-12
Copy link
Contributor Author

hemantk-12 commented Jan 11, 2024

@aswinshakil actually you are right (Good catch), even after the synchronized cleanup, it is possible the thread is waiting for the lock which is acquired by another thread.

For example,
At timet0 Thread 1 -> Acquires lock for snap1
At timet0 Thread 2 -> Acquires lock for snap2

Now thread one acquires the cleanup() lock and starts the clean up, but it will stuck for snap2 lock because it has not been released so far.
At time t1 Thread 1 -> Waiting for lock snap2

Moving cleanup in get function out of the lock should fix it.

After the change,
At timet0 Thread 1 -> Acquires lock for snap1 -> does dbMap#getOrCompute() -> release the lock
At timet0 Thread 2 -> Acquires lock for snap2 -> does dbMap#getOrCompute() -> release the lock

One of the thread will acquire the cleanup lock because it is synchronized. Let's say, Thread 1 gets it.
At time t1 Thread 1 -> acquire the cleanup lock -> acquire snap1 (and other keys sequentially) lock -> remove the entry if possible -> release snap1 lock -> release clean up lock.

@aswinshakil
Copy link
Member

LGTM + 1. It would be great if someone else also took a look at it. Thanks @hemantk-12 for working on this.

// Soft-limit of the total number of snapshot DB instances allowed to be
// opened on the OM.
private final int cacheSizeLimit;
private final Striped<ReadWriteLock> striped;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a javadoc paragraph above this line to explain the intended usage of this for other maintainers would be appreciated.

this.dbMap = new ConcurrentHashMap<>();
this.pendingEvictionList =
Collections.synchronizedSet(new LinkedHashSet<>());
this.dbMap = new HashMap<>();
Copy link
Contributor

@smengcl smengcl Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could thread visibility be an issue here with plain HashMap since dbMap can be mutated by multiple threads and we are not using synchronize on this particular Map?

https://stackoverflow.com/a/11050613

We likely still need ConcurrentHashMap here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You made a good point. I never heard before that HashMap could lead to infinite loop if multiple thread are updating it.

But using ConcurrentHashMap with locks could lead to the same problem as #5751.

There is another approach to just use the ConcurrentHashMap. I created draft PR: #5986 according to that. My only concern is if it will throw ConcurrentModificationException while iterating over the map when another thread is updating or adding entries to it. I read couple of blogs and seems like it should not be a problem if I go with compute function https://stackoverflow.com/questions/37127285/iterate-over-concurrenthashmap-while-deleting-entries and https://www.digitalocean.com/community/tutorials/concurrenthashmap-in-java

Let me know what you think about this approach.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemantk-12 . Overall looks fine. I have some comments inline.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hemantk-12 thanks for the patch. I believe there is still an issue with the locking.

@swamirishi swamirishi closed this Jan 11, 2024
@swamirishi
Copy link
Contributor

Closing the pull request in favour of #5986

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants