Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolve database deadlock: #4989

Merged
merged 4 commits into from
Apr 18, 2024
Merged

fix: resolve database deadlock: #4989

merged 4 commits into from
Apr 18, 2024

Conversation

seelabs
Copy link
Collaborator

@seelabs seelabs commented Apr 12, 2024

The rotateWithLock function holds a lock while it calls a callback function that's passed in by the caller. This is a problematic design that needs to be used very carefully. In this case, at least one caller passed in a callback that eventually relocks the mutex on the same thread, causing UB (a deadlock was observed). The caller was from SHAMapStoreImpl, and it called clearCaches. This clearCaches can potentially call fetchNodeObject, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a recursive_mutex. Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex.

Type of Change

  • [x ] Bug fix (non-breaking change which fixes an issue)

The `rotateWithLock` function holds a lock while it calls a callback
function that's passed in by the caller. This is a problematic design
that needs to be used very carefully. In this case, at least one caller
passed in a callback that eventually relocks the mutex on the same
thread, causing UB (a deadlock was observed). The caller was from
SHAMapStoreImpl, and it called `clearCaches`. This `clearCaches` can
potentially call `fetchNodeObject`, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a
`recursive_mutex`. Ideally, the code should be rewritten so it doesn't
hold the mutex during the callback and the mutex should be changed back
to a regular mutex.
@codecov-commenter
Copy link

codecov-commenter commented Apr 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.0%. Comparing base (24a275b) to head (e6e26d4).
Report is 1 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #4989   +/-   ##
=======================================
  Coverage     70.9%   71.0%           
=======================================
  Files          796     796           
  Lines        66727   66727           
  Branches     10981   10973    -8     
=======================================
+ Hits         47333   47345   +12     
+ Misses       19394   19382   -12     
Files Coverage Δ
src/ripple/nodestore/impl/DatabaseRotatingImp.h 50.0% <ø> (ø)

... and 3 files with indirect coverage changes

Impacted file tree graph

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish that the lock handling did not require a recursive_mutex. But we are where we are, and this is a pragmatic solution. 👍

Copy link
Contributor

@HowardHinnant HowardHinnant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a low-risk alternative at this time.

@seelabs seelabs added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Apr 16, 2024
@ximinez ximinez merged commit cd737ad into XRPLF:develop Apr 18, 2024
16 of 17 checks passed
sophiax851 pushed a commit to sophiax851/rippled that referenced this pull request Jun 12, 2024
The `rotateWithLock` function holds a lock while it calls a callback
function that's passed in by the caller. This is a problematic design
that needs to be used very carefully. In this case, at least one caller
passed in a callback that eventually relocks the mutex on the same
thread, causing UB (a deadlock was observed). The caller was from
SHAMapStoreImpl, and it called `clearCaches`. This `clearCaches` can
potentially call `fetchNodeObject`, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a
`recursive_mutex`. Ideally, the code should be rewritten so it doesn't
hold the mutex during the callback and the mutex should be changed back
to a regular mutex.

Co-authored-by: Ed Hennis <[email protected]>
ximinez added a commit to ximinez/rippled that referenced this pull request Aug 7, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Aug 7, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Aug 21, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Sep 6, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Sep 11, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Sep 11, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Sep 25, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Oct 15, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Oct 18, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Oct 31, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Nov 4, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Nov 5, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Nov 8, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Nov 13, 2024
ximinez added a commit to ximinez/rippled that referenced this pull request Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants