Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash inside OverlayImpl loops over ids_ #5071

Merged
merged 8 commits into from
Aug 2, 2024

Conversation

Bronek
Copy link
Collaborator

@Bronek Bronek commented Jul 24, 2024

High Level Overview of Change

Functions OverlayImpl::getActivePeers and OverlayImpl::findPeerByPublicKey may cause a crash when iterating over ids_

Context of Change

Functions OverlayImpl::getActivePeers and OverlayImpl::findPeerByPublicKey use the following buggy pattern to iterate over ids_ :

for (auto& [id, w] : ids_)
{
    if (auto p = w.lock())
    {
        // ...
    } // p destroyed here
}

The problem is that, given specific thread scheduling, when p is destroyed at a marked location, it might also cause the destruction of the PeerImp that it owned since the start of if (auto p = w.lock()) block. This destruction in turn will call OverlayImpl::onPeerDeactivate which will remove the current element from ids_ collection, hence invalidating the iterator to the removed element. This is all happening while p is being destroyed, that is before the for at the start of the loop tries to increment the iterator (that just got invalidated) to reach to the next element. Incrementing the invalidated operator is an invalid operation, in practice it will yield bad data, causing a crash.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Performance (increase or change in throughput and/or latency)
  • Tests (you added tests for code that already exists, or your new feature included in this PR)
  • Documentation update
  • Chore (no impact to binary, e.g. .gitignore, formatting, dropping support for older tooling)
  • Release

@Bronek Bronek changed the title Fix crash inside OverladImpl loops over ids_ Fix crash inside OverlayImpl loops over ids_ Jul 24, 2024
@Bronek Bronek force-pushed the bugfix/crash_in_OverlayImpl_loops branch 2 times, most recently from f273f4d to 89fd018 Compare July 24, 2024 19:53
Copy link

codecov bot commented Jul 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.3%. Comparing base (e5aa605) to head (c3deeff).
Report is 22 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5071     +/-   ##
=========================================
- Coverage     71.4%   71.3%   -0.0%     
=========================================
  Files          796     796             
  Lines        67073   67075      +2     
  Branches     10866   10881     +15     
=========================================
- Hits         47881   47854     -27     
- Misses       19192   19221     +29     
Files with missing lines Coverage Δ
src/xrpld/overlay/detail/OverlayImpl.cpp 29.8% <100.0%> (+1.1%) ⬆️

... and 8 files with indirect coverage changes

Impacted file tree graph

@Bronek Bronek requested a review from JoelKatz July 25, 2024 11:01
@Bronek Bronek marked this pull request as ready for review July 25, 2024 14:12
@Bronek
Copy link
Collaborator Author

Bronek commented Jul 25, 2024

Here's a smaller reproduction of the bug https://godbolt.org/z/9rGsx9eEP and of the proposed fix https://godbolt.org/z/qPbc9oYM9

@Bronek Bronek requested a review from ximinez July 25, 2024 15:10
@Bronek Bronek force-pushed the bugfix/crash_in_OverlayImpl_loops branch from b7500df to fa59802 Compare July 31, 2024 12:27
@ximinez ximinez self-assigned this Aug 1, 2024
Copy link
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

I did a little bit of a search, and came across a few more places that at a glance could have a non-zero chance of having the same problem, and could probably be quickly and easily fixed with this same pattern. Would they be worth changing in the PR, or another one?

@thejohnfreeman
Copy link
Collaborator

@ximinez those first two are fine because the collection being iterated is held in a local variable, and the destructor of the objects in the collection cannot remove that object from the collection. I don't think the third is in danger, either, but I haven't looked very deeply.

@Bronek
Copy link
Collaborator Author

Bronek commented Aug 2, 2024

@ximinez those first two are fine because the collection being iterated is held in a local variable, and the destructor of the objects in the collection cannot remove that object from the collection. I don't think the third is in danger, either, but I haven't looked very deeply.

Yes, the first two operate on a stack-based collection within the scope of the function, so these are not a problem. The last one is not a problem either, because both collections deltas_ and skipLists_ operate on objects which do not remove themselves from a collection inside their respective destructors :
https://github.com/ximinez/rippled/blob/1e3ab9f76615cf9bb74b7bfeb6d2bd1cd257fe33/src/xrpld/app/ledger/detail/SkipListAcquire.cpp#L48
https://github.com/ximinez/rippled/blob/1e3ab9f76615cf9bb74b7bfeb6d2bd1cd257fe33/src/xrpld/app/ledger/detail/LedgerDeltaAcquire.cpp#L52

Their base class TimeoutCounter doesn't do anything like that either

https://github.com/ximinez/rippled/blob/1e3ab9f76615cf9bb74b7bfeb6d2bd1cd257fe33/src/xrpld/app/ledger/detail/TimeoutCounter.cpp#L118
https://github.com/ximinez/rippled/blob/1e3ab9f76615cf9bb74b7bfeb6d2bd1cd257fe33/src/xrpld/app/ledger/detail/TimeoutCounter.h#L97

@Bronek Bronek added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Aug 2, 2024
@ximinez
Copy link
Collaborator

ximinez commented Aug 2, 2024

The last one is not a problem either, because both collections deltas_ and skipLists_ operate on objects which do not remove themselves from a collection inside their respective destructors

I am thinking a little bit about future-proofing. This seems like a good pattern to use in general just in case the behavior of these objects changes in the future. However, that is definitely beyond the scope of this PR.

@ximinez ximinez merged commit ffc343a into XRPLF:develop Aug 2, 2024
19 of 20 checks passed
vlntb pushed a commit to vlntb/rippled that referenced this pull request Aug 23, 2024
@Bronek Bronek added the Found with Antithesis Bugs identified with the help of Anithesis continuous testing platform label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Found with Antithesis Bugs identified with the help of Anithesis continuous testing platform Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants