Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ipld): fix races in the Retriever's doRequest method #815

Merged
merged 4 commits into from
Jun 14, 2022

Conversation

Wondertan
Copy link
Member

@Wondertan Wondertan commented Jun 12, 2022

Mainly this PR fixes two races:

  1. One race happens when in one slice slot, two routines are trying to write the same share. Two because we have two sources for each share: Row and Col. Also, this does not happen on a happy path, and only in case, Retriever tries to fetch as many shares as possible to reconstruct the block, explaining why the race was uncovered with test(share): add Full Node reconstruction tests #702.

The mentioned race is a known one, and we had an incorrect attempt previously to fix it cca832a.

This fix is correct and achieved by keeping a mutex per share. If a share comes from any of two sources first, it locks the slot and prevents a share from another source from being written.

  1. The second race happened after the square was reconstructed. During the reconstruction, some new shares could arrive and wait on the lock. Once reconstructed, they were unlocked and continued to write in an already filled-up square slice causing multiple races with the same root. Basically, the reconstructed square was returned to the user, while Retriever internal was continue to writing. This PR fixes that and ensure that after unlocking writings routines check if reconstruction happened and if so aborts them.

Closes #814

P.S. The complexity of the code in Retriever is starting to grow, and it's hard to test it. We will likely need to refactor the code(I have a brief picture in my head). However, though, this should happen after all the API improvements land in rsmt2d.

@Wondertan Wondertan added the kind:fix Attached to bug-fixing PRs label Jun 12, 2022
@Wondertan Wondertan requested a review from vgonkivs June 12, 2022 09:35
@Wondertan Wondertan self-assigned this Jun 12, 2022
@Wondertan
Copy link
Member Author

Wondertan commented Jun 12, 2022

Oh shoot, I just realized I am using a feature from 1.18 here and we are not updated yet, ahhh.

@codecov-commenter
Copy link

codecov-commenter commented Jun 12, 2022

Codecov Report

Merging #815 (7af60f3) into main (a2dbbbe) will increase coverage by 0.03%.
The diff coverage is 83.72%.

@@            Coverage Diff             @@
##             main     #815      +/-   ##
==========================================
+ Coverage   53.30%   53.33%   +0.03%     
==========================================
  Files         119      119              
  Lines        6833     6858      +25     
==========================================
+ Hits         3642     3658      +16     
- Misses       2821     2826       +5     
- Partials      370      374       +4     
Impacted Files Coverage Δ
ipld/retriever.go 85.36% <83.72%> (+1.19%) ⬆️
fraud/bad_encoding.go 31.11% <0.00%> (-3.34%) ⬇️
das/daser.go 73.68% <0.00%> (-2.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2dbbbe...7af60f3. Read the comment docs.

@Wondertan
Copy link
Member Author

Ok, so this race fixing PR uncovered another race on CI. Nice

@Wondertan
Copy link
Member Author

Ok, so now this PR fixes two independent races. Will update the description

@Wondertan Wondertan changed the title fix(ipld): fix race in the Retriever's doRequest method fix(ipld): fix races in the Retriever's doRequest method Jun 13, 2022
@Wondertan Wondertan force-pushed the hlib/ipld/fix-race-retriever branch from 8d446a4 to 2ad02cf Compare June 13, 2022 13:55
@Wondertan
Copy link
Member Author

Rebased on main with go1.18. Removed temporal semaphore as we have the needed feature from 1.18 now

* During which the square was continued to be written in while used outside of the retrievalSession
* Also, unlock on defer to simplify code a bit
@Wondertan Wondertan force-pushed the hlib/ipld/fix-race-retriever branch from 2ad02cf to 6eb883b Compare June 13, 2022 15:04
Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. The complexity of the code in Retriever is starting to grow, and it's hard to test it. We will likely need to refactor the code(I have a brief picture in my head). However, though, this should happen after all the API improvements land in rsmt2d.

Can you please write that down (also which changes in r2mt2d are necessary and how they related to simplifying the retriever code here).

This is really hard to reason about and I can't really tell if this PR fixes things while introducing a bunch of others.

ipld/retriever.go Outdated Show resolved Hide resolved
@Wondertan
Copy link
Member Author

@liamsi, to reason about the code is better to check out and see the whole code, rather than changes only on GH.

@Wondertan
Copy link
Member Author

I also put a lot of effort into comments. If they don't make sense, pls point that out. Otherwise, I am pretty confident in this code now.

Copy link
Member

@liamsi liamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes with the comments make sense (besides one case I commented on).

ipld/retriever.go Outdated Show resolved Hide resolved
ipld/retriever.go Outdated Show resolved Hide resolved
@Wondertan
Copy link
Member Author

Wondertan commented Jun 13, 2022

Thank you @liamsi!

@Wondertan Wondertan merged commit b1880b6 into main Jun 14, 2022
@Wondertan Wondertan deleted the hlib/ipld/fix-race-retriever branch June 14, 2022 09:53
@Wondertan Wondertan mentioned this pull request Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:fix Attached to bug-fixing PRs
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

ipld: race in Retriever
5 participants