-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shutdown notifications engine when closing a bitswap session #4658
Conversation
very strange test failure... |
0b80b2c
to
5833ecc
Compare
not strange, i just wasnt paying attention. |
@Stebalien I think my second commit is the right fix for the wantlist thing, but the test doesnt actually reproduce the issue. I think its because there are a couple different places the wantlist is accounted for, and bitswap.Wantlist is not the same as the wantmanager where the sessions track their wants. |
cs, _ := cid.Cast([]byte(c)) | ||
live = append(live, cs) | ||
} | ||
bs.CancelWants(live, s.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if another session wants a cid which is cancelled here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wants are tracked per session (note that we pass in the session ID)
Are you not running into the issue I have here: #4659 (comment)? It looks like unsubscribing after the session is shut down causes the unsubscribe function to hang (looks like a bug in the pubsub library we're using). Actually, we should probably just replace that library (https://github.com/briantigerchow/pubsub). It's way too complicated for what it does (and we don't actually need to spawn a goroutine like that). |
@Stebalien I didn't notice any issues, all the tests passed when i ran them. And i'm wary of replacing a library that has caused us zero problems in three years. |
Calling Unsub after Shutdown will hang forever trying to write to the cmds channel. This patch has the same bug as mine. Try downloading something you don't have from the gateway, cancel, and then run:
My primary motivation was getting rid of the goroutine but we need goroutines anyways (one per subscription, actually) to make our contexts work... However, we still need to fix the unsubscribe issue (and the easiest way would be to at least modify this library). |
@whyrusleeping off the top of your head, do you recall why we have separate notification engines per session instead of just using the main one? |
@Stebalien I actually don't remember off the top of my head. I think it was something around me thinking that multiple different subscriptions wouldnt play nicely. But they should... I'd say try it. It seems like it could work. |
License: MIT Signed-off-by: Jeromy <[email protected]>
License: MIT Signed-off-by: Jeromy <[email protected]>
License: MIT Signed-off-by: Steven Allen <[email protected]>
…g it down Otherwise, we'll deadlock and leak a goroutine. This fix is kind of crappy but modifying the pubsub library would have been worse (and, really, it *is* reasonable to say "don't use the pubsub instance after shutting it down"). License: MIT Signed-off-by: Steven Allen <[email protected]>
974eaa0
to
5395826
Compare
@whyrusleeping so, I fixed the unsubscribe after shutdown deadlock by just not unsubscribing after shutting down. I'm not happy with the fix but cleanly fixing the pubsub library didn't seem possible either due to its API (e.g., when a user calls I figured if we couldn't have a nice fix, we might as well have a small fix internally and avoid maintaining a fork. |
// Interrupt in-progress subscriptions. | ||
close(ps.cancel) | ||
// Wait for them to finish. | ||
ps.wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will wait for all active wants to be cancelled, which happens if the caller closes the session, right? I would like to see a test around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it'll just wait for all unsubscribes to finish (which should happen immediately after we close the ps.cancel
channel).
However, I have added a test to ensure that shutting down the PubSub while a subscription is active works (and doesn't block as it did before). Is that what you're looking for?
License: MIT Signed-off-by: Steven Allen <[email protected]>
Real test failures:
|
(will deadlock) License: MIT Signed-off-by: Steven Allen <[email protected]>
So that I don't forget where I left off... Canceling wants for one session sometimes prevents the other session from receiving its requested block. So far, I've narrowed this down to the |
@Stebalien status here? |
@whyrusleeping still debugging. |
…ages Before, we weren't using a pointer so we were throwing away the update. License: MIT Signed-off-by: Steven Allen <[email protected]>
Well... that's an old bug. Unfortunately, that fix still doesn't fix everything. |
very weird behaviour here. Putting a sleep after |
Putting the |
Okay, looking at wantlist messages being sent from peer A to peer B. In success cases we have one of:
But in the failure case, I'm always seeing:
Which makes sense why it hangs, we cancelled our request. The other peer is respecting that. Now the question is, what makes us cancel? |
So with the above pattern,
This should result in three messages getting sent, a want, a cancel, and a want. My only thought now is that the messages are getting reordered in transit somehow. |
The messages are delivered by throwing them off into a goroutine... |
adding a random delay to every message send causes the bug to be reproduced ~30% of the time. |
License: MIT Signed-off-by: Jeromy <[email protected]>
License: MIT
Signed-off-by: Jeromy [email protected]