Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutdown notifications engine when closing a bitswap session #4658

Merged
merged 8 commits into from
Feb 13, 2018

Conversation

whyrusleeping
Copy link
Member

License: MIT
Signed-off-by: Jeromy [email protected]

@ghost ghost assigned whyrusleeping Feb 4, 2018
@ghost ghost added the status/in-progress In progress label Feb 4, 2018
@whyrusleeping
Copy link
Member Author

very strange test failure...

@whyrusleeping
Copy link
Member Author

not strange, i just wasnt paying attention.

@whyrusleeping
Copy link
Member Author

@Stebalien I think my second commit is the right fix for the wantlist thing, but the test doesnt actually reproduce the issue. I think its because there are a couple different places the wantlist is accounted for, and bitswap.Wantlist is not the same as the wantmanager where the sessions track their wants.

cs, _ := cid.Cast([]byte(c))
live = append(live, cs)
}
bs.CancelWants(live, s.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if another session wants a cid which is cancelled here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wants are tracked per session (note that we pass in the session ID)

@Stebalien
Copy link
Member

Are you not running into the issue I have here: #4659 (comment)? It looks like unsubscribing after the session is shut down causes the unsubscribe function to hang (looks like a bug in the pubsub library we're using).

Actually, we should probably just replace that library (https://github.com/briantigerchow/pubsub). It's way too complicated for what it does (and we don't actually need to spawn a goroutine like that).

@whyrusleeping
Copy link
Member Author

@Stebalien I didn't notice any issues, all the tests passed when i ran them. And i'm wary of replacing a library that has caused us zero problems in three years.

@Stebalien
Copy link
Member

Calling Unsub after Shutdown will hang forever trying to write to the cmds channel. This patch has the same bug as mine. Try downloading something you don't have from the gateway, cancel, and then run:

> curl 'http://localhost:5001/debug/pprof/goroutine?debug=2' | grep 'pubsub.*Unsub'

And i'm wary of replacing a library that has caused us zero problems in three years.

My primary motivation was getting rid of the goroutine but we need goroutines anyways (one per subscription, actually) to make our contexts work... However, we still need to fix the unsubscribe issue (and the easiest way would be to at least modify this library).

@Stebalien
Copy link
Member

@whyrusleeping off the top of your head, do you recall why we have separate notification engines per session instead of just using the main one?

@whyrusleeping
Copy link
Member Author

@Stebalien I actually don't remember off the top of my head. I think it was something around me thinking that multiple different subscriptions wouldnt play nicely. But they should... I'd say try it. It seems like it could work.

whyrusleeping and others added 4 commits February 9, 2018 12:18
…g it down

Otherwise, we'll deadlock and leak a goroutine. This fix is kind of crappy but
modifying the pubsub library would have been worse (and, really, it *is*
reasonable to say "don't use the pubsub instance after shutting it down").

License: MIT
Signed-off-by: Steven Allen <[email protected]>
@Stebalien
Copy link
Member

@whyrusleeping so, I fixed the unsubscribe after shutdown deadlock by just not unsubscribing after shutting down. I'm not happy with the fix but cleanly fixing the pubsub library didn't seem possible either due to its API (e.g., when a user calls AddSub(ch, topics...) on a pubsub instance after shutting it down, should we close ch? That could panic if we had already registered it and closed it).

I figured if we couldn't have a nice fix, we might as well have a small fix internally and avoid maintaining a fork.

// Interrupt in-progress subscriptions.
close(ps.cancel)
// Wait for them to finish.
ps.wg.Wait()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will wait for all active wants to be cancelled, which happens if the caller closes the session, right? I would like to see a test around this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it'll just wait for all unsubscribes to finish (which should happen immediately after we close the ps.cancel channel).

However, I have added a test to ensure that shutting down the PubSub while a subscription is active works (and doesn't block as it did before). Is that what you're looking for?

@Stebalien
Copy link
Member

Real test failures:

--- FAIL: TestMultipleSessions (20.01s)
	session_test.go:284: bad juju
--- FAIL: TestWantlistClearsOnCancel (0.00s)
	session_test.go:318: expected empty wantlist

(will deadlock)

License: MIT
Signed-off-by: Steven Allen <[email protected]>
@Stebalien
Copy link
Member

So that I don't forget where I left off...

Canceling wants for one session sometimes prevents the other session from receiving its requested block. So far, I've narrowed this down to the mq.out.Cancel(e.Cid) line in the addMessage function of msgQueue in wantmanager.go. Commenting that line out fixes it.

@whyrusleeping
Copy link
Member Author

@Stebalien status here?

@Stebalien
Copy link
Member

@whyrusleeping still debugging.

…ages

Before, we weren't using a pointer so we were throwing away the update.

License: MIT
Signed-off-by: Steven Allen <[email protected]>
@Stebalien
Copy link
Member

Well... that's an old bug. Unfortunately, that fix still doesn't fix everything. go test -v -count 100 -run TestMultipleSessions hangs after ~20 runs (better than the 1-5 before...).

@whyrusleeping
Copy link
Member Author

very weird behaviour here. Putting a sleep after cancel1() causes the issue to no longer happen. But putting the cancel1() after the next GetBlocks call also causes the issue to no longer happen.

@whyrusleeping
Copy link
Member Author

Putting the cancel1() after the NewSession call still allows me to reproduce the issue.

@whyrusleeping
Copy link
Member Author

Okay, looking at wantlist messages being sent from peer A to peer B. In success cases we have one of:

  • A -> B: Want X
    or
  • A -> B: Want X
  • A -> B: Want X
    or
  • A -> B: Want X
  • A -> B: Cancel X
  • A -> B: Want X

But in the failure case, I'm always seeing:

  • A -> B: Want X
  • A -> B: Want X
  • A -> B: Cancel X

Which makes sense why it hangs, we cancelled our request. The other peer is respecting that. Now the question is, what makes us cancel?

@whyrusleeping
Copy link
Member Author

So with the above pattern, addMessage gets called three times:

  • Want X (session 1)
  • Cancel X (session 1)
  • Want X (session 2)

This should result in three messages getting sent, a want, a cancel, and a want. My only thought now is that the messages are getting reordered in transit somehow.

@whyrusleeping
Copy link
Member Author

Oh look: https://github.com/ipfs/go-ipfs/blob/master/exchange/bitswap/testnet/virtual.go#L80

The messages are delivered by throwing them off into a goroutine...

@whyrusleeping
Copy link
Member Author

adding a random delay to every message send causes the bug to be reproduced ~30% of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants