Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent resetAddrConn from re-adding addConn after being deleted by balancer #1354

Closed
wants to merge 2 commits into from

Conversation

nghialv
Copy link
Contributor

@nghialv nghialv commented Jul 6, 2017

Currently an addrConn is being added to the cc.conns without checking whether the address still exists in balancing targets list.
In some cases, for example, the client receives a goaway frame immediately after the addrConn has been deleted from the cc.conns list but before the tearDown of addrConn.
This causes a new addrConn will be added and an endless resetTransport loop begins.

This is easy to happen while shutting down multiple servers. The following loop will take enough time for some goaway frames to come.
https://github.com/grpc/grpc-go/blob/master/clientconn.go#L535

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again.

@nghialv
Copy link
Contributor Author

nghialv commented Jul 6, 2017

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

@menghanl
Copy link
Contributor

menghanl commented Jul 12, 2017

Thanks for debugging and reporting the issue.

I think we could still use cc.conns as the new cc.addrs map you added in this PR if we can keep cc.conns consistent with the addresses specified by balancer.

The root issue I'm seeing here is that, resetAddrConn is used to create new addrConn in some cases (balancer adds more addresses), and to replace the old addrConn with a new one in some cases (goaway happens).
The bug as you described is, in the latter case, if such an addrConn doesn't exist in cc.conns, we will still add a new addrConn, which is wrong.

resetAddrConn takes a parameter to specify the error to tearDown the addrConn.
The error is always nil if we are adding a new addrConn, and is always non-nil if we are replacing the addrConn.

The fix I'm proposing is that, we should check both if tearDownErr != nil and if stale := cc.conns[ac.addr]; stale == nil.
If tearDownErr != nil (we are replacing addrConn, not adding a new one), and also stale == nil (the addrConn we are trying to replace was removed), we should not add a new addrConn to cc.conns.
Please let me know if this sounds correct.

@menghanl
Copy link
Contributor

menghanl commented Jul 12, 2017

A second thought on this, maybe we should never replace the old addrConn with new one.
In the case of connection error happens, we could just reset the underlying connection.
Created #1369 for this. Please take a look and let me know what you think.

Ignore this. It's a bit more complicated than that.

@nghialv
Copy link
Contributor Author

nghialv commented Jul 13, 2017

Hi

Thanks for your review.

Yes, I think we should keep cc.conns consistent with the addresses of balancer.
And rather than recreating a new addrConn, it is better to only reset the underlying transport connection. However, we need to carefully consider the impact from that change. (Maybe that change should be done in the future)

And in the scope of this PR, if we ensure that the tearDownError is always non-nil while replacing the addrConn, we should use stale and that error to determine whether or not the new addrConn should be added.

@nghialv
Copy link
Contributor Author

nghialv commented Jul 18, 2017

Hi @menghanl

How about this PR?
And can you tell me the date of the next release?
(We want to apply bug fixes of #1353, #1354 into our code)

@menghanl
Copy link
Contributor

Our release cycle is every 6 weeks.
So the next release is scheduled July 19 (tomorrow).

About this PR, I updated #1369 with some new changes, so we can replace transport not the addrConn. Please take a look at that.

If we cannot include the fix for this in the next release (1.5.0), we can have a minor release (1.5.1) to include this later.

@menghanl
Copy link
Contributor

#1369 has been merged, and the problem should be solved.
Can you try and see if you can still produce the issue?
Also, if you don't mind, can you clean up this PR and only include the test you added?

@nghialv
Copy link
Contributor Author

nghialv commented Jul 24, 2017

I tried with the code at master and it worked fine.
About the test in this PR, I think it is not necessary after merging #1369 because we made sure that resetAddrConn is only called when a new address is added. (maybe the name of resetAddrConn method should be changed)
So you can close this PR without merging.
I'm hoping for a minor release (1.5.1) in this week.

Thanks @menghanl

@nghialv
Copy link
Contributor Author

nghialv commented Jul 27, 2017

@menghanl Can you tell me the date of the next release (1.5.1)?

@menghanl
Copy link
Contributor

Sorry for delay.
v1.5.1 was released.

@nghialv
Copy link
Contributor Author

nghialv commented Jul 28, 2017

Great! Thank you.

@menghanl
Copy link
Contributor

Closing this PR now. Thanks.

@menghanl menghanl closed this Jul 28, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
@nghialv nghialv deleted the conns branch October 29, 2024 06:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants