-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc: hold ac.mu while calling resetTransport to prevent concurrent connection attempts #7390
Merged
Merged
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
c3a3d1c
Update conn state to prevent concurrent connection attempts
arjan-bal 6214c9d
Make callers of resetBackoff() lock the mutex
arjan-bal ff977b3
Add doc comment for resetTransportAndUnlock
arjan-bal 76ef33f
Merge remote-tracking branch 'source/master' into fix_conn_connect_race
arjan-bal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably file a bug for this otherwise it will be a behavior change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you look at #7365 (comment) for the details of this bug?
What change in behaviour are you concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was just asking with respect to release notes because currently the bug points to test flake. Anyways, I just checked it doesn't matter because release notes refer to the fix PR and not the issue. Although in the release notes, we should prefix the package
balancer: Fix race condition that could lead to multiple transports being created in parallel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it looks like without your fix, there is a case where resetTransport can error out and return without updating the connectivity state
grpc-go/clientconn.go
Line 1237 in bdd707e
May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any benefit in adding the
acCtcx.Err
check inconnect
because even resetTransport sets the the state toConnecting
and releases the lock:grpc-go/clientconn.go
Lines 1262 to 1263 in bdd707e
This means that the context can be cancelled (and subsequently addrConn shutdown) after the channel is in
connecting
state even without the change.IIUC we just need to ensure that we don't set connecting state after the channel enters shutdown.
The test for shutdown state on top should be enough protection to ensure shutdown state comes only after we enter
connecting
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could rename
resetTransport
toresetTransportLocked
and expect the callers to hold the lock while calling this method. However,resetTransport
releases the lock temporarily. Add to this that ac.updateAddrs callsresetTransport
in a new go routine so it can't hold the lock tillresetTransport
completes. It feels a little risky to make that change. I don't know for sure, but I feel we could end up in a situation where the lock is not released correctly resulting in a deadlock.I don't want do make that change as the first option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code it looks like in case of
acCtcx.Err
, resetTransport() doesn't update the state and return so state will be still idle but after your fix in case ofacCtcx.Err
state will be updated toconnecting
. Am I missing something?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding,
ac.ctx
is used to control the creation of remote connections whileac.state
is used to synchronize all the state transitions for theaddrConn
.ac.connect()
doesn't deal with creating remote connections, so it doesn't need to checkac.ctx.Err()
. It needs to ensure the transition toConnecting
is valid, which it does by locking the mutex and verifying thatac.state != Shutdown
ac.ctx
is used to avoid doing throw away work which takes significant time (creating a remote conn).Please let me know if your understanding is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: it doesn't matter if resetTransport() returns error after state being updated to
connecting