Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: hold ac.mu while calling resetTransport to prevent concurrent connection attempts #7390

Merged
merged 4 commits into from
Jul 9, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions clientconn.go
Original file line number Diff line number Diff line change
Expand Up @@ -918,6 +918,9 @@ func (ac *addrConn) connect() error {
ac.mu.Unlock()
return nil
}
// Update the state to ensure no concurrent requests start once the lock
// is released.
ac.updateConnectivityState(connectivity.Connecting, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably file a bug for this otherwise it will be a behavior change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at #7365 (comment) for the details of this bug?

What change in behaviour are you concerned about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was just asking with respect to release notes because currently the bug points to test flake. Anyways, I just checked it doesn't matter because release notes refer to the fix PR and not the issue. Although in the release notes, we should prefix the package

balancer: Fix race condition that could lead to multiple transports being created in parallel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it looks like without your fix, there is a case where resetTransport can error out and return without updating the connectivity state

if acCtx.Err() != nil {

May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any benefit in adding the acCtcx.Err check in connect because even resetTransport sets the the state to Connecting and releases the lock:

grpc-go/clientconn.go

Lines 1262 to 1263 in bdd707e

ac.updateConnectivityState(connectivity.Connecting, nil)
ac.mu.Unlock()

This means that the context can be cancelled (and subsequently addrConn shutdown) after the channel is in connecting state even without the change.

IIUC we just need to ensure that we don't set connecting state after the channel enters shutdown.

The test for shutdown state on top should be enough protection to ensure shutdown state comes only after we enter connecting.

Copy link
Contributor Author

@arjan-bal arjan-bal Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?

We could rename resetTransport to resetTransportLocked and expect the callers to hold the lock while calling this method. However, resetTransport releases the lock temporarily. Add to this that ac.updateAddrs calls resetTransport in a new go routine so it can't hold the lock till resetTransport completes. It feels a little risky to make that change. I don't know for sure, but I feel we could end up in a situation where the lock is not released correctly resulting in a deadlock.

I don't want do make that change as the first option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any benefit in adding the acCtcx.Err check in connect because even resetTransport sets the the state to Connecting and releases the lock

From the code it looks like in case of acCtcx.Err, resetTransport() doesn't update the state and return so state will be still idle but after your fix in case of acCtcx.Err state will be updated to connecting. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, ac.ctx is used to control the creation of remote connections while ac.state is used to synchronize all the state transitions for the addrConn. ac.connect() doesn't deal with creating remote connections, so it doesn't need to check ac.ctx.Err(). It needs to ensure the transition to Connecting is valid, which it does by locking the mutex and verifying that ac.state != Shutdown

ac.ctx is used to avoid doing throw away work which takes significant time (creating a remote conn).

Please let me know if your understanding is different.

Copy link
Contributor

@purnesh42H purnesh42H Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: it doesn't matter if resetTransport() returns error after state being updated to connecting

ac.mu.Unlock()

ac.resetTransport()
Expand Down
Loading