Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: hold ac.mu while calling resetTransport to prevent concurrent connection attempts #7390

Merged
merged 4 commits into from
Jul 9, 2024

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Jul 4, 2024

This change ensures that caller of resetTransport() keep holding the mutex instead of releasing it and having resetTransport() re-acquire it. This ensures that no concurrent requests are able to start once caller of resetTransport does some validation and calls resetTransport.

See #7365 (comment) for more details.

Tested

Verified that Test/AuthorityRevive no longer flakes for 100000 attempts with the change.

Fixes: #7365

RELEASE NOTES:

  • client: fix race that could lead to orphaned connections and associated resources.

@arjan-bal arjan-bal added this to the 1.66 Release milestone Jul 4, 2024
@arjan-bal arjan-bal changed the title grpc: Update conn state to prevent concurrent connection attempts grpc: Update conn state early to prevent concurrent connection attempts Jul 4, 2024
Copy link

codecov bot commented Jul 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.35%. Comparing base (daab563) to head (76ef33f).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7390      +/-   ##
==========================================
- Coverage   81.51%   81.35%   -0.17%     
==========================================
  Files         348      348              
  Lines       26744    26741       -3     
==========================================
- Hits        21801    21754      -47     
- Misses       3764     3793      +29     
- Partials     1179     1194      +15     
Files Coverage Δ
clientconn.go 92.40% <100.00%> (-0.99%) ⬇️

... and 23 files with indirect coverage changes

clientconn.go Outdated
@@ -918,6 +918,9 @@ func (ac *addrConn) connect() error {
ac.mu.Unlock()
return nil
}
// Update the state to ensure no concurrent requests start once the lock
// is released.
ac.updateConnectivityState(connectivity.Connecting, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably file a bug for this otherwise it will be a behavior change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at #7365 (comment) for the details of this bug?

What change in behaviour are you concerned about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was just asking with respect to release notes because currently the bug points to test flake. Anyways, I just checked it doesn't matter because release notes refer to the fix PR and not the issue. Although in the release notes, we should prefix the package

balancer: Fix race condition that could lead to multiple transports being created in parallel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it looks like without your fix, there is a case where resetTransport can error out and return without updating the connectivity state

if acCtx.Err() != nil {

May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any benefit in adding the acCtcx.Err check in connect because even resetTransport sets the the state to Connecting and releases the lock:

grpc-go/clientconn.go

Lines 1262 to 1263 in bdd707e

ac.updateConnectivityState(connectivity.Connecting, nil)
ac.mu.Unlock()

This means that the context can be cancelled (and subsequently addrConn shutdown) after the channel is in connecting state even without the change.

IIUC we just need to ensure that we don't set connecting state after the channel enters shutdown.

The test for shutdown state on top should be enough protection to ensure shutdown state comes only after we enter connecting.

Copy link
Contributor Author

@arjan-bal arjan-bal Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?

We could rename resetTransport to resetTransportLocked and expect the callers to hold the lock while calling this method. However, resetTransport releases the lock temporarily. Add to this that ac.updateAddrs calls resetTransport in a new go routine so it can't hold the lock till resetTransport completes. It feels a little risky to make that change. I don't know for sure, but I feel we could end up in a situation where the lock is not released correctly resulting in a deadlock.

I don't want do make that change as the first option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any benefit in adding the acCtcx.Err check in connect because even resetTransport sets the the state to Connecting and releases the lock

From the code it looks like in case of acCtcx.Err, resetTransport() doesn't update the state and return so state will be still idle but after your fix in case of acCtcx.Err state will be updated to connecting. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, ac.ctx is used to control the creation of remote connections while ac.state is used to synchronize all the state transitions for the addrConn. ac.connect() doesn't deal with creating remote connections, so it doesn't need to check ac.ctx.Err(). It needs to ensure the transition to Connecting is valid, which it does by locking the mutex and verifying that ac.state != Shutdown

ac.ctx is used to avoid doing throw away work which takes significant time (creating a remote conn).

Please let me know if your understanding is different.

Copy link
Contributor

@purnesh42H purnesh42H Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: it doesn't matter if resetTransport() returns error after state being updated to connecting

@arjan-bal arjan-bal requested a review from purnesh42H July 4, 2024 12:02
@purnesh42H
Copy link
Contributor

purnesh42H commented Jul 5, 2024

Release notes needs to be prefixed with "package name:" in this case balancer:?

@arjan-bal
Copy link
Contributor Author

Release notes needs to be prefixed with ":" in this case balancer:?

This is a change in the outermost grpc package. Added the package name in the release notes.

@purnesh42H
Copy link
Contributor

is Test/ServerSideXDS_WithValidAndInvalidSecurityConfiguration failure related to change or just a flake?

@arjan-bal
Copy link
Contributor Author

is Test/ServerSideXDS_WithValidAndInvalidSecurityConfiguration failure related to change or just a flake?

It's a flake, there's an existing issue for this and I've commented on the issue about this failure: #6914

Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@arjan-bal arjan-bal assigned dfawley and arjan-bal and unassigned purnesh42H and dfawley Jul 8, 2024
@arjan-bal arjan-bal changed the title grpc: Update conn state early to prevent concurrent connection attempts grpc: Hold mutex while calling resetTransport to prevent concurrent connection attempts Jul 8, 2024
@arjan-bal arjan-bal force-pushed the fix_conn_connect_race branch from 375bdd7 to 6214c9d Compare July 8, 2024 19:18
@arjan-bal arjan-bal assigned dfawley and unassigned arjan-bal Jul 8, 2024
@dfawley
Copy link
Member

dfawley commented Jul 8, 2024

RELEASE NOTES:

  • Fix race condition that could lead to multiple transports being created in parallel.

What is the user-visible symptom here? A memory leak? Or just an extra connection that was attempted, but that will quickly go away on its own anyway? (Or will the extra connection stick around until the channel is closed?)

@@ -1231,8 +1228,7 @@ func (ac *addrConn) adjustParams(r transport.GoAwayReason) {
}
}

func (ac *addrConn) resetTransport() {
ac.mu.Lock()
func (ac *addrConn) resetTransportAndUnlock() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a short comment here:

// resetTransportAndUnlock unconditionally connects the addrConn.
//
// ac.mu must be held by the caller, and this function will guarantee it is released.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ac.mu must be held by the caller, and this function will guarantee it is released

should we have code check for this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the doc comment.

Copy link
Contributor Author

@arjan-bal arjan-bal Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have code check for this as well?

We don't verify if the mutex is locked in other functions in the code which assume the caller has the lock. These functions have the suffix Locked in the name.

  • There is a TryLock() method on sync/mutex, but its use is discouraged.
  • resetTransportAndUnlock is a private method and we run tests with race detector so we can catch incorrect usage.
  • When the resetTransportAndUnlock method tries to unlock a mutex that isn't locked, it will panic.

Copy link
Contributor

@purnesh42H purnesh42H Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant just having a doc doesn't enforce the mutex should be locked. Unlocking a mutex that is not locked in Go will result in a runtime panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest how to enforce the locking?
How is the caller expected to handle the error if resetTransportAndUnlock doesn't panic?
Since its a private method, doesn't a panic make the failure more visible and ensure incorrect usages are caught by tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it will probably require to implement custom mutex with some boolean field but since we have precedences of methods (e.g https://github.com/grpc/grpc-go/blob/master/clientconn.go#L728) unlocking mutex, devs will be aware of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name of the function and the comment should be sufficient for this.

@arjan-bal
Copy link
Contributor Author

arjan-bal commented Jul 9, 2024

What is the user-visible symptom here? A memory leak? Or just an extra connection that was attempted, but that will quickly go away on its own anyway? (Or will the extra connection stick around until the channel is closed?)

What I'm seeing in the test is that only one transport is closed when the subConn is updated (ac.transport) and the other transport is orphaned. The orphaned transports get closed when the server is shutdown at the end of the test. Updated the release notes.

@dfawley dfawley changed the title grpc: Hold mutex while calling resetTransport to prevent concurrent connection attempts grpc: hold ac.mu while calling resetTransport to prevent concurrent connection attempts Jul 9, 2024
@dfawley dfawley merged commit 45d44a7 into grpc:master Jul 9, 2024
12 of 13 checks passed
printchard pushed a commit to printchard/grpc-go that referenced this pull request Jul 11, 2024
printchard pushed a commit to printchard/grpc-go that referenced this pull request Jul 30, 2024
printchard pushed a commit to printchard/grpc-go that referenced this pull request Jul 30, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 6, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test/AuthorityRevive in google.golang.org/grpc/xds/internal/xdsclient/tests
3 participants