Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test/AuthorityRevive in google.golang.org/grpc/xds/internal/xdsclient/tests #7365

Closed
atollena opened this issue Jun 28, 2024 · 3 comments · Fixed by #7390
Closed

Flaky Test/AuthorityRevive in google.golang.org/grpc/xds/internal/xdsclient/tests #7365

atollena opened this issue Jun 28, 2024 · 3 comments · Fixed by #7390

Comments

@atollena
Copy link
Collaborator

https://github.com/grpc/grpc-go/actions/runs/9711071699/job/26803095072?pr=7364

--- FAIL: Test (14.05s)
    --- FAIL: Test/AuthorityRevive (0.00s)
        authority_test.go:76: Created new snapshot cache...
        tlogger.go:116: INFO server.go:685 [core] [Server #518]Server created  (t=+259.644µs)
        authority_test.go:76: Registered Aggregated Discovery Service (ADS)...
        authority_test.go:76: xDS management server serving at: 127.0.0.1:44757...
        authority_test.go:79: Created new snapshot cache...
        tlogger.go:116: INFO server.go:685 [core] [Server #519]Server created  (t=+316.09µs)
        authority_test.go:79: Registered Aggregated Discovery Service (ADS)...
        authority_test.go:79: xDS management server serving at: 127.0.0.1:41611...
        tlogger.go:116: INFO server.go:881 [core] [Server #518 ListenSocket #520]ListenSocket created  (t=+404.195µs)
        tlogger.go:116: INFO server.go:881 [core] [Server #519 ListenSocket #521]ListenSocket created  (t=+480.408µs)
        tlogger.go:116: INFO bootstrap.go:571 [xds] [xds-bootstrap] Bootstrap config for creating xds-client: &{xDSServers:[0xc00076abd0] cpcs:map[] serverListenerResourceNameTemplate: clientDefaultListenerResourceNameTemplate:%s authorities:map[test-authority1:0xc00076d830 test-authority2:0xc00076d860 test-authority3:0xc00076d890] node:{ID:4974a10e-6dfc-42ee-b520-e37a6fc2948f Cluster: Locality:{Region: Zone: SubZone:} Metadata:<nil> userAgentName:gRPC Go userAgentVersionType:{UserAgentVersion:1.66.0-dev} clientFeatures:[envoy.lb.does_not_support_overprovisioning xds.config.resource-in-sotw]} certProviderConfigs:map[]}  (t=+538.716µs)
        tlogger.go:116: INFO client_new.go:87 [xds] [xds-client 0xc0004ea0a0] Created client to xDS management server: 127.0.0.1:44757-insecure-  (t=+573.17µs)
        server.go:229: Created new resource snapshot...
        server.go:235: Updated snapshot cache with resource snapshot...
        tlogger.go:116: INFO clientconn.go:1687 [core] original dial target is: "127.0.0.1:44757"  (t=+847.172µs)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #522]Channel created  (t=+875.374µs)
        tlogger.go:116: INFO clientconn.go:191 [core] [Channel #522]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/127.0.0.1:44757", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}  (t=+912.994µs)
        tlogger.go:116: INFO clientconn.go:192 [core] [Channel #522]Channel authority set to "127.0.0.1:44757"  (t=+935.887µs)
        tlogger.go:116: INFO resolver_wrapper.go:197 [core] [Channel #522]Resolver state updated: {
              "Addresses": [
                {
                  "Addr": "127.0.0.1:44757",
                  "ServerName": "",
                  "Attributes": null,
                  "BalancerAttributes": null,
                  "Metadata": null
                }
              ],
              "Endpoints": [
                {
                  "Addresses": [
                    {
                      "Addr": "127.0.0.1:44757",
                      "ServerName": "",
                      "Attributes": null,
                      "BalancerAttributes": null,
                      "Metadata": null
                    }
                  ],
                  "Attributes": null
                }
              ],
              "ServiceConfig": null,
              "Attributes": null
            } (resolver returned new addresses)  (t=+1.022228ms)
        tlogger.go:116: INFO balancer_wrapper.go:103 [core] [Channel #522]Channel switches to new LB policy "pick_first"  (t=+1.056141ms)
        tlogger.go:116: INFO clientconn.go:852 [core] [Channel #522 SubChannel #523]Subchannel created  (t=+1.096707ms)
        tlogger.go:116: INFO clientconn.go:539 [core] [Channel #522]Channel Connectivity change to CONNECTING  (t=+1.115833ms)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #522]Channel exiting idle mode  (t=+1.137313ms)
        tlogger.go:116: INFO transport.go:237 [xds] [xds-client 0xc0004ea0a0] [127.0.0.1:44757] Created transport to server "127.0.0.1:44757"  (t=+1.165655ms)
        tlogger.go:116: INFO clientconn.go:1[213](https://github.com/grpc/grpc-go/actions/runs/9711071699/job/26803095072?pr=7364#step:8:214) [core] [Channel #522 SubChannel #523]Subchannel Connectivity change to CONNECTING  (t=+1.259239ms)
        tlogger.go:116: INFO clientconn.go:1329 [core] [Channel #522 SubChannel #523]Subchannel picks a new address "127.0.0.1:44757" to connect  (t=+1.285688ms)
        tlogger.go:116: INFO clientconn.go:1329 [core] [Channel #522 SubChannel #523]Subchannel picks a new address "127.0.0.1:44757" to connect  (t=+1.429717ms)
        authority_test.go:298: Unexpected new transport created to management server
        tlogger.go:116: WARNING transport.go:335 [xds] [xds-client 0xc0004ea0a0] [127.0.0.1:44757] Creating new ADS stream failed: rpc error: code = Canceled desc = received context error while waiting for new LB policy update: context canceled  (t=+1.926424ms)
        tlogger.go:116: INFO clientconn.go:539 [core] [Channel #522]Channel Connectivity change to SHUTDOWN  (t=+1.947954ms)
        tlogger.go:116: INFO resolver_wrapper.go:100 [core] [Channel #522]Closing the name resolver  (t=+1.969675ms)
        tlogger.go:116: INFO balancer_wrapper.go:135 [core] [Channel #522]ccBalancerWrapper: closing  (t=+1.989221ms)
        tlogger.go:116: INFO clientconn.go:1213 [core] [Channel #522 SubChannel #523]Subchannel Connectivity change to SHUTDOWN  (t=+2.015099ms)
        tlogger.go:116: INFO clientconn.go:1560 [core] [Channel #522 SubChannel #523]Subchannel deleted  (t=+2.035908ms)
        tlogger.go:116: INFO clientconn.go:309 [core] [Channel #522]Channel deleted  (t=+2.054984ms)
        tlogger.go:116: INFO clientimpl.go:100 [xds] [xds-client 0xc0004ea0a0] Shutdown  (t=+2.127708ms)
        tlogger.go:116: INFO server.go:817 [core] [Server #519 ListenSocket #521]ListenSocket deleted  (t=+2.179294ms)
        tlogger.go:116: INFO server.go:817 [core] [Server #518 ListenSocket #520]ListenSocket deleted  (t=+2.[217](https://github.com/grpc/grpc-go/actions/runs/9711071699/job/26803095072?pr=7364#step:8:218)455ms)
@arjan-bal
Copy link
Contributor

There is roughly 0.4% flakiness when run on forge: 399 out of 100000 failures

@arjan-bal
Copy link
Contributor

arjan-bal commented Jul 4, 2024

Investigation

It looks like the subchannel picks the same address twice in case of failures

tlogger.go:116: INFO clientconn.go:1329 [core] [Channel #522 SubChannel #523]Subchannel picks a new address "127.0.0.1:44757" to connect  (t=+1.285688ms)
tlogger.go:116: INFO clientconn.go:1329 [core] [Channel #522 SubChannel #523]Subchannel picks a new address "127.0.0.1:44757" to connect  (t=+1.429717ms)

tryAllAddrs is being called twice.

@arjan-bal
Copy link
Contributor

arjan-bal commented Jul 4, 2024

Root Cause

  • Balancer Wrapper calls connect in a new go routine: https://github.com/grpc/grpc-go/blame/bdd707e642e40cf75db5ac3f0f6af48077f48368/balancer_wrapper.go#L276C22-L276C22
  • This in turn calls addrConn.connect() which briefly locks the mutex, ensures the channel is idle releases the mutex before calling resetTransport here:

    grpc-go/clientconn.go

    Lines 914 to 925 in bdd707e

    if ac.state != connectivity.Idle {
    if logger.V(2) {
    logger.Infof("connect called on addrConn in non-idle state (%v); ignoring.", ac.state)
    }
    ac.mu.Unlock()
    return nil
    }
    ac.mu.Unlock()
    ac.resetTransport()
    return nil
    }
  • resetTransport locks the mutex when it starts, it also sets the state to connecting to prevent parallel connections:

    grpc-go/clientconn.go

    Lines 1234 to 1262 in bdd707e

    func (ac *addrConn) resetTransport() {
    ac.mu.Lock()
    acCtx := ac.ctx
    if acCtx.Err() != nil {
    ac.mu.Unlock()
    return
    }
    addrs := ac.addrs
    backoffFor := ac.dopts.bs.Backoff(ac.backoffIdx)
    // This will be the duration that dial gets to finish.
    dialDuration := minConnectTimeout
    if ac.dopts.minConnectTimeout != nil {
    dialDuration = ac.dopts.minConnectTimeout()
    }
    if dialDuration < backoffFor {
    // Give dial more time as we keep failing to connect.
    dialDuration = backoffFor
    }
    // We can potentially spend all the time trying the first address, and
    // if the server accepts the connection and then hangs, the following
    // addresses will never be tried.
    //
    // The spec doesn't mention what should be done for multiple addresses.
    // https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md#proposed-backoff-algorithm
    connectDeadline := time.Now().Add(dialDuration)
    ac.updateConnectivityState(connectivity.Connecting, nil)

When addrConn.connect releases the mutex after checking for idleness, another call to addrConn.connect can come in which also sees the channel as idle because resetTransport hasn't acquired the lock yet. So we have two connection attempts in parallel.

A simple fix it to set the state to connecting while addrConn.connect has the mutex locked. I tried it and it fixed the flakiness. Will discuss with the team and raise a PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
2 participants