-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ringhash: more e2e tests from c-core #7334
base: master
Are you sure you want to change the base?
Conversation
8dafccc
to
d6d26a5
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #7334 +/- ##
==========================================
- Coverage 81.47% 81.41% -0.07%
==========================================
Files 348 348
Lines 26761 26786 +25
==========================================
+ Hits 21804 21807 +3
- Misses 3772 3778 +6
- Partials 1185 1201 +16
|
Follow up to grpc#7271 to fix grpc#6072. This adds a dozen more end to end tests. There are tests that I did not port, specifically: - TestRingHash_TransientFailureSkipToAvailableReady was flaky when I ported it, so I removed it while investigating. - TestRingHash_SwitchToLowerPriorityAndThenBack was also flaky, I also removed it while investigating. - TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime, I'm not sure we implement this behavior, and if we do, it's not working the same way as in c-core, where the order of subchannel connection attempts is based on the resolver address order rather than the ring order. I will follow up with fixes for each one of the remaining tests.
d6d26a5
to
321d866
Compare
@atollena : Looks like there are some merge conflicts. Could you please take care of them before I start looking. Thanks. |
Done. This is ready for review. |
FYI the remaining 4 tests that were failing, mentioned in the description, are caused by #7363. |
// TestRingHash_IdleToReady tests that the channel will go from idle to ready | ||
// via connecting; (though it is not possible to catch the connecting state | ||
// before moving to ready). | ||
func (s) TestRingHash_IdleToReady(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an internal API that allows one to subscribe to connectivity state changes on the channel. See:
Line 139 in f199062
SubscribeToConnectivityStateChanges any // func(*grpc.ClientConn, grpcsync.Subscriber) |
Do you think it makes sense to use that for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. That would allow us to see all transitions. But this is an internal API that is unused within the project. Do you have some usages within Google, since IIUC you don't enforce internal there? I would rather not depend on an API that is internal & unused, and would perhaps even propose to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two places we intend to use this eventually within the codebase, but just haven't found time to do it.
- Monitoring the state of the xDS client channel. We currently get around this by using
WaitForReady
call option in the ADS stream call. - Monitoring the state of the RLS control channel. We currently have some complicated logic to achieve this and it makes the associated test a little flaky. It's been on my wish list for a while to switch RLS to use this internal API and make that test non-flaky.
So, I'd be happy if we end up using it here and see some usage for this API. But I'm totally OK if you dont want to use it as well. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer not to add that in as part of this PR, maybe adding a TODO to use the internal API would work as well. Thanks.
} | ||
} | ||
|
||
// DialContext implements a context dialer for use with grpc.WithContextDialer | ||
// dial option for a BlockingDialer. | ||
func (d *BlockingDialer) DialContext(ctx context.Context, addr string) (net.Conn, error) { | ||
d.mu.Lock() | ||
holds := d.holds[addr] | ||
if len(holds) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know why I missed this in the first pass. Can we invert the conditional and have less indented code?
d.mu.Lock()
holds := d.holds[addr]
if len(holds) == 0 {
// No hold for this addr.
d.mu.Unlock()
return d.dialer.DialContext(ctx, "tcp", addr)
}
hold := holds[0]
d.holds[addr] = holds[1:]
d.mu.Unlock()
logger.Infof("Hold %p: Intercepted connection attempt to addr %q", hold, addr)
close(hold.waitCh)
select {
case <-hold.blockCh:
if hold.err != nil {
return nil, hold.err
}
return d.dialer.DialContext(ctx, "tcp", addr)
case <-ctx.Done():
logger.Infof("Hold %p: Connection attempt to addr %q cancelled", hold, addr)
}
// holds maps network addresses to a list of holds for that address. | ||
holds map[string][]*Hold | ||
// dialer dials connections when they are not blocked. | ||
dialer *net.Dialer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that this is initialized to net.Dialer
unconditionally in NewBlockingDialer
. If we don't imagine a case where this could be set to anything else, we could get rid of this field and where we do d.dialer.DialContext(ctx, "tcp", addr)
, we could do (&net.Dialer{}).DialContext(ctx, "tcp", addr)
. Or we do imagine the case where this field could be set to something else, we should probably accept it as an argument in NewBlockingDialer
. What do you think?
} | ||
return d.dialer.DialContext(ctx, "tcp", addr) | ||
case <-ctx.Done(): | ||
logger.Infof("Hold %p: Connection attempt to addr %q cancelled", hold, addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe s/cancelled/timed out/
?
} | ||
return d.dialer.DialContext(ctx, "tcp", addr) | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this under the case
for <-h.waitCh
to make it more explicit.
// blockCh is closed when the connection attempt should resume. | ||
blockCh chan error | ||
// err is the error to return when the connection attempt is failed. | ||
err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either blockCh
should be used to communicate the error
value between Resume/Fail
and DialContext
(instead of using a separate field err
and having to add a comment saying it is synchronized via blockCh
). Or blockCh
should be of type chan struct{}
to be clear that no value is being communicated via this channel. I would prefer the former.
go func() { | ||
conn, err := d.DialContext(ctx, lis.Addr().String()) | ||
if err != nil { | ||
t.Errorf("BlockingDialer.DialContext() got error: %v, want success", err) | ||
} | ||
conn.Close() | ||
done <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: if err != nil
, conn
is most likely to be nil
and conn.Close()
would panic. Could we rewrite this as:
go func() {
defer close(done)
conn, err := d.DialContext(ctx, lis.Addr().String())
if err != nil {
t.Errorf("BlockingDialer.DialContext() got error: %v, want success", err)
return
}
conn.Close()
}
// makeNonExistentBackends returns a slice of strings with num listeners, each | ||
// of which is closed immediately. Useful to simulate servers that are | ||
// unreachable. | ||
func makeNonExistentBackends(t *testing.T, num int) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: mark this method as a test helper as well, with t.Helper()
?
// endpointResource creates a ClusterLoadAssignment containing a single locality | ||
// with the given addresses. | ||
func endpointResource(t *testing.T, clusterName string, addrs []string) *v3endpointpb.ClusterLoadAssignment { | ||
// We must set the host name socket address in EDS, as the ring hash policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Helper()
here as well.
// TestRingHash_IdleToReady tests that the channel will go from idle to ready | ||
// via connecting; (though it is not possible to catch the connecting state | ||
// before moving to ready). | ||
func (s) TestRingHash_IdleToReady(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer not to add that in as part of this PR, maybe adding a TODO to use the internal API would work as well. Thanks.
}() | ||
|
||
// Wait for the connection attempt to the real backend. | ||
hold := dialer.Hold(backend.Address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create this hold before we spawn the goroutine that makes the RPC? If not, we could have a case where the connection attempt to this address is made before we get here, right?
// Tests that when the first pick is down leading to a transient failure, we | ||
// will move on to the next ring hash entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs updating probably.
Follow up to #7271 to fix #6072.
This adds a dozen more end to end tests.
There are tests that I did not port, specifically:
TestRingHash_SwitchToLowerPriorityAndThenBack
was also flaky when ported as-is, I also removed it while investigating.TestRingHash_TransientFailureSkipToAvailableReady
was flaky when ported as-is, so I removed it while investigating.TestRingHash_ReattemptWhenAllEndpointsUnreachable
was flaky when ported as-is, so I removed it while investigating.TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime
, I'm not sure we implement this behavior, and if we do, it's not working the same way as in c-core, where the order of subchannel connection attempts is based on the resolver address order rather than the ring order.I will follow up with fixes for each one of the remaining tests.
RELEASE NOTES: none