ringhash: more e2e tests from c-core #7334

atollena · 2024-06-19T10:26:54Z

Follow up to #7271 to fix #6072.

This adds a dozen more end to end tests.

There are tests that I did not port, specifically:

TestRingHash_SwitchToLowerPriorityAndThenBack was also flaky when ported as-is, I also removed it while investigating.
TestRingHash_TransientFailureSkipToAvailableReady was flaky when ported as-is, so I removed it while investigating.
TestRingHash_ReattemptWhenAllEndpointsUnreachable was flaky when ported as-is, so I removed it while investigating.
TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime, I'm not sure we implement this behavior, and if we do, it's not working the same way as in c-core, where the order of subchannel connection attempts is based on the resolver address order rather than the ring order.

I will follow up with fixes for each one of the remaining tests.

RELEASE NOTES: none

codecov · 2024-06-19T10:41:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.41%. Comparing base (98e5dee) to head (3926a09).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7334      +/-   ##
==========================================
- Coverage   81.47%   81.41%   -0.07%     
==========================================
  Files         348      348              
  Lines       26761    26786      +25     
==========================================
+ Hits        21804    21807       +3     
- Misses       3772     3778       +6     
- Partials     1185     1201      +16

Files	Coverage Δ
internal/testutils/blocking_context_dialer.go	`100.00% <100.00%> (+16.66%)`	⬆️

... and 22 files with indirect coverage changes

Follow up to grpc#7271 to fix grpc#6072. This adds a dozen more end to end tests. There are tests that I did not port, specifically: - TestRingHash_TransientFailureSkipToAvailableReady was flaky when I ported it, so I removed it while investigating. - TestRingHash_SwitchToLowerPriorityAndThenBack was also flaky, I also removed it while investigating. - TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime, I'm not sure we implement this behavior, and if we do, it's not working the same way as in c-core, where the order of subchannel connection attempts is based on the resolver address order rather than the ring order. I will follow up with fixes for each one of the remaining tests.

easwars · 2024-06-26T22:32:47Z

@atollena : Looks like there are some merge conflicts. Could you please take care of them before I start looking. Thanks.

atollena · 2024-06-27T10:57:33Z

@atollena : Looks like there are some merge conflicts. Could you please take care of them before I start looking. Thanks.

Done. This is ready for review.

atollena · 2024-06-28T07:58:35Z

FYI the remaining 4 tests that were failing, mentioned in the description, are caused by #7363.

internal/testutils/blocking_context_dialer.go

internal/testutils/blocking_context_dialer_test.go

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

easwars · 2024-06-28T18:52:51Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+// TestRingHash_IdleToReady tests that the channel will go from idle to ready
+// via connecting; (though it is not possible to catch the connecting state
+// before moving to ready).
+func (s) TestRingHash_IdleToReady(t *testing.T) {


We have an internal API that allows one to subscribe to connectivity state changes on the channel. See:

grpc-go/internal/internal.go

Line 139 in f199062

SubscribeToConnectivityStateChanges any // func(*grpc.ClientConn, grpcsync.Subscriber)

Do you think it makes sense to use that for this test?

Interesting. That would allow us to see all transitions. But this is an internal API that is unused within the project. Do you have some usages within Google, since IIUC you don't enforce internal there? I would rather not depend on an API that is internal & unused, and would perhaps even propose to remove it.

There are two places we intend to use this eventually within the codebase, but just haven't found time to do it.

Monitoring the state of the xDS client channel. We currently get around this by using WaitForReady call option in the ADS stream call.

Monitoring the state of the RLS control channel. We currently have some complicated logic to achieve this and it makes the associated test a little flaky. It's been on my wish list for a while to switch RLS to use this internal API and make that test non-flaky.

So, I'd be happy if we end up using it here and see some usage for this API. But I'm totally OK if you dont want to use it as well. Thanks.

If you prefer not to add that in as part of this PR, maybe adding a TODO to use the internal API would work as well. Thanks.

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

easwars · 2024-07-12T19:56:21Z

internal/testutils/blocking_context_dialer.go

 	}
 }

 // DialContext implements a context dialer for use with grpc.WithContextDialer
 // dial option for a BlockingDialer.
 func (d *BlockingDialer) DialContext(ctx context.Context, addr string) (net.Conn, error) {
+	d.mu.Lock()
+	holds := d.holds[addr]
+	if len(holds) > 0 {


Don't know why I missed this in the first pass. Can we invert the conditional and have less indented code?

d.mu.Lock() holds := d.holds[addr] if len(holds) == 0 { // No hold for this addr. d.mu.Unlock() return d.dialer.DialContext(ctx, "tcp", addr) } hold := holds[0] d.holds[addr] = holds[1:] d.mu.Unlock() logger.Infof("Hold %p: Intercepted connection attempt to addr %q", hold, addr) close(hold.waitCh) select { case <-hold.blockCh: if hold.err != nil { return nil, hold.err } return d.dialer.DialContext(ctx, "tcp", addr) case <-ctx.Done(): logger.Infof("Hold %p: Connection attempt to addr %q cancelled", hold, addr) }

easwars · 2024-07-16T14:47:56Z

internal/testutils/blocking_context_dialer.go

+	// holds maps network addresses to a list of holds for that address.
+	holds map[string][]*Hold
+	// dialer dials connections when they are not blocked.
+	dialer *net.Dialer


Just realized that this is initialized to net.Dialer unconditionally in NewBlockingDialer. If we don't imagine a case where this could be set to anything else, we could get rid of this field and where we do d.dialer.DialContext(ctx, "tcp", addr), we could do (&net.Dialer{}).DialContext(ctx, "tcp", addr). Or we do imagine the case where this field could be set to something else, we should probably accept it as an argument in NewBlockingDialer. What do you think?

easwars · 2024-07-16T14:50:17Z

internal/testutils/blocking_context_dialer.go

+			}
+			return d.dialer.DialContext(ctx, "tcp", addr)
+		case <-ctx.Done():
+			logger.Infof("Hold %p: Connection attempt to addr %q cancelled", hold, addr)


Maybe s/cancelled/timed out/ ?

easwars · 2024-07-16T14:53:33Z

internal/testutils/blocking_context_dialer.go

 	}
-	return d.dialer.DialContext(ctx, "tcp", addr)
+	return true


Could we move this under the case for <-h.waitCh to make it more explicit.

easwars · 2024-07-16T14:57:11Z

internal/testutils/blocking_context_dialer.go

+	// blockCh is closed when the connection attempt should resume.
+	blockCh chan error
+	// err is the error to return when the connection attempt is failed.
+	err error


Either blockCh should be used to communicate the error value between Resume/Fail and DialContext (instead of using a separate field err and having to add a comment saying it is synchronized via blockCh). Or blockCh should be of type chan struct{} to be clear that no value is being communicated via this channel. I would prefer the former.

easwars · 2024-07-16T15:02:22Z

internal/testutils/blocking_context_dialer_test.go

+	go func() {
+		conn, err := d.DialContext(ctx, lis.Addr().String())
+		if err != nil {
+			t.Errorf("BlockingDialer.DialContext() got error: %v, want success", err)
+		}
+		conn.Close()
+		done <- struct{}{}


Nit: if err != nil, conn is most likely to be nil and conn.Close() would panic. Could we rewrite this as:

go func() { defer close(done) conn, err := d.DialContext(ctx, lis.Addr().String()) if err != nil { t.Errorf("BlockingDialer.DialContext() got error: %v, want success", err) return } conn.Close() }

easwars · 2024-07-16T15:12:13Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+// makeNonExistentBackends returns a slice of strings with num listeners, each
+// of which is closed immediately. Useful to simulate servers that are
+// unreachable.
+func makeNonExistentBackends(t *testing.T, num int) []string {


Nit: mark this method as a test helper as well, with t.Helper()?

easwars · 2024-07-16T15:14:57Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+// endpointResource creates a ClusterLoadAssignment containing a single locality
+// with the given addresses.
+func endpointResource(t *testing.T, clusterName string, addrs []string) *v3endpointpb.ClusterLoadAssignment {
+	// We must set the host name socket address in EDS, as the ring hash policy


t.Helper() here as well.

easwars · 2024-07-16T15:24:20Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+// TestRingHash_IdleToReady tests that the channel will go from idle to ready
+// via connecting; (though it is not possible to catch the connecting state
+// before moving to ready).
+func (s) TestRingHash_IdleToReady(t *testing.T) {


If you prefer not to add that in as part of this PR, maybe adding a TODO to use the internal API would work as well. Thanks.

easwars · 2024-07-16T15:28:49Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+	}()
+
+	// Wait for the connection attempt to the real backend.
+	hold := dialer.Hold(backend.Address)


Should we create this hold before we spawn the goroutine that makes the RPC? If not, we could have a case where the connection attempt to this address is made before we get here, right?

easwars · 2024-07-16T15:30:53Z

xds/internal/balancer/ringhash/e2e/ringhash_balancer_test.go

+// Tests that when the first pick is down leading to a transient failure, we
+// will move on to the next ring hash entry.


This comment needs updating probably.

atollena added the Type: Testing label Jun 19, 2024

atollena added this to the 1.66 Release milestone Jun 19, 2024

atollena requested a review from easwars June 19, 2024 10:31

atollena assigned easwars Jun 19, 2024

atollena force-pushed the issue-6072-part-2 branch from 8dafccc to d6d26a5 Compare June 19, 2024 10:35

atollena force-pushed the issue-6072-part-2 branch from d6d26a5 to 321d866 Compare June 19, 2024 11:10

atollena marked this pull request as ready for review June 20, 2024 11:47

easwars assigned atollena and unassigned easwars Jun 26, 2024

Merge branch 'master' into issue-6072-part-2

5b17c36

atollena assigned easwars and unassigned atollena Jun 27, 2024

cleanup

5bfc245

atollena mentioned this pull request Jun 28, 2024

ringhash: fix bug where ring hash can be stuck in transient failure despite having available endpoints #7364

Draft

easwars reviewed Jun 28, 2024

View reviewed changes

easwars assigned atollena and unassigned easwars Jun 28, 2024

comments from easwars + backport test changes from issue-7363

3926a09

atollena assigned easwars and unassigned atollena Jul 2, 2024

easwars reviewed Jul 16, 2024

View reviewed changes

easwars assigned atollena and unassigned easwars Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ringhash: more e2e tests from c-core #7334

ringhash: more e2e tests from c-core #7334

atollena commented Jun 19, 2024 •

edited

Loading

codecov bot commented Jun 19, 2024 •

edited

Loading

easwars commented Jun 26, 2024

atollena commented Jun 27, 2024

atollena commented Jun 28, 2024

easwars Jun 28, 2024

atollena Jul 1, 2024

easwars Jul 12, 2024

easwars Jul 16, 2024

easwars Jul 12, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

easwars Jul 16, 2024

		// Tests that when the first pick is down leading to a transient failure, we
		// will move on to the next ring hash entry.

ringhash: more e2e tests from c-core #7334

Are you sure you want to change the base?

ringhash: more e2e tests from c-core #7334

Conversation

atollena commented Jun 19, 2024 • edited Loading

codecov bot commented Jun 19, 2024 • edited Loading

Codecov Report

easwars commented Jun 26, 2024

atollena commented Jun 27, 2024

atollena commented Jun 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atollena commented Jun 19, 2024 •

edited

Loading

codecov bot commented Jun 19, 2024 •

edited

Loading