Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ringhash: more e2e tests from c-core #7334

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

atollena
Copy link
Collaborator

@atollena atollena commented Jun 19, 2024

Follow up to #7271 to fix #6072.

This adds a dozen more end to end tests.

There are tests that I did not port, specifically:

I will follow up with fixes for each one of the remaining tests.

RELEASE NOTES: none

Copy link

codecov bot commented Jun 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.41%. Comparing base (98e5dee) to head (3926a09).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7334      +/-   ##
==========================================
- Coverage   81.47%   81.41%   -0.07%     
==========================================
  Files         348      348              
  Lines       26761    26786      +25     
==========================================
+ Hits        21804    21807       +3     
- Misses       3772     3778       +6     
- Partials     1185     1201      +16     
Files Coverage Δ
internal/testutils/blocking_context_dialer.go 100.00% <100.00%> (+16.66%) ⬆️

... and 22 files with indirect coverage changes

Follow up to grpc#7271 to fix
grpc#6072.

This adds a dozen more end to end tests.

There are tests that I did not port, specifically:

- TestRingHash_TransientFailureSkipToAvailableReady was flaky when I ported it,
so I removed it while investigating.

- TestRingHash_SwitchToLowerPriorityAndThenBack was also flaky, I also removed
it while investigating.

- TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime, I'm not sure
we implement this behavior, and if we do, it's not working the same way as in
c-core, where the order of subchannel connection attempts is based on the
resolver address order rather than the ring order.

I will follow up with fixes for each one of the remaining tests.
@atollena atollena marked this pull request as ready for review June 20, 2024 11:47
@easwars
Copy link
Contributor

easwars commented Jun 26, 2024

@atollena : Looks like there are some merge conflicts. Could you please take care of them before I start looking. Thanks.

@easwars easwars assigned atollena and unassigned easwars Jun 26, 2024
@atollena atollena assigned easwars and unassigned atollena Jun 27, 2024
@atollena
Copy link
Collaborator Author

@atollena : Looks like there are some merge conflicts. Could you please take care of them before I start looking. Thanks.

Done. This is ready for review.

@atollena
Copy link
Collaborator Author

FYI the remaining 4 tests that were failing, mentioned in the description, are caused by #7363.

internal/testutils/blocking_context_dialer.go Show resolved Hide resolved
internal/testutils/blocking_context_dialer.go Outdated Show resolved Hide resolved
internal/testutils/blocking_context_dialer.go Show resolved Hide resolved
internal/testutils/blocking_context_dialer_test.go Outdated Show resolved Hide resolved
internal/testutils/blocking_context_dialer_test.go Outdated Show resolved Hide resolved
// TestRingHash_IdleToReady tests that the channel will go from idle to ready
// via connecting; (though it is not possible to catch the connecting state
// before moving to ready).
func (s) TestRingHash_IdleToReady(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an internal API that allows one to subscribe to connectivity state changes on the channel. See:

SubscribeToConnectivityStateChanges any // func(*grpc.ClientConn, grpcsync.Subscriber)

Do you think it makes sense to use that for this test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. That would allow us to see all transitions. But this is an internal API that is unused within the project. Do you have some usages within Google, since IIUC you don't enforce internal there? I would rather not depend on an API that is internal & unused, and would perhaps even propose to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two places we intend to use this eventually within the codebase, but just haven't found time to do it.

  • Monitoring the state of the xDS client channel. We currently get around this by using WaitForReady call option in the ADS stream call.
  • Monitoring the state of the RLS control channel. We currently have some complicated logic to achieve this and it makes the associated test a little flaky. It's been on my wish list for a while to switch RLS to use this internal API and make that test non-flaky.

So, I'd be happy if we end up using it here and see some usage for this API. But I'm totally OK if you dont want to use it as well. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer not to add that in as part of this PR, maybe adding a TODO to use the internal API would work as well. Thanks.

@easwars easwars assigned atollena and unassigned easwars Jun 28, 2024
@atollena atollena assigned easwars and unassigned atollena Jul 2, 2024
}
}

// DialContext implements a context dialer for use with grpc.WithContextDialer
// dial option for a BlockingDialer.
func (d *BlockingDialer) DialContext(ctx context.Context, addr string) (net.Conn, error) {
d.mu.Lock()
holds := d.holds[addr]
if len(holds) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know why I missed this in the first pass. Can we invert the conditional and have less indented code?

	d.mu.Lock()
	holds := d.holds[addr]
	if len(holds) == 0 {
		// No hold for this addr.
		d.mu.Unlock()
		return d.dialer.DialContext(ctx, "tcp", addr)
	}
	
	hold := holds[0]
	d.holds[addr] = holds[1:]
	d.mu.Unlock()

	logger.Infof("Hold %p: Intercepted connection attempt to addr %q", hold, addr)
	close(hold.waitCh)
	select {
	case <-hold.blockCh:
		if hold.err != nil {
			return nil, hold.err
		}
		return d.dialer.DialContext(ctx, "tcp", addr)
	case <-ctx.Done():
		logger.Infof("Hold %p: Connection attempt to addr %q cancelled", hold, addr)
	}

// holds maps network addresses to a list of holds for that address.
holds map[string][]*Hold
// dialer dials connections when they are not blocked.
dialer *net.Dialer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that this is initialized to net.Dialer unconditionally in NewBlockingDialer. If we don't imagine a case where this could be set to anything else, we could get rid of this field and where we do d.dialer.DialContext(ctx, "tcp", addr), we could do (&net.Dialer{}).DialContext(ctx, "tcp", addr). Or we do imagine the case where this field could be set to something else, we should probably accept it as an argument in NewBlockingDialer. What do you think?

}
return d.dialer.DialContext(ctx, "tcp", addr)
case <-ctx.Done():
logger.Infof("Hold %p: Connection attempt to addr %q cancelled", hold, addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/cancelled/timed out/ ?

}
return d.dialer.DialContext(ctx, "tcp", addr)
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this under the case for <-h.waitCh to make it more explicit.

Comment on lines +86 to +89
// blockCh is closed when the connection attempt should resume.
blockCh chan error
// err is the error to return when the connection attempt is failed.
err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either blockCh should be used to communicate the error value between Resume/Fail and DialContext (instead of using a separate field err and having to add a comment saying it is synchronized via blockCh). Or blockCh should be of type chan struct{} to be clear that no value is being communicated via this channel. I would prefer the former.

Comment on lines +65 to +71
go func() {
conn, err := d.DialContext(ctx, lis.Addr().String())
if err != nil {
t.Errorf("BlockingDialer.DialContext() got error: %v, want success", err)
}
conn.Close()
done <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if err != nil, conn is most likely to be nil and conn.Close() would panic. Could we rewrite this as:

	go func() {
		defer close(done)
		conn, err := d.DialContext(ctx, lis.Addr().String())
		if err != nil {
			t.Errorf("BlockingDialer.DialContext() got error: %v, want success", err)
			return
		}
		conn.Close()
	}

// makeNonExistentBackends returns a slice of strings with num listeners, each
// of which is closed immediately. Useful to simulate servers that are
// unreachable.
func makeNonExistentBackends(t *testing.T, num int) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: mark this method as a test helper as well, with t.Helper()?

// endpointResource creates a ClusterLoadAssignment containing a single locality
// with the given addresses.
func endpointResource(t *testing.T, clusterName string, addrs []string) *v3endpointpb.ClusterLoadAssignment {
// We must set the host name socket address in EDS, as the ring hash policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Helper() here as well.

// TestRingHash_IdleToReady tests that the channel will go from idle to ready
// via connecting; (though it is not possible to catch the connecting state
// before moving to ready).
func (s) TestRingHash_IdleToReady(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer not to add that in as part of this PR, maybe adding a TODO to use the internal API would work as well. Thanks.

}()

// Wait for the connection attempt to the real backend.
hold := dialer.Hold(backend.Address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create this hold before we spawn the goroutine that makes the RPC? If not, we could have a case where the connection attempt to this address is made before we get here, right?

Comment on lines +1449 to +1450
// Tests that when the first pick is down leading to a transient failure, we
// will move on to the next ring hash entry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs updating probably.

@easwars easwars assigned atollena and unassigned easwars Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ringhash: improve test coverage
2 participants