Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: support fallback within an authority #7701

Merged
merged 4 commits into from
Nov 6, 2024

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Oct 4, 2024

#a71-xds-fallback
Fixes #6902

This PR adds support to fallback to a lower priority server (within an authority) when a higher priority goes down and also adds support to revert to a higher priority server when it comes back up.

This PR also adds e2e style tests for the scenarios specified in the fallback interop test spec.

RELEASE NOTES:

  • xdsclient: add support to fallback to lower priority servers when higher priority ones are down

@easwars easwars force-pushed the xdsclient_ads_stream branch 7 times, most recently from 9c6247d to 0bf5608 Compare October 10, 2024 00:07
@purnesh42H purnesh42H added this to the 1.69 Release milestone Oct 16, 2024
@easwars easwars force-pushed the xdsclient_ads_stream branch 4 times, most recently from c960691 to ed16d72 Compare October 23, 2024 18:21
@easwars easwars force-pushed the xdsclient_ads_stream branch from ed16d72 to 88b1de8 Compare November 1, 2024 16:24
@easwars easwars force-pushed the xdsclient_ads_stream branch from 88b1de8 to 4909aa4 Compare November 1, 2024 16:35
@easwars easwars changed the title xdsclient: complete refactor and fallback support xdsclient: support fallback within an authority Nov 1, 2024
@easwars easwars requested review from zasweq and purnesh42H November 1, 2024 16:39
@easwars easwars added the Type: Feature New features or improvements in behavior label Nov 1, 2024
@easwars easwars marked this pull request as ready for review November 1, 2024 16:40
Copy link

codecov bot commented Nov 1, 2024

Codecov Report

Attention: Patch coverage is 68.54839% with 39 lines in your changes missing coverage. Please review.

Project coverage is 81.77%. Comparing base (d66fc3a) to head (72930ff).
Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/xdsclient/authority.go 68.54% 28 Missing and 11 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7701      +/-   ##
==========================================
+ Coverage   81.71%   81.77%   +0.05%     
==========================================
  Files         374      373       -1     
  Lines       38166    37844     -322     
==========================================
- Hits        31188    30947     -241     
+ Misses       5699     5597     -102     
- Partials     1279     1300      +21     
Files with missing lines Coverage Δ
xds/internal/xdsclient/client_new.go 85.71% <ø> (+2.38%) ⬆️
xds/internal/xdsclient/authority.go 78.14% <68.54%> (-11.61%) ⬇️

... and 32 files with indirect coverage changes

xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/authority.go Show resolved Hide resolved
xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/authority.go Outdated Show resolved Hide resolved
fallbackChannel := a.xdsChannelConfigs[fallbackServerIdx]

// If the server to fallback to already has an xdsChannel, it means that
// this connectivity error is from a server with a higher priority. There
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't quite understand this comment. If there is an error from higher priority server, we should fallback to a lower priority server. How is having an existing channel for fallback server makes any difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say the authority has two servers: primary and fallback. We start off with primary, and say it fails and therefore we switch to fallback. And lets say the connection to fallback works and we get all resources from it and we are happy. Now, we get another error from the primary (in fact, we will keep getting stream errors from it since we retry the stream with backoff). At this point, we will see that we already have a channel to the next server in the list which is the fallback server, and therefore we have nothing to do here.

// resource that has not yet been cached.
//
// Only executed in the context of a serializer callback.
func (a *authority) uncachedWatcherExists() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name seems to convey "if any uncached watcher exist" where it should convey if there is a watcher that is looking for an uncached resource. May be rename to something like "uncachedResourceExistToWatch" or flip the bool to "allWatchableResourceCached" or simply "areAllRequestedResourceCached"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// existing resources.
//
// Only executed in the context of a serializer callback.
func (a *authority) triggerFallbackOnStreamFailure(failingServerConfig *bootstrap.ServerConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this function is simply falling back and not checking any stream failures. I guess that's done by caller? In that case name can be simply "triggerFallback". Also, in docstring we should mention it switches to next fallback server from the list (i.e. next lower priority than current failing one) and if no more servers below current then it returns as no-op. May be name can be more explicit "switchToNextFallbackServerIfAvailable"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to go with fallbackToNextServerIfPossible since Available is a term that can be viewed differently in this context, i.e. that the server is actually available for traffic, which is not what we check here.

// unsubscribe, and remove the channel from the list of
// channels that this resource is subscribed to.
if xc == cfg {
state.xdsChannelConfigs = append(state.xdsChannelConfigs[:idx], state.xdsChannelConfigs[idx+1:]...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this panic due to idx + 1 because of all these mutations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if it hits the last index does the index+1: panic, or is it just like nil or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to go with a map instead of a slice and that significantly simplifies this piece of code.

@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Nov 5, 2024
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some testing comments

xds/internal/xdsclient/tests/fallback_test.go Show resolved Hide resolved
xds/internal/xdsclient/tests/fallback_test.go Show resolved Hide resolved
xds/internal/xdsclient/tests/fallback_test.go Show resolved Hide resolved
xds/internal/xdsclient/tests/fallback_test.go Show resolved Hide resolved
xds/internal/xdsclient/tests/fallback_test.go Show resolved Hide resolved
xds/internal/xdsclient/tests/fallback_test.go Show resolved Hide resolved
xds/internal/xdsclient/tests/fallback_test.go Show resolved Hide resolved
xds/internal/xdsclient/tests/fallback_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// unsubscribe, and remove the channel from the list of
// channels that this resource is subscribed to.
if xc == cfg {
xc.xc.unsubscribe(rType, resourceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I still get confused by xc.xc, can we switch the top level symbol to xcc or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, in a few other places as well. Also, I renames the struct fields to be more descriptive.

// channels that this resource is subscribed to.
if xc == cfg {
xc.xc.unsubscribe(rType, resourceName)
delete(state.xdsChannelConfigs, xc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two operations can only hit once per xDS Channel right? I think you can break out of the for once this hits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also, ended up inverting the conditional.

@zasweq zasweq removed their assignment Nov 5, 2024
@easwars easwars merged commit b3393d9 into grpc:master Nov 6, 2024
15 checks passed
@easwars easwars deleted the xdsclient_ads_stream branch November 6, 2024 19:52
misvivek pushed a commit to misvivek/grpc-go that referenced this pull request Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xdsclient: support fallback within each xDS server authority [A71]
3 participants