-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xdsclient: support fallback within an authority #7701
Conversation
9c6247d
to
0bf5608
Compare
c960691
to
ed16d72
Compare
ed16d72
to
88b1de8
Compare
88b1de8
to
4909aa4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7701 +/- ##
==========================================
+ Coverage 81.71% 81.77% +0.05%
==========================================
Files 374 373 -1
Lines 38166 37844 -322
==========================================
- Hits 31188 30947 -241
+ Misses 5699 5597 -102
- Partials 1279 1300 +21
|
fallbackChannel := a.xdsChannelConfigs[fallbackServerIdx] | ||
|
||
// If the server to fallback to already has an xdsChannel, it means that | ||
// this connectivity error is from a server with a higher priority. There |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't quite understand this comment. If there is an error from higher priority server, we should fallback to a lower priority server. How is having an existing channel for fallback server makes any difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say the authority has two servers: primary and fallback. We start off with primary, and say it fails and therefore we switch to fallback. And lets say the connection to fallback works and we get all resources from it and we are happy. Now, we get another error from the primary (in fact, we will keep getting stream errors from it since we retry the stream with backoff). At this point, we will see that we already have a channel to the next server in the list which is the fallback server, and therefore we have nothing to do here.
xds/internal/xdsclient/authority.go
Outdated
// resource that has not yet been cached. | ||
// | ||
// Only executed in the context of a serializer callback. | ||
func (a *authority) uncachedWatcherExists() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: name seems to convey "if any uncached watcher exist" where it should convey if there is a watcher that is looking for an uncached resource. May be rename to something like "uncachedResourceExistToWatch" or flip the bool to "allWatchableResourceCached" or simply "areAllRequestedResourceCached"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
xds/internal/xdsclient/authority.go
Outdated
// existing resources. | ||
// | ||
// Only executed in the context of a serializer callback. | ||
func (a *authority) triggerFallbackOnStreamFailure(failingServerConfig *bootstrap.ServerConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this function is simply falling back and not checking any stream failures. I guess that's done by caller? In that case name can be simply "triggerFallback". Also, in docstring we should mention it switches to next fallback server from the list (i.e. next lower priority than current failing one) and if no more servers below current then it returns as no-op. May be name can be more explicit "switchToNextFallbackServerIfAvailable"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to go with fallbackToNextServerIfPossible
since Available
is a term that can be viewed differently in this context, i.e. that the server is actually available for traffic, which is not what we check here.
xds/internal/xdsclient/authority.go
Outdated
// unsubscribe, and remove the channel from the list of | ||
// channels that this resource is subscribed to. | ||
if xc == cfg { | ||
state.xdsChannelConfigs = append(state.xdsChannelConfigs[:idx], state.xdsChannelConfigs[idx+1:]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this panic due to idx + 1 because of all these mutations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if it hits the last index does the index+1: panic, or is it just like nil or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to go with a map instead of a slice and that significantly simplifies this piece of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some testing comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
xds/internal/xdsclient/authority.go
Outdated
// unsubscribe, and remove the channel from the list of | ||
// channels that this resource is subscribed to. | ||
if xc == cfg { | ||
xc.xc.unsubscribe(rType, resourceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I still get confused by xc.xc, can we switch the top level symbol to xcc or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, in a few other places as well. Also, I renames the struct fields to be more descriptive.
xds/internal/xdsclient/authority.go
Outdated
// channels that this resource is subscribed to. | ||
if xc == cfg { | ||
xc.xc.unsubscribe(rType, resourceName) | ||
delete(state.xdsChannelConfigs, xc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two operations can only hit once per xDS Channel right? I think you can break out of the for once this hits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also, ended up inverting the conditional.
#a71-xds-fallback
Fixes #6902
This PR adds support to fallback to a lower priority server (within an
authority
) when a higher priority goes down and also adds support to revert to a higher priority server when it comes back up.This PR also adds e2e style tests for the scenarios specified in the fallback interop test spec.
RELEASE NOTES: