-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns + round_robin not updating address list in v1.26 #3353
Comments
If you can reliably reproduce this, can you enable logging and include the output here (or in a pastebin)? https://github.com/grpc/grpc-go#how-to-turn-on-logging |
Yes very easy to repro:
It looks like I was missing something in the chain of events. I was testing a small footprint deployment and for a brief period there might be zero endpoints, thus the resolution error. This seems to be a state it won't recover from. There were no further log messages even upon subsequent retries. If I start with one replica of my service, add a second one, then remove the first:
In this case we've missed the addition of a second endpoint until the first one fails. The good news is there is no user error, but we're still not connecting to all the available endpoints as they change. Here's the 1.25.1 behavior:
(b) I start with 1 service replica, then add a second. It shows up ~30 seconds later so that I have two endpoints available:
|
Interesting. I think this is an important step for reproducing; I'll take a look at this.
This sounds like it's working as intended. The DNS resolver does not regularly poll as of 1.26; it only refreshes when connection errors occur. This is a behavior change from 1.25 (see #3228), which is why the new backend appears there without any error occurring. You can set |
@travisgroth if you don't mind, could you run your scenario with zero endpoints using #3361 and confirm this fixes the problem, i.e. the ClientConn should now connect to the new endpoint when it is added. Thanks! |
(a) Great! It seems to be resolved with that branch. We get a failed SC and it continues through the empty list until we get a new one that gets to a ready state:
(b) I feel like there are more corner cases that might trip over the total removal of the client regularly checking in for endpoints. Artificially closing connections regularly seems like a disruptive way of forcing a rediscovery interval. It also removes a lot of control over discovery semantics from the client. I wouldn't mind so much having to explicitly enable the feature, but being able to trigger client discovery outside of a server-initiated close sounds much more resilient. Thoughts? |
@travisgroth thanks for testing! When this is merged I'll make a patch release with it. Regarding (b), this is a common concern, however I don't think it's a problem in practice. Please see grpc/grpc#12295 (comment) (and several other comments in that issue) for more discussion on this topic. Note that HTTP/1 recreates connections far more often, so cycling these connections periodically (every 20 minutes or so) shouldn't be a major impact. If you need more responsive discovery, using something other than DNS that doesn't need to be polled at all would be recommended. |
I appreciate the quick response on this! After reviewing the thread, I see the reasoning. I'm happy to sideline the DNS discussion for the moment and see how the settings work in practice. |
What version of gRPC are you using?
1.26
What version of Go are you using (
go version
)?1.13.4
What operating system (Linux, Windows, …) and version?
Linux / Kubernetes / Docker - various.
What did you do?
When using
grpc.WithBalancerName(roundrobin.Name)
, with adns:///
Dial target, errors do not trigger re-resolve.Detail: a client starts with
dns:///my-service
as a dial target and the round_robin load balancer.my-service
DNS returns an IP192.168.0.19
. Due to an operational change,my-service
is moved and DNS is updated to return192.168.0.20
. The service stops running on192.168.0.19
, refusing connections.This is approximately the workflow that would occur when using Kubernetes and a headless service if a pod is replaced. This happens for a variety of reasons, including routine updates and rescheduling.
What did you expect to see?
(a) Upon receiving a connection refused, DNS should be re-queried for an updated endpoint list
(b) Without receiving an error from an existing endpoint, the load balancer should be repopulated occasionally with the latest records in DNS as endpoint sets change
This was the behavior previous to 1.26. By rolling code back to grpc 1.25.1, we were able to restore this behavior.
#3165 is likely the culprit.
What did you see instead?
(a) Endpoint list is never re-queried and the client simply throws a persistent error:
(b) Since every subsequent request fails, the endpoint list does not seem to getting updated with new records in DNS at all.
Additional
I recognize the API around DNS and load balancing may have changed recently (and are evolving) but I'm not seeing any clear indication of how to retain the expected behavior.
The text was updated successfully, but these errors were encountered: