Fix load balancing of pilot connections#20957
Conversation
|
🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test? Courtesy of your friendly test nag. |
|
/retest |
1 similar comment
|
/retest |
| "keepalive_time": 300 | ||
| } | ||
| }, | ||
| "max_requests_per_connection": 1, |
There was a problem hiding this comment.
Sorry, I did not fully understand how this will help. Are you saying when pilot's gRPC server disconnects this Envoy - It is reconnecting on the same connection and that is why max age is not working? We do not have this and we see it is disconnected at +/- 30 mins regularly.
There was a problem hiding this comment.
I do not know the exact problem yet, but please note, this envoyproxy/envoy#9668 has changed the http2 connection behaviour. IIUC, with that change, there will be multiple connections to same host and even if max requests per connection is hit, it might use another connection to the same host from pool. Read the circuit breaking doc updates in that PR once to see if it still works the way you are thinking/tried
There was a problem hiding this comment.
see the linked comment in the pr description. As far as I can tell this only works today if you are connecting to pilot over plain text (port 15010). everything else is broken
There was a problem hiding this comment.
Ok. AFAICT, this should not depend on whether you are on plain text or tls. Probably the change you made to convert to gRPC should take care of that and this is not needed?
There was a problem hiding this comment.
there's 3 cases.
- direct to pilot, using grpc stack - works
- with pilot Sidecar - does not work. this happens when using tls on 1.4-
3 direct to pilot, using net/http stack - does not work. this happens when 1.4 proxies connect to 1.5 pilot over tls only, but no other cases. technically we could make that case use the grpc stack as well, we just have not yet
so it's not necessarily about tls or not, that just determines other factors which impact this
There was a problem hiding this comment.
case 3 is broken for all standard 1.5 uses cases without this pr as well, as otherwise the net/http stack is always used
There was a problem hiding this comment.
the max req per Conn is needed for the case with pilot Sidecar. maybe we don't care about that since we removed the Sidecar by default in 1.5
There was a problem hiding this comment.
Ok. Thanks for the details. We use 2nd case and use TLS with 1.5 master and it seems proxies are getting disconnected properly. But if you have tested and this flag does not have any side affects, we can leave it as is.
| grpcServer *grpc.Server | ||
| secureHTTPServer *http.Server | ||
| secureGRPCServer *grpc.Server | ||
| secureHTTPServerDNS *http.Server |
There was a problem hiding this comment.
You are removing this server, which is orthogonal with the LB issue i think.
But looks good.
There was a problem hiding this comment.
it's not orthogonal, I just didn't explain it well. right now we serve http and grpc over the same port. To do that we use the net/http stack instead of the grpc stack. This doesn't support max connection age
There was a problem hiding this comment.
Yep, then same should apply to secureHTTPServer ?
There was a problem hiding this comment.
secureHTTPServer is mostly legacy that will be removed soon so I didn't want to touch it too much but I can if we are ok with this approach and want the same applied there?
|
Do we have any e2e or integration tests for this? |
We do not. This seems like a challenging thing to add this sort of test for, do you have any ideas of how to do this without a ton of overhead? We have great coverage for this in our stress testing, which is going to be more realistic than any e2e test (assuming we don't run e2e tests for a day) |
|
Agree, e2e is hard for this case. A monitoring like this https://snapshot.raintank.io/dashboard/snapshot/XVhdky21rBQX5nZW4rj20MvAHmL3Aq1i?orgId=2&fullscreen&panelId=47 is relatively convincing. |
|
I have ran this for a few days now and the balancing is working correctly. Is there anything else needed? |
|
@howardjohn Thanks, that is great. LGTM |
|
/cherrypick release-1.5 |
|
@howardjohn: new pull request created: #21126 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
* Fix load balancing of pilot connections Context: istio#11181 (comment) * fix repetitive code * Update goldens
Context:
#11181 (comment)
Basically, this moves back to grpc stack, and adds max 1 requests per connection to force envoy to rebalance