Fix load balancing of pilot connections by howardjohn · Pull Request #20957 · istio/istio

howardjohn · 2020-02-07T22:05:50Z

Basically, this moves back to grpc stack, and adds max 1 requests per connection to force envoy to rebalance

istio-policy-bot · 2020-02-07T22:05:53Z

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

howardjohn · 2020-02-08T00:15:13Z

/retest

howardjohn · 2020-02-08T00:51:35Z

/retest

ramaraochavali · 2020-02-08T07:33:35Z

tools/packaging/common/envoy_bootstrap_v2.json

            "keepalive_time": 300
          }
        },
+        "max_requests_per_connection": 1,


Sorry, I did not fully understand how this will help. Are you saying when pilot's gRPC server disconnects this Envoy - It is reconnecting on the same connection and that is why max age is not working? We do not have this and we see it is disconnected at +/- 30 mins regularly.

I do not know the exact problem yet, but please note, this envoyproxy/envoy#9668 has changed the http2 connection behaviour. IIUC, with that change, there will be multiple connections to same host and even if max requests per connection is hit, it might use another connection to the same host from pool. Read the circuit breaking doc updates in that PR once to see if it still works the way you are thinking/tried

see the linked comment in the pr description. As far as I can tell this only works today if you are connecting to pilot over plain text (port 15010). everything else is broken

Ok. AFAICT, this should not depend on whether you are on plain text or tls. Probably the change you made to convert to gRPC should take care of that and this is not needed?

there's 3 cases.

direct to pilot, using grpc stack - works

with pilot Sidecar - does not work. this happens when using tls on 1.4-
3 direct to pilot, using net/http stack - does not work. this happens when 1.4 proxies connect to 1.5 pilot over tls only, but no other cases. technically we could make that case use the grpc stack as well, we just have not yet

so it's not necessarily about tls or not, that just determines other factors which impact this

case 3 is broken for all standard 1.5 uses cases without this pr as well, as otherwise the net/http stack is always used

the max req per Conn is needed for the case with pilot Sidecar. maybe we don't care about that since we removed the Sidecar by default in 1.5

Ok. Thanks for the details. We use 2nd case and use TLS with 1.5 master and it seems proxies are getting disconnected properly. But if you have tested and this flag does not have any side affects, we can leave it as is.

hzxuzhonghu · 2020-02-10T01:57:09Z

pilot/pkg/bootstrap/server.go

 	grpcServer          *grpc.Server
 	secureHTTPServer    *http.Server
 	secureGRPCServer    *grpc.Server
-	secureHTTPServerDNS *http.Server


You are removing this server, which is orthogonal with the LB issue i think.

But looks good.

it's not orthogonal, I just didn't explain it well. right now we serve http and grpc over the same port. To do that we use the net/http stack instead of the grpc stack. This doesn't support max connection age

Yep, then same should apply to secureHTTPServer ?

secureHTTPServer is mostly legacy that will be removed soon so I didn't want to touch it too much but I can if we are ok with this approach and want the same applied there?

ayj · 2020-02-10T21:26:37Z

Do we have any e2e or integration tests for this?

howardjohn · 2020-02-12T04:21:25Z

Do we have any e2e or integration tests for this?

We do not. This seems like a challenging thing to add this sort of test for, do you have any ideas of how to do this without a ton of overhead? We have great coverage for this in our stress testing, which is going to be more realistic than any e2e test (assuming we don't run e2e tests for a day)

hzxuzhonghu · 2020-02-12T05:10:35Z

Agree, e2e is hard for this case. A monitoring like this https://snapshot.raintank.io/dashboard/snapshot/XVhdky21rBQX5nZW4rj20MvAHmL3Aq1i?orgId=2&fullscreen&panelId=47 is relatively convincing.

howardjohn · 2020-02-13T17:09:18Z

I have ran this for a few days now and the balancing is working correctly. Is there anything else needed?

ramaraochavali · 2020-02-13T17:14:41Z

@howardjohn Thanks, that is great. LGTM

howardjohn · 2020-02-14T01:11:36Z

/cherrypick release-1.5

istio-testing · 2020-02-14T01:11:50Z

@howardjohn: new pull request created: #21126

Details

In response to this:

/cherrypick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* Fix load balancing of pilot connections Context: istio#11181 (comment) * fix repetitive code * Update goldens

Fix load balancing of pilot connections

6858b16

Context: istio#11181 (comment)

howardjohn requested review from a team as code owners February 7, 2020 22:05

googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Feb 7, 2020

istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 7, 2020

howardjohn added 2 commits February 7, 2020 14:18

fix repetitive code

d9fc700

Update goldens

1b6d669

howardjohn requested a review from a team as a code owner February 7, 2020 23:28

howardjohn added the cherrypick/release-1.5 label Feb 8, 2020

ramaraochavali reviewed Feb 8, 2020

View reviewed changes

hzxuzhonghu reviewed Feb 10, 2020

View reviewed changes

costinm approved these changes Feb 13, 2020

View reviewed changes

nmittler approved these changes Feb 13, 2020

View reviewed changes

istio-testing merged commit 6bff985 into istio:master Feb 14, 2020

istio-testing mentioned this pull request Feb 14, 2020

[release-1.5] Fix load balancing of pilot connections #21126

Merged

sdake pushed a commit to sdake/istio that referenced this pull request Feb 21, 2020

Fix load balancing of pilot connections (istio#20957)

c054e20

* Fix load balancing of pilot connections Context: istio#11181 (comment) * fix repetitive code * Update goldens

Conversation

howardjohn commented Feb 7, 2020

Uh oh!

istio-policy-bot commented Feb 7, 2020

Uh oh!

howardjohn commented Feb 8, 2020

Uh oh!

howardjohn commented Feb 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramaraochavali Feb 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayj commented Feb 10, 2020

Uh oh!

howardjohn commented Feb 12, 2020

Uh oh!

hzxuzhonghu commented Feb 12, 2020

Uh oh!

howardjohn commented Feb 13, 2020

Uh oh!

ramaraochavali commented Feb 13, 2020

Uh oh!

howardjohn commented Feb 14, 2020

Uh oh!

istio-testing commented Feb 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

ramaraochavali Feb 8, 2020 •

edited

Loading