Skip to content

Fix load balancing of pilot connections#20957

Merged
istio-testing merged 3 commits intoistio:masterfrom
howardjohn:pilot/cmux
Feb 14, 2020
Merged

Fix load balancing of pilot connections#20957
istio-testing merged 3 commits intoistio:masterfrom
howardjohn:pilot/cmux

Conversation

@howardjohn
Copy link
Member

Context:
#11181 (comment)

Basically, this moves back to grpc stack, and adds max 1 requests per connection to force envoy to rebalance

@howardjohn howardjohn requested review from a team as code owners February 7, 2020 22:05
@istio-policy-bot
Copy link

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Feb 7, 2020
@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 7, 2020
@howardjohn howardjohn requested a review from a team as a code owner February 7, 2020 23:28
@howardjohn
Copy link
Member Author

/retest

1 similar comment
@howardjohn
Copy link
Member Author

/retest

"keepalive_time": 300
}
},
"max_requests_per_connection": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not fully understand how this will help. Are you saying when pilot's gRPC server disconnects this Envoy - It is reconnecting on the same connection and that is why max age is not working? We do not have this and we see it is disconnected at +/- 30 mins regularly.

Copy link
Contributor

@ramaraochavali ramaraochavali Feb 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know the exact problem yet, but please note, this envoyproxy/envoy#9668 has changed the http2 connection behaviour. IIUC, with that change, there will be multiple connections to same host and even if max requests per connection is hit, it might use another connection to the same host from pool. Read the circuit breaking doc updates in that PR once to see if it still works the way you are thinking/tried

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the linked comment in the pr description. As far as I can tell this only works today if you are connecting to pilot over plain text (port 15010). everything else is broken

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. AFAICT, this should not depend on whether you are on plain text or tls. Probably the change you made to convert to gRPC should take care of that and this is not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's 3 cases.

  1. direct to pilot, using grpc stack - works
  2. with pilot Sidecar - does not work. this happens when using tls on 1.4-
    3 direct to pilot, using net/http stack - does not work. this happens when 1.4 proxies connect to 1.5 pilot over tls only, but no other cases. technically we could make that case use the grpc stack as well, we just have not yet

so it's not necessarily about tls or not, that just determines other factors which impact this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 3 is broken for all standard 1.5 uses cases without this pr as well, as otherwise the net/http stack is always used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the max req per Conn is needed for the case with pilot Sidecar. maybe we don't care about that since we removed the Sidecar by default in 1.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for the details. We use 2nd case and use TLS with 1.5 master and it seems proxies are getting disconnected properly. But if you have tested and this flag does not have any side affects, we can leave it as is.

grpcServer *grpc.Server
secureHTTPServer *http.Server
secureGRPCServer *grpc.Server
secureHTTPServerDNS *http.Server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are removing this server, which is orthogonal with the LB issue i think.

But looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not orthogonal, I just didn't explain it well. right now we serve http and grpc over the same port. To do that we use the net/http stack instead of the grpc stack. This doesn't support max connection age

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, then same should apply to secureHTTPServer ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secureHTTPServer is mostly legacy that will be removed soon so I didn't want to touch it too much but I can if we are ok with this approach and want the same applied there?

@ayj
Copy link
Contributor

ayj commented Feb 10, 2020

Do we have any e2e or integration tests for this?

@howardjohn
Copy link
Member Author

Do we have any e2e or integration tests for this?

We do not. This seems like a challenging thing to add this sort of test for, do you have any ideas of how to do this without a ton of overhead? We have great coverage for this in our stress testing, which is going to be more realistic than any e2e test (assuming we don't run e2e tests for a day)

@hzxuzhonghu
Copy link
Member

Agree, e2e is hard for this case. A monitoring like this https://snapshot.raintank.io/dashboard/snapshot/XVhdky21rBQX5nZW4rj20MvAHmL3Aq1i?orgId=2&fullscreen&panelId=47 is relatively convincing.

@howardjohn
Copy link
Member Author

I have ran this for a few days now and the balancing is working correctly. Is there anything else needed?

@ramaraochavali
Copy link
Contributor

@howardjohn Thanks, that is great. LGTM

@istio-testing istio-testing merged commit 6bff985 into istio:master Feb 14, 2020
@howardjohn
Copy link
Member Author

/cherrypick release-1.5

@istio-testing
Copy link
Collaborator

@howardjohn: new pull request created: #21126

Details

In response to this:

/cherrypick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdake pushed a commit to sdake/istio that referenced this pull request Feb 21, 2020
* Fix load balancing of pilot connections

Context:
istio#11181 (comment)

* fix repetitive code

* Update goldens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants