-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble-relay: persist connections to hubble peers #11335
Conversation
This commits updates hubble-relay so that instead of opening/closing a connection to every hubble peer for every gRPC request, connections are instead persisted. I see two main approaches to this: 1. The optimistic approach (the one this commit implements): A new connection is established when a peer is added or updated. When the connection cannot be established, a new attempt is made at the next client request, without blocking the client request itself. 2. The pessimistic approach: A new connection is established when a peer is added or updated. For each peer, a goroutine ensures that the connection is healthy by issuing a periodic heartbeat and attempts to (re)connect if need be. With the optimistic approach, a client request might end up with incomplete data. However, there isn't much overhead with this approach. The pessimistic approach limits the problem of incomplete data (without suppressing it). On the other hand, it generates heartbeat traffic, which will be non-negligible on a 5000 nodes cluster. It also forces to use goroutines to track the connection status with every peer. It would also induce more code complexity (one has to take down the corresponding connection monitoring goroutine when a node is deleted for instance). Given the above, I chose option 1. Signed-off-by: Robin Hahling <[email protected]>
f8d4196
to
d059d61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach is fine overall. I have not found any obvious bugs or races.
The only disadvantage I see it is that if you have an unresponsive peer, you will have multiple go-routines trying to connect to it at the same time (assuming you are handling multiple GetFlows request), which is not great. See the comment for details.
@rolinh another option is to let the grpc library handle keep-alive: https://godoc.org/google.golang.org/grpc#WithKeepaliveParams the code itself looks totally fine to me, but this kind of pull request is difficult for me to approve or disapprove without actual measurements. |
Doing this basically means picking the pessimistic approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the overall approach, there is still a bigger discussion to be had on how long we want to persist connections. It's a trade-off: We can either keep connections open for as long as possible (which means we might waste resources if they end up not being used), or we close them after some idle-time (which means we will impose significant latency on the client requests which have to re-establish all connections again).
It's not clear to me yet how exactly this trade-off should look like - however, I think it's clear that we do not want to create a connection per node per request. We do want to be able to re-use connections between requests. In that regard, this PR is a clear step forward over the status quo.
Therefore, I am approving this, as the code itself looks solid to me. I'm assuming there will be follow-ups, such as us notifying clients about unresponsive peers (#11360) and eventually implementing a policy which implements the aforementioned trade-off.
test-me-please |
Fixes #11229.
Note: I hope someone comes up with option 3 that would be best of both worlds :)