Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble-relay: persist connections to hubble peers #11335

Merged
merged 1 commit into from
May 6, 2020

Conversation

rolinh
Copy link
Member

@rolinh rolinh commented May 5, 2020

This commits updates hubble-relay so that instead of opening/closing a
connection to every hubble peer for every gRPC request, connections are
instead persisted.

I see two main approaches to this:

1. The optimistic approach (the one this commit implements):
   A new connection is established when a peer is added or updated. When
   the connection cannot be established, a new attempt is made at the
   next client request, without blocking the client request itself.
2. The pessimistic approach:
   A new connection is established when a peer is added or updated. For
   each peer, a goroutine ensures that the connection is healthy by
   issuing a periodic heartbeat and attempts to (re)connect if need be.

With the optimistic approach, a client request might end up with
incomplete data. However, there isn't much overhead with this approach.
The pessimistic approach limits the problem of incomplete data (without
suppressing it). On the other hand, it generates heartbeat traffic,
which will be non-negligible on a 5000 nodes cluster. It also forces to
use goroutines to track the connection status with every peer. It would
also induce more code complexity (one has to take down the corresponding
connection monitoring goroutine when a node is deleted for instance).
Given the above, I chose option 1.

Fixes #11229.

Note: I hope someone comes up with option 3 that would be best of both worlds :)

@rolinh rolinh added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels May 5, 2020
This commits updates hubble-relay so that instead of opening/closing a
connection to every hubble peer for every gRPC request, connections are
instead persisted.

I see two main approaches to this:

1. The optimistic approach (the one this commit implements):
   A new connection is established when a peer is added or updated. When
   the connection cannot be established, a new attempt is made at the
   next client request, without blocking the client request itself.
2. The pessimistic approach:
   A new connection is established when a peer is added or updated. For
   each peer, a goroutine ensures that the connection is healthy by
   issuing a periodic heartbeat and attempts to (re)connect if need be.

With the optimistic approach, a client request might end up with
incomplete data. However, there isn't much overhead with this approach.
The pessimistic approach limits the problem of incomplete data (without
suppressing it). On the other hand, it generates heartbeat traffic,
which will be non-negligible on a 5000 nodes cluster. It also forces to
use goroutines to track the connection status with every peer. It would
also induce more code complexity (one has to take down the corresponding
connection monitoring goroutine when a node is deleted for instance).
Given the above, I chose option 1.

Signed-off-by: Robin Hahling <[email protected]>
@coveralls
Copy link

coveralls commented May 5, 2020

Coverage Status

Coverage increased (+0.01%) to 44.433% when pulling d059d61 on pr/rolinh/hubble-relay-persist-conections into 1a77f03 on master.

@rolinh rolinh force-pushed the pr/rolinh/hubble-relay-persist-conections branch from f8d4196 to d059d61 Compare May 5, 2020 12:12
@rolinh rolinh marked this pull request as ready for review May 5, 2020 12:14
@rolinh rolinh requested a review from a team May 5, 2020 12:14
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach is fine overall. I have not found any obvious bugs or races.

The only disadvantage I see it is that if you have an unresponsive peer, you will have multiple go-routines trying to connect to it at the same time (assuming you are handling multiple GetFlows request), which is not great. See the comment for details.

pkg/hubble/relay/observer.go Show resolved Hide resolved
@michi-covalent
Copy link
Contributor

@rolinh another option is to let the grpc library handle keep-alive: https://godoc.org/google.golang.org/grpc#WithKeepaliveParams

the code itself looks totally fine to me, but this kind of pull request is difficult for me to approve or disapprove without actual measurements.

@rolinh
Copy link
Member Author

rolinh commented May 6, 2020

@rolinh another option is to let the grpc library handle keep-alive: https://godoc.org/google.golang.org/grpc#WithKeepaliveParams

Doing this basically means picking the pessimistic approach.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the overall approach, there is still a bigger discussion to be had on how long we want to persist connections. It's a trade-off: We can either keep connections open for as long as possible (which means we might waste resources if they end up not being used), or we close them after some idle-time (which means we will impose significant latency on the client requests which have to re-establish all connections again).

It's not clear to me yet how exactly this trade-off should look like - however, I think it's clear that we do not want to create a connection per node per request. We do want to be able to re-use connections between requests. In that regard, this PR is a clear step forward over the status quo.

Therefore, I am approving this, as the code itself looks solid to me. I'm assuming there will be follow-ups, such as us notifying clients about unresponsive peers (#11360) and eventually implementing a policy which implements the aforementioned trade-off.

@rolinh
Copy link
Member Author

rolinh commented May 6, 2020

test-me-please

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 6, 2020
@rolinh rolinh merged commit 854f625 into master May 6, 2020
@rolinh rolinh deleted the pr/rolinh/hubble-relay-persist-conections branch May 6, 2020 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hubble-relay: persist connections to hubble peers
4 participants