hubble-relay: persist connections to hubble peers #11335

rolinh · 2020-05-05T11:46:08Z

This commits updates hubble-relay so that instead of opening/closing a
connection to every hubble peer for every gRPC request, connections are
instead persisted.

I see two main approaches to this:

1. The optimistic approach (the one this commit implements):
   A new connection is established when a peer is added or updated. When
   the connection cannot be established, a new attempt is made at the
   next client request, without blocking the client request itself.
2. The pessimistic approach:
   A new connection is established when a peer is added or updated. For
   each peer, a goroutine ensures that the connection is healthy by
   issuing a periodic heartbeat and attempts to (re)connect if need be.

With the optimistic approach, a client request might end up with
incomplete data. However, there isn't much overhead with this approach.
The pessimistic approach limits the problem of incomplete data (without
suppressing it). On the other hand, it generates heartbeat traffic,
which will be non-negligible on a 5000 nodes cluster. It also forces to
use goroutines to track the connection status with every peer. It would
also induce more code complexity (one has to take down the corresponding
connection monitoring goroutine when a node is deleted for instance).
Given the above, I chose option 1.

Fixes #11229.

Note: I hope someone comes up with option 3 that would be best of both worlds :)

This commits updates hubble-relay so that instead of opening/closing a connection to every hubble peer for every gRPC request, connections are instead persisted. I see two main approaches to this: 1. The optimistic approach (the one this commit implements): A new connection is established when a peer is added or updated. When the connection cannot be established, a new attempt is made at the next client request, without blocking the client request itself. 2. The pessimistic approach: A new connection is established when a peer is added or updated. For each peer, a goroutine ensures that the connection is healthy by issuing a periodic heartbeat and attempts to (re)connect if need be. With the optimistic approach, a client request might end up with incomplete data. However, there isn't much overhead with this approach. The pessimistic approach limits the problem of incomplete data (without suppressing it). On the other hand, it generates heartbeat traffic, which will be non-negligible on a 5000 nodes cluster. It also forces to use goroutines to track the connection status with every peer. It would also induce more code complexity (one has to take down the corresponding connection monitoring goroutine when a node is deleted for instance). Given the above, I chose option 1. Signed-off-by: Robin Hahling <[email protected]>

coveralls · 2020-05-05T12:11:33Z

Coverage increased (+0.01%) to 44.433% when pulling d059d61 on pr/rolinh/hubble-relay-persist-conections into 1a77f03 on master.

gandro

I think the approach is fine overall. I have not found any obvious bugs or races.

The only disadvantage I see it is that if you have an unresponsive peer, you will have multiple go-routines trying to connect to it at the same time (assuming you are handling multiple GetFlows request), which is not great. See the comment for details.

pkg/hubble/relay/observer.go

michi-covalent · 2020-05-05T16:45:14Z

@rolinh another option is to let the grpc library handle keep-alive: https://godoc.org/google.golang.org/grpc#WithKeepaliveParams

the code itself looks totally fine to me, but this kind of pull request is difficult for me to approve or disapprove without actual measurements.

rolinh · 2020-05-06T11:51:22Z

@rolinh another option is to let the grpc library handle keep-alive: https://godoc.org/google.golang.org/grpc#WithKeepaliveParams

Doing this basically means picking the pessimistic approach.

gandro

In terms of the overall approach, there is still a bigger discussion to be had on how long we want to persist connections. It's a trade-off: We can either keep connections open for as long as possible (which means we might waste resources if they end up not being used), or we close them after some idle-time (which means we will impose significant latency on the client requests which have to re-establish all connections again).

It's not clear to me yet how exactly this trade-off should look like - however, I think it's clear that we do not want to create a connection per node per request. We do want to be able to re-use connections between requests. In that regard, this PR is a clear step forward over the status quo.

Therefore, I am approving this, as the code itself looks solid to me. I'm assuming there will be follow-ups, such as us notifying clients about unresponsive peers (#11360) and eventually implementing a policy which implements the aforementioned trade-off.

rolinh · 2020-05-06T12:16:34Z

test-me-please

rolinh added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels May 5, 2020

rolinh force-pushed the pr/rolinh/hubble-relay-persist-conections branch from f8d4196 to d059d61 Compare May 5, 2020 12:12

rolinh marked this pull request as ready for review May 5, 2020 12:14

rolinh requested a review from a team May 5, 2020 12:14

gandro reviewed May 5, 2020

View reviewed changes

pkg/hubble/relay/observer.go Show resolved Hide resolved

gandro approved these changes May 6, 2020

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 6, 2020

rolinh merged commit 854f625 into master May 6, 2020

rolinh deleted the pr/rolinh/hubble-relay-persist-conections branch May 6, 2020 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble-relay: persist connections to hubble peers #11335

hubble-relay: persist connections to hubble peers #11335

rolinh commented May 5, 2020 •

edited

Loading

coveralls commented May 5, 2020 •

edited

Loading

gandro left a comment

michi-covalent commented May 5, 2020

rolinh commented May 6, 2020

gandro left a comment •

edited

Loading

rolinh commented May 6, 2020

hubble-relay: persist connections to hubble peers #11335

hubble-relay: persist connections to hubble peers #11335

Conversation

rolinh commented May 5, 2020 • edited Loading

coveralls commented May 5, 2020 • edited Loading

gandro left a comment

Choose a reason for hiding this comment

michi-covalent commented May 5, 2020

rolinh commented May 6, 2020

gandro left a comment • edited Loading

Choose a reason for hiding this comment

rolinh commented May 6, 2020

rolinh commented May 5, 2020 •

edited

Loading

coveralls commented May 5, 2020 •

edited

Loading

gandro left a comment •

edited

Loading