-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove deprecated gRPC Dial
calls
#2676
Conversation
Dial
calls
064de73
to
e409abd
Compare
✅ Deploy Preview for tetragon ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
e409abd
to
dca9dcc
Compare
Nice, taking a look shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DialContext got deprecated, so we switch to NewClient that does not
connect to the server, it waits for the first RPC to be called to
establish a connection, thus the retry mechanism at connect no longer
make sense. It could be reimplemented as a retry helper for the RPC
call.
This makes sense to me. Assuming this is true and we establish a new connection on the first RPC call (and not a new connection on every subsequent call), why don't we have the client call a healthcheck RPC or similar (I think we already have one) and put the old retry logic in there?
cmd/tetra/common/flags.go
Outdated
func ResolveServerAddress() (string, error) { | ||
if ServerAddress == "" { | ||
sa, err := readActiveServerAddressFromFile(defaults.InitInfoFile) | ||
|
||
if err != nil { | ||
logger.GetLogger().WithError(err).WithField( | ||
"defaultServerAddress", defaultServerAddress, | ||
).Debug("failed to resolve server address reading init info file, using default value") | ||
return defaultServerAddress, nil | ||
} | ||
|
||
logger.GetLogger().WithFields(logrus.Fields{ | ||
"InitInfoFile": defaults.InitInfoFile, | ||
"ServerAddress": sa, | ||
}).Debug("resolved server address using info file") | ||
return sa, nil | ||
} | ||
|
||
return ServerAddress, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're always returning nil here. Perhaps we could consider removing the error return in the type signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed thanks
I would not do RPC checks that are not asked by the user, it would slow down and would be unexpected if it breaks no? I would maybe do the retry on the request itself or just not do it and let people use retry mechanism on |
I don't have super strong feelings one way or another, but I do think it would be nice to support retries. Especially since we've had a lot of problems with connection timeouts in tetra in the past. |
dca9dcc
to
c811061
Compare
This is an idea on how it could be done, c811061. It would need a rewrite for those using I'll see if I can make work gRPC build-in retry mechanism before trying to extend this. However this work as expected and works along interrupt and timeout which is nice (it wasn't the case before). |
c811061
to
a696b32
Compare
Okay! I spent some amount of time to satisfy you on this one, I deep dive into their built-in backoff mechanism, see a696b32. Now it should work and it's transparent! |
That's amazing 😮 |
a696b32
to
06b85c9
Compare
The goal is to remove the connect function later on when dropping the deprecated DialContext usage in favor of new NewClient. Signed-off-by: Mahe Tardy <[email protected]>
Switch from grpc.DialContext to grpc.NewClient which does not perform any I/O and wait lazily for the first RPC call to be made to connect. This is the way to go as DialContext is getting deprecated and the gRPC team has been writing that using the other is an anti-pattern. https://github.com/grpc/grpc-go/blob/master/Documentation/anti-patterns.md Signed-off-by: Mahe Tardy <[email protected]>
DialContext got deprecated, so we switch to NewClient that does not connect to the server, it waits for the first RPC to be called to establish a connection, thus the retry mechanism at connect no longer make sense. It could be reimplemented as a retry helper for the RPC call. Signed-off-by: Mahe Tardy <[email protected]>
Reuse the machinery created in tetra/common to use grpc.NewClient instead of the deprecated DialContext. Signed-off-by: Mahe Tardy <[email protected]>
Here we keep the previous behavior by waiting for the connection with WaitForStateChange instead of connecting at a later stage when performing a RPC. Signed-off-by: Mahe Tardy <[email protected]>
Signed-off-by: cilium-renovate[bot] <134692979+cilium-renovate[bot]@users.noreply.github.com>
gRGC A6 - gRPC Retry Design (a.k.a. built in backoff retry) https://github.com/grpc/proposal/blob/master/A6-client-retries.md was implemented by grpc/grpc-go#2111 but unusable for a long time since maxAttempts was limited to hardcoded 5 (grpc/grpc-go#4615), recent PR fixed that grpc/grpc-go#7229. It's transparent to the user, to see it in action, make sure the gRPC server is unreachable (do not start tetragon for example), run tetra with: GRPC_GO_LOG_SEVERITY_LEVEL=warning <tetra cmd> Note that logs don't always have the time to be pushed before exit so output might be a bit off but the number of retries is respected (you can debug or synchronously print in the grpc/stream.c:shouldRetry or :withRetry to verify). Also note that the final backoff duration is completely random and chosen between 0 and the final duration that was computed via to the params: https://github.com/grpc/grpc-go/blob/v1.65.0/stream.go#L702 Signed-off-by: Mahe Tardy <[email protected]>
06b85c9
to
f5a92b6
Compare
Fixes #2675.
Consider also that now with those patches, the
retries
flag is no longer doing anything, since it was only retrying on the initial connection (whiletimeout
now properly works on all calls). This could be a follow-up PR I think: remove it or add a helper to perform retry on the actual call in case of gRPC errors (and not the actual underlying method error)?