fix: Do not attempt relogin on remote BadParameter errors#36866
fix: Do not attempt relogin on remote BadParameter errors#36866codingllama merged 5 commits intomasterfrom
Conversation
|
Created as a draft to gather reviewer feedback, as it's hard to say what kinds of errors could fall into this. Backport label only to v15 for the same reason. |
|
FWIW, the check on The retry logic in Connect uses a very primitive check on the error message because we're yet to implement the more sophisticated check (https://github.com/gravitational/teleport.e/issues/853): teleport/web/packages/teleterm/src/ui/utils/retryWithRelogin.ts Lines 95 to 102 in b37d06a So far we haven't seen any problems due to Connect not checking Overall I think we should do this. I wish we made that change before the test plan, but at this point it's better to make it today than tomorrow. |
|
OK, I figured out what happened. Since #30578 we automatically translate gRPC errors into trace errors. Before that commit server-side failures materialized as status.Error, which didn't trigger relogin, but since the interceptors a good chunk of server-side errors are now (correctly) handled as BadParameter and do trigger relogin. It's hard to say what the original condition was meant to handle. Maybe we had "local" lib/client BadParameters that we wanted to retry.
Yep, this is my stance too, especially after the bisect adventure above. |
There are several HTTP calls that will return teleport/api/client/alpn_conn_upgrade.go Line 253 in a0dbc26 Line 158 in cf2c705 |
fec0466 to
71764cf
Compare
|
I've pushed a change that marks errors originated from the interceptors and avoids retrying those. This should move us closer to the behavior pre-#30578 and cause less side-effects. Some brave soul can then try to remove retries for all BadParameters another time. PTAL. |
|
Fixed a few tests and removed wrapping from stream io.EOF errors (which are often used as stop guards). |
|
@codingllama See the table below for backport results.
|
|
@codingllama given that the original PR that introduced this behavior was backported to v14, v13 and v12, should we also backport this to other branches besides v15? |
|
@tiago agreed - I was just waiting for the testplan to finish so this (hopefully) gets exercised a bit more. I'll mail out the backports so we don't forget about it. |
* fix: Do not attempt relogin on trace.BadParameterError * Special case remote errors in IsErrorResolvableWithRelogin * Do not wrap Recv EOF errors * Fix various tests * Use errors.Is to check for io.EOF
* fix: Do not attempt relogin on trace.BadParameterError * Special case remote errors in IsErrorResolvableWithRelogin * Do not wrap Recv EOF errors * Fix various tests * Use errors.Is to check for io.EOF
* fix: Do not attempt relogin on trace.BadParameterError * Special case remote errors in IsErrorResolvableWithRelogin * Do not wrap Recv EOF errors * Fix various tests * Use errors.Is to check for io.EOF
* fix: Do not attempt relogin on remote BadParameter errors (#36866) * fix: Do not attempt relogin on trace.BadParameterError * Special case remote errors in IsErrorResolvableWithRelogin * Do not wrap Recv EOF errors * Fix various tests * Use errors.Is to check for io.EOF * Fix imports for v13 * fix: Do not wrap io.EOF intercepted by stream Sends (#37647) * Verify that intercepted stream Sends wrap io.EOF * fix: Do not wrap io.EOF intercepted by stream Sends * Use a helper func, fix duplicate Send/Recv calls * Fix typo
* fix: Do not attempt relogin on remote BadParameter errors (#36866) * fix: Do not attempt relogin on trace.BadParameterError * Special case remote errors in IsErrorResolvableWithRelogin * Do not wrap Recv EOF errors * Fix various tests * Use errors.Is to check for io.EOF * fix: Do not wrap io.EOF intercepted by stream Sends (#37647) * Verify that intercepted stream Sends wrap io.EOF * fix: Do not wrap io.EOF intercepted by stream Sends * Use a helper func, fix duplicate Send/Recv calls * Fix typo
| // https://github.com/gravitational/teleport/pull/30578. | ||
| var remoteErr *interceptors.RemoteError | ||
| if errors.As(err, &remoteErr) { | ||
| return false |
There was a problem hiding this comment.
@codingllama should we treat all remote errors equally here?
We can get Original Error: *interceptors.RemoteError access denied: client credentials have expired, please relogin. for which we return false but it seems that for this particular case it should be true?
I ask because I'm working on adding a client cache to Connect, and this means that the client no longer checks the user cert before making call, but instead it has to rely on errors from the server.
The original comment #38202 (comment).
There was a problem hiding this comment.
FYI @ravicious
This block here is a poor stopgap to avoid tsh from retrying all kinds of errors it shouldn't be. As you have observed it is prone to making bad choices - the entire idea of IsErrorResolvableWithRelogin is, as it loses all context from the call site.
I think the better way of doing this is explicitly marking errors as retriable by wrapping them with a RetryableError type. This way we can check for that wrapper (with errors.As) and return true if we find it. That means finding the client-side callsite for GenerateUserCerts, inspecting the response and marking it as retriable accordingly.
I'd also recommend that we have a guard error for "client credentials have expired" - as in an exported var we can check for in its entirety - so we don't go looking for specific substrings.
That's my 2c on the issue. Happy to talk more.
There was a problem hiding this comment.
That means finding the client-side callsite for GenerateUserCerts, inspecting the response and marking it as retriable accordingly.
So we would have to wrap methods in api/client/client.go manually? Or could we use the interceptor (where we add RemoteError) and check for 'client credentials have expired' there?
There was a problem hiding this comment.
Wrapping client.go manually is the best choice, imo, as there is no loss of context and little chance of false-positives. That's how I think this should have started.
For a "generic" place I would do it here, not in the interceptor.
Attemping relogin on BadParameter error makes
tshretry legitimate failures, hiding those failures from users and instead drowning them in login attempts.An example from #36749:
This PRs avoids such retries by skipping relogin on all RPC errors. This behavior is similar to Teleport pre-#30578, which is the source for the regression.
#36749, #36850
Changelog: Fix tsh trying to relogin on fatal errors