Skip to content

fix: Do not attempt relogin on remote BadParameter errors#36866

Merged
codingllama merged 5 commits intomasterfrom
codingllama/relogin-too-eager
Jan 18, 2024
Merged

fix: Do not attempt relogin on remote BadParameter errors#36866
codingllama merged 5 commits intomasterfrom
codingllama/relogin-too-eager

Conversation

@codingllama
Copy link
Copy Markdown
Contributor

@codingllama codingllama commented Jan 18, 2024

Attemping relogin on BadParameter error makes tsh retry legitimate failures, hiding those failures from users and instead drowning them in login attempts.

An example from #36749:

$ tsh mfa rm 5ci      # Attempts to remove last passwordless device, which is a failure
Tap any security key  # OK, this is part of the flow
Detected security key tap
# At this point we already got the failure, but relogin is triggered.
Enter password for Teleport user llama: # As a user I'm thinking "weird, I was already logged in"
Tap any security key        # This is the login tap
Detected security key tap
Tap any security key        # This is the "mfa rm" tap, as a user I'm more than confused now
Detected security key tap
ERROR: cannot delete last passwordless credential for user  # tsh gives up here, surfaces actual error

This PRs avoids such retries by skipping relogin on all RPC errors. This behavior is similar to Teleport pre-#30578, which is the source for the regression.

#36749, #36850

Changelog: Fix tsh trying to relogin on fatal errors

@codingllama
Copy link
Copy Markdown
Contributor Author

Created as a draft to gather reviewer feedback, as it's hard to say what kinds of errors could fall into this. Backport label only to v15 for the same reason.

@ravicious
Copy link
Copy Markdown
Member

FWIW, the check on BadParameter has been used for this retry since 2019. I remember being surprised by this back when I extracted IsErrorResolvableWithRelogin, because BadParameter is a pretty generic error in the context of expired certs (see the "gRPC interceptors" section in #13020).

The retry logic in Connect uses a very primitive check on the error message because we're yet to implement the more sophisticated check (https://github.com/gravitational/teleport.e/issues/853):

export function isRetryable(error: unknown): boolean {
// TODO(ravicious): Replace this with actual check on metadata.
return (
error instanceof Error &&
(error.message.includes('ssh: handshake failed') ||
error.message.includes('ssh: cert has expired'))
);
}

So far we haven't seen any problems due to Connect not checking BadParameter. In theory Connect has less surface area than tsh when it comes to possible Go errors that might get thrown. OTOH, when Connect and tsh try to make a request using an expired cert, don't they both go through pretty much exactly the same code path?

Overall I think we should do this. I wish we made that change before the test plan, but at this point it's better to make it today than tomorrow.

@codingllama
Copy link
Copy Markdown
Contributor Author

OK, I figured out what happened. Since #30578 we automatically translate gRPC errors into trace errors. Before that commit server-side failures materialized as status.Error, which didn't trigger relogin, but since the interceptors a good chunk of server-side errors are now (correctly) handled as BadParameter and do trigger relogin.

It's hard to say what the original condition was meant to handle. Maybe we had "local" lib/client BadParameters that we wanted to retry.

Overall I think we should do this. I wish we made that change before the test plan, but at this point it's better to make it today than tomorrow.

Yep, this is my stance too, especially after the bisect adventure above.

@tigrato
Copy link
Copy Markdown
Contributor

tigrato commented Jan 18, 2024

It's hard to say what the original condition was meant to handle. Maybe we had "local" lib/client BadParameters that we wanted to retry.

There are several HTTP calls that will return trace.BadParameter when the response returns non-200 responses in api/client. Maybe those were the ones we were interested in retrying

return nil, trace.BadParameter("failed to switch Protocols %v", resp.StatusCode)

return nil, trace.BadParameter("unable to proxy connection: %v", resp.Status)

@codingllama codingllama changed the title fix: Do not attempt relogin on trace.BadParameterError fix: Do not attempt relogin on remote BadParameter errors Jan 18, 2024
@codingllama codingllama force-pushed the codingllama/relogin-too-eager branch from fec0466 to 71764cf Compare January 18, 2024 19:35
@codingllama
Copy link
Copy Markdown
Contributor Author

I've pushed a change that marks errors originated from the interceptors and avoids retrying those. This should move us closer to the behavior pre-#30578 and cause less side-effects. Some brave soul can then try to remove retries for all BadParameters another time.

PTAL.

@codingllama codingllama marked this pull request as ready for review January 18, 2024 19:38
@github-actions github-actions Bot requested review from Tener and hugoShaka January 18, 2024 19:39
Copy link
Copy Markdown
Contributor

@tigrato tigrato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @codingllama !

@codingllama
Copy link
Copy Markdown
Contributor Author

Fixed a few tests and removed wrapping from stream io.EOF errors (which are often used as stop guards).

Comment thread api/utils/grpc/interceptors/errors.go Outdated
@codingllama codingllama enabled auto-merge January 18, 2024 21:00
@codingllama codingllama added this pull request to the merge queue Jan 18, 2024
Merged via the queue into master with commit 96defc6 Jan 18, 2024
@codingllama codingllama deleted the codingllama/relogin-too-eager branch January 18, 2024 21:35
@public-teleport-github-review-bot
Copy link
Copy Markdown

@codingllama See the table below for backport results.

Branch Result
branch/v15 Create PR

@tigrato
Copy link
Copy Markdown
Contributor

tigrato commented Jan 19, 2024

@codingllama given that the original PR that introduced this behavior was backported to v14, v13 and v12, should we also backport this to other branches besides v15?

@codingllama
Copy link
Copy Markdown
Contributor Author

@tiago agreed - I was just waiting for the testplan to finish so this (hopefully) gets exercised a bit more.

I'll mail out the backports so we don't forget about it.

codingllama added a commit that referenced this pull request Jan 19, 2024
* fix: Do not attempt relogin on trace.BadParameterError

* Special case remote errors in IsErrorResolvableWithRelogin

* Do not wrap Recv EOF errors

* Fix various tests

* Use errors.Is to check for io.EOF
codingllama added a commit that referenced this pull request Jan 19, 2024
* fix: Do not attempt relogin on trace.BadParameterError

* Special case remote errors in IsErrorResolvableWithRelogin

* Do not wrap Recv EOF errors

* Fix various tests

* Use errors.Is to check for io.EOF
codingllama added a commit that referenced this pull request Jan 19, 2024
* fix: Do not attempt relogin on trace.BadParameterError

* Special case remote errors in IsErrorResolvableWithRelogin

* Do not wrap Recv EOF errors

* Fix various tests

* Use errors.Is to check for io.EOF
github-merge-queue Bot pushed a commit that referenced this pull request Feb 1, 2024
* fix: Do not attempt relogin on remote BadParameter errors (#36866)

* fix: Do not attempt relogin on trace.BadParameterError

* Special case remote errors in IsErrorResolvableWithRelogin

* Do not wrap Recv EOF errors

* Fix various tests

* Use errors.Is to check for io.EOF

* Fix imports for v13

* fix: Do not wrap io.EOF intercepted by stream Sends (#37647)

* Verify that intercepted stream Sends wrap io.EOF

* fix: Do not wrap io.EOF intercepted by stream Sends

* Use a helper func, fix duplicate Send/Recv calls

* Fix typo
github-merge-queue Bot pushed a commit that referenced this pull request Feb 1, 2024
* fix: Do not attempt relogin on remote BadParameter errors (#36866)

* fix: Do not attempt relogin on trace.BadParameterError

* Special case remote errors in IsErrorResolvableWithRelogin

* Do not wrap Recv EOF errors

* Fix various tests

* Use errors.Is to check for io.EOF

* fix: Do not wrap io.EOF intercepted by stream Sends (#37647)

* Verify that intercepted stream Sends wrap io.EOF

* fix: Do not wrap io.EOF intercepted by stream Sends

* Use a helper func, fix duplicate Send/Recv calls

* Fix typo
Comment thread lib/client/api.go
// https://github.com/gravitational/teleport/pull/30578.
var remoteErr *interceptors.RemoteError
if errors.As(err, &remoteErr) {
return false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codingllama should we treat all remote errors equally here?
We can get Original Error: *interceptors.RemoteError access denied: client credentials have expired, please relogin. for which we return false but it seems that for this particular case it should be true?

I ask because I'm working on adding a client cache to Connect, and this means that the client no longer checks the user cert before making call, but instead it has to rely on errors from the server.

The original comment #38202 (comment).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @ravicious

This block here is a poor stopgap to avoid tsh from retrying all kinds of errors it shouldn't be. As you have observed it is prone to making bad choices - the entire idea of IsErrorResolvableWithRelogin is, as it loses all context from the call site.

I think the better way of doing this is explicitly marking errors as retriable by wrapping them with a RetryableError type. This way we can check for that wrapper (with errors.As) and return true if we find it. That means finding the client-side callsite for GenerateUserCerts, inspecting the response and marking it as retriable accordingly.

I'd also recommend that we have a guard error for "client credentials have expired" - as in an exported var we can check for in its entirety - so we don't go looking for specific substrings.

That's my 2c on the issue. Happy to talk more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means finding the client-side callsite for GenerateUserCerts, inspecting the response and marking it as retriable accordingly.

So we would have to wrap methods in api/client/client.go manually? Or could we use the interceptor (where we add RemoteError) and check for 'client credentials have expired' there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping client.go manually is the best choice, imo, as there is no loss of context and little chance of false-positives. That's how I think this should have started.

For a "generic" place I would do it here, not in the interceptor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants