fix: avoid inadvertent deletion of active HSM keys#25025
Merged
Conversation
Contributor
Author
|
@codingllama @AntonAM PTAL |
AntonAM
approved these changes
Apr 26, 2023
Contributor
AntonAM
left a comment
There was a problem hiding this comment.
Yeah, pretty unfortunate. How did we found out, from an angry customer? 😅
Contributor
Author
|
Luckily I found it while running through the test plan before a customer could find it for us 😬 I have reached out to the one customer who I know is using this hardware to warn them about the issue, I will feel a bit better once we ship this fix |
codingllama
approved these changes
Apr 26, 2023
This is a partial fix for #25017 The latest version of the YubiHSM2 SDK has changed the behavior for keys longer than 2 bytes, which used to be silently truncated for all operations. This causes an unfortunate interaction with `DeleteUnusedKeys` when the SDK is upgraded in an active Teleport cluster. Because none of the active keys can be queried from the HSM individually by their ID, but they can be listed by their label, all of the active keys end up being deleted. Yeah that's bad. `DeleteUnusedKeys` is written this way in an attempt to be "stateless". Trying to synchronously delete keys at the instant they are rotated out during a CA rotation would be error-prone. If the auth server were to restart or crash at the wrong moment, you could be left with an orphaned key on your HSM forever, with no reference to it stored by Teleport or anywhere else. Instead, the Auth server labels all keys it creates with its own host UUID. Then periodically (during startup) it lists all keys in the HSM that are labeled with its own UUID, and if they are not currently active, deletes them. This goes catastrophically wrong when individual lookup operations fail, but list operations succeed. The fix here is to avoid deleting any keys if any single lookup fails. The YubiHSM2 SDK version 2023.1 is still not supported, but with this fix at least we won't delete any active keys.
f38a892 to
b0f5b6b
Compare
|
@nklaassen See the table below for backport results.
|
This was referenced Apr 26, 2023
nklaassen
added a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
github-actions Bot
pushed a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
github-actions Bot
pushed a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
github-actions Bot
pushed a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is currently failing to delete unused keys and emitting confusing warning logs whenever keys have been generated by multiple different Auth servers (no actual functionality is currently broken). This bug was introduced in #25025. This issue introduced there is that the function now checks that all currently active keys have actually been found in the keyring, but the ListCryptoKeys call used a filter that excluded all keys created by different Auth servers. This fix improves some of the error messages, and also uses a more permissive filter in the ListCryptoKeys call to make sure we can list keys created by any auth server, but will only delete keys created by the local auth server. changelog: Fix cleanup of unused GCP KMS keys
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a partial fix for #25017
The latest version of the YubiHSM2 SDK has changed the behavior for keys longer than 2 bytes, which used to be silently truncated for all operations.
This causes an unfortunate interaction with
DeleteUnusedKeyswhen the SDK is upgraded in an active Teleport cluster.Because none of the active keys can be queried from the HSM individually by their ID, but they can be listed by their label, all of the active keys end up being deleted.
DeleteUnusedKeysis written this way in an attempt to be "stateless". Trying to synchronously delete keys at the instant they are rotated out during a CA rotation would be error-prone. If the auth server were to restart or crash at the wrong moment, you could be left with an orphaned key on your HSM forever, with no reference to it stored by Teleport or anywhere else.Instead, the Auth server labels all keys it creates with its own host UUID.
Then periodically (during startup) it lists all keys in the HSM that are labeled with its own UUID, and if they are not currently active, deletes them.
This goes catastrophically wrong when individual lookup operations fail, but list operations succeed.
The fix here is to avoid deleting any keys if any single lookup fails.
The YubiHSM2 SDK version
2023.1is still not supported, but with this fix at least we won't delete any active keys.