Skip to content

fix: avoid inadvertent deletion of active HSM keys#25025

Merged
nklaassen merged 1 commit intomasterfrom
nklaassen/fix-key-deletion
Apr 26, 2023
Merged

fix: avoid inadvertent deletion of active HSM keys#25025
nklaassen merged 1 commit intomasterfrom
nklaassen/fix-key-deletion

Conversation

@nklaassen
Copy link
Copy Markdown
Contributor

This is a partial fix for #25017

The latest version of the YubiHSM2 SDK has changed the behavior for keys longer than 2 bytes, which used to be silently truncated for all operations.
This causes an unfortunate interaction with DeleteUnusedKeys when the SDK is upgraded in an active Teleport cluster.
Because none of the active keys can be queried from the HSM individually by their ID, but they can be listed by their label, all of the active keys end up being deleted.

DeleteUnusedKeys is written this way in an attempt to be "stateless". Trying to synchronously delete keys at the instant they are rotated out during a CA rotation would be error-prone. If the auth server were to restart or crash at the wrong moment, you could be left with an orphaned key on your HSM forever, with no reference to it stored by Teleport or anywhere else.

Instead, the Auth server labels all keys it creates with its own host UUID.
Then periodically (during startup) it lists all keys in the HSM that are labeled with its own UUID, and if they are not currently active, deletes them.
This goes catastrophically wrong when individual lookup operations fail, but list operations succeed.

The fix here is to avoid deleting any keys if any single lookup fails.

The YubiHSM2 SDK version 2023.1 is still not supported, but with this fix at least we won't delete any active keys.

@nklaassen
Copy link
Copy Markdown
Contributor Author

@codingllama @AntonAM PTAL

Copy link
Copy Markdown
Contributor

@AntonAM AntonAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, pretty unfortunate. How did we found out, from an angry customer? 😅

Comment thread lib/auth/keystore/pkcs11.go Outdated
@nklaassen
Copy link
Copy Markdown
Contributor Author

nklaassen commented Apr 26, 2023

Luckily I found it while running through the test plan before a customer could find it for us 😬 I have reached out to the one customer who I know is using this hardware to warn them about the issue, I will feel a bit better once we ship this fix

Comment thread lib/auth/keystore/gcp_kms.go Outdated
Comment thread lib/auth/keystore/pkcs11.go Outdated
Comment thread lib/auth/keystore/pkcs11.go Outdated
This is a partial fix for #25017

The latest version of the YubiHSM2 SDK has changed the behavior for keys
longer than 2 bytes, which used to be silently truncated for all
operations.
This causes an unfortunate interaction with `DeleteUnusedKeys` when the
SDK is upgraded in an active Teleport cluster.
Because none of the active keys can be queried from the HSM
individually by their ID, but they can be listed by their label, all of
the active keys end up being deleted.
Yeah that's bad.

`DeleteUnusedKeys` is written this way in an attempt to be "stateless".
Trying to synchronously delete keys at the instant they are
rotated out during a CA rotation would be error-prone.
If the auth server were to restart or crash at the wrong moment, you
could be left with an orphaned key on your HSM forever, with no
reference to it stored by Teleport or anywhere else.

Instead, the Auth server labels all keys it creates with its own host
UUID.
Then periodically (during startup) it lists all keys in the HSM that are
labeled with its own UUID, and if they are not currently active, deletes
them.
This goes catastrophically wrong when individual lookup operations fail,
but list operations succeed.

The fix here is to avoid deleting any keys if any single lookup fails.

The YubiHSM2 SDK version 2023.1 is still not supported, but with this
fix at least we won't delete any active keys.
@nklaassen nklaassen force-pushed the nklaassen/fix-key-deletion branch from f38a892 to b0f5b6b Compare April 26, 2023 16:15
@nklaassen nklaassen enabled auto-merge April 26, 2023 16:23
@nklaassen nklaassen added this pull request to the merge queue Apr 26, 2023
Merged via the queue into master with commit e1d2305 Apr 26, 2023
@nklaassen nklaassen deleted the nklaassen/fix-key-deletion branch April 26, 2023 16:53
@public-teleport-github-review-bot
Copy link
Copy Markdown

@nklaassen See the table below for backport results.

Branch Result
branch/v11 Create PR
branch/v12 Create PR
branch/v13 Create PR

nklaassen added a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot pushed a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
github-actions Bot pushed a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
github-actions Bot pushed a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
github-actions Bot pushed a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot pushed a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot pushed a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
github-merge-queue Bot pushed a commit that referenced this pull request Nov 10, 2023
This PR fixes the cleanup code for unused keys in GCP KMS which is
currently failing to delete unused keys and emitting confusing warning
logs whenever keys have been generated by multiple different Auth
servers (no actual functionality is currently broken).

This bug was introduced in
#25025. This issue
introduced there is that the function now checks that all currently
active keys have actually been found in the keyring, but the
ListCryptoKeys call used a filter that excluded all keys created by
different Auth servers.

This fix improves some of the error messages, and also uses a more
permissive filter in the ListCryptoKeys call to make sure we can list
keys created by any auth server, but will only delete keys created by
the local auth server.

changelog: Fix cleanup of unused GCP KMS keys
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants