feat: seamless migration to HSM/KMS#36549
Conversation
|
🤖 Vercel preview here: https://docs-oq94pkenh-goteleport.vercel.app/docs/ver/preview |
|
🤖 Vercel preview here: https://docs-2eragb3m2-goteleport.vercel.app/docs/ver/preview |
|
Are we backporting this? |
No, the plan is for it to go out in v15 |
The PR makes it much easier to migrate an existing Teleport cluster from software keys to HSM or KMS keys. Previously, as soon as an HSM/KMS was configured in the teleport.yaml for an auth server, it would immediately refuse to sign any more certificates with software keys. This was meant to defend against someone configuring an HSM and then forgetting to perform the necessary CA migrations, thinking they were protected by the HSM when in fact they were still using the old software keys. In practice, this just made it very difficult to migrate a cluster since there would be downtime where you couldn't even use tctl remotely because there was no way to log in. In a dual-auth cluster you could theoretically avoid downtime but the process was arcane and difficult to execute. Check out the docs changes to see all the steps I was able to remove here. This will be critical for enabling Cloud to start using AWS KMS keys. TODO: add a cluster alert (probably not for Cloud) when an HSM/KMS is configured but not actively used yet because a CA rotation is needed. Necessary now that `tctl status` no longer prints this (it has no way to tell).
7d238d0 to
044ed1e
Compare
|
🤖 Vercel preview here: https://docs-6nnps8jkl-goteleport.vercel.app/docs/ver/preview |
|
🤖 Vercel preview here: https://docs-j2dncq7uq-goteleport.vercel.app/docs/ver/preview |
| config.Auth.ListenAddr.Addr = net.JoinHostPort(hostName, "0") | ||
| config.Auth.ListenAddr.Addr = net.JoinHostPort("localhost", "0") |
There was a problem hiding this comment.
not sure why, but I had to make this change to get the tests to pass on my macbook
| if (cfg.PKCS11 != PKCS11Config{}) { | ||
| backend, err := newPKCS11KeyStore(&cfg.PKCS11, logger) | ||
| return &Manager{backend: backend}, trace.Wrap(err) | ||
| pkcs11Backend, err := newPKCS11KeyStore(&cfg.PKCS11, cfg.Logger) | ||
| return &Manager{ | ||
| backendForNewKeys: pkcs11Backend, | ||
| usableSigningBackends: []backend{pkcs11Backend, softwareBackend}, | ||
| }, trace.Wrap(err) | ||
| } | ||
| if (cfg.GCPKMS != GCPKMSConfig{}) { | ||
| backend, err := newGCPKMSKeyStore(ctx, &cfg.GCPKMS, logger) | ||
| return &Manager{backend: backend}, trace.Wrap(err) | ||
| gcpBackend, err := newGCPKMSKeyStore(ctx, &cfg.GCPKMS, cfg.Logger) | ||
| return &Manager{ | ||
| backendForNewKeys: gcpBackend, | ||
| usableSigningBackends: []backend{gcpBackend, softwareBackend}, | ||
| }, trace.Wrap(err) | ||
| } | ||
| if (cfg.AWSKMS != AWSKMSConfig{}) { | ||
| backend, err := newAWSKMSKeystore(ctx, &cfg.AWSKMS, logger) | ||
| return &Manager{backend: backend}, trace.Wrap(err) | ||
| awsBackend, err := newAWSKMSKeystore(ctx, &cfg.AWSKMS, cfg.Logger) | ||
| return &Manager{ | ||
| backendForNewKeys: awsBackend, | ||
| usableSigningBackends: []backend{awsBackend, softwareBackend}, | ||
| }, trace.Wrap(err) | ||
| } | ||
| return &Manager{backend: newSoftwareKeyStore(&cfg.Software, logger)}, nil | ||
| return &Manager{ | ||
| backendForNewKeys: softwareBackend, | ||
| usableSigningBackends: []backend{softwareBackend}, | ||
| }, nil |
There was a problem hiding this comment.
This is the meat of the actual change. Instead of only having a single keystore backend for everything, the keystore manager has one preferred backend for any new keys it will generate (this is only used to generate CA keys) and a list of backends it can use to sign stuff, in preference order. I always include the software keystore as the last element in the list, so the auth can always sign certs if there are any software keys in the CA.
|
@smallinsky @zmb3 @r0mant I'm hoping I can get this in for the v15 cutoff and the bot chose all of you for one reason or another, not sure if I need one or two more reviews |
|
@nklaassen See the table below for backport results.
|
Backport #36899 to branch/v13 The actual fix is a few characters in lib/auth/keystore/pkcs11.go. I'm also backporting changes to test files from #36549 that this PR built on top of, which make it easier to run all HSM unit and integration tests with a connected YubiHSM2 (which I did when putting together this backport). Instead of merging all changes in the integration tests, I just checked out the state of them from branch/v14 in #37296
Backport #36899 to branch/v12 The actual fix is a few characters in lib/auth/keystore/pkcs11.go. I'm also backporting changes to test files from #36549 that this PR built on top of, which make it easier to run all HSM unit and integration tests with a connected YubiHSM2 (which I did when putting together this backport). Instead of merging all changes in the integration tests, I just checked out the state of them from branch/v13 in #37301 Changelog: fixes CA key generation when two auth servers share a single YubiHSM2
…7301) Backport #36899 to branch/v13 The actual fix is a few characters in lib/auth/keystore/pkcs11.go. I'm also backporting changes to test files from #36549 that this PR built on top of, which make it easier to run all HSM unit and integration tests with a connected YubiHSM2 (which I did when putting together this backport). Instead of merging all changes in the integration tests, I just checked out the state of them from branch/v14 in #37296
…7305) Backport #36899 to branch/v12 The actual fix is a few characters in lib/auth/keystore/pkcs11.go. I'm also backporting changes to test files from #36549 that this PR built on top of, which make it easier to run all HSM unit and integration tests with a connected YubiHSM2 (which I did when putting together this backport). Instead of merging all changes in the integration tests, I just checked out the state of them from branch/v13 in #37301 Changelog: fixes CA key generation when two auth servers share a single YubiHSM2
This PR makes it much easier to migrate an existing Teleport cluster from software keys to HSM or KMS keys.
Previously, as soon as an HSM/KMS was configured in the teleport.yaml for an auth server, it would immediately refuse to sign any more certificates with software keys. This was meant to defend against someone configuring an HSM and then forgetting to perform the necessary CA migrations, thinking they were protected by the HSM when in fact they were still using the old software keys.
In practice, this just made it very difficult to migrate a cluster since there would be downtime where you couldn't even use tctl remotely because there was no way to log in. In a dual-auth cluster you could theoretically avoid downtime but the process was arcane and difficult to execute. Check out the docs changes to see the parts I was able to remove here.
This will be critical for enabling Cloud to start using AWS KMS keys.
TODO: (in a following PR) add a cluster alert (probably not for Cloud) when an HSM/KMS is configured but not actively used yet because a CA rotation is needed. Necessary now that
tctl statusno longer prints this (it has no way totell).
Changelog: Improved the migration experience when configuring HSM or KMS backing for CA key material