Conversation
the active health checker performs signing requests against CAs.
failed signing requests are reported back to a callback which will
trip teleports readiness state.
the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.
this can be enabled by configuring
```yaml
keystore:
health_check:
active:
enabled: true
```
this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker
10c171e to
4ea3649
Compare
4ea3649 to
107e524
Compare
3d8584d to
d7e281a
Compare
0771836 to
3e0a81e
Compare
| hc, err := NewActiveHealthChecker(ActiveHealthCheckConfig{ | ||
| Callback: func(err error) { | ||
| calls = append(calls, err) | ||
| wg.Done() |
There was a problem hiding this comment.
seems this may be called twice, flaky test detector is catching a negative waitgroup counter
you could probably try using synctest here instead of a fake clock https://pkg.go.dev/testing/synctest, then you could just "sleep" instead of advancing the clock and if done right it should guarantee the health check runs before your test continues
There was a problem hiding this comment.
👍 I'll give it a shot, I haven't messed with synctest before.
There was a problem hiding this comment.
good to know this is an option, thanks. updated in 214f551
| func (m *Manager) GetTLSSigner(ctx context.Context, ca types.CertAuthority) (crypto.Signer, error) { | ||
| _, signer, err := m.getTLSCertAndSigner(ctx, ca.GetActiveKeys()) | ||
| return signer, trace.Wrap(err) | ||
| } |
There was a problem hiding this comment.
couldn't you just call GetTLSCertAndSigner instead of adding this method?
There was a problem hiding this comment.
Yeah I was trying to make the API more consistent but I could also just drop the cert in the health checker
There was a problem hiding this comment.
doesn't really matter, but if you prefer to add the GetTLSSigner method, can you just move it right next to GetTLSCertAndSigner?
| } | ||
|
|
||
| caService := local.NewCAService(bk) | ||
| //nolint:staticcheck // SA1019 This should be updated to use [services.NewCertAuthorityWatcher] |
There was a problem hiding this comment.
this test of the deprecated watcher should not be updated though
There was a problem hiding this comment.
I think this is just a bad diff, the deprecated watcher test is not modified.
There was a problem hiding this comment.
I think it's right, this test should exercise the deprecated watcher, I just don't think the comment should say it should be updated (i assume it's a copy/paste error)
There was a problem hiding this comment.
I miss understood. I will update to say something along the lines of "remove the test when the deprecated method is no longer used"
| if last == nil { | ||
| return c.signers[0], nil | ||
| } | ||
| for i := range c.signers { | ||
| if c.signers[i].Equal(last) { | ||
| return c.signers[(i+1)%n], nil | ||
| } | ||
| } | ||
| return c.signers[0], nil |
There was a problem hiding this comment.
The process of selecting the next signer is a bit funky. The local signer store isn't sorted, and the API doesn't specify that the upstream data source sending the new signer list needs to send them sorted. Additionally, this seeking logic just always restarts at the beginning if it can't find where it left off.
Admittedly, the set of signers for a cluster is small and relatively static, so these may be non-issues. I doubt we'd actually have problems unless something else was going wrong, but if you find the time I think it would be better practice to do one of the following:
-
Sort the local signer store by
<domain-name>/<kind>(or vise-versa) and then select the lexicographically next value. That way we don't need to rely on the upstream yielding a sorted sequence, and can continue where we left off in the face of removals. -
Do away with stepping through one signer each "tick" in favor of just health checking all of them each tick. If load is a concern, adding a cpl hundred ms of rate limiting to the inner check loop would probably be fine.
There was a problem hiding this comment.
good point, I will update to use the lexical sorting approach.
This is mostly intended for KMS integrations which have quota and cost implications. for example aws kms asymmetric crypto operations have a default 1000 req/second limit (for a region not per key) and costs $0.15 / 10000 requests.
at a rate of 1 req/minute the cost works out to $7.88 per key per year. Or just $7.88 since we're doing 1 key at a time.
we don't do it often but we did add two CAs in v18 and each CA can contain multiple keys so I'd rather keep the load/cost of this small and predictable.
* keystore: add active health checker
the active health checker performs signing requests against CAs.
failed signing requests are reported back to a callback which will
trip teleports readiness state.
the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.
this can be enabled by configuring
```yaml
keystore:
health_check:
active:
enabled: true
```
this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker
* improve godocs
* simplify logic
* fix outdated comment
* add file license header
* fix lint
* return curr signer on health failure
* refactor watch loop to improve readability
* fix comments
* add context to deletekey godocs
* move GetTLSSigner closer to similar func def
* use synctest and remove fake clock
* update lint comment
* use lexical sorting to select next key
* keystore: add active health checker
the active health checker performs signing requests against CAs.
failed signing requests are reported back to a callback which will
trip teleports readiness state.
the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.
this can be enabled by configuring
```yaml
keystore:
health_check:
active:
enabled: true
```
this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker
* improve godocs
* simplify logic
* fix outdated comment
* add file license header
* fix lint
* return curr signer on health failure
* refactor watch loop to improve readability
* fix comments
* add context to deletekey godocs
* move GetTLSSigner closer to similar func def
* use synctest and remove fake clock
* update lint comment
* use lexical sorting to select next key
* [v18] keystore: add active health checker (#61962) * keystore: add active health checker the active health checker performs signing requests against CAs. failed signing requests are reported back to a callback which will trip teleports readiness state. the health checker watches CA events to ensure that the keys being tested are kept up-to-date. this can be enabled by configuring ```yaml keystore: health_check: active: enabled: true ``` this config format leaves room for adding alternative health checking methods and additional parameters for the active health checker * improve godocs * simplify logic * fix outdated comment * add file license header * fix lint * return curr signer on health failure * refactor watch loop to improve readability * fix comments * add context to deletekey godocs * move GetTLSSigner closer to similar func def * use synctest and remove fake clock * update lint comment * use lexical sorting to select next key * remove use of synctest since it is experimental in 1.24
the active health checker performs signing requests against CAs.
failed signing requests are reported back to a callback which will trip teleports readiness state.
the health checker watches CA events to ensure that the keys being tested are kept up-to-date.
this can be enabled by configuring
this config format leaves room for adding alternative health checking methods and additional parameters for the active health checker
changelog: Added support for health checks to monitor cert authority availability and affect Teleport Auth readiness