Skip to content

keystore: add active health checker#61962

Merged
dboslee merged 15 commits intomasterfrom
david/kms-healthcheck
Jan 5, 2026
Merged

keystore: add active health checker#61962
dboslee merged 15 commits intomasterfrom
david/kms-healthcheck

Conversation

@dboslee
Copy link
Copy Markdown
Contributor

@dboslee dboslee commented Dec 3, 2025

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will trip teleports readiness state.

the health checker watches CA events to ensure that the keys being tested are kept up-to-date.

this can be enabled by configuring

keystore:
  health_check:
    active:
      enabled: true

this config format leaves room for adding alternative health checking methods and additional parameters for the active health checker

changelog: Added support for health checks to monitor cert authority availability and affect Teleport Auth readiness

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will
trip teleports readiness state.

the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.

this can be enabled by configuring

```yaml
keystore:
  health_check:
    active:
      enabled: true

```

this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker
@dboslee dboslee force-pushed the david/kms-healthcheck branch 3 times, most recently from 10c171e to 4ea3649 Compare December 4, 2025 02:33
@dboslee dboslee force-pushed the david/kms-healthcheck branch from 4ea3649 to 107e524 Compare December 4, 2025 19:08
@dboslee dboslee force-pushed the david/kms-healthcheck branch from 3d8584d to d7e281a Compare December 5, 2025 20:19
@dboslee dboslee force-pushed the david/kms-healthcheck branch from 0771836 to 3e0a81e Compare December 5, 2025 20:38
@dboslee dboslee marked this pull request as ready for review December 5, 2025 20:44
Comment thread lib/auth/keystore/health/active_test.go Outdated
hc, err := NewActiveHealthChecker(ActiveHealthCheckConfig{
Callback: func(err error) {
calls = append(calls, err)
wg.Done()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this may be called twice, flaky test detector is catching a negative waitgroup counter

you could probably try using synctest here instead of a fake clock https://pkg.go.dev/testing/synctest, then you could just "sleep" instead of advancing the clock and if done right it should guarantee the health check runs before your test continues

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll give it a shot, I haven't messed with synctest before.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know this is an option, thanks. updated in 214f551

Comment thread lib/auth/keystore/health/active.go
Comment thread lib/auth/keystore/manager.go Outdated
Comment on lines +319 to +322
func (m *Manager) GetTLSSigner(ctx context.Context, ca types.CertAuthority) (crypto.Signer, error) {
_, signer, err := m.getTLSCertAndSigner(ctx, ca.GetActiveKeys())
return signer, trace.Wrap(err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you just call GetTLSCertAndSigner instead of adding this method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was trying to make the API more consistent but I could also just drop the cert in the health checker

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really matter, but if you prefer to add the GetTLSSigner method, can you just move it right next to GetTLSCertAndSigner?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved in 3b14f71

Comment thread lib/services/watcher_test.go Outdated
}

caService := local.NewCAService(bk)
//nolint:staticcheck // SA1019 This should be updated to use [services.NewCertAuthorityWatcher]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test of the deprecated watcher should not be updated though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a bad diff, the deprecated watcher test is not modified.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's right, this test should exercise the deprecated watcher, I just don't think the comment should say it should be updated (i assume it's a copy/paste error)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss understood. I will update to say something along the lines of "remove the test when the deprecated method is no longer used"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the lint comment in 6b773ab

Comment thread lib/auth/keystore/health/active.go
Comment on lines +206 to +214
if last == nil {
return c.signers[0], nil
}
for i := range c.signers {
if c.signers[i].Equal(last) {
return c.signers[(i+1)%n], nil
}
}
return c.signers[0], nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process of selecting the next signer is a bit funky. The local signer store isn't sorted, and the API doesn't specify that the upstream data source sending the new signer list needs to send them sorted. Additionally, this seeking logic just always restarts at the beginning if it can't find where it left off.

Admittedly, the set of signers for a cluster is small and relatively static, so these may be non-issues. I doubt we'd actually have problems unless something else was going wrong, but if you find the time I think it would be better practice to do one of the following:

  • Sort the local signer store by <domain-name>/<kind> (or vise-versa) and then select the lexicographically next value. That way we don't need to rely on the upstream yielding a sorted sequence, and can continue where we left off in the face of removals.

  • Do away with stepping through one signer each "tick" in favor of just health checking all of them each tick. If load is a concern, adding a cpl hundred ms of rate limiting to the inner check loop would probably be fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I will update to use the lexical sorting approach.

This is mostly intended for KMS integrations which have quota and cost implications. for example aws kms asymmetric crypto operations have a default 1000 req/second limit (for a region not per key) and costs $0.15 / 10000 requests.

at a rate of 1 req/minute the cost works out to $7.88 per key per year. Or just $7.88 since we're doing 1 key at a time.

we don't do it often but we did add two CAs in v18 and each CA can contain multiple keys so I'd rather keep the load/cost of this small and predictable.

@dboslee dboslee requested a review from nklaassen December 17, 2025 19:25
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from rudream January 3, 2026 01:10
@dboslee dboslee enabled auto-merge January 5, 2026 14:14
@dboslee dboslee added this pull request to the merge queue Jan 5, 2026
Merged via the queue into master with commit a5a9e0b Jan 5, 2026
42 of 43 checks passed
@dboslee dboslee deleted the david/kms-healthcheck branch January 5, 2026 15:56
@dboslee dboslee restored the david/kms-healthcheck branch January 5, 2026 17:28
dboslee added a commit that referenced this pull request Jan 6, 2026
* keystore: add active health checker

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will
trip teleports readiness state.

the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.

this can be enabled by configuring

```yaml
keystore:
  health_check:
    active:
      enabled: true

```

this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker

* improve godocs

* simplify logic

* fix outdated comment

* add file license header

* fix lint

* return curr signer on health failure

* refactor watch loop to improve readability

* fix comments

* add context to deletekey godocs

* move GetTLSSigner closer to similar func def

* use synctest and remove fake clock

* update lint comment

* use lexical sorting to select next key
21KennethTran pushed a commit that referenced this pull request Jan 6, 2026
* keystore: add active health checker

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will
trip teleports readiness state.

the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.

this can be enabled by configuring

```yaml
keystore:
  health_check:
    active:
      enabled: true

```

this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker

* improve godocs

* simplify logic

* fix outdated comment

* add file license header

* fix lint

* return curr signer on health failure

* refactor watch loop to improve readability

* fix comments

* add context to deletekey godocs

* move GetTLSSigner closer to similar func def

* use synctest and remove fake clock

* update lint comment

* use lexical sorting to select next key
github-merge-queue bot pushed a commit that referenced this pull request Jan 14, 2026
* [v18] keystore: add active health checker (#61962)

* keystore: add active health checker

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will
trip teleports readiness state.

the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.

this can be enabled by configuring

```yaml
keystore:
  health_check:
    active:
      enabled: true

```

this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker

* improve godocs

* simplify logic

* fix outdated comment

* add file license header

* fix lint

* return curr signer on health failure

* refactor watch loop to improve readability

* fix comments

* add context to deletekey godocs

* move GetTLSSigner closer to similar func def

* use synctest and remove fake clock

* update lint comment

* use lexical sorting to select next key

* remove use of synctest since it is experimental in 1.24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants