[v18] Add health checks to Kubernetes#60492
Merged
rana merged 10 commits intobranch/v18from Oct 29, 2025
Merged
Conversation
Contributor
|
Amplify deployment status
|
0ec7622 to
525c865
Compare
|
@rana - this PR will require admin approval to merge due to its size. Consider breaking it up into a series smaller changes. |
GavinFrazar
approved these changes
Oct 28, 2025
tigrato
approved these changes
Oct 28, 2025
rosstimothy
approved these changes
Oct 29, 2025
f1d49ce to
fba612f
Compare
fba612f to
a7f0da6
Compare
- Add Kubernetes label matchers to `Matcher` for `HealthCheckConfig` - Add message `KubernetesServerStatusV3` - Add `status` field to `KubernetesServerV3` - Add `target_health` field to `Kube` for UI - Regenerate Terraform schema and docs for `HealthCheckConfig` - Add Kubernetes label matchers to Terraform test `TestImportHealthCheckConfig` Relates to #58413 Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
…er` interface (#59396) The main intent of refactoring is to provide health check extensibility for Kubernetes while supporting the existing DB health checks. A `HealthChecker` interface is added to support the different health check approaches of DBs and Kubernetes. Existing DB TCP health check logic is moved to a new `TargetDialer` struct. Changes: - Added `HealthChecker` interface with two functions: - `CheckHealth(ctx context.Context) ([]string, error)` - `GetProtocol() types.TargetHealthProtocol` - Added `TargetDialer` struct which encapsulates existing TCP health check logic - Changed `Target` struct to use the `HealthChecker` interface - Changed `worker.checkHealth` to call the new `CheckHealth` function - Removed a `protocol` field from `healthCheckConfig` - Added `TargetHealthProtocolHTTP` for use with Kubernetes health checks - Moved and renamed test `Test_dialEndpoints` to `TestTargetDialer_dialEndpoints` - Added files `net.go` and `net_test.go` for `TargetDialer` Part of #58413
Implements core health check logic for Kubernetes.
- Added functions `CheckHealth` and `GetProtocol` to `kubeDetails`
- Added a `HealthCheckManager` field to the Kubernetes `TLSServerConfig`
- Kube agent and proxy now instantiate a `HealthCheckManager`
- Added a `HealthCheckConfigReader` to interfaces `ReadKubernetesAccessPoint` and `ReadProxyAccessPoint`
- Added `HealthCheckConfig` read-only permissions for proxy and kube
- Added `HealthCheckConfig` watching for proxy and kube
- Added functions to `KubeServer` interface and `KubernetesServerV3` struct:
- `GetTargetHealth() TargetHealth`
- `SetTargetHealth(h TargetHealth)`
- `GetTargetHealthStatus() TargetHealthStatus`
- `SetTargetHealthStatus(status TargetHealthStatus)`
- Health check supports dynamic discovery of Kubernetes clusters. Calls to `startHealthCheck()` and `stopHealthCheck()` were rearranged.
- Added functions `startHeartbeatAndHealthCheck()` and `stopHeartbeatAndHealthCheck()`
- Moved call to `HealthCheckManager.Start()` outside of the Kubernetes proxy server providing a future option to reuse `HealthCheckManager` in multiple proxy services
- Removed `Status` initialization in KubernetesServerV3 `CheckAndSetDefaults()`
- Added `kubernetesLabelMatchers` to `healthCheckConfig` struct. Default presets are omitted until the entire kube health check is complete.
- Added Kubernetes matcher checking to `ValidateHealthCheckConfig()`
- Changed `ValidateHealthCheckConfig()` to allow zero total DB matchers and Kubernetes matchers.
- Changed KubernetesServerV3 `GetTargetHealthStatus()` to return `TargetHealthStatusUnknown` instead of an empty string
- Changed health check worker `getTargetHealthTimeout` default timeout to `4s` from `10s`. This potentially reduces the response time of the initial heartbeat polling call to `GetServerInfo()`.
Part of #58413
Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Adds health status indicators for Kubernetes clusters on the Resources page. Unhealthy clusters are highlighted, and clicking them opens a side panel displaying server information. Changes include: - New `KubeServer` protobuf message and `ListKubernetesServers` RPC - Web and Connect API endpoints for fetching Kubernetes server data - Health status filtering in `matchAndFilterKubeClusters` - `TargetHealth` fields added to frontend/backend types - Updated `StatusInfo.tsx` to display `kube_cluster` data Part of #58413 Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Kubernetes servers are grouped by health, and dialed in order of healthy, unknown, then unhealthy, with random shuffling within each group for load distribution. The grouping and shuffling is implemented generically in a separate iterator function `OrderByTargetHealthStatus()` for reuse and testability. Changes: - Added `healthcheck.OrderByTargetHealthStatus()` function with tests Part of #58413 Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Health check configuration can be disabled and enabled through the matcher. Disabling the matcher disables health checks for this configuration's resources including databases, Kubernetes clusters, and any future resources. Changes: - Added `disabled` field to health check matcher proto - Updated matcher selection logic - Updated unit tests Part of #58413
A new Kubernetes-specific health check page is added. The existing Kubernetes troubleshooting documentation page is updated with health check specific error resolutions. Part of #58413 Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com> Co-authored-by: Gavin Frazar <gavin.frazar@goteleport.com> Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
Health checks are enabled for all Kubernetes clusters by default. A design of creating one health check config default per resource is implemented. The choice eases adoption of health checks, supports existing clusters that already have database health checks, and avoids migrating the backend database. A new Kubernetes-specific `default-kube` health check config is added. And a database-specific `default` health check config already exists, and is preserved. A virtual default design is implemented by returning health check configs from memory if they don't exist in the backend database. The approach has the benefit of not re-inserting default values to the backend after they're deleted, which a prior approach had. Virtual defaults are added at the local health check service level, and returned from functions `GetHealthCheckConfig` and `ListHealthCheckConfigs`. Virtual defaults may be written, updated, and deleted to and from the backend. While virtual defaults may be deleted, it has the net effect of resetting the config to default settings, and matching all resources of that type (db, kube). Virtual defaults are always returned from health check `get` and `list` functions. Changes: - Added `default-kube` health check config specific to Kubernetes only - Updated local service functions `GetHealthCheckConfig` and `ListHealthCheckConfigs` to return virtual defaults - Added unit tests - Updated health check documentation with `default-kube` and info about virtual defaults Part of #58413 Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
Filters virtual defaults to allow explicit config references. Part of #58413
a7f0da6 to
98fe91a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A
v18backport of multiple PRs:healthcheckfor Kubernetes extensibility withHealthCheckerinterface #59396Relates to:
Changelog: Added health checks for enrolled Kubernetes clusters
Manual Testing
rana/backport-kube-health-checks/v18Deleted backend database and started with a clean slate.
Started teleport with auth + proxy + kube agent.
1. Initial state observed
This is equivalent to a new cluster installation.
health_check_configrecords. Virtual default being used.tctlexercisedtctl get kube_server/colima. Kube clustertarget_healthis healthy.kind: kube_server spec: ... version: 18.2.10 status: target_health: address: 192.168.106.2:57500 message: 1 health check passed protocol: http status: healthy transition_reason: threshold_reached transition_timestamp: "2025-10-28T22:32:12.643374Z"tctl topand saw two raw metrics present. Two metrics are expected in intial state.╭Prometheus Metrics───────────────────────────────────────────╮ │ Filter: teleport_resources_health_status │ │ │ │ 2 items • 663 filtered │ │ │ │ teleport_resources_health_status_healthy{type="kubernetes"} │ │ 1.00 │ │ │ │ teleport_resources_health_status_unknown{type="kubernetes"} │ │ 0.00 │colimais present and in a normal statecolimais present and in a normal state2. Unhealthy state observed
Kubernetes was stopped to simulate a network failure with
colima stop.tctlexercisedtctl get kube_server/colima. Kube clustertarget_healthis unhealthy.kind: kube_server spec: ... version: 18.2.10 status: target_health: address: 192.168.106.2:57500 message: 1 health check failed protocol: http status: unhealthy transition_error: Unable to contact the Kubernetes cluster. Please see the Kubernetes Access Troubleshooting guide, https://goteleport.com/docs/enroll-resources/kubernetes-access/troubleshooting. transition_reason: threshold_reached transition_timestamp: "2025-10-28T23:07:58.797991Z"tctl topand saw three raw metrics present. Three metrics are expected now that an unhealthy cluster is reported.╭Prometheus Metrics─────────────────────────────────────────────╮ │ Filter: teleport_resources_health_status │ │ │ │ 3 items • 664 filtered │ │ │ │ teleport_resources_health_status_healthy{type="kubernetes"} │ │ 0.00 │ │ │ │ teleport_resources_health_status_unknown{type="kubernetes"} │ │ 0.00 │ │ │ │ teleport_resources_health_status_unhealthy{type="kubernetes"} │ │ 1.00 │UIs viewed
colimais present and shows a warning indicator. Clicking the Kubernetes cluster resource displays a side panel with error details.colimais present and shows a warning indicator. Clicking the Kubernetes cluster resource displays a side panel with error details.Kubernetes was restarted with
colima start.tctlexercisedtctl get kube_server/colima. Kube clustertarget_healthis healthy.kind: kube_server spec: ... version: 18.2.10 status: target_health: address: 192.168.106.2:57500 message: 2 health checks passed protocol: http status: healthy transition_reason: threshold_reached transition_timestamp: "2025-10-28T23:34:30.71887Z"colimais present and in a normal statecolimais present and in a normal state3. Disabled state observed
The virtual default health check config was disabled and persisted to the backend.
tctl edit health_check_config/default-kubeand setdisabled: true.kind: health_check_config metadata: description: Enables health checks for all Kubernetes clusters by default. name: default-kube revision: d796f007-e60c-4747-8dde-f479aff6b743 spec: match: + disabled: true kubernetes_labels: - name: '*' values: - '*'tctl get kube_server/colima. Kube clustertarget_healthis disabled.kind: kube_server spec: ... version: 18.2.10 status: target_health: message: No health check config matches this resource protocol: http status: unknown transition_reason: disabled transition_timestamp: "2025-10-29T00:59:24.040189Z"/health_check_config/default-kubewith a value. The default is no longer virtual.tctl edit health_check_config/default-kubesettingdisabled: false.kind: health_check_config metadata: description: Enables health checks for all Kubernetes clusters by default. name: default-kube revision: d796f007-e60c-4747-8dde-f479aff6b743 spec: match: + disabled: false kubernetes_labels: - name: '*' values: - '*'tctl get kube_server/colima. Kube clustertarget_healthis enabled.kind: kube_server spec: ... version: 18.2.10 status: target_health: address: 192.168.106.2:57500 message: 1 health check passed protocol: http status: healthy transition_reason: threshold_reached transition_timestamp: "2025-10-29T01:06:18.393322Z"4. Excluded matcher state observed
The kube matcher was configured to exclude the kube cluster.
tctl edit health_check_config/default-kubeand added a kube matcher labeldoes-not-exist.kind: health_check_config metadata: description: Enables health checks for all Kubernetes clusters by default. name: default-kube revision: d796f007-e60c-4747-8dde-f479aff6b743 spec: match: kubernetes_labels: + - name: 'does-not-exist' values: - '*'5. Virtual default behaviors observed
At this point
default-kubewas persisted to the backend from previous steps. It's no longer virtual.Teleport was restarted.
This is equivalent to an existing cluster upgraded to version
18.3and running for the first time.tctl get health_check_config/default-kubeshows the persisted config between restarts.tctl rm health_check_config/default-kubeand deleted the persisted virtual default.tctl rm health_check_config/default-kubeand deleted the persisted virtual default.health_check_configrecords. Virtual default being used.