Skip to content

[v18] Add health checks to Kubernetes#60492

Merged
rana merged 10 commits intobranch/v18from
rana/backport-kube-health-checks/v18
Oct 29, 2025
Merged

[v18] Add health checks to Kubernetes#60492
rana merged 10 commits intobranch/v18from
rana/backport-kube-health-checks/v18

Conversation

@rana
Copy link
Copy Markdown
Contributor

@rana rana commented Oct 23, 2025

A v18 backport of multiple PRs:

Relates to:


Changelog: Added health checks for enrolled Kubernetes clusters


Manual Testing
  • Environment: Macos, colima, Kubernetes v1.33.3+k3s1
  • Branch: rana/backport-kube-health-checks/v18

Deleted backend database and started with a clean slate.

Started teleport with auth + proxy + kube agent.

1. Initial state observed

This is equivalent to a new cluster installation.

  • Kube health checks enabled by default in Teleport terminal output
2025-10-28T15:21:35.935-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:initialized message:Health checker initialized healthcheck/worker.go:423
2025-10-28T15:21:35.935-07:00 INFO [KUBERNETE] Health checker started target_name:colima target_kind:kube_cluster target_origin: health_check_config:default-kube interval:30s timeout:5s healthy_threshold:2 unhealthy_threshold:1 healthcheck/worker.go:234
2025-10-28T15:21:35.942-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411
  • Backend DB contains no health_check_config records. Virtual default being used.
  • tctl exercised
    • Ran tctl get kube_server/colima. Kube cluster target_health is healthy.
    kind: kube_server
    spec:
      ...
      version: 18.2.10
    status:
      target_health:
        address: 192.168.106.2:57500
        message: 1 health check passed
        protocol: http
        status: healthy
        transition_reason: threshold_reached
        transition_timestamp: "2025-10-28T22:32:12.643374Z"
    • Ran tctl top and saw two raw metrics present. Two metrics are expected in intial state.
    ╭Prometheus Metrics───────────────────────────────────────────╮
    │   Filter: teleport_resources_health_status                  │
    │                                                             │
    │   2 items • 663 filtered                                    │
    │                                                             │
    │ teleport_resources_health_status_healthy{type="kubernetes"} │
    │ 1.00                                                        │
    │                                                             │
    │ teleport_resources_health_status_unknown{type="kubernetes"} │
    │ 0.00                                                        │
  • UIs viewed
    • Viewed Web UI. Kube resource colima is present and in a normal state
    • Viewed Connect UI. Kube resource colima is present and in a normal state

2. Unhealthy state observed

Kubernetes was stopped to simulate a network failure with colima stop.

  • Teleport terminal output shows an unhealthy kube cluster
2025-10-28T16:07:58.797-07:00 DEBU [KUBERNETE] Failed health check target_name:colima target_kind:kube_cluster target_origin: error:[
ERROR REPORT:
Original Error: *trace.ConnectionProblemError Unable to contact the Kubernetes cluster. Please see the Kubernetes Access Troubleshooting guide, https://goteleport.com/docs/enroll-resources/kubernetes-access/troubleshooting.
Stack Trace:
        github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:346 github.com/gravitational/teleport/lib/kube/proxy.(*kubeDetails).checkHealthReadyz
        github.com/gravitational/teleport/lib/kube/proxy/cluster_details.go:324 github.com/gravitational/teleport/lib/kube/proxy.(*kubeDetails).CheckHealth
        github.com/gravitational/teleport/lib/healthcheck/worker.go:278 github.com/gravitational/teleport/lib/healthcheck.(*worker).checkHealth
        github.com/gravitational/teleport/lib/healthcheck/worker.go:216 github.com/gravitational/teleport/lib/healthcheck.(*worker).run
        runtime/asm_arm64.s:1268 runtime.goexit
User Message: Unable to contact the Kubernetes cluster. Please see the Kubernetes Access Troubleshooting guide, https://goteleport.com/docs/enroll-resources/kubernetes-access/troubleshooting.] healthcheck/worker.go:293
2025-10-28T16:07:58.797-07:00 WARN [KUBERNETE] Target became unhealthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check failed healthcheck/worker.go:417
  • tctl exercised

    • Ran tctl get kube_server/colima. Kube cluster target_health is unhealthy.
    kind: kube_server
    spec:
      ...
      version: 18.2.10
    status:
      target_health:
        address: 192.168.106.2:57500
        message: 1 health check failed
        protocol: http
        status: unhealthy
        transition_error: Unable to contact the Kubernetes cluster. Please see the Kubernetes
          Access Troubleshooting guide, https://goteleport.com/docs/enroll-resources/kubernetes-access/troubleshooting.
        transition_reason: threshold_reached
        transition_timestamp: "2025-10-28T23:07:58.797991Z"
    • Ran tctl top and saw three raw metrics present. Three metrics are expected now that an unhealthy cluster is reported.
    ╭Prometheus Metrics─────────────────────────────────────────────╮
    │   Filter: teleport_resources_health_status                    │
    │                                                               │
    │   3 items • 664 filtered                                      │
    │                                                               │
    │ teleport_resources_health_status_healthy{type="kubernetes"}   │
    │ 0.00                                                          │
    │                                                               │
    │ teleport_resources_health_status_unknown{type="kubernetes"}   │
    │ 0.00                                                          │
    │                                                               │
    │ teleport_resources_health_status_unhealthy{type="kubernetes"} │
    │ 1.00                                                          │
  • UIs viewed

    • I waited about 5m and clicked the refresh icon.
    • Viewed Web UI. Kube resource colima is present and shows a warning indicator. Clicking the Kubernetes cluster resource displays a side panel with error details.
    kube-healthcheck-unhealthy-webui-1
    • Viewed Connect UI. Kube resource colima is present and shows a warning indicator. Clicking the Kubernetes cluster resource displays a side panel with error details.
    kube-healthcheck-unhealthy-connectui-1

Kubernetes was restarted with colima start.

  • Teleport terminal output shows a healthy kube cluster
2025-10-28T16:34:30.718-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:2 health checks passed healthcheck/worker.go:411
  • tctl exercised
    • Ran tctl get kube_server/colima. Kube cluster target_health is healthy.
    kind: kube_server
    spec:
      ...
      version: 18.2.10
    status:
      target_health:
        address: 192.168.106.2:57500
        message: 2 health checks passed
        protocol: http
        status: healthy
        transition_reason: threshold_reached
        transition_timestamp: "2025-10-28T23:34:30.71887Z"
  • UIs viewed
    • I waited about 5m and clicked the refresh icon.
    • Viewed Web UI. Kube resource colima is present and in a normal state
    • Viewed Connect UI. Kube resource colima is present and in a normal state

3. Disabled state observed

The virtual default health check config was disabled and persisted to the backend.

  • Ran tctl edit health_check_config/default-kube and set disabled: true.
kind: health_check_config
metadata:
  description: Enables health checks for all Kubernetes clusters by default.
  name: default-kube
  revision: d796f007-e60c-4747-8dde-f479aff6b743
spec:
  match:
+    disabled: true
    kubernetes_labels:
    - name: '*'
      values:
      - '*'
  • Teleport terminal output shows health checks disabled to the kube cluster
2025-10-28T17:59:24.040-07:00 INFO [KUBERNETE] Health checker stopped target_name:colima target_kind:kube_cluster target_origin: healthcheck/worker.go:251
2025-10-28T17:59:24.040-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:disabled message:No health check config matches this resource healthcheck/worker.go:423
  • Ran tctl get kube_server/colima. Kube cluster target_health is disabled.
kind: kube_server
spec:
  ...
  version: 18.2.10
status:
  target_health:
    message: No health check config matches this resource
    protocol: http
    status: unknown
    transition_reason: disabled
    transition_timestamp: "2025-10-29T00:59:24.040189Z"
  • Backend DB table kv contains key /health_check_config/default-kube with a value. The default is no longer virtual.
  • Ran tctl edit health_check_config/default-kube setting disabled: false.
kind: health_check_config
metadata:
  description: Enables health checks for all Kubernetes clusters by default.
  name: default-kube
  revision: d796f007-e60c-4747-8dde-f479aff6b743
spec:
  match:
+    disabled: false
    kubernetes_labels:
    - name: '*'
      values:
      - '*'
  • Teleport terminal output shows health checks enabled to the kube cluster
2025-10-28T18:06:17.049-07:00 INFO [KUBERNETE] Health checker started target_name:colima target_kind:kube_cluster target_origin: health_check_config:default-kube interval:30s timeout:5s healthy_threshold:2 unhealthy_threshold:1 healthcheck/worker.go:234
2025-10-28T18:06:17.050-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:initialized message:Health checker initialized healthcheck/worker.go:423
2025-10-28T18:06:18.393-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411
  • Ran tctl get kube_server/colima. Kube cluster target_health is enabled.
kind: kube_server
spec:
  ...
  version: 18.2.10
status:
  target_health:
    address: 192.168.106.2:57500
    message: 1 health check passed
    protocol: http
    status: healthy
    transition_reason: threshold_reached
    transition_timestamp: "2025-10-29T01:06:18.393322Z"

4. Excluded matcher state observed

The kube matcher was configured to exclude the kube cluster.

  • Ran tctl edit health_check_config/default-kube and added a kube matcher label does-not-exist.
kind: health_check_config
metadata:
  description: Enables health checks for all Kubernetes clusters by default.
  name: default-kube
  revision: d796f007-e60c-4747-8dde-f479aff6b743
spec:
  match:
    kubernetes_labels:
+    - name: 'does-not-exist'
      values:
      - '*'
  • Teleport terminal output shows health checks are disabled to the kube cluster
2025-10-28T18:17:52.039-07:00 INFO [KUBERNETE] Health checker stopped target_name:colima target_kind:kube_cluster target_origin: healthcheck/worker.go:251
2025-10-28T18:17:52.039-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:disabled message:No health check config matches this resource healthcheck/worker.go:423

5. Virtual default behaviors observed

At this point default-kube was persisted to the backend from previous steps. It's no longer virtual.

Teleport was restarted.

This is equivalent to an existing cluster upgraded to version 18.3 and running for the first time.

  • Teleport terminal ouput shows health checks are disabled to the kube cluster
2025-10-28T18:21:55.255-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:disabled message:No health check config matches this resource healthcheck/worker.go:423
  • tctl get health_check_config/default-kube shows the persisted config between restarts.
kind: health_check_config
metadata:
  description: Enables health checks for all Kubernetes clusters by default.
  name: default-kube
  revision: b9003591-26d5-4a75-853e-0e7e17544b69
spec:
  match:
    kubernetes_labels:
    - name: does-not-exist
      values:
      - '*'
  • Ran tctl rm health_check_config/default-kube and deleted the persisted virtual default.
  • Teleport terminal output shows the virtual default deleted. Teleport terminal output shows health checks are enabled to the kube cluster. Deleting a persisted virtual default has the net effect of resetting default settings.
2025-10-28T18:29:37.550-07:00 INFO  emitting audit event event_type:health_check_config.delete fields:map[addr.remote:127.0.0.1:63550 cluster_name:teleport-laptop code:THCC003I ei:0 event:health_check_config.delete expires:0001-01-01T00:00:00Z name:default-kube time:2025-10-29T01:29:37.55Z trace.component:audit uid:6cbad5eb-84e5-4c48-a86c-1b319a5adef0 user:7314368c-a7e9-42a1-be49-86342d4b91c4.teleport-laptop user_cluster_name:teleport-laptop user_kind:3 user_roles:[Admin]] events/emitter.go:287
2025-10-28T18:29:38.209-07:00 INFO [KUBERNETE] Health checker started target_name:colima target_kind:kube_cluster target_origin: health_check_config:default-kube interval:30s timeout:5s healthy_threshold:2 unhealthy_threshold:1 healthcheck/worker.go:234
2025-10-28T18:29:38.209-07:00 DEBU [KUBERNETE] Target health status is unknown target_name:colima target_kind:kube_cluster target_origin: reason:initialized message:Health checker initialized healthcheck/worker.go:423
2025-10-28T18:29:46.730-07:00 INFO [KUBERNETE] Target became healthy target_name:colima target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411
  • Ran tctl rm health_check_config/default-kube and deleted the persisted virtual default.
  • Teleport terminal output shows the virtual default deleted, but has no net effect on health checks when the default is unaltered by a user.
2025-10-28T18:31:46.746-07:00 INFO  emitting audit event event_type:health_check_config.delete fields:map[addr.remote:127.0.0.1:63561 cluster_name:teleport-laptop code:THCC003I ei:0 event:health_check_config.delete expires:0001-01-01T00:00:00Z name:default-kube time:2025-10-29T01:31:46.746Z trace.component:audit uid:b2fd41ba-6e0e-4c6b-b08e-d15e7220adfa user:7314368c-a7e9-42a1-be49-86342d4b91c4.teleport-laptop user_cluster_name:teleport-laptop user_kind:3 user_roles:[Admin]] events/emitter.go:287
  • Backend DB contains no health_check_config records. Virtual default being used.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 23, 2025

Amplify deployment status

Branch Commit Job ID Status Preview Updated (UTC)
rana/backport-kube-health-checks/v18 HEAD 1 ❌FAILED rana-backport-kube-health-checks-v18 2025-10-29 18:26:20

@rana rana force-pushed the rana/backport-kube-health-checks/v18 branch from 0ec7622 to 525c865 Compare October 27, 2025 18:34
@rana rana marked this pull request as ready for review October 27, 2025 23:23
@github-actions github-actions bot added database-access Database access related issues and PRs documentation size/xl ui labels Oct 27, 2025
@public-teleport-github-review-bot
Copy link
Copy Markdown

@rana - this PR will require admin approval to merge due to its size. Consider breaking it up into a series smaller changes.

@rana rana force-pushed the rana/backport-kube-health-checks/v18 branch from f1d49ce to fba612f Compare October 29, 2025 16:18
@rana rana force-pushed the rana/backport-kube-health-checks/v18 branch from fba612f to a7f0da6 Compare October 29, 2025 16:28
rana and others added 3 commits October 29, 2025 09:38
- Add Kubernetes label matchers to `Matcher` for `HealthCheckConfig`
- Add message `KubernetesServerStatusV3`
- Add `status` field to `KubernetesServerV3`
- Add `target_health` field to `Kube` for UI
- Regenerate Terraform schema and docs for `HealthCheckConfig`
- Add Kubernetes label matchers to Terraform test `TestImportHealthCheckConfig`

Relates to #58413

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
…er` interface (#59396)

The main intent of refactoring is to provide health check extensibility for Kubernetes while supporting the existing DB health checks. A `HealthChecker` interface is added to support the different health check approaches of DBs and Kubernetes. Existing DB TCP health check logic is moved to a new `TargetDialer` struct.

Changes:
- Added `HealthChecker` interface with two functions:
    - `CheckHealth(ctx context.Context) ([]string, error)`
    - `GetProtocol() types.TargetHealthProtocol`
- Added `TargetDialer` struct which encapsulates existing TCP health check logic
- Changed `Target` struct to use the `HealthChecker` interface
- Changed `worker.checkHealth` to call the new `CheckHealth` function
- Removed a `protocol` field from `healthCheckConfig`
- Added `TargetHealthProtocolHTTP` for use with Kubernetes health checks
- Moved and renamed test `Test_dialEndpoints` to `TestTargetDialer_dialEndpoints`
- Added files `net.go` and `net_test.go` for `TargetDialer`

Part of #58413
Implements core health check logic for Kubernetes.

- Added functions `CheckHealth` and `GetProtocol` to `kubeDetails`
- Added a `HealthCheckManager` field to the Kubernetes `TLSServerConfig`
- Kube agent and proxy now instantiate a `HealthCheckManager`
- Added a `HealthCheckConfigReader` to interfaces `ReadKubernetesAccessPoint` and `ReadProxyAccessPoint`
- Added `HealthCheckConfig` read-only permissions for proxy and kube
- Added `HealthCheckConfig` watching for proxy and kube
- Added functions to `KubeServer` interface and `KubernetesServerV3` struct:
    - `GetTargetHealth() TargetHealth`
    - `SetTargetHealth(h TargetHealth)`
    - `GetTargetHealthStatus() TargetHealthStatus`
    - `SetTargetHealthStatus(status TargetHealthStatus)`

- Health check supports dynamic discovery of Kubernetes clusters. Calls to `startHealthCheck()` and `stopHealthCheck()` were rearranged.
    - Added functions `startHeartbeatAndHealthCheck()` and `stopHeartbeatAndHealthCheck()`
- Moved call to `HealthCheckManager.Start()` outside of the Kubernetes proxy server providing a future option to reuse `HealthCheckManager` in multiple proxy services
- Removed `Status` initialization in KubernetesServerV3 `CheckAndSetDefaults()`
- Added `kubernetesLabelMatchers` to `healthCheckConfig` struct. Default presets are omitted until the entire kube health check is complete.
- Added Kubernetes matcher checking to `ValidateHealthCheckConfig()`
- Changed `ValidateHealthCheckConfig()` to allow zero total DB matchers and Kubernetes matchers.
- Changed KubernetesServerV3 `GetTargetHealthStatus()` to return `TargetHealthStatusUnknown` instead of an empty string
- Changed health check worker `getTargetHealthTimeout` default timeout to `4s` from `10s`. This potentially reduces the response time of the initial heartbeat polling call to `GetServerInfo()`.

Part of #58413

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
rana and others added 7 commits October 29, 2025 09:38
Adds health status indicators for Kubernetes clusters on the Resources
page. Unhealthy clusters are highlighted, and clicking them opens a side
panel displaying server information.

Changes include:
- New `KubeServer` protobuf message and `ListKubernetesServers` RPC
- Web and Connect API endpoints for fetching Kubernetes server data
- Health status filtering in `matchAndFilterKubeClusters`
- `TargetHealth` fields added to frontend/backend types
- Updated `StatusInfo.tsx` to display `kube_cluster` data

Part of #58413

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Kubernetes servers are grouped by health, and dialed in order of healthy, unknown, then unhealthy, with random shuffling within each group for load distribution.

The grouping and shuffling is implemented generically in a separate iterator function `OrderByTargetHealthStatus()` for reuse and testability.

Changes:
- Added `healthcheck.OrderByTargetHealthStatus()` function with tests

Part of #58413

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Health check configuration can be disabled and enabled through the matcher. Disabling the matcher disables health checks for this configuration's resources including databases, Kubernetes clusters, and any future resources.

Changes:
- Added `disabled` field to health check matcher proto
- Updated matcher selection logic
- Updated unit tests

Part of #58413
A new Kubernetes-specific health check page is added. The existing Kubernetes troubleshooting documentation page is updated with health check specific error resolutions.

Part of #58413

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Gavin Frazar <gavin.frazar@goteleport.com>
Co-authored-by: Paul Gottschling <paul.gottschling@goteleport.com>
Health checks are enabled for all Kubernetes clusters by default.

A design of creating one health check config default per resource is implemented. The choice eases adoption of health checks, supports existing clusters that already have database health checks, and avoids migrating the backend database. A new Kubernetes-specific `default-kube` health check config is added. And a database-specific `default` health check config already exists, and is preserved.

A virtual default design is implemented by returning health check configs from memory if they don't exist in the backend database. The approach has the benefit of not re-inserting default values to the backend after they're deleted, which a prior approach had.

Virtual defaults are added at the local health check service level, and returned from functions `GetHealthCheckConfig` and `ListHealthCheckConfigs`. Virtual defaults may be written, updated, and deleted to and from the backend. While virtual defaults may be deleted, it has the net effect of resetting the config to default settings, and matching all resources of that type (db, kube). Virtual defaults are always returned from health check `get` and `list` functions.

Changes:
- Added `default-kube` health check config specific to Kubernetes only
- Updated local service functions `GetHealthCheckConfig` and `ListHealthCheckConfigs` to return virtual defaults
- Added unit tests
- Updated health check documentation with `default-kube` and info about virtual defaults

Part of #58413

Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
The generic `Resources` function is backported to the generic `Service` type.
`Resources` is used in a related commit enabling health checks for Kubernetes (#60544).

Part of #58413
Filters virtual defaults to allow explicit config references.

Part of #58413
@rana rana force-pushed the rana/backport-kube-health-checks/v18 branch from a7f0da6 to 98fe91a Compare October 29, 2025 16:38
@rana rana added this pull request to the merge queue Oct 29, 2025
Merged via the queue into branch/v18 with commit 93ceddb Oct 29, 2025
43 of 44 checks passed
@rana rana deleted the rana/backport-kube-health-checks/v18 branch October 29, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants