Skip to content

Add RFD for Kubernetes health checks#58065

Merged
rana merged 1 commit intomasterfrom
rfd/0223-k8s-health-checks
Sep 11, 2025
Merged

Add RFD for Kubernetes health checks#58065
rana merged 1 commit intomasterfrom
rfd/0223-k8s-health-checks

Conversation

@rana
Copy link
Copy Markdown
Contributor

@rana rana commented Aug 19, 2025

A request for discussion on Kubernetes health checks.

Relates to:

@rana rana force-pushed the rfd/0223-k8s-health-checks branch 4 times, most recently from d39994f to 146b47d Compare August 25, 2025 18:17
@rana rana added kubernetes-access rfd Request for Discussion health-check Resource health check related labels Aug 25, 2025
@rana rana requested review from creack, rosstimothy and tigrato August 25, 2025 18:22
@rana rana marked this pull request as ready for review August 25, 2025 18:30
@github-actions github-actions bot requested review from gzdunek and strideynet August 25, 2025 18:30
@rana rana added the no-changelog Indicates that a PR does not require a changelog entry label Aug 25, 2025
@rana rana removed request for gzdunek and strideynet August 25, 2025 20:11
@rana rana marked this pull request as draft August 25, 2025 20:11
@rana rana force-pushed the rfd/0223-k8s-health-checks branch 2 times, most recently from ceb68bc to cc95e76 Compare August 25, 2025 20:13
@rana rana marked this pull request as ready for review August 25, 2025 20:15
@github-actions github-actions bot requested a review from bernardjkim August 25, 2025 20:15
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md
@rana rana removed the request for review from bernardjkim August 26, 2025 00:41
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md
Comment thread rfd/0223-k8s-health-checks.md
Comment thread rfd/0223-k8s-health-checks.md
@rana rana force-pushed the rfd/0223-k8s-health-checks branch 3 times, most recently from d396d51 to 7893980 Compare August 26, 2025 22:43
@rana rana changed the title Kubernetes health checks RFD Add RFD for Kubernetes health checks Aug 27, 2025
Copy link
Copy Markdown
Contributor

@tigrato tigrato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes health checks should go beyond simply verifying that the Kubernetes API server is responsive.
A critical aspect to validate is whether each agent can operate correctly - specifically, that it has the necessary permissions to impersonate users and groups within the cluster.

While not strictly required, it’s also valuable to test Kubernetes GET Pod requests. This ensures that session start/end events are properly populated with fields like kubernetes_pod_labels and kubernetes_pod_image.

Another important area comes from a recent discussion with @programmerq, @webvictim, and a customer. The customer had a script that mistakenly copied Teleport agent state between different clusters. This led to agents across clusters sharing the same host_uid, which caused proxies to make incorrect routing decisions. Instead of forwarding requests to Cluster A, they were sent to Cluster B, since both clusters reported identical host_uid values. Detecting and preventing this scenario - and providing visibility when it occurs - would be a significant improvement.

@rosstimothy
Copy link
Copy Markdown
Contributor

Kubernetes health checks should go beyond simply verifying that the Kubernetes API server is responsive. A critical aspect to validate is whether each agent can operate correctly - specifically, that it has the necessary permissions to impersonate users and groups within the cluster.

I strongly agree here. These health checks should be able to indicate that the Kubernetes Cluster is usable by Teleport users.

While not strictly required, it’s also valuable to test Kubernetes GET Pod requests. This ensures that session start/end events are properly populated with fields like kubernetes_pod_labels and kubernetes_pod_image.

I don't think that we should be using health checks to validate the presence of data in audit events.

Another important area comes from a recent discussion with @programmerq, @webvictim, and a customer. The customer had a script that mistakenly copied Teleport agent state between different clusters. This led to agents across clusters sharing the same host_uid, which caused proxies to make incorrect routing decisions. Instead of forwarding requests to Cluster A, they were sent to Cluster B, since both clusters reported identical host_uid values. Detecting and preventing this scenario - and providing visibility when it occurs - would be a significant improvement.

I think this should be out of scope for health checks. Duplicate UUIDs are a massive footgun, but I don't know that we should be complicating health checking to catch this scenario.

@TeleLos TeleLos added the c-cv Internal Customer Reference label Aug 28, 2025
@TeleLos
Copy link
Copy Markdown
Contributor

TeleLos commented Aug 28, 2025

I added c-cv as they are interested in this feature but specifically for database.

@rana
Copy link
Copy Markdown
Contributor Author

rana commented Aug 28, 2025

I added c-cv as they are interested in this feature but specifically for database.

@TeleLos Thank you, health checks for databases happens to already be released in v18 also.

Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md
@rana rana force-pushed the rfd/0223-k8s-health-checks branch 4 times, most recently from 32b8891 to 090b479 Compare September 8, 2025 20:43
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
Comment thread rfd/0223-k8s-health-checks.md Outdated
@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from creack September 11, 2025 08:14
@rana rana force-pushed the rfd/0223-k8s-health-checks branch from 8fe417c to 33ce1a0 Compare September 11, 2025 15:15
The RFD proposes automated health checks for Kubernetes clusters with monitoring from the Web UI, `tctl`, and Prometheus metrics. Health checks use the Kubernetes `SelfSubjectAccessReview` API to verify connectivity and RBAC configuration.

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>

Relates to #58413
@rana rana force-pushed the rfd/0223-k8s-health-checks branch from 33ce1a0 to 7b12dd6 Compare September 11, 2025 16:00
@rana rana added this pull request to the merge queue Sep 11, 2025
Merged via the queue into master with commit ac7a000 Sep 11, 2025
42 checks passed
@rana rana deleted the rfd/0223-k8s-health-checks branch September 11, 2025 16:19
@rana rana mentioned this pull request Oct 23, 2025
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c-cv Internal Customer Reference health-check Resource health check related kubernetes-access no-changelog Indicates that a PR does not require a changelog entry rfd Request for Discussion size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants