Skip to content

Relax Kubernetes CRD discovery when building cache#36214

Merged
tigrato merged 1 commit intomasterfrom
tigrato/relax-crd-discovery-on-bad-configs
Jan 3, 2024
Merged

Relax Kubernetes CRD discovery when building cache#36214
tigrato merged 1 commit intomasterfrom
tigrato/relax-crd-discovery-on-bad-configs

Conversation

@tigrato
Copy link
Copy Markdown
Contributor

@tigrato tigrato commented Jan 2, 2024

Teleport Kubernetes Service has a monitor that constantly watches the Resources registered in the Kubernetes Cluster via API Discovery. The goal is to keep an up-to-date representation of all resources existing in the cluster in order to be able to register them for Teleport per-Resource RBAC.

Having an up-to-date representation allows us to unmarshal the API responses and filter them when the custom resources are local.

When the Kubernetes APIs are registered using non-local services - i.e. the API is served by a POD running within the cluster like metrics API - and those services aren't healthy - i.e. pod not running, invalid selector, cluster has no nodes - the discovery watcher returns an error and fails. This is an improper configuration but seems to be a common problem.

This PR relaxes the discovery mechanism and doesn't enforce that all APIs return their resources if they aren't currently available.

When the client.Discovery().ServerGroupsAndResources() returns a *discovery.ErrGroupDiscoveryFailed, it also returns the partial results that we will use for registration.

Changelog: Safeguard against the disruption of cluster access caused by incorrect Kubernetes APIService configurations.

Comment thread lib/kube/proxy/scheme.go Outdated
Teleport Kubernetes Service has a monitor that constantly watches the
Resources registered in the Kubernetes Cluster via API Discovery. The
goal is to keep an up-to-date representation of all resources existing
in the cluster in order to be able to register them for Teleport
per-Resource RBAC.

Having an up-to-date represenation allows us to unmarshal the API
responses and filter them when the custom resources are local.

When the Kubernetes APIs are registered using non-local services - i.e.
the API is served by a POD running within the cluster like metrics API -
and those services aren't healthy - i.e. pod not running, invalid
selector, cluster has no nodes - the discovery watcher returns an error
and fails. This is an improper configuration but seems to be a common
problem.

This PR relaxes the discovery mechanism and doesn't enforce that all
APIs return their resources if they aren't currently available.

When the `client.Discovery().ServerGroupsAndResources()` returns a
`*discovery.ErrGroupDiscoveryFailed`, it also returns the partial
results that we will use for registration.

Signed-off-by: Tiago Silva <tiago.silva@goteleport.com>
@tigrato tigrato force-pushed the tigrato/relax-crd-discovery-on-bad-configs branch from c3907f0 to 29fd151 Compare January 3, 2024 10:36
@tigrato tigrato added this pull request to the merge queue Jan 3, 2024
Merged via the queue into master with commit 2b3691c Jan 3, 2024
@tigrato tigrato deleted the tigrato/relax-crd-discovery-on-bad-configs branch January 3, 2024 14:29
@public-teleport-github-review-bot
Copy link
Copy Markdown

@tigrato See the table below for backport results.

Branch Result
branch/v14 Create PR

Envek added a commit to Envek/teleport that referenced this pull request Jan 4, 2024
…se-anon-key

* origin/master: (344 commits)
  Undelete CreateHostUserMode_HOST_USER_MODE_DROP (gravitational#36273)
  allow cwd to be changed in difftest (gravitational#35946)
  Auth device list component (gravitational#36235)
  make unified resources responsive (gravitational#35961)
  Support running Teleport in a "hot reload" mode (gravitational#35040)
  Prevent deleting enum values, allow deleting enum reservations in types.proto (gravitational#36248)
  Remove support for legacy (Amazon Linux 2) AMIs (gravitational#36153)
  Bump version(s) used for teleport-lab and teleport-quickstart (gravitational#36167)
  Allow Reconciler update handler to examine old value during update (gravitational#36171)
  Validate the user still exists during account reset (gravitational#35676)
  ButtonTextWithAddIcon shared component (gravitational#36103)
  Refactor hostname resolution for SSH connections via the WebUI (gravitational#35773)
  add structuredClone to jest JSDOMEnvironment (gravitational#36213)
  fix flaky `lib/auth` cache-enabled tests (gravitational#36216)
  Report resource usage counts by handling heartbeat events (gravitational#35968)
  Reviewer bot should use the stable version of Go (gravitational#36242)
  RFD 0153 Resource Guidelines (gravitational#34103)
  Use cmp and cmpots properly in operator tests (gravitational#36215)
  Relax Kubernetes CRD discovery when building cache (gravitational#36214)
  Add Access List messages to TAG protobuf (gravitational#36176)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants