Skip to content

integrations/operator: re-use the teleport client instead of creating a new one#34050

Merged
hugoShaka merged 3 commits intomasterfrom
hugo/operator-reuse-client
Nov 7, 2023
Merged

integrations/operator: re-use the teleport client instead of creating a new one#34050
hugoShaka merged 3 commits intomasterfrom
hugo/operator-reuse-client

Conversation

@hugoShaka
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka commented Oct 30, 2023

Fixes #24110

This PR addresses several major issues of the Teleport Operator:

  • potentially high CPU and memory usage when reconciling broken resources
  • the operator caused a high log volume on the Teleport container
  • teleport clients were not properly closed, causing a memory leak

This should the ongoing memory issues several users reported, should largely reduce the impact of broken reconciliation and reduce the memory spikes when reconciling many resources. Another PR from @tigrato will reduce the amount of unnecessary reconciliations. With both PRs we should be in a much better place in terms of CPU/memory load on the operator side.

How it works

With this PR the embedded tbot now caches the client and can skip the whole connection dance if the certs have not changed. In the future, tbot will return us a single client with rolling certificates, which will simplify the whole thing.

This PR also wraps the teleport client in a new structure that contains an RWLock and tracks who is using the client. This allows us to ensure no one is using the client before closing it.

Finally, this PR adds an RWLock on tbot's memory destination to ensure safe reads from the sidecar (we don't want to read while tbot is writing the renewed cert, this would end badly).

changelog: The operator reuses its connection to Teleport. Reduces CPU usage, logs, and fixes a memory leak.

@hugoShaka hugoShaka added bug kube-operator Issues related to Kube Operator labels Oct 30, 2023
Comment thread integrations/operator/sidecar/tbot.go Outdated
Copy link
Copy Markdown
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - let's put that RWMutex in to the Memory Destination though.

Copy link
Copy Markdown
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See slack, just realised where the root of your memory leak is.

Comment thread integrations/operator/sidecar/tbot.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we ever closing the client?

Copy link
Copy Markdown
Contributor

@strideynet strideynet Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the leak I spotted - the problem also exists in the existing code. We can probably adjust this PR to fix this since the other content of the PR is still valuable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 13de098a38a33c7890cc95bd6886c6a41fa8e4aa

This is not super clean, and we'll definitely want to get rid of this when tbot will send us clients with in-place cert-renewal. However, tbot changes won't be backported to v12/v13, so we need the current fix for those versions.

Comment thread integrations/operator/sidecar/tbot.go Outdated
Comment thread integrations/operator/sidecar/tbot.go Outdated
@hugoShaka hugoShaka force-pushed the hugo/operator-reuse-client branch from 68243a8 to 13de098 Compare October 31, 2023 15:25
Comment thread lib/tbot/config/destination_memory.go Outdated
Comment thread integrations/operator/sidecar/tbot.go Outdated
Comment thread integrations/operator/sidecar/tbot.go Outdated
@hugoShaka hugoShaka force-pushed the hugo/operator-reuse-client branch 2 times, most recently from 360cf2b to b3b6690 Compare October 31, 2023 19:38
Comment thread integrations/operator/controllers/resources/github_connector_controller.go Outdated
Copy link
Copy Markdown
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hugoShaka hugoShaka force-pushed the hugo/operator-reuse-client branch from b30cb50 to ba32bcb Compare November 2, 2023 18:10
Comment thread integrations/operator/sidecar/tbot.go Outdated
@hugoShaka hugoShaka requested a review from tigrato November 6, 2023 16:58
@hugoShaka hugoShaka force-pushed the hugo/operator-reuse-client branch from d4bed68 to 90121a4 Compare November 7, 2023 20:04
@hugoShaka hugoShaka added this pull request to the merge queue Nov 7, 2023
Merged via the queue into master with commit df257be Nov 7, 2023
@hugoShaka hugoShaka deleted the hugo/operator-reuse-client branch November 7, 2023 22:24
@public-teleport-github-review-bot
Copy link
Copy Markdown

@hugoShaka See the table below for backport results.

Branch Result
branch/v12 Failed
branch/v13 Failed
branch/v14 Create PR

hugoShaka added a commit that referenced this pull request Nov 9, 2023
… a new one (#34050)

* integrations/operator: re-use the teleport client instead of creating a new one

* fix race condition

* address feedback + add godocs
hugoShaka added a commit that referenced this pull request Nov 10, 2023
github-merge-queue Bot pushed a commit that referenced this pull request Nov 13, 2023
…eating a new one (#34431)

* integrations/operator: re-use the teleport client instead of creating a new one (#34050)

* integrations/operator: re-use the teleport client instead of creating a new one

* fix race condition

* address feedback + add godocs

* fixup! integrations/operator: re-use the teleport client instead of creating a new one (#34050)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug kube-operator Issues related to Kube Operator size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Huge memory and cpu usage for teleport operator

4 participants