-
Notifications
You must be signed in to change notification settings - Fork 254
Slowly delete stale pool clusters #1512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
openshift-merge-robot
merged 2 commits into
openshift:master
from
2uasimojo:HIVE-1058/cycle-stale-clusters
Aug 31, 2021
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| package clusterpool | ||
|
|
||
| import ( | ||
| "github.com/prometheus/client_golang/prometheus" | ||
|
|
||
| "sigs.k8s.io/controller-runtime/pkg/metrics" | ||
| ) | ||
|
|
||
| var ( | ||
| // metricStaleClusterDeploymentsDeleted tracks the total number of CDs we delete because they | ||
| // became "stale". That is, the ClusterPool was modified in a substantive way such that these | ||
| // CDs no longer match its spec. Note that this only counts stale CDs we've *deleted* -- there | ||
| // may be other stale CDs we haven't gotten around to deleting yet. | ||
| metricStaleClusterDeploymentsDeleted = prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
| Name: "hive_clusterpool_stale_clusterdeployments_deleted", | ||
| Help: "The number of ClusterDeployments deleted because they no longer match the spec of their ClusterPool.", | ||
| }, []string{"clusterpool_namespace", "clusterpool_name"}) | ||
| ) | ||
|
|
||
| func init() { | ||
| metrics.Registry.MustRegister() | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually does this need a check to make sure none are deleting? Otherwise it kinda looks like it might immediately wipe them all out. (depending on how fast Installing clusters start showing up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't need that:
Deleteone.casebecause we'll be one short ofpool.sizesodriftwill be< 0. It'll add a CD via L382 which will go toInstalling.Installingclusters, so will skip thiscase.Did I logic that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right, there are cases where the cache isn't updated for new objects and may not see the new Installing cluster, which is where we have to use expectations. If I'm correct that might happen here, it's possible it could wipe out a couple at once before the installing start showing up. I'm not super confident in any of this so we can see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do a counter metric here so we can see the rate at which clusters are getting replaced due to pool changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then someone will have to explain to me what "expectations" are and how they work :P
Just a monotonic counter we increment every time we hit this branch? Or a gauge we reset once we're fully synced?
And the fancy rate calculation stuff happens in prometheus, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best ref I've used for expectations was when Matthew first introduced them to Hive and why: #518 I don't necessarily think we need to do this here, but it may be an edge case we'll hit where a couple get deleted when we only expected one. I do think an unclaimed deleting check would likely rectify it though, that should be reflected in cache immediately, and by the time it's done deleting the new cluster provisions should definitely be in the cache.
Just a counter we increment yes, it'll tell us how often this is happening.
Correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary from today's discussion re expectations: we don't need them here because the deletion hits the cache immediately, which would result in the next reconcile having a nonzero
driftand would thus miss the new branch.Digging into the metric a bit...
You suggested a Counter that would keep an running total of the number of stale CDs we've ever* deleted.
The other option is a Gauge that denotes how many stale clusters exist at any given time. Would that have value? Would it allow us to calculate the above by some prom magic? Or perhaps we could do both.
*Since the last time this controller started? Or like ever ever? I never really understood when data are in the controller and when they're in prometheus...
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the Counter metric in a separate commit. LMK if the Gauge seems like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counters are since the process started, prometheus handles the past data and adds the two together. In this case it's not so much for the value of the counter, more so for the rate at which it's occurring.
A gauge does sound useful for the total number of stale at any given time, rather than the rate at which we're doing replacements.