Slowly delete stale pool clusters #1512

2uasimojo · 2021-08-24T20:18:02Z

3dc3c3c / #1484 added setting and discovery of a version marker for ClusterPool and ClusterDeployment so we could tell whether they match. This commit goes a step further and uses that information to replace stale CDs such that the pool will eventually be consistent.

The algorithm we use is as follows:
If the pool is in steady state -- i.e. we're at capacity and all unclaimed CDs are finished installing -- delete one stale CD. This triggers another reconcile wherein the deleted CD is replaced. We're no longer in steady state until (at least) that CD finishes installing, whereupon we're eligible to repeat.

Reasoning:

We don't want to immediately trash all stale CDs, as this could result in significant downtime following a pool edit.
We don't want to exceed capacity (size or maxSize) to do this rotation, so we go "down" instead of "up".
It would have been nice to just wait for the replacement CDs to finish installing; but that would have entailed tracking those somehow as being distinct from CDs added through other code paths. That could get tricky, especially across multiple pool edits.

TODO:

As written, we'll even wait for stale CDs to finish installing before we start deleting. We should figure out a way to discount those when checking whether we're eligible for a deletion.
Consider exposing a knob allowing the consumer to tune how aggressively stale CDs are replaced.

HIVE-1058

2uasimojo · 2021-08-24T20:18:47Z

/hold

Needs UT & live test

/cc @stevekuznetsov @gurnben

codecov · 2021-08-24T20:27:33Z

Codecov Report

Merging #1512 (f3e437e) into master (8b4f8db) will increase coverage by 0.03%.
The diff coverage is 88.67%.

@@            Coverage Diff             @@
##           master    #1512      +/-   ##
==========================================
+ Coverage   41.47%   41.51%   +0.03%     
==========================================
  Files         335      336       +1     
  Lines       30538    30569      +31     
==========================================
+ Hits        12666    12691      +25     
- Misses      16790    16794       +4     
- Partials     1082     1084       +2

Impacted Files	Coverage Δ
pkg/controller/metrics/metrics.go	`40.71% <ø> (ø)`
...g/controller/clusterpool/clusterpool_controller.go	`54.59% <72.72%> (+0.11%)`	⬆️
pkg/controller/clusterpool/collections.go	`75.91% <92.50%> (+0.67%)`	⬆️
pkg/controller/clusterpool/metrics.go	`100.00% <100.00%> (ø)`

dgoodwin · 2021-08-25T11:42:01Z

LGTM so far.

dgoodwin · 2021-08-25T11:42:46Z

pkg/controller/clusterpool/clusterpool_controller.go

 			return reconcile.Result{}, err
 		}
+	// Special case for stale CDs: allow deleting one if all CDs are installed.
+	case drift == 0 && len(cds.Installing()) == 0 && len(cds.Stale()) > 0:


Actually does this need a check to make sure none are deleting? Otherwise it kinda looks like it might immediately wipe them all out. (depending on how fast Installing clusters start showing up)

It shouldn't need that:

On this reconcile we Delete one.

The next reconcile will miss this case because we'll be one short of pool.size so drift will be < 0. It'll add a CD via L382 which will go to Installing.

The next reconcile (and subsequent ones for the next 40m) will see nonzero Installing clusters, so will skip this case.

Did I logic that right?

You are probably right, there are cases where the cache isn't updated for new objects and may not see the new Installing cluster, which is where we have to use expectations. If I'm correct that might happen here, it's possible it could wipe out a couple at once before the installing start showing up. I'm not super confident in any of this so we can see how it goes.

Could we do a counter metric here so we can see the rate at which clusters are getting replaced due to pool changes.

use expectations

Then someone will have to explain to me what "expectations" are and how they work :P

a counter metric

Just a monotonic counter we increment every time we hit this branch? Or a gauge we reset once we're fully synced?

And the fancy rate calculation stuff happens in prometheus, right?

use expectations

Then someone will have to explain to me what "expectations" are and how they work :P

Best ref I've used for expectations was when Matthew first introduced them to Hive and why: #518 I don't necessarily think we need to do this here, but it may be an edge case we'll hit where a couple get deleted when we only expected one. I do think an unclaimed deleting check would likely rectify it though, that should be reflected in cache immediately, and by the time it's done deleting the new cluster provisions should definitely be in the cache.

a counter metric

Just a monotonic counter we increment every time we hit this branch? Or a gauge we reset once we're fully synced?

Just a counter we increment yes, it'll tell us how often this is happening.

And the fancy rate calculation stuff happens in prometheus, right?

Correct.

Summary from today's discussion re expectations: we don't need them here because the deletion hits the cache immediately, which would result in the next reconcile having a nonzero drift and would thus miss the new branch.

Digging into the metric a bit...

You suggested a Counter that would keep an running total of the number of stale CDs we've ever* deleted.

The other option is a Gauge that denotes how many stale clusters exist at any given time. Would that have value? Would it allow us to calculate the above by some prom magic? Or perhaps we could do both.

*Since the last time this controller started? Or like ever ever? I never really understood when data are in the controller and when they're in prometheus...

Added the Counter metric in a separate commit. LMK if the Gauge seems like a good idea.

Counters are since the process started, prometheus handles the past data and adds the two together. In this case it's not so much for the value of the counter, more so for the rate at which it's occurring.

A gauge does sound useful for the total number of stale at any given time, rather than the rate at which we're doing replacements.

2uasimojo · 2021-08-26T21:36:14Z

(Still holding for test)

pkg/controller/metrics/metrics.go

3dc3c3c / openshift#1484 added setting and discovery of a version marker for ClusterPool and ClusterDeployment so we could tell whether they match. This commit goes a step further and uses that information to replace stale CDs such that the pool will eventually be consistent. The algorithm we use is as follows: If the pool is in steady state -- i.e. we're at capacity and all unclaimed CDs are finished installing -- delete *one* stale CD. This triggers another reconcile wherein the deleted CD is replaced. We're no longer in steady state until (at least) that CD finishes installing, whereupon we're eligible to repeat. Reasoning: - We don't want to immediately trash all stale CDs, as this could result in significant downtime following a pool edit. - We don't want to exceed capacity (size or maxSize) to do this rotation, so we go "down" instead of "up". - It would have been nice to just wait for the *replacement* CDs to finish installing; but that would have entailed tracking those somehow as being distinct from CDs added through other code paths. That could get tricky, especially across multiple pool edits. TODO: - As written, we'll even wait for *stale* CDs to finish installing before we start deleting. We should figure out a way to discount those when checking whether we're eligible for a deletion. - Consider exposing a knob allowing the consumer to tune how aggressively stale CDs are replaced. HIVE-1058

2uasimojo · 2021-08-30T21:13:18Z

UT is in. Still need to test live.

Add a CounterVec, labeled by pool name, incremented every time we delete a stale ClusterDeployment. Part of HIVE-1058

2uasimojo · 2021-08-31T13:34:01Z

/test e2e

2uasimojo · 2021-08-31T15:35:29Z

Tested live, works.

/hold cancel
/assign @dgoodwin

dgoodwin · 2021-08-31T16:12:56Z

/lgtm

openshift-ci · 2021-08-31T16:20:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [2uasimojo,dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This was missed in openshift#1512 / f3e437e. HIVE-1058 HIVE-1645

openshift-ci bot requested review from gurnben and stevekuznetsov August 24, 2021 20:18

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2021

openshift-ci bot requested review from abutcher and akhil-rane August 24, 2021 20:22

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2021

2uasimojo force-pushed the HIVE-1058/cycle-stale-clusters branch from 58cc192 to 2f12f8b Compare August 24, 2021 21:45

dgoodwin reviewed Aug 25, 2021

View reviewed changes

2uasimojo force-pushed the HIVE-1058/cycle-stale-clusters branch from 2f12f8b to 5d89d37 Compare August 26, 2021 21:35

dgoodwin reviewed Aug 30, 2021

View reviewed changes

pkg/controller/metrics/metrics.go Outdated Show resolved Hide resolved

dgoodwin reviewed Aug 30, 2021

View reviewed changes

pkg/controller/metrics/metrics.go Outdated Show resolved Hide resolved

2uasimojo force-pushed the HIVE-1058/cycle-stale-clusters branch from 5d89d37 to 37fb6af Compare August 30, 2021 20:24

2uasimojo force-pushed the HIVE-1058/cycle-stale-clusters branch from 37fb6af to c48acb3 Compare August 30, 2021 21:12

Counter metric for deletion of stale pool CDs

f3e437e

Add a CounterVec, labeled by pool name, incremented every time we delete a stale ClusterDeployment. Part of HIVE-1058

2uasimojo force-pushed the HIVE-1058/cycle-stale-clusters branch from c48acb3 to f3e437e Compare August 30, 2021 22:16

openshift-ci bot assigned dgoodwin Aug 31, 2021

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 31, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 31, 2021

openshift-merge-robot merged commit 694e5a8 into openshift:master Aug 31, 2021

2uasimojo deleted the HIVE-1058/cycle-stale-clusters branch August 31, 2021 16:32

2uasimojo mentioned this pull request Sep 3, 2021

ClusterPool: Delete broken (ProvisionStopped) clusters #1524

Merged

2uasimojo added a commit to 2uasimojo/hive that referenced this pull request Sep 14, 2021

Register hive_clusterpool_stale_clusterdeployments_deleted metric

a22a451

This was missed in openshift#1512 / f3e437e. HIVE-1058 HIVE-1645

2uasimojo mentioned this pull request Sep 14, 2021

Register hive_clusterpool_stale_clusterdeployments_deleted metric #1537

Merged

Slowly delete stale pool clusters #1512

Slowly delete stale pool clusters #1512

Uh oh!

Conversation

2uasimojo commented Aug 24, 2021 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2uasimojo commented Aug 24, 2021

Uh oh!

codecov bot commented Aug 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dgoodwin commented Aug 25, 2021

Uh oh!

dgoodwin Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo Aug 25, 2021

Choose a reason for hiding this comment

Uh oh!

dgoodwin Aug 25, 2021

Choose a reason for hiding this comment

Uh oh!

dgoodwin Aug 25, 2021

Choose a reason for hiding this comment

Uh oh!

2uasimojo Aug 25, 2021

Choose a reason for hiding this comment

Uh oh!

dgoodwin Aug 26, 2021

Choose a reason for hiding this comment

Uh oh!

2uasimojo Aug 26, 2021

Choose a reason for hiding this comment

Uh oh!

2uasimojo Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dgoodwin Aug 30, 2021

Choose a reason for hiding this comment

Uh oh!

2uasimojo commented Aug 26, 2021

Uh oh!

Uh oh!

Uh oh!

2uasimojo commented Aug 30, 2021

Uh oh!

2uasimojo commented Aug 31, 2021

Uh oh!

2uasimojo commented Aug 31, 2021

Uh oh!

dgoodwin commented Aug 31, 2021

Uh oh!

openshift-ci bot commented Aug 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2uasimojo commented Aug 24, 2021 •

edited by openshift-ci bot

Loading

codecov bot commented Aug 24, 2021 •

edited

Loading

dgoodwin Aug 25, 2021 •

edited

Loading

2uasimojo Aug 26, 2021 •

edited

Loading