-
Notifications
You must be signed in to change notification settings - Fork 253
ClusterPool:ClusterDeployment version & condition #1484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is currently stacked on #1474. /hold until that merges. Meanwhile, review only the second commit. |
|
/test e2e /test e2e-pool |
|
/test e2e-pool |
|
/retest |
| github.com/aws/aws-sdk-go v1.38.41 | ||
| github.com/blang/semver/v4 v4.0.0 | ||
| github.com/davecgh/go-spew v1.1.1 | ||
| github.com/davegardnerisme/deephash v0.0.0-20210406090112-6d072427d830 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the use what we do elsewhere in the code,
hive/pkg/operator/hive/controllersconfig.go
Lines 86 to 91 in 57147b2
| hasher := md5.New() | |
| hasher.Write([]byte(fmt.Sprintf("%v", hiveControllersConfigMap.Data))) | |
| for _, h := range additionalControllerConfigHashes { | |
| hasher.Write([]byte(h)) | |
| } | |
| return hex.EncodeToString(hasher.Sum(nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into that. Is %v guaranteed to produce the same thing for (different instances of) identical objects? I know some languages don't/didn't guarantee map ordering for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah, I can't prove that %v breaks with map key order, but I can prove it doesn't play nice with pointers, whereas deephash does The Right Thing™: https://play.golang.org/p/1SQUEeJfOND
apis/hive/v1/clusterpool_types.go
Outdated
| // gives a quick indication as to whether the ClusterDeployment's configuration matches that of | ||
| // the pool. | ||
| // +optional | ||
| PoolVersion string `json:"poolVersion,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version to me has a lot of baggage in terms of the format, an opaque digest doesn't fit well. Can we use some other name here?
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the naming of ResourceVersion, which is a close corollary to this field: an opaque string that you're only supposed to compare for (in)equality to see if something has changed. Open to other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named the annotation cluster-pool-spec-hash. There are still vestiges of poolVersion in comments, variable names, method names. Hope that's acceptable. (I really didn't want to be naming variables clusterPoolSpecHash...)
Codecov Report
@@ Coverage Diff @@
## master #1484 +/- ##
==========================================
- Coverage 41.35% 41.23% -0.12%
==========================================
Files 335 334 -1
Lines 30455 30265 -190
==========================================
- Hits 12594 12481 -113
+ Misses 16782 16722 -60
+ Partials 1079 1062 -17
|
22036cc to
f2fc7c0
Compare
|
/test e2e |
| } | ||
|
|
||
| func calculatePoolVersion(clp *hivev1.ClusterPool) string { | ||
| ba := []byte{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we store the hashes in the annotation, what is the expected max length of the hash? the append here makes it seems like it grows with more fields we include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Each individual hash is 64 bits == 16 hex chars, so right now the value is 64c. And yes, it would grow if we extended it to include more fields. Fortunately for us, there's effectively no limit to the length of an annotation value :)
(One thing that's sort of neat about concatenating the hashes of the individual fields is that you can tell, with a little bit of work, which field(s) changed. Not that I think we should advertise this as a feature or anything, but it may come in handy some day.)
Note that if we ever did decide to add new fields, installing that version of the code would spuriously invalidate the previous CDs no matter what formatting choices we make, now or then. So if we decide we want to include a dozen fields and we feel like a 192c hash is just too much and want to do something like use a hash-of-hashes instead, we can switch over at that time and it wouldn't be any worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately for us, there's effectively no limit to the length of an annotation value :)
even though there is no limitation, i would expect that our hash is capped/constant at a length like most hashes.. our hashed value that keep increasing in length just looks odd and might create confusion in future imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to hash the hashes, which fixes the length at 16c.
(ed. Notwithstanding what I said above, anyone confused about the actual value of the string shouldn't be looking at the string. It's supposed to be an opaque value -- again like kube's resourceVersion.)
apis/hive/v1/clusterpool_types.go
Outdated
| ClusterPoolCapacityAvailableCondition ClusterPoolConditionType = "CapacityAvailable" | ||
| // ClusterPoolClusterDeploymentsCurrentCondition indicates whether all unassigned (installing or ready) | ||
| // ClusterDeployments in the pool match the current configuration of the ClusterPool. | ||
| ClusterPoolClusterDeploymentsCurrentCondition ClusterPoolConditionType = "ClusterDeploymentsCurrent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. 2 things that came to my mind.
- having the condition name include ClusteDeployments seems to tie user-facing thing to internal implementation of how clusters are created, if there is something else that provide clusters it might be difficult to move away from this name.
- Current suffix does not immediately give the idea that this conditions allows us to see that the clusters we have are using the latest desired state. at least imo.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ClusterPools manage ClusterDeployments" is not an internal implementation detail at all. It's literally the first sentence in the doc. Maybe I'm misunderstanding your comment.
Current suffix does not immediately give the idea
"Current" in the sense of "Up to date" -- see definition #4. I'm open to suggestions here, noting that this variable name is already pretty long and mealy. ClusterPoolOwnedUnclaimedClusterDeploymentsInSyncWithPoolSpecCondition may be more precise, but...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally my preference is to not include ClusterDeployments in the condition name.. so the best i could come up was AllClustersCurrent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saying that the clusters are current implies a lot more than we're actually verifying. And the word "All" confuses the fact that we're only validating unclaimed CDs.
But I've changed it per your suggestion.
2uasimojo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @abhinavdahiya!
apis/hive/v1/clusterpool_types.go
Outdated
| ClusterPoolCapacityAvailableCondition ClusterPoolConditionType = "CapacityAvailable" | ||
| // ClusterPoolClusterDeploymentsCurrentCondition indicates whether all unassigned (installing or ready) | ||
| // ClusterDeployments in the pool match the current configuration of the ClusterPool. | ||
| ClusterPoolClusterDeploymentsCurrentCondition ClusterPoolConditionType = "ClusterDeploymentsCurrent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ClusterPools manage ClusterDeployments" is not an internal implementation detail at all. It's literally the first sentence in the doc. Maybe I'm misunderstanding your comment.
Current suffix does not immediately give the idea
"Current" in the sense of "Up to date" -- see definition #4. I'm open to suggestions here, noting that this variable name is already pretty long and mealy. ClusterPoolOwnedUnclaimedClusterDeploymentsInSyncWithPoolSpecCondition may be more precise, but...
| } | ||
|
|
||
| func calculatePoolVersion(clp *hivev1.ClusterPool) string { | ||
| ba := []byte{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Each individual hash is 64 bits == 16 hex chars, so right now the value is 64c. And yes, it would grow if we extended it to include more fields. Fortunately for us, there's effectively no limit to the length of an annotation value :)
(One thing that's sort of neat about concatenating the hashes of the individual fields is that you can tell, with a little bit of work, which field(s) changed. Not that I think we should advertise this as a feature or anything, but it may come in handy some day.)
Note that if we ever did decide to add new fields, installing that version of the code would spuriously invalidate the previous CDs no matter what formatting choices we make, now or then. So if we decide we want to include a dozen fields and we feel like a 192c hash is just too much and want to do something like use a hash-of-hashes instead, we can switch over at that time and it wouldn't be any worse.
|
/test e2e |
2 similar comments
|
/test e2e |
|
/test e2e |
6b77d52 to
3820c30
Compare
|
Updated (forgot to I don't know wtf is wrong with e2e 😠 |
|
gdi, if I had a nickel for every time I updated the vendor type file instead of the base one... |
This commit adds a mechanism for calculating a hash representing the following fields in ClusterPool.Spec: - Platform - BaseDomain - ImageSetRef - InstallConfigSecretTemplateRef When creating a new ClusterDeployment for the pool, this hash is stored in a new annotation on the ClusterDeployment. The clusterpool controller checks all unassigned (installing or ready) ClusterDeployments and sets a new condition on the ClusterPool. This condition is: - "True" when all CDs are up to date with the pool (including the edge case where there are no CDs); - "False" when one or more CDs are stale; - "Unknown" if the annotation is missing from one or more CDs. - (Also "Unknown" if we haven't had a chance to figure it out yet.) Of note: - When "False" or "Unknown", the condition's Message contains a list of the offending CDs (in the spirit of ClusterSync's "Failed" condition). - "False" takes precedence over "Unknown" (including: we don't try to combine the CD lists in the Message). HIVE-1058
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, abhinavdahiya The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3dc3c3c / openshift#1484 added setting and discovery of a version marker for ClusterPool and ClusterDeployment so we could tell whether they match. This commit goes a step further and uses that information to replace stale CDs such that the pool will eventually be consistent. The algorithm we use is as follows: If the pool is in steady state -- i.e. we're at capacity and all unclaimed CDs are finished installing -- delete *one* stale CD. This triggers another reconcile wherein the deleted CD is replaced. We're no longer in steady state until (at least) that CD finishes installing, whereupon we're eligible to repeat. Reasoning: - We don't want to immediately trash all stale CDs, as this could result in significant downtime following a pool edit. - We don't want to exceed capacity (size or maxSize) to do this rotation, so we go "down" instead of "up". - It would have been nice to just wait for the *replacement* CDs to finish installing; but that would have entailed tracking those somehow as being distinct from CDs added through other code paths. That could get tricky, especially across multiple pool edits. TODO: - As written, we'll even wait for *stale* CDs to finish installing before we start deleting. We should figure out a way to discount those when checking whether we're eligible for a deletion. - Consider exposing a knob allowing the consumer to tune how aggressively stale CDs are replaced. HIVE-1058
3dc3c3c / openshift#1484 added setting and discovery of a version marker for ClusterPool and ClusterDeployment so we could tell whether they match. This commit goes a step further and uses that information to replace stale CDs such that the pool will eventually be consistent. The algorithm we use is as follows: If the pool is in steady state -- i.e. we're at capacity and all unclaimed CDs are finished installing -- delete *one* stale CD. This triggers another reconcile wherein the deleted CD is replaced. We're no longer in steady state until (at least) that CD finishes installing, whereupon we're eligible to repeat. Reasoning: - We don't want to immediately trash all stale CDs, as this could result in significant downtime following a pool edit. - We don't want to exceed capacity (size or maxSize) to do this rotation, so we go "down" instead of "up". - It would have been nice to just wait for the *replacement* CDs to finish installing; but that would have entailed tracking those somehow as being distinct from CDs added through other code paths. That could get tricky, especially across multiple pool edits. TODO: - As written, we'll even wait for *stale* CDs to finish installing before we start deleting. We should figure out a way to discount those when checking whether we're eligible for a deletion. - Consider exposing a knob allowing the consumer to tune how aggressively stale CDs are replaced. HIVE-1058
3dc3c3c / openshift#1484 added setting and discovery of a version marker for ClusterPool and ClusterDeployment so we could tell whether they match. This commit goes a step further and uses that information to replace stale CDs such that the pool will eventually be consistent. The algorithm we use is as follows: If the pool is in steady state -- i.e. we're at capacity and all unclaimed CDs are finished installing -- delete *one* stale CD. This triggers another reconcile wherein the deleted CD is replaced. We're no longer in steady state until (at least) that CD finishes installing, whereupon we're eligible to repeat. Reasoning: - We don't want to immediately trash all stale CDs, as this could result in significant downtime following a pool edit. - We don't want to exceed capacity (size or maxSize) to do this rotation, so we go "down" instead of "up". - It would have been nice to just wait for the *replacement* CDs to finish installing; but that would have entailed tracking those somehow as being distinct from CDs added through other code paths. That could get tricky, especially across multiple pool edits. TODO: - As written, we'll even wait for *stale* CDs to finish installing before we start deleting. We should figure out a way to discount those when checking whether we're eligible for a deletion. - Consider exposing a knob allowing the consumer to tune how aggressively stale CDs are replaced. HIVE-1058
3dc3c3c / openshift#1484 added setting and discovery of a version marker for ClusterPool and ClusterDeployment so we could tell whether they match. This commit goes a step further and uses that information to replace stale CDs such that the pool will eventually be consistent. The algorithm we use is as follows: If the pool is in steady state -- i.e. we're at capacity and all unclaimed CDs are finished installing -- delete *one* stale CD. This triggers another reconcile wherein the deleted CD is replaced. We're no longer in steady state until (at least) that CD finishes installing, whereupon we're eligible to repeat. Reasoning: - We don't want to immediately trash all stale CDs, as this could result in significant downtime following a pool edit. - We don't want to exceed capacity (size or maxSize) to do this rotation, so we go "down" instead of "up". - It would have been nice to just wait for the *replacement* CDs to finish installing; but that would have entailed tracking those somehow as being distinct from CDs added through other code paths. That could get tricky, especially across multiple pool edits. TODO: - As written, we'll even wait for *stale* CDs to finish installing before we start deleting. We should figure out a way to discount those when checking whether we're eligible for a deletion. - Consider exposing a knob allowing the consumer to tune how aggressively stale CDs are replaced. HIVE-1058
3dc3c3c / openshift#1484 added setting and discovery of a version marker for ClusterPool and ClusterDeployment so we could tell whether they match. This commit goes a step further and uses that information to replace stale CDs such that the pool will eventually be consistent. The algorithm we use is as follows: If the pool is in steady state -- i.e. we're at capacity and all unclaimed CDs are finished installing -- delete *one* stale CD. This triggers another reconcile wherein the deleted CD is replaced. We're no longer in steady state until (at least) that CD finishes installing, whereupon we're eligible to repeat. Reasoning: - We don't want to immediately trash all stale CDs, as this could result in significant downtime following a pool edit. - We don't want to exceed capacity (size or maxSize) to do this rotation, so we go "down" instead of "up". - It would have been nice to just wait for the *replacement* CDs to finish installing; but that would have entailed tracking those somehow as being distinct from CDs added through other code paths. That could get tricky, especially across multiple pool edits. TODO: - As written, we'll even wait for *stale* CDs to finish installing before we start deleting. We should figure out a way to discount those when checking whether we're eligible for a deletion. - Consider exposing a knob allowing the consumer to tune how aggressively stale CDs are replaced. HIVE-1058
This commit adds a mechanism for calculating a hash representing the following fields in ClusterPool.Spec:
When creating a new ClusterDeployment for the pool, this hash is stored in a new annotation on the ClusterDeployment.
The clusterpool controller checks all unassigned (installing or ready) ClusterDeployments and sets a new condition on the ClusterPool. This condition is:
Of note:
HIVE-1058