-
Notifications
You must be signed in to change notification settings - Fork 220
Publish operator status #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish operator status #47
Conversation
7a03ce9 to
533e0f3
Compare
pkg/stub/handler.go
Outdated
| return err | ||
| } | ||
|
|
||
| if err := h.syncIngressUpdate(o); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syncOperatorStatus calls woven throughout the update logic seem more imperative than level driven to me... would it be possible to simply it to "if syncIngressUpdate returns an error, set a degraded status"? Or take the error into account when doing a state-observation driven computation unconditionally after the update, like:
- Attempt update
- Compute status by observing the state of the cluster after the update
- Take any error from (1) into account when computing in (2)
The point being it should be possible to re-compute status outside the context of the update function. That is, there could be a Handler.ComputeOperatorStatus() function which could be called at any time to compute the correct status absent any event to observe.
This also closes the current gap with status computation based on deletes: the current code will only handle deletes in response to a delete event (something which needs refactored); if any delete processing (or status processing during delete processing) fails, it should be possible to call a general status recompilation function later on (outside the context of a delete event) and get the right status.
Yeah, it's a bit icky.
No, not all errors should cause "degraded" status. In particular, errors from which the operator can recover without intervention should not cause "degraded" status.
This approach is complicated by the fact that Maybe it would be helpful to list out error conditions, whether "degraded" status makes sense for each, and how we could otherwise infer "degraded" status from state:
In order to identify and remove orphaned resources, we must list them, list ClusterIngress resources, and then reconcile. To this end, it would be helpful to disentangle the resync logic from event handling. Ultimately if the overhead of listing everything can be mostly mitigated by informers' caching, we may want to replace the edge-triggered event handling (ClusterIngress was added/modified/deleted, so add/modify/delete the appropriate resources) with edge-triggered reconciliation (some event involving some object in one of our namespaces happened, so get all relevant state and reconcile). This approach deviates from the pattern we see in operator-sdk and other operators though, and it is beyond the scope of this PR, I believe. |
It might be useful to decouple the error handling and status discussions entirely. They aren't necessarily related. Error taxonomy as pertains to work queue handling (e.g. is this error transient? Is it recoverable?) is useful but not generally relevant to status computation.
Edge triggered reconciliation is indeed the approach we should be striving for, and is the only reliable approach to authoring a controller of this nature. We should be able to, in a single thread,
Any errors that occurred during step 1 are relevant only indirectly insofar as their side effects on the cluster state which is observed to drive step 2. Categorization/analysis of errors during step 1 may drive things like retries, but the status function shouldn't use errors (a kind of edge event) as inputs. If that means we need to bite the bullet now and refactor |
|
Can you give an example state from which we could infer "degraded" status? Would it be all right if I removed "degraded" status from this PR (that is, delete all calls to |
For the operator itself, how about something something as simple as a failure to ensure the Expressed as a level driven status calc algorithm:
Some similar analysis would be useful for the router ClusterRole. |
|
#47 (comment) May also blur the line a bit between operator and ClusterIngress status, but it's a start. I think there's a case to be made that the operator should be trying to ensure the router namespace exists during reconciliation independent of any ClusterIngress existing, which may further clarify the boundaries. |
|
What about transient errors from creating the namespace? Should we implement our own retry loop as part of reconciliation? We could even implement our own work queue à la CVO, but I thought operators generally were supposed to be moving away from that paradigm towards embracing operator-sdk and its patterns. |
The operator-sdk doesn't give us any help for re-queueing yet, but it soon will. Don't spend too much time working around it. We can assume the happy path for a little while longer. The operator-sdk team has explicitly acknowledged the deficiencies of the current handler paradigm in these regards and has been actively working on refactoring to use the controller-runtime upstream framework as the underlying basis, which will provide us with first-class reconciliation features including retry/work queue control. Keep in mind the operator-sdk (at the version we started with) doesn't necessarily represent the defacto best practices for k8s controller design and the project is evolving to catch up. We must avoid at all costs going to prod with an edge-driven design: that particular mistake was extremely costly to repair in the original OpenShift v3 controllers. |
|
/hold This appears to be setting an older version of status that has been superseded by openshift/cluster-version-operator#31 |
533e0f3 to
5a46915
Compare
|
Latest push is a rewrite to be level-driven and to use the new ClusterOperator API. Labeled "WIP" because I have not been able to test it (although most new code has unit test coverage). |
pkg/stub/handler.go
Outdated
|
|
||
| // syncOperatorStatus computes the operator's current status and therefrom | ||
| // creates or updates the ClusterOperator resource for the operator. | ||
| func (h *Handler) syncOperatorStatus() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move all this stuff and the new stuff in util into a new file (status.go)?
pkg/stub/handler.go
Outdated
|
|
||
| // getOperatorState gets and returns the resources necessary to compute the | ||
| // operator's current state. | ||
| func (h *Handler) getOperatorState() (*corev1.Namespace, *ingressv1alpha1.ClusterIngressList, *appsv1.DaemonSetList, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about returning plain arrays (e.g. []ClusterIngress) instead? The list wrappers are cumbersome to work with and provide no additional value in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. My only concern is that I don't want to copy large structures accidentally; hopefully I've avoided doing so...
pkg/stub/handler.go
Outdated
| } else { | ||
| failingCondition.Status = osv1.ConditionFalse | ||
| } | ||
| clusteroperator.SetStatusCondition(co, failingCondition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style issue: mutating input is troublesome. This function is close to stateless; how about just returning the new Status object based on immutable inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pkg/stub/handler.go
Outdated
| Type: osv1.OperatorAvailable, | ||
| Status: osv1.ConditionUnknown, | ||
| } | ||
| if ingressList != nil && daemonsetList != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this correlation is correct, the DaemonSet name is a function of (not equal to) the ClusterIngress name... also, I think you have to compare availability of daemonsets relative to their own status: e.g.:
- for each ingress
- get daemonset by ingress
- let fullyAvailable = (daemonset.desired == daemonset.ready)
- get daemonset by ingress
I think a more natural way to do this sort of stuff would be to build up some indexed caches from the k8s stdlib. For example:
- DaemonSet cache with index:
- by ClusterIngress.Name where .Status.Desired == .Status.Ready
Then given a ClusterIngress you can do an O(1) lookup to see if there exists a ready DaemonSet associated with the ClusterIngress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to start small and slightly incorrect and build up to unblock deployment? For example, if the count of daemonsets matches the count of ingresses, at least mark us as "Progressing" and not failed, and we can refine the algorithms in followups using strategies like in
5a46915#r226467331.
Would that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this correlation is correct, the DaemonSet name is a function of (not equal to) the ClusterIngress name...
Fixed (I think). Thanks!
also, I think you have to compare availability of daemonsets relative to their own status: e.g.:
- for each ingress
- get daemonset by ingress
- let fullyAvailable = (daemonset.desired == daemonset.ready)
What is daemonset.desired, and what is daemonset.ready? I chose to use ds.Status.NumberAvailable based on my reading the docstrings for OperatorAvailable and DaemonSetStatus and a literal interpretation of "available". If my interpretation is not correct, we should specify what we mean by "available".
I think a more natural way to do this sort of stuff would be to build up some indexed caches from the k8s stdlib.
I've been trying to get by using operator-sdk. Presumably its client API will grow richer, so I don't want to move away from it prematurely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Available is probably right, I didn't look closely at the docs for those fields.
Regarding indexers, no need to go outside the operator-sdk. Check this out. Given just a kubernetes.Interface (easily available at our entrypoint in main.go) you can create any number of indexes to pass along to the handler.
There are a lot of examples to help you prototype in unit tests, which should help.
Making liberal use of caches/indexes/appropriate data structures is going to be very useful here to avoid an ever-expanding rat's nest of imperative analysis code.
5a46915 to
2dc5f2b
Compare
|
Latest push renames |
|
Threw together a quick contextual example of what I'm talking about: package stub
import (
"testing"
"k8s.io/client-go/tools/cache"
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func TestComputeOperatorStatus(t *testing.T) {
byStatus := func(obj interface{}) ([]string, error) {
ds := obj.(*appsv1.DaemonSet)
if ds.Status.NumberUnavailable == 0 {
return []string{"available"}, nil
}
return []string{"unavailable"}, nil
}
mkds := func(name string, unavailable int32) *appsv1.DaemonSet {
return &appsv1.DaemonSet{ObjectMeta: metav1.ObjectMeta{Name: name}, Status: appsv1.DaemonSetStatus{NumberUnavailable: unavailable}}
}
index := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{"byStatus": byStatus})
index.Add(mkds("a", 0))
index.Add(mkds("b", 0))
index.Add(mkds("c", 1))
index.Add(mkds("d", 2))
index.Add(mkds("e", 1))
unavailable, err := index.ByIndex("byStatus", "unavailable")
if err != nil {
t.Fatal(err)
}
if e, a := 3, len(unavailable); e != a {
t.Errorf("expected %d unavailable, got %d", e, a)
}
}Lots of places to go from there to answer different questions. Status updates could launched in another goroutine with a pretty frequent resync period provided they remain cache-driven. |
You mean using
I reviewed the examples, thanks!
Not sure it makes the code simpler, but I can give it a try. Note that operator-sdk seems to be moving towards using controller-runtime and promoting use of the split client, which would give us caching for free, which means that using an indexer would probably not reduce API usage, although it could reduce CPU usage or LOC (although I'm skeptical about that). |
|
David has warned us against relying on Indexer going forward; apparently it has consistency issues and has bitten us before in ways he can elaborate on if we're interested. His recommendation was to run filters on lists, not far from what you were doing. |
|
/retest |
1 similar comment
|
/retest |
pkg/stub/status_test.go
Outdated
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| ) | ||
|
|
||
| func TestSetStatusConditions(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Given inputs, the output should be some expected
[]ClusterOperatorStatusCondition. Existing condition isn't an input to the status computation function. When wouldoldever be relevant? - What about doing a few tweaks to allow expressing the cases as one-liners for easier comprehension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing condition isn't an input to the status computation function. When would old ever be relevant?
It is, and I think it needs to be if for no other reason so that we update LastTransitionTime if and only if it appropriate to do so (although the unit tests ignore LastTransitionTime presently). As far as testing, having the old conditions lets us test to make sure SetStatusCondition doesn't add duplicate or fail to update an existing condition, for example.
What about doing a few tweaks to allow expressing the cases as one-liners for easier comprehension?
Maybe something like this?
type testInputs struct {
haveNamespace bool
numWanted, numAvailable, numMissing, numUnavailable int
}
type testOutputs struct {
failing, progressing, available bool
}
// ...
testCases := []struct {
description string
in testInputs
out testOutputs
}{
{ "1/2 ingresses available", testInputs{true, 2, 1, 0, 1}, testOutputs{false, true, false} },
// ...and ignore condition reason and message? (That example is >80 characters but technically a one-liner...)
Edit: Fixed haveNamespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing condition isn't an input to the status computation function. When would old ever be relevant?
It is, and I think it needs to be if for no other reason so that we update LastTransitionTime if and only if it appropriate to do so (although the unit tests ignore LastTransitionTime presently).
I guess it could make sense in this case. I'm not sure how important that is.
As far as testing, having the old conditions lets us test to make sure SetStatusCondition doesn't add duplicate or fail to update an existing condition, for example.
Assuming our operator owns that field, whenever the operator wants to update the field it could atomically replace the entire field with newly computed contents. Adding complexity to diff and patch seems like overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: the tests, yeah, something like that. Here's a way I'd probably try it, but I personally prefer using special purpose structs for additional clarity, otherwise at a glance it's hard to know what the scenario represents without tracing function calls:
type daemonset struct {
name string
avail int32
}
tests := []struct {
daemonsets []daemonset
}{
{
// I'd then convert these to DaemonSets during scenario processing
daemonsets: []daemonset{
{name: "router-a", avail: 3},
{name: "router-b", avail: 0},
},
},
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes me wish we could use named function parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented something very similar to what I outlined in my last comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw your last two comments. I like using additional types such as your daemonset type for clarity, but I don't think it matters in the test case how we name the daemonsets or what specific value (besides zero or non-zero) we use for NumberAvailable.
|
Looks good, there's a lot of stuff that could be shared between operators |
Gopkg.toml
Outdated
|
|
||
| [[constraint]] | ||
| name = "github.com/openshift/cluster-version-operator" | ||
| revision = "fe673cb712fa5e27001488fc088ac91bb553353d" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that relevant but adding the newline should remove this little warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any warning, but I think I accidentally added the newline anyway.
| } | ||
| numIngresses := len(ingresses) | ||
| numDaemonsets := len(daemonsets) | ||
| if numIngresses == numDaemonsets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an Ingress CR is deleted, corresponding daemonset may not be deleted yet (pending op) and the length of numIngresses and numDaemonsets will defer, don't we want to say OperatorProgressing = True in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want to say "Progressing" is true in that case. The docstring for osv1.OperatorProgressing says, "OperatorProgressing indicates that the operator is actively making changes to the binary maintained by the operator". When a ClusterIngress is deleted, then the operator reconciles, and then it is actively making changes to the binary (deleting the DaemonSet for the router) until the DaemonSet is gone.
cd2d220 to
dca9cc9
Compare
|
Latest push renames |
|
Rebased. |
5086f80 to
c3008d8
Compare
|
Latest push adds a TODO to use cluster-version-operator's |
* Gopkg.lock: * Gopkg.toml: * vendor/github.com/openshift/cluster-version-operator: Import package in order to have the necessary API types for updating operator status.
c3008d8 to
ff0fb63
Compare
|
Just needs e2e tested against an AWS cluster (using |
Create (if necessary) and update a ClusterOperator object to indicate the operator's status. * cmd/cluster-ingress-operator/main.go: Add a watch on daemonsets in the application namespace to trigger reconciliation on status updates. * manifests/00-cluster-role.yaml: Allow the operator to create, get, and update clusteroperators. * pkg/manifests/bindata.go: Regenerate. * pkg/util/clusteroperator/status.go: Add new functions: SetStatusCondition and ConditionsEqual. * pkg/util/clusteroperator/status_test.go: Add tests. * pkg/stub/status.go: Add new functions: syncOperatorStatus, getOperatorState, and computeStatusConditions. Use getOperatorState in syncOperatorStatus to get the prerequisite state for computing the operator's status. Use computeStatusConditions in syncOperatorStatus to compute the status. Use clusteroperator.ConditionsEqual in syncOperatorStatus to determine whether an update is needed. Use clusteroperator.SetStatusCondition in computeStatusConditions to compute new conditions. * pkg/stub/handler.go: Use syncOperatorStatus in Handle to update ClusterOperator. * pkg/stub/handler_test.go: Add tests.
ff0fb63 to
c279137
Compare
|
Latest push adds a watch on daemonsets in the "openshift-ingress" namespace so that status updates to daemonsets trigger status updates to clusteroperators, fixes the namespace that Integration tests are blocked on openshift/installer#530. |
|
Dan and I discussed testing of this PR, and we concluded that it is sufficient to test it manually using |
|
Great work, thanks. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ironcladlou, Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
LGTM |
Add openshift/cluster-version-operator dependency
Gopkg.lock:Gopkg.toml:vendor/github.com/openshift/cluster-version-operator: Import package in order to have the necessary API types for updating ClusterOperator.Publish operator status
Create (if necessary) and update a ClusterOperator object to indicate the operator's status.
cmd/cluster-ingress-operator/main.go: Add a watch on daemonsets in the application namespace to trigger reconciliation on status updates.manifests/00-cluster-role.yaml: Allow the operator to create, get, and update clusteroperators.pkg/manifests/bindata.go: Regenerate.pkg/util/clusteroperator/status.go: Add new functions:SetStatusConditionandConditionsEqual.pkg/util/clusteroperator/status_test.go: Add tests.pkg/stub/status.go: Add new functions:syncOperatorStatus,getOperatorState, andcomputeStatusConditions.Use
getOperatorStateinsyncOperatorStatusto get the prerequisite state for computing the operator's status.Use
computeStatusConditionsinsyncOperatorStatusto compute the status.Use
clusteroperator.ConditionsEqualinsyncOperatorStatusto determine whether an update is needed.Use
clusteroperator.SetStatusConditionincomputeStatusConditionsto compute new conditions.pkg/stub/handler.go: UsesyncOperatorStatusinHandleto update ClusterOperator.pkg/stub/handler_test.go: Add tests.