Skip to content

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented Oct 12, 2018

Add openshift/cluster-version-operator dependency

  • Gopkg.lock:
  • Gopkg.toml:
  • vendor/github.com/openshift/cluster-version-operator: Import package in order to have the necessary API types for updating ClusterOperator.

Publish operator status

Create (if necessary) and update a ClusterOperator object to indicate the operator's status.

  • cmd/cluster-ingress-operator/main.go: Add a watch on daemonsets in the application namespace to trigger reconciliation on status updates.
  • manifests/00-cluster-role.yaml: Allow the operator to create, get, and update clusteroperators.
  • pkg/manifests/bindata.go: Regenerate.
  • pkg/util/clusteroperator/status.go: Add new functions: SetStatusCondition and ConditionsEqual.
  • pkg/util/clusteroperator/status_test.go: Add tests.
  • pkg/stub/status.go: Add new functions: syncOperatorStatus, getOperatorState, and computeStatusConditions.
    Use getOperatorState in syncOperatorStatus to get the prerequisite state for computing the operator's status.
    Use computeStatusConditions in syncOperatorStatus to compute the status.
    Use clusteroperator.ConditionsEqual in syncOperatorStatus to determine whether an update is needed.
    Use clusteroperator.SetStatusCondition in computeStatusConditions to compute new conditions.
  • pkg/stub/handler.go: Use syncOperatorStatus in Handle to update ClusterOperator.
  • pkg/stub/handler_test.go: Add tests.

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 12, 2018
@Miciah Miciah force-pushed the publish-OperatorStatus branch from 7a03ce9 to 533e0f3 Compare October 12, 2018 22:20
return err
}

if err := h.syncIngressUpdate(o); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syncOperatorStatus calls woven throughout the update logic seem more imperative than level driven to me... would it be possible to simply it to "if syncIngressUpdate returns an error, set a degraded status"? Or take the error into account when doing a state-observation driven computation unconditionally after the update, like:

  1. Attempt update
  2. Compute status by observing the state of the cluster after the update
    1. Take any error from (1) into account when computing in (2)

The point being it should be possible to re-compute status outside the context of the update function. That is, there could be a Handler.ComputeOperatorStatus() function which could be called at any time to compute the correct status absent any event to observe.

This also closes the current gap with status computation based on deletes: the current code will only handle deletes in response to a delete event (something which needs refactored); if any delete processing (or status processing during delete processing) fails, it should be possible to call a general status recompilation function later on (outside the context of a delete event) and get the right status.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 15, 2018

The syncOperatorStatus calls woven throughout the update logic seem more imperative than level driven to me...

Yeah, it's a bit icky.

would it be possible to simply it to "if syncIngressUpdate returns an error, set a degraded status"?

No, not all errors should cause "degraded" status. In particular, errors from which the operator can recover without intervention should not cause "degraded" status.

Or take the error into account when doing a state-observation driven computation unconditionally after the update, like:

  1. Attempt update
  2. Compute status by observing the state of the cluster after the update
    i. Take any error from (1) into account when computing in (2)

The point being it should be possible to re-compute status outside the context of the update function. That is, there could be a Handler.ComputeOperatorStatus() function which could be called at any time to compute the correct status absent any event to observe.

This approach is complicated by the fact that syncIngressUpdate decorates its errors, which means the error value that it returns does not respond to such introspective methods as errors.IsForbidden. We could stop decorating errors in order to preserve the information needed to decide whether an error were recoverable, use string matching to infer this information, or decorate using richer error types that included that information. However, the simplest and cleanest approach I can think of would be to add a Boolean return value retry and let Handle use that value to decide whether to return an error (in order to cause a retry) or log and set "degraded" status. This approach would still be edge-driven but would be simpler and less imperative. Does this approach seem reasonable to you?

Maybe it would be helpful to list out error conditions, whether "degraded" status makes sense for each, and how we could otherwise infer "degraded" status from state:

  • Error getting the watch namespace or operator name in NewHandler: This can only happen on startup, so the operator just errors out, which seems reasonable enough.
  • Error building a resource: This could be caused either by some fault or by an erroneous name or selector in the ClusterIngress, either of which would require intervention. Testing should catch the former, and validation should catch the latter. Once we add validation, do we even need to handle errors from the manifest factory?
  • Error creating, deleting, getting, or updating a resource, which could be any of the following:
    • Timeout, conflict, or other intermittent issue that we know is recoverable, in which case we should retry.
    • "Forbidden", which could be caused by a failure in CVO, a bad manifest, or possibly a misconfigured cluster, any of which requires intervention. We could use SARs to identify when we will get permissions errors, and maybe that would be a reasonable thing to do on startup, but there is still the possibility that RBAC configuration will change while the operator is running, in which case we still need to handle the errors.
    • "Bad request" or similar error, or some generic error, in which case we really have no clue whether it is recoverable. Right now we retry on these errors—should we default to assuming errors are unrecoverable?

This also closes the current gap with status computation based on deletes: the current code will only handle deletes in response to a delete event (something which needs refactored); if any delete processing (or status processing during delete processing) fails, it should be possible to call a general status recompilation function later on (outside the context of a delete event) and get the right status.

In order to identify and remove orphaned resources, we must list them, list ClusterIngress resources, and then reconcile. To this end, it would be helpful to disentangle the resync logic from event handling. Ultimately if the overhead of listing everything can be mostly mitigated by informers' caching, we may want to replace the edge-triggered event handling (ClusterIngress was added/modified/deleted, so add/modify/delete the appropriate resources) with edge-triggered reconciliation (some event involving some object in one of our namespaces happened, so get all relevant state and reconcile). This approach deviates from the pattern we see in operator-sdk and other operators though, and it is beyond the scope of this PR, I believe.

@ironcladlou
Copy link
Contributor

@Miciah

Maybe it would be helpful to list out error conditions, whether "degraded" status makes sense for each, and how we could otherwise infer "degraded" status from state:>
...

It might be useful to decouple the error handling and status discussions entirely. They aren't necessarily related. Error taxonomy as pertains to work queue handling (e.g. is this error transient? Is it recoverable?) is useful but not generally relevant to status computation.

In order to identify and remove orphaned resources, we must list them, list ClusterIngress resources, and then reconcile. To this end, it would be helpful to disentangle the resync logic from event handling. Ultimately if the overhead of listing everything can be mostly mitigated by informers' caching, we may want to replace the edge-triggered event handling (ClusterIngress was added/modified/deleted, so add/modify/delete the appropriate resources) with edge-triggered reconciliation (some event involving some object in one of our namespaces happened, so get all relevant state and reconcile). This approach deviates from the pattern we see in operator-sdk and other operators though, and it is beyond the scope of this PR, I believe.

Edge triggered reconciliation is indeed the approach we should be striving for, and is the only reliable approach to authoring a controller of this nature. We should be able to, in a single thread,

  1. Try and reconcile
  2. Compute status

Any errors that occurred during step 1 are relevant only indirectly insofar as their side effects on the cluster state which is observed to drive step 2. Categorization/analysis of errors during step 1 may drive things like retries, but the status function shouldn't use errors (a kind of edge event) as inputs.

If that means we need to bite the bullet now and refactor syncIngressUpdate and deleteIngress out into a single level driven reconcile function, let's just do it and get it over with. I'm afraid of accumulating serious technical debt in the core controller at this stage.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 15, 2018

Can you give an example state from which we could infer "degraded" status?

Would it be all right if I removed "degraded" status from this PR (that is, delete all calls to setOperatorStatusDegraded) and made a follow-up PR to implement reconciliation? That would enable us to keep moving, fulfill the requirement of publishing OperatorStatus without incurring significant additional technical debt, and then focus on reconciliation.

@ironcladlou
Copy link
Contributor

@Miciah

Can you give an example state from which we could infer "degraded" status?

For the operator itself, how about something something as simple as a failure to ensure the openshift-cluster-ingress-router namespace exists (if we can't create that, then we can't satisfy any ClusterIngress resources at all).

Expressed as a level driven status calc algorithm:

  1. There exist more than zero ClusterIngress resources in the openshift-cluster-ingress-operator namespace
  2. After a reconciliation, there does not exist a openshift-cluster-ingress-router namespace

Some similar analysis would be useful for the router ClusterRole.

@ironcladlou
Copy link
Contributor

#47 (comment) May also blur the line a bit between operator and ClusterIngress status, but it's a start. I think there's a case to be made that the operator should be trying to ensure the router namespace exists during reconciliation independent of any ClusterIngress existing, which may further clarify the boundaries.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 15, 2018

What about transient errors from creating the namespace?

Should we implement our own retry loop as part of reconciliation? We could even implement our own work queue à la CVO, but I thought operators generally were supposed to be moving away from that paradigm towards embracing operator-sdk and its patterns.

@ironcladlou
Copy link
Contributor

@Miciah

What about transient errors from creating the namespace?

Should we implement our own retry loop as part of reconciliation? We could even implement our own work queue à la CVO, but I thought operators generally were supposed to be moving away from that paradigm towards embracing operator-sdk and its patterns.

The operator-sdk doesn't give us any help for re-queueing yet, but it soon will. Don't spend too much time working around it. We can assume the happy path for a little while longer. The operator-sdk team has explicitly acknowledged the deficiencies of the current handler paradigm in these regards and has been actively working on refactoring to use the controller-runtime upstream framework as the underlying basis, which will provide us with first-class reconciliation features including retry/work queue control.

Keep in mind the operator-sdk (at the version we started with) doesn't necessarily represent the defacto best practices for k8s controller design and the project is evolving to catch up. We must avoid at all costs going to prod with an edge-driven design: that particular mistake was extremely costly to repair in the original OpenShift v3 controllers.

@deads2k
Copy link
Contributor

deads2k commented Oct 16, 2018

/hold

This appears to be setting an older version of status that has been superseded by openshift/cluster-version-operator#31

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 16, 2018
@Miciah Miciah force-pushed the publish-OperatorStatus branch from 533e0f3 to 5a46915 Compare October 18, 2018 20:26
@Miciah Miciah changed the title Publish OperatorStatus [WIP] Publish operator status Oct 18, 2018
@Miciah
Copy link
Contributor Author

Miciah commented Oct 18, 2018

Latest push is a rewrite to be level-driven and to use the new ClusterOperator API. Labeled "WIP" because I have not been able to test it (although most new code has unit test coverage).


// syncOperatorStatus computes the operator's current status and therefrom
// creates or updates the ClusterOperator resource for the operator.
func (h *Handler) syncOperatorStatus() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move all this stuff and the new stuff in util into a new file (status.go)?


// getOperatorState gets and returns the resources necessary to compute the
// operator's current state.
func (h *Handler) getOperatorState() (*corev1.Namespace, *ingressv1alpha1.ClusterIngressList, *appsv1.DaemonSetList, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about returning plain arrays (e.g. []ClusterIngress) instead? The list wrappers are cumbersome to work with and provide no additional value in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. My only concern is that I don't want to copy large structures accidentally; hopefully I've avoided doing so...

} else {
failingCondition.Status = osv1.ConditionFalse
}
clusteroperator.SetStatusCondition(co, failingCondition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style issue: mutating input is troublesome. This function is close to stateless; how about just returning the new Status object based on immutable inputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Type: osv1.OperatorAvailable,
Status: osv1.ConditionUnknown,
}
if ingressList != nil && daemonsetList != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this correlation is correct, the DaemonSet name is a function of (not equal to) the ClusterIngress name... also, I think you have to compare availability of daemonsets relative to their own status: e.g.:

  • for each ingress
    • get daemonset by ingress
      • let fullyAvailable = (daemonset.desired == daemonset.ready)

I think a more natural way to do this sort of stuff would be to build up some indexed caches from the k8s stdlib. For example:

  • DaemonSet cache with index:
    • by ClusterIngress.Name where .Status.Desired == .Status.Ready

Then given a ClusterIngress you can do an O(1) lookup to see if there exists a ready DaemonSet associated with the ClusterIngress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to start small and slightly incorrect and build up to unblock deployment? For example, if the count of daemonsets matches the count of ingresses, at least mark us as "Progressing" and not failed, and we can refine the algorithms in followups using strategies like in
5a46915#r226467331.

Would that help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this correlation is correct, the DaemonSet name is a function of (not equal to) the ClusterIngress name...

Fixed (I think). Thanks!

also, I think you have to compare availability of daemonsets relative to their own status: e.g.:

  • for each ingress
    • get daemonset by ingress
      • let fullyAvailable = (daemonset.desired == daemonset.ready)

What is daemonset.desired, and what is daemonset.ready? I chose to use ds.Status.NumberAvailable based on my reading the docstrings for OperatorAvailable and DaemonSetStatus and a literal interpretation of "available". If my interpretation is not correct, we should specify what we mean by "available".

I think a more natural way to do this sort of stuff would be to build up some indexed caches from the k8s stdlib.

I've been trying to get by using operator-sdk. Presumably its client API will grow richer, so I don't want to move away from it prematurely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available is probably right, I didn't look closely at the docs for those fields.

Regarding indexers, no need to go outside the operator-sdk. Check this out. Given just a kubernetes.Interface (easily available at our entrypoint in main.go) you can create any number of indexes to pass along to the handler.

There are a lot of examples to help you prototype in unit tests, which should help.

Making liberal use of caches/indexes/appropriate data structures is going to be very useful here to avoid an ever-expanding rat's nest of imperative analysis code.

@Miciah Miciah force-pushed the publish-OperatorStatus branch from 5a46915 to 2dc5f2b Compare October 18, 2018 22:30
@Miciah
Copy link
Contributor Author

Miciah commented Oct 18, 2018

Latest push renames setStatusConditions to computeStatusConditions and makes it functional, changes getOperatorState and computeStatusConditions to use slices instead of list types, moves syncOperatorStatus, getOperatorState, and computeStatusConditions to pkg/stub/status.go, makes SetStatusCondition functional, renames StatusEqual to ConditionsEqual and makes it work on slices of conditions instead of on statuses, and fixes the mapping of ClusterIngress name to DaemonSet name.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 18, 2018
@ironcladlou
Copy link
Contributor

Threw together a quick contextual example of what I'm talking about:

package stub

import (
	"testing"

	"k8s.io/client-go/tools/cache"

	appsv1 "k8s.io/api/apps/v1"

	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func TestComputeOperatorStatus(t *testing.T) {
	byStatus := func(obj interface{}) ([]string, error) {
		ds := obj.(*appsv1.DaemonSet)
		if ds.Status.NumberUnavailable == 0 {
			return []string{"available"}, nil
		}
		return []string{"unavailable"}, nil
	}

	mkds := func(name string, unavailable int32) *appsv1.DaemonSet {
		return &appsv1.DaemonSet{ObjectMeta: metav1.ObjectMeta{Name: name}, Status: appsv1.DaemonSetStatus{NumberUnavailable: unavailable}}
	}

	index := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{"byStatus": byStatus})

	index.Add(mkds("a", 0))
	index.Add(mkds("b", 0))
	index.Add(mkds("c", 1))
	index.Add(mkds("d", 2))
	index.Add(mkds("e", 1))

	unavailable, err := index.ByIndex("byStatus", "unavailable")
	if err != nil {
		t.Fatal(err)
	}
	if e, a := 3, len(unavailable); e != a {
		t.Errorf("expected %d unavailable, got %d", e, a)
	}
}

Lots of places to go from there to answer different questions. Status updates could launched in another goroutine with a pretty frequent resync period provided they remain cache-driven.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 19, 2018

Regarding indexers, no need to go outside the operator-sdk. Check this out. Given just a kubernetes.Interface (easily available at our entrypoint in main.go) you can create any number of indexes to pass along to the handler.

You mean using k8sclient.GetKubeClient()? Looks like k8sclient is gone upstream, so we will need to change to the controller-runtime client eventually; do we have plans for that work?

There are a lot of examples to help you prototype in unit tests, which should help.

I reviewed the examples, thanks!

Making liberal use of caches/indexes/appropriate data structures is going to be very useful here to avoid an ever-expanding rat's nest of imperative analysis code.

Not sure it makes the code simpler, but I can give it a try. Note that operator-sdk seems to be moving towards using controller-runtime and promoting use of the split client, which would give us caching for free, which means that using an indexer would probably not reduce API usage, although it could reduce CPU usage or LOC (although I'm skeptical about that).

@ironcladlou
Copy link
Contributor

David has warned us against relying on Indexer going forward; apparently it has consistency issues and has bitten us before in ways he can elaborate on if we're interested. His recommendation was to run filters on lists, not far from what you were doing.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 22, 2018

/retest

1 similar comment
@Miciah
Copy link
Contributor Author

Miciah commented Oct 22, 2018

/retest

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func TestSetStatusConditions(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Given inputs, the output should be some expected []ClusterOperatorStatusCondition. Existing condition isn't an input to the status computation function. When would old ever be relevant?
  2. What about doing a few tweaks to allow expressing the cases as one-liners for easier comprehension?

Copy link
Contributor Author

@Miciah Miciah Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing condition isn't an input to the status computation function. When would old ever be relevant?

It is, and I think it needs to be if for no other reason so that we update LastTransitionTime if and only if it appropriate to do so (although the unit tests ignore LastTransitionTime presently). As far as testing, having the old conditions lets us test to make sure SetStatusCondition doesn't add duplicate or fail to update an existing condition, for example.

What about doing a few tweaks to allow expressing the cases as one-liners for easier comprehension?

Maybe something like this?

type testInputs struct {
  haveNamespace bool
  numWanted, numAvailable, numMissing, numUnavailable int
}
type testOutputs struct {
  failing, progressing, available bool
}
// ...
  testCases := []struct {
    description string
    in testInputs
    out testOutputs
  }{
    { "1/2 ingresses available", testInputs{true, 2, 1, 0, 1}, testOutputs{false, true, false} },
    // ...

and ignore condition reason and message? (That example is >80 characters but technically a one-liner...)

Edit: Fixed haveNamespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing condition isn't an input to the status computation function. When would old ever be relevant?

It is, and I think it needs to be if for no other reason so that we update LastTransitionTime if and only if it appropriate to do so (although the unit tests ignore LastTransitionTime presently).

I guess it could make sense in this case. I'm not sure how important that is.

As far as testing, having the old conditions lets us test to make sure SetStatusCondition doesn't add duplicate or fail to update an existing condition, for example.

Assuming our operator owns that field, whenever the operator wants to update the field it could atomically replace the entire field with newly computed contents. Adding complexity to diff and patch seems like overkill.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the tests, yeah, something like that. Here's a way I'd probably try it, but I personally prefer using special purpose structs for additional clarity, otherwise at a glance it's hard to know what the scenario represents without tracing function calls:

type daemonset struct {
  name  string
  avail int32
}
tests := []struct {
  daemonsets []daemonset
}{
  {
    // I'd then convert these to DaemonSets during scenario processing
    daemonsets: []daemonset{
      {name: "router-a", avail: 3},
      {name: "router-b", avail: 0},
    },
  },
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me wish we could use named function parameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented something very similar to what I outlined in my last comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw your last two comments. I like using additional types such as your daemonset type for clarity, but I don't think it matters in the test case how we name the daemonsets or what specific value (besides zero or non-zero) we use for NumberAvailable.

@imcsk8
Copy link
Contributor

imcsk8 commented Oct 22, 2018

Looks good, there's a lot of stuff that could be shared between operators

Gopkg.toml Outdated

[[constraint]]
name = "github.com/openshift/cluster-version-operator"
revision = "fe673cb712fa5e27001488fc088ac91bb553353d"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that relevant but adding the newline should remove this little warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any warning, but I think I accidentally added the newline anyway.

}
numIngresses := len(ingresses)
numDaemonsets := len(daemonsets)
if numIngresses == numDaemonsets {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an Ingress CR is deleted, corresponding daemonset may not be deleted yet (pending op) and the length of numIngresses and numDaemonsets will defer, don't we want to say OperatorProgressing = True in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want to say "Progressing" is true in that case. The docstring for osv1.OperatorProgressing says, "OperatorProgressing indicates that the operator is actively making changes to the binary maintained by the operator". When a ClusterIngress is deleted, then the operator reconciles, and then it is actively making changes to the binary (deleting the DaemonSet for the router) until the DaemonSet is gone.

@Miciah Miciah force-pushed the publish-OperatorStatus branch from cd2d220 to dca9cc9 Compare October 22, 2018 23:15
@Miciah
Copy link
Contributor Author

Miciah commented Oct 22, 2018

Latest push renames TestSetStatusCondition to TestComputeStatusConditions, simplifies the test cases therein, and adds TestSetStatusCondition to pkg/util/clusteroperator/status_test.go to make up for coverage lost in the TestComputeStatusConditions simplification.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 22, 2018

Rebased.

@Miciah Miciah force-pushed the publish-OperatorStatus branch 2 times, most recently from 5086f80 to c3008d8 Compare October 23, 2018 17:11
@Miciah
Copy link
Contributor Author

Miciah commented Oct 23, 2018

Latest push adds a TODO to use cluster-version-operator's SetOperatorStatusCondition once we update to a version of cluster-version-operator that has it.

@Miciah Miciah changed the title [WIP] Publish operator status Publish operator status Oct 23, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2018
* Gopkg.lock:
* Gopkg.toml:
* vendor/github.com/openshift/cluster-version-operator: Import package
in order to have the necessary API types for updating operator status.
@Miciah Miciah force-pushed the publish-OperatorStatus branch from c3008d8 to ff0fb63 Compare October 23, 2018 17:43
@ironcladlou
Copy link
Contributor

ironcladlou commented Oct 23, 2018

Just needs e2e tested against an AWS cluster (using hack/local-release.sh and make test-integration, etc.)

Create (if necessary) and update a ClusterOperator object to indicate the
operator's status.

* cmd/cluster-ingress-operator/main.go: Add a watch on daemonsets in the
application namespace to trigger reconciliation on status updates.
* manifests/00-cluster-role.yaml: Allow the operator to create, get, and
update clusteroperators.
* pkg/manifests/bindata.go: Regenerate.
* pkg/util/clusteroperator/status.go: Add new functions: SetStatusCondition
and ConditionsEqual.
* pkg/util/clusteroperator/status_test.go: Add tests.
* pkg/stub/status.go: Add new functions: syncOperatorStatus,
getOperatorState, and computeStatusConditions.
Use getOperatorState in syncOperatorStatus to get the prerequisite
state for computing the operator's status.
Use computeStatusConditions in syncOperatorStatus to compute the status.
Use clusteroperator.ConditionsEqual in syncOperatorStatus to determine
whether an update is needed.
Use clusteroperator.SetStatusCondition in computeStatusConditions to
compute new conditions.
* pkg/stub/handler.go: Use syncOperatorStatus in Handle to update
ClusterOperator.
* pkg/stub/handler_test.go: Add tests.
@Miciah Miciah force-pushed the publish-OperatorStatus branch from ff0fb63 to c279137 Compare October 26, 2018 16:44
@Miciah
Copy link
Contributor Author

Miciah commented Oct 26, 2018

Latest push adds a watch on daemonsets in the "openshift-ingress" namespace so that status updates to daemonsets trigger status updates to clusteroperators, fixes the namespace that getOperatorState uses to list daemonsets, and changes %s to %q for interpolating ingress name into status.

Integration tests are blocked on openshift/installer#530.

@Miciah
Copy link
Contributor Author

Miciah commented Oct 26, 2018

Dan and I discussed testing of this PR, and we concluded that it is sufficient to test it manually using make release-local, hack/uninstall.sh, and oc apply /tmp/... and verifying that the operator runs and creates the service and daemonset correctly. I have done this manual testing, so this PR is good to merge.

@ironcladlou
Copy link
Contributor

Great work, thanks.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2018
@ironcladlou
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2018
@openshift-merge-robot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit c92416d into openshift:master Oct 30, 2018
@imcsk8
Copy link
Contributor

imcsk8 commented Oct 30, 2018

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants