Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs #771

wking · 2022-05-02T22:41:05Z

I'd dropped this in cc9292a (#400), claiming:

There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle).

But from upstream docs:

It might take a few seconds for the endpoint to be created. You can watch the Established condition of your CustomResourceDefinition to be true or watch the discovery information of the API server for your resource to show up.

So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. And the code I removed in cc9292a was in fact looking at the Established condition already.

This commit restores the Established check, but without the previous PollImmediateUntil wait. We'll see how it plays in CI. If we spend too long in between failed sync cycles waiting on new CRDs to show up, we can look at restoring a short wait.

wking · 2022-05-03T17:23:06Z

Update:

Cluster did not complete upgrade: timed out waiting for the condition: Could not update customresourcedefinition "clusteroperators.config.openshift.io" (4 of 792)

From the CVO logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521351550553821184/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5f5f9cd4b4-gwq8w_cluster-version-operator.log | grep customresourcedef | tail -n2
E0503 08:14:05.090838       1 task.go:112] error running apply for customresourcedefinition "clusteroperators.config.openshift.io" (4 of 792): CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status condition
E0503 08:14:20.932028       1 task.go:112] error running apply for customresourcedefinition "clusteroperators.config.openshift.io" (4 of 792): CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status condition

From the must-gather:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521351550553821184/artifacts/e2e-agnostic-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-build03-ci-openshift-org-ci-op-sbcr982j-stable-initial-sha256-862bbab55fec772e032e10f2765e9221f8a743e6fbcdd2d376bbe10a9e875c79/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/clusteroperators.config.openshift.io.yaml | yaml2json | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 
2022-05-03T05:07:37Z NamesAccepted=True NoConflicts: no conflicts found
2022-05-03T05:07:37Z Established=True InitialNamesAccepted: the initial names have been accepted

Looks like it's declaring Established to me. I've pushed c1328ad -> 4998c6c to log the conditions we did get.

wking · 2022-05-03T22:49:19Z

upgrade failed to install. Unrelated to this PR, because the installed version is the base branch.

/test e2e-agnostic-upgrade

wking · 2022-05-04T06:35:38Z

New run failed the same way, and confirms that the CVO code is not seeing the conditions that show up in the must-gather:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521622968680058880/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-644b6dd49d-rkdzj_cluster-version-operator.log | grep 'does not declare' | tail -n2
E0504 03:05:53.570755       1 task.go:112] error running apply for customresourcedefinition "clusteroperators.config.openshift.io" (4 of 791): CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status condition: []
I0504 03:06:13.358865       1 sync_worker.go:1098] Update error 4 of 791: UpdatePayloadFailed Could not update customresourcedefinition "clusteroperators.config.openshift.io" (4 of 791) (*errors.errorString: CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status condition: [])
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521622968680058880/artifacts/e2e-agnostic-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-build03-ci-openshift-org-ci-op-dwkpjrf0-stable-initial-sha256-862bbab55fec772e032e10f2765e9221f8a743e6fbcdd2d376bbe10a9e875c79/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/clusteroperators.config.openshift.io.yaml | yaml2json | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-03T23:08:12Z NamesAccepted=True NoConflicts: no conflicts found
2022-05-03T23:08:12Z Established=True InitialNamesAccepted: the initial names have been accepted

I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition

openshift-ci · 2022-05-04T23:50:25Z

@wking: This pull request references Bugzilla bug 2081895, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jiajliu

Details

In response to this:

Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-05-04T23:56:46Z

lib/resourceapply/apireg.go

+)
+
+// ApplyAPIServicev1 applies the required API service to the cluster.
+func ApplyAPIServicev1(ctx context.Context, client apiregclientv1.APIServicesGetter, required *apiregv1.APIService, reconciling bool) (*apiregv1.APIService, bool, error) {


Can we just drop DeleteAPIServicev1? Seems like it was added in 0afb8a8 (#438), but that it has never had any consumers.

I'd leave it in so our resourcebuilder continues to support apply and delete for all the resources.

But looking at this further, this is saying CVO will never apply this resource. Did CVO ever apply the resource such that there could ever be an orphan? If not then yeah, I suppose there's no reason for the delete.

Looks like we dropped support for APIService in 5681a70 (#566) saying no 4.8 manifests were using it. Auditing recent z-stream tips:

$ oc adm release extract --to 4.11 quay.io/openshift-release-dev/ocp-release-nightly@sha256:080e5cf5e3e043ac0877b8f545ba2b596016f3fd76f3f457d15060603b3615e1 $ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.13-x86_64 $ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.32-x86_64 $ oc adm release extract --to 4.8 quay.io/openshift-release-dev/ocp-release:4.8.39-x86_64 $ grep -r 'kind: APIService' 4.* ...no hits...

I guess I'll pivot to re-dropping support, and explain that^.

I've pushed ebb1271 -> 6e12b9f with a new commit that re-drops APIService handling.

jottofar · 2022-05-09T18:25:11Z

lib/resourcebuilder/resourcebuilder.go

-			if _, _, err := resourceapply.ApplyDeploymentv1(ctx, b.appsClientv1, typedObject, reconcilingMode); err != nil {
+			if actual, _, err := resourceapply.ApplyDeploymentv1(ctx, b.appsClientv1, typedObject, reconcilingMode); err != nil {
 				return err
+			} else if actual != nil {


So before your change if a resource was IsCreateOnly this would have segfault'ed? Seems like a pretty big hole.

Should the nil check be in each of the functions instead?

no, before my change we were passing typedObject through. That manifest content wasn't nil, but it didn't have anything useful in status either. I'm agnostic about nil checks or panics inside the check*Health functions; either way it would be a pretty serious bug to pass nil through to the health checker, and either way I expect we'll hear about it.

ACK. Did not notice that param change.

jottofar · 2022-05-09T18:25:53Z

/hold
/lgtm

The original APIService reconciliation landed in 662e182 (handle APIService, Service objects, 2018-09-27, openshift#26). We definitely reconcile a bunch of Service manifests (e.g. for serving operator metrics, including the CVO's own Service), but it's not clear what the use case was for APIService. DeleteAPIServicev1 landed in 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438), but was never consumed. We dropped APIService reconciliation support in 5681a70 (Drop APIService support, 2021-06-10, openshift#566). This commit drops the unused DeleteAPIServicev1. Auditing z-stream tips from 4.3 through 4.11: $ oc adm release extract --to 4.11 quay.io/openshift-release-dev/ocp-release-nightly@sha256:080e5cf5e3e043ac0877b8f545ba2b596016f3fd76f3f457d15060603b3615e1 $ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.13-x86_64 $ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.32-x86_64 $ oc adm release extract --to 4.8 quay.io/openshift-release-dev/ocp-release:4.8.39-x86_64 $ oc adm release extract --to 4.7 quay.io/openshift-release-dev/ocp-release:4.7.50-x86_64 $ oc adm release extract --to 4.6 quay.io/openshift-release-dev/ocp-release:4.6.57-x86_64 $ oc adm release extract --to 4.5 quay.io/openshift-release-dev/ocp-release:4.5.41-x86_64 $ oc adm release extract --to 4.4 quay.io/openshift-release-dev/ocp-release:4.4.33-x86_64 $ oc adm release extract --to 4.3 quay.io/openshift-release-dev/ocp-release:4.3.40-x86_64 $ grep -r 'kind: APIService' 4.* ...no hits...

I'd accidentally broken this in e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), when I moved the health checks out of the Do methods and began passing typedObject into them. typedObject is the release manifest, which lacks the status attributes we want the health checks to cover. This also catches the generation script up with some past manual changes: * 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438). * 05e1af7 (Log resource diffs on update only in reconcile mode, 2021-07-13, openshift#628).

I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition

jottofar · 2022-05-09T19:22:55Z

/lgtm

openshift-ci · 2022-05-09T19:23:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2022-05-10T00:01:17Z

PDB/DNS stuff is unrelated.

/override ci/prow/e2e-agnostic-upgrade

wking · 2022-05-10T00:01:30Z

/hold cancel

openshift-ci · 2022-05-10T00:01:42Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

PDB/DNS stuff is unrelated.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-05-10T02:04:11Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-05-10T02:05:24Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#771

Bugzilla bug 2081895 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I'd missed these Gets in 0833bba (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). What I'd thought was happening: 1. Pass typedObject into check*Health. 2. No status on typedObject, so check*Health passed, even if the in-cluster resource was sad. What was actually happening: 1. Pass typedObject into check*Health. 2. check*Health calls Get to see what's going on in the cluster. 3. Health check appropriately checks the health of the in-cluster resource. However, now that 0833bba is passing in the just-retrieved in-cluster resource, the check*Health Get call is an API call that we don't need to make. Dropping it here saves us and the API server a small amount of CPU cycles and network bandwidth.

openshift-ci bot requested review from jottofar and vrutkovs May 2, 2022 22:41

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2022

wking force-pushed the restore-CRD-established-check branch 2 times, most recently from c1328ad to 4998c6c Compare May 3, 2022 17:22

wking force-pushed the restore-CRD-established-check branch from 4998c6c to 11b7ecb Compare May 4, 2022 07:26

wking force-pushed the restore-CRD-established-check branch from 11b7ecb to 38596fc Compare May 4, 2022 07:30

wking force-pushed the restore-CRD-established-check branch from 38596fc to 978ac53 Compare May 4, 2022 07:31

wking force-pushed the restore-CRD-established-check branch from 978ac53 to ebb1271 Compare May 4, 2022 07:41

wking changed the title ~~lib/resourcebuilder/apiext: Restore check for Established=True CRDs~~ Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs May 4, 2022

openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 4, 2022

openshift-ci bot requested a review from jiajliu May 4, 2022 23:50

wking commented May 4, 2022

View reviewed changes

jottofar reviewed May 9, 2022

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2022

openshift-ci bot assigned jottofar May 9, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2022

wking added 3 commits May 9, 2022 11:56

wking force-pushed the restore-CRD-established-check branch from ebb1271 to 6e12b9f Compare May 9, 2022 19:05

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 9, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2022

openshift-merge-robot merged commit ee40ed5 into openshift:master May 10, 2022

wking deleted the restore-CRD-established-check branch May 10, 2022 03:38

wking mentioned this pull request May 13, 2022

Bug 2081895: lib/resourcebuilder: Drop Get from check*Health functions #780

Merged

Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs #771

Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs #771

Uh oh!

Conversation

wking commented May 2, 2022

Uh oh!

wking commented May 3, 2022

Uh oh!

wking commented May 3, 2022

Uh oh!

wking commented May 4, 2022

Uh oh!

openshift-ci bot commented May 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jottofar commented May 9, 2022

Uh oh!

jottofar commented May 9, 2022

Uh oh!

openshift-ci bot commented May 9, 2022

Uh oh!

wking commented May 10, 2022

Uh oh!

wking commented May 10, 2022

Uh oh!

openshift-ci bot commented May 10, 2022

Uh oh!

openshift-ci bot commented May 10, 2022

Uh oh!

openshift-ci bot commented May 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants