-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs #771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2081895: lib/resourcebuilder/apiext: Restore check for Established=True CRDs #771
Conversation
c1328ad to
4998c6c
Compare
From the CVO logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521351550553821184/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5f5f9cd4b4-gwq8w_cluster-version-operator.log | grep customresourcedef | tail -n2
E0503 08:14:05.090838 1 task.go:112] error running apply for customresourcedefinition "clusteroperators.config.openshift.io" (4 of 792): CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status condition
E0503 08:14:20.932028 1 task.go:112] error running apply for customresourcedefinition "clusteroperators.config.openshift.io" (4 of 792): CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status conditionFrom the must-gather: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521351550553821184/artifacts/e2e-agnostic-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-build03-ci-openshift-org-ci-op-sbcr982j-stable-initial-sha256-862bbab55fec772e032e10f2765e9221f8a743e6fbcdd2d376bbe10a9e875c79/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/clusteroperators.config.openshift.io.yaml | yaml2json | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-03T05:07:37Z NamesAccepted=True NoConflicts: no conflicts found
2022-05-03T05:07:37Z Established=True InitialNamesAccepted: the initial names have been acceptedLooks like it's declaring |
|
upgrade failed to install. Unrelated to this PR, because the installed version is the base branch. /test e2e-agnostic-upgrade |
|
New run failed the same way, and confirms that the CVO code is not seeing the conditions that show up in the must-gather: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521622968680058880/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-644b6dd49d-rkdzj_cluster-version-operator.log | grep 'does not declare' | tail -n2
E0504 03:05:53.570755 1 task.go:112] error running apply for customresourcedefinition "clusteroperators.config.openshift.io" (4 of 791): CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status condition: []
I0504 03:06:13.358865 1 sync_worker.go:1098] Update error 4 of 791: UpdatePayloadFailed Could not update customresourcedefinition "clusteroperators.config.openshift.io" (4 of 791) (*errors.errorString: CustomResourceDefinition clusteroperators.config.openshift.io does not declare an Established status condition: [])
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/771/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1521622968680058880/artifacts/e2e-agnostic-upgrade/gather-must-gather/artifacts/must-gather.tar | tar xOz registry-build03-ci-openshift-org-ci-op-dwkpjrf0-stable-initial-sha256-862bbab55fec772e032e10f2765e9221f8a743e6fbcdd2d376bbe10a9e875c79/cluster-scoped-resources/apiextensions.k8s.io/customresourcedefinitions/clusteroperators.config.openshift.io.yaml | yaml2json | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-05-03T23:08:12Z NamesAccepted=True NoConflicts: no conflicts found
2022-05-03T23:08:12Z Established=True InitialNamesAccepted: the initial names have been accepted |
4998c6c to
11b7ecb
Compare
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
11b7ecb to
38596fc
Compare
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
38596fc to
978ac53
Compare
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
978ac53 to
ebb1271
Compare
|
@wking: This pull request references Bugzilla bug 2081895, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
lib/resourceapply/apireg.go
Outdated
| ) | ||
|
|
||
| // ApplyAPIServicev1 applies the required API service to the cluster. | ||
| func ApplyAPIServicev1(ctx context.Context, client apiregclientv1.APIServicesGetter, required *apiregv1.APIService, reconciling bool) (*apiregv1.APIService, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave it in so our resourcebuilder continues to support apply and delete for all the resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But looking at this further, this is saying CVO will never apply this resource. Did CVO ever apply the resource such that there could ever be an orphan? If not then yeah, I suppose there's no reason for the delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we dropped support for APIService in 5681a70 (#566) saying no 4.8 manifests were using it. Auditing recent z-stream tips:
$ oc adm release extract --to 4.11 quay.io/openshift-release-dev/ocp-release-nightly@sha256:080e5cf5e3e043ac0877b8f545ba2b596016f3fd76f3f457d15060603b3615e1
$ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.13-x86_64
$ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.32-x86_64
$ oc adm release extract --to 4.8 quay.io/openshift-release-dev/ocp-release:4.8.39-x86_64
$ grep -r 'kind: APIService' 4.*
...no hits...I guess I'll pivot to re-dropping support, and explain that^.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if _, _, err := resourceapply.ApplyDeploymentv1(ctx, b.appsClientv1, typedObject, reconcilingMode); err != nil { | ||
| if actual, _, err := resourceapply.ApplyDeploymentv1(ctx, b.appsClientv1, typedObject, reconcilingMode); err != nil { | ||
| return err | ||
| } else if actual != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before your change if a resource was IsCreateOnly this would have segfault'ed? Seems like a pretty big hole.
Should the nil check be in each of the functions instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, before my change we were passing typedObject through. That manifest content wasn't nil, but it didn't have anything useful in status either. I'm agnostic about nil checks or panics inside the check*Health functions; either way it would be a pretty serious bug to pass nil through to the health checker, and either way I expect we'll hear about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK. Did not notice that param change.
|
/hold |
The original APIService reconciliation landed in 662e182 (handle APIService, Service objects, 2018-09-27, openshift#26). We definitely reconcile a bunch of Service manifests (e.g. for serving operator metrics, including the CVO's own Service), but it's not clear what the use case was for APIService. DeleteAPIServicev1 landed in 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438), but was never consumed. We dropped APIService reconciliation support in 5681a70 (Drop APIService support, 2021-06-10, openshift#566). This commit drops the unused DeleteAPIServicev1. Auditing z-stream tips from 4.3 through 4.11: $ oc adm release extract --to 4.11 quay.io/openshift-release-dev/ocp-release-nightly@sha256:080e5cf5e3e043ac0877b8f545ba2b596016f3fd76f3f457d15060603b3615e1 $ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.13-x86_64 $ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.32-x86_64 $ oc adm release extract --to 4.8 quay.io/openshift-release-dev/ocp-release:4.8.39-x86_64 $ oc adm release extract --to 4.7 quay.io/openshift-release-dev/ocp-release:4.7.50-x86_64 $ oc adm release extract --to 4.6 quay.io/openshift-release-dev/ocp-release:4.6.57-x86_64 $ oc adm release extract --to 4.5 quay.io/openshift-release-dev/ocp-release:4.5.41-x86_64 $ oc adm release extract --to 4.4 quay.io/openshift-release-dev/ocp-release:4.4.33-x86_64 $ oc adm release extract --to 4.3 quay.io/openshift-release-dev/ocp-release:4.3.40-x86_64 $ grep -r 'kind: APIService' 4.* ...no hits...
I'd accidentally broken this in e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), when I moved the health checks out of the Do methods and began passing typedObject into them. typedObject is the release manifest, which lacks the status attributes we want the health checks to cover. This also catches the generation script up with some past manual changes: * 0afb8a8 (Add a manifest annotation to be used for object deletion, 2020-08-17, openshift#438). * 05e1af7 (Log resource diffs on update only in reconcile mode, 2021-07-13, openshift#628).
I'd dropped this in cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-30, openshift#400), claiming: There's no object status for CRDs or DaemonSets that marks "we are really hurting". The v1.18.0 Kubernetes CRD and DaemonSet controllers do not set any conditions in their operand status (although the API for those conditions exists [2,3]). With this commit, we have very minimal wait logic for either. Sufficiently unhealthy DaemonSet should be reported on via their associated ClusterOperator, and sufficiently unhealthy CRD should be reported on when we fail to push any custom resources consuming them (Task.Run retries will give the API server time to ready itself after accepting a CRD update before the CVO fails its sync cycle). But from [1]: > It might take a few seconds for the endpoint to be created. You > can watch the Established condition of your > CustomResourceDefinition to be true or watch the discovery > information of the API server for your resource to show up. So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. I was probably confused by e8ffccb (lib: Add autogeneration for some resource* functionality, 2020-07-29, openshift#420), which broke the health-check inputs as described in 002591d (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). The code I removed in cc9292a was in fact looking at the Established condition already. This commit restores the Established check, but without the previous PollImmediateUntil wait. [1]: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition
ebb1271 to
6e12b9f
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
PDB/DNS stuff is unrelated. /override ci/prow/e2e-agnostic-upgrade |
|
/hold cancel |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 2081895 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'd missed these Gets in 0833bba (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). What I'd thought was happening: 1. Pass typedObject into check*Health. 2. No status on typedObject, so check*Health passed, even if the in-cluster resource was sad. What was actually happening: 1. Pass typedObject into check*Health. 2. check*Health calls Get to see what's going on in the cluster. 3. Health check appropriately checks the health of the in-cluster resource. However, now that 0833bba is passing in the just-retrieved in-cluster resource, the check*Health Get call is an API call that we don't need to make. Dropping it here saves us and the API server a small amount of CPU cycles and network bandwidth.
I'd missed these Gets in 0833bba (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). What I'd thought was happening: 1. Pass typedObject into check*Health. 2. No status on typedObject, so check*Health passed, even if the in-cluster resource was sad. What was actually happening: 1. Pass typedObject into check*Health. 2. check*Health calls Get to see what's going on in the cluster. 3. Health check appropriately checks the health of the in-cluster resource. However, now that 0833bba is passing in the just-retrieved in-cluster resource, the check*Health Get call is an API call that we don't need to make. Dropping it here saves us and the API server a small amount of CPU cycles and network bandwidth.
I'd missed these Gets in 0833bba (lib/resourcebuilder: Use actual resource in check*Health calls, 2022-05-03, openshift#771). What I'd thought was happening: 1. Pass typedObject into check*Health. 2. No status on typedObject, so check*Health passed, even if the in-cluster resource was sad. What was actually happening: 1. Pass typedObject into check*Health. 2. check*Health calls Get to see what's going on in the cluster. 3. Health check appropriately checks the health of the in-cluster resource. However, now that 0833bba is passing in the just-retrieved in-cluster resource, the check*Health Get call is an API call that we don't need to make. Dropping it here saves us and the API server a small amount of CPU cycles and network bandwidth.
I'd dropped this in cc9292a (#400), claiming:
But from upstream docs:
So I was correct that we will hear about CRD issues when we fail to push a dependent custom resource. But I was not correct in claiming that the CRD controller set no conditions. And the code I removed in cc9292a was in fact looking at the
Establishedcondition already.This commit restores the
Establishedcheck, but without the previousPollImmediateUntilwait. We'll see how it plays in CI. If we spend too long in between failed sync cycles waiting on new CRDs to show up, we can look at restoring a short wait.