-
Notifications
You must be signed in to change notification settings - Fork 1.5k
manifests: add cvo-overrides #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manifests: add cvo-overrides #415
Conversation
This allows installer to block installation of components in release manifests that are casing conflicts with old tectonic-operators
|
I don't understand well enought to review your list of operators to ignore, but +1 on the approach to unstick us. |
|
/retest |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, crawford The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
openshift/release#1814 fixes which: no extended.test in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
/bin/bash: line 93: ginkgo: command not founderror from e2e-aws /retest |
…lures These are currently generating a lot of error messages. From [1] (testing openshift/installer#415): Gathering artifacts ... Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) Error from server (NotFound): the server could not find the requested resource ... Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found Error from server (BadRequest): previous terminated container "kube-apiserver" in pod "kube-apiserver-2hf2w" not found Error from server (BadRequest): previous terminated container "kube-apiserver" in pod "kube-apiserver-7pgl9" not found ... Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes): $ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)" $ echo "${POD_LOGS}" | grep '^ *20$' | wc -l 86 $ echo "${POD_LOGS}" | grep '\[file\]' | wc -l 172 And, possibly because of the errors?, the commands are slow with one of the above lines coming out every second or so. The teardown container obviously does some other things as well, but it's taking a significant chunk of our e2e-aws time [2]: 2018/10/04 17:59:00 Running pod e2e-aws 2018/10/04 18:03:25 Container setup in pod e2e-aws completed successfully 2018/10/04 18:16:37 Container test in pod e2e-aws completed successfully 2018/10/04 18:33:31 Container teardown in pod e2e-aws completed successfully 2018/10/04 18:33:31 Pod e2e-aws succeeded after 34m31s So 4.5 minutes to setup, 13 minutes to test, and 17 minutes to teardown. When the test pass, we probably aren't going to be poking around in the logs, so drop log acquisition in those cases to speed up our CI. [1]: https://api.ci.openshift.org/console/project/ci-op-w11cl72x/browse/pods/e2e-aws?tab=logs [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/build-log.txt
With 10 pulls going at once. These are currently generating a lot of error messages. From recent openshift/installer#415 tests [1]: $ oc project ci-op-w11cl72x $ oc logs e2e-aws -c teardown --timestamps 2018-10-04T18:17:06.557740109Z Gathering artifacts ... 2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) ... 2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) 2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource 2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found ... 2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found 2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ... 2018-10-04T18:29:24.466289887Z Deprovisioning cluster ... 2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups" ... 2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete" So 12 minutes to pull the logs, followed by four minutes for destroy-cluster. Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes): $ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)" $ echo "${POD_LOGS}" | grep '^ *20$' | wc -l 86 $ echo "${POD_LOGS}" | grep '\[file\]' | wc -l 172 So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls. With this commit, I've added a new 'queue' command. This command checks to see how many background jobs we have using 'jobs' [2], and idles until we get below 10. Then it launches its particular command in the background. By using 'queue', we'll keep up to 10 log-fetches running in parallel, and the final 'wait' will block for any which still happen to be running by that point. The previous gzip invocations used -c, which dates back to 82d333e (Set up artifact reporting for ci-operator jobs, 2018-05-17, openshift#867). But with these gzip filters running on stdin anyway, the -c was superfluous. I've dropped it in this commit. Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as ${1} was the best I've come up with. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/build-log.txt [2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html
| namespace: openshift-cluster-network-operator | ||
| name: cluster-network-operator | ||
| unmanaged: true | ||
| - kind: Deployment # this conflicts with tectonic-ingress-controller-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it matters now, but this doesn't actually conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ironcladlou if that'd the case feel free to open a PR to drop this override.
| namespace: openshift-cluster-dns-operator | ||
| name: cluster-dns-operator | ||
| unmanaged: true | ||
| - kind: Deployment # this conflicts with kube-core-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again @ironcladlou if that'd the case feel free to open a PR to drop this override.
Currently things like [1,2] that try to unstick us vs. some external change we need to /hold the other approved PRs to get them out of the merge queue while the fix goes in. With the bot removed from our repository, those PRs would remove themselves as they failed naturally, and we'd just /retest them after the fix lands. We can turn the bot back on once we got back to one-external-workaround a week or so, vs. our current several per day ;). Docs for the repo: syntax are in [3]. [1]: openshift/installer#415 [2]: openshift/installer#425 [3]: https://help.github.com/articles/searching-issues-and-pull-requests/#search-within-a-users-or-organizations-repositories
This allows installer to block installation of components in release
manifests that are casing conflicts with old tectonic-operators
Requires openshift/cluster-version-operator#30
@wking this should allow us to stop installing conflicting new operators
/cc @wking