Skip to content

Conversation

@abhinavdahiya
Copy link
Contributor

@abhinavdahiya abhinavdahiya commented Oct 4, 2018

This allows installer to block installation of components in release
manifests that are casing conflicts with old tectonic-operators

Requires openshift/cluster-version-operator#30

@wking this should allow us to stop installing conflicting new operators
/cc @wking

This allows installer to block installation of components in release
manifests that are casing conflicts with old tectonic-operators
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 4, 2018
@wking
Copy link
Member

wking commented Oct 4, 2018

I don't understand well enought to review your list of operators to ignore, but +1 on the approach to unstick us.

@abhinavdahiya
Copy link
Contributor Author

/retest

openshift/cluster-version-operator#30 merged.

@crawford
Copy link
Contributor

crawford commented Oct 4, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhinavdahiya
Copy link
Contributor Author

openshift/release#1814 fixes

which: no extended.test in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
/bin/bash: line 93: ginkgo: command not found

error from e2e-aws

/retest

@openshift-merge-robot openshift-merge-robot merged commit 93e9292 into openshift:master Oct 4, 2018
wking added a commit to wking/openshift-release that referenced this pull request Oct 4, 2018
…lures

These are currently generating a lot of error messages.  From [1]
(testing openshift/installer#415):

  Gathering artifacts ...
  Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
  Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
  Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
  Error from server (NotFound): the server could not find the requested resource
  ...
  Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found
  Error from server (BadRequest): previous terminated container "kube-apiserver" in pod "kube-apiserver-2hf2w" not found
  Error from server (BadRequest): previous terminated container "kube-apiserver" in pod "kube-apiserver-7pgl9" not found
  ...

Looking at the extracted logs, lots of them are zero (which compresses
to 20 bytes):

  $ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)"
  $ echo "${POD_LOGS}" | grep '^ *20$' | wc -l
  86
  $ echo "${POD_LOGS}" | grep '\[file\]' | wc -l
  172

And, possibly because of the errors?, the commands are slow with one
of the above lines coming out every second or so.  The teardown
container obviously does some other things as well, but it's taking a
significant chunk of our e2e-aws time [2]:

  2018/10/04 17:59:00 Running pod e2e-aws
  2018/10/04 18:03:25 Container setup in pod e2e-aws completed successfully
  2018/10/04 18:16:37 Container test in pod e2e-aws completed successfully
  2018/10/04 18:33:31 Container teardown in pod e2e-aws completed successfully
  2018/10/04 18:33:31 Pod e2e-aws succeeded after 34m31s

So 4.5 minutes to setup, 13 minutes to test, and 17 minutes to
teardown.

When the test pass, we probably aren't going to be poking around in
the logs, so drop log acquisition in those cases to speed up our CI.

[1]: https://api.ci.openshift.org/console/project/ci-op-w11cl72x/browse/pods/e2e-aws?tab=logs
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/build-log.txt
wking added a commit to wking/openshift-release that referenced this pull request Oct 4, 2018
With 10 pulls going at once.

These are currently generating a lot of error messages.  From recent
openshift/installer#415 tests [1]:

  $ oc project ci-op-w11cl72x
  $ oc logs e2e-aws -c teardown --timestamps
  2018-10-04T18:17:06.557740109Z Gathering artifacts ...
  2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
  ...
  2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
  2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource
  2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found
  ...
  2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found
  2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ...
  2018-10-04T18:29:24.466289887Z Deprovisioning cluster ...
  2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups"
  ...
  2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete"

So 12 minutes to pull the logs, followed by four minutes for
destroy-cluster.

Looking at the extracted logs, lots of them are zero (which compresses
to 20 bytes):

  $ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)"
  $ echo "${POD_LOGS}" | grep '^ *20$' | wc -l
  86
  $ echo "${POD_LOGS}" | grep '\[file\]' | wc -l
  172

So it's possible that the delay is due to the errors, or to a few
large logs blocking the old, serial pod/container pulls.

With this commit, I've added a new 'queue' command.  This command
checks to see how many background jobs we have using 'jobs' [2], and
idles until we get below 10.  Then it launches its particular command
in the background.  By using 'queue', we'll keep up to 10 log-fetches
running in parallel, and the final 'wait' will block for any which
still happen to be running by that point.

The previous gzip invocations used -c, which dates back to 82d333e
(Set up artifact reporting for ci-operator jobs, 2018-05-17, openshift#867).
But with these gzip filters running on stdin anyway, the -c was
superfluous.  I've dropped it in this commit.

Moving redirect target to a positional argument is a bit cludgy.  I'd
rather have a more familiar way of phrasing that redirect, but passing
it in as ${1} was the best I've come up with.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/build-log.txt
[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html
namespace: openshift-cluster-network-operator
name: cluster-network-operator
unmanaged: true
- kind: Deployment # this conflicts with tectonic-ingress-controller-operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters now, but this doesn't actually conflict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ironcladlou if that'd the case feel free to open a PR to drop this override.

namespace: openshift-cluster-dns-operator
name: cluster-dns-operator
unmanaged: true
- kind: Deployment # this conflicts with kube-core-operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again @ironcladlou if that'd the case feel free to open a PR to drop this override.

@abhinavdahiya abhinavdahiya deleted the cvo_overrides branch October 5, 2018 13:13
wking added a commit to wking/openshift-release that referenced this pull request Oct 5, 2018
Currently things like [1,2] that try to unstick us vs. some external
change we need to /hold the other approved PRs to get them out of the
merge queue while the fix goes in.  With the bot removed from our
repository, those PRs would remove themselves as they failed
naturally, and we'd just /retest them after the fix lands.  We can
turn the bot back on once we got back to one-external-workaround a
week or so, vs. our current several per day ;).

Docs for the repo: syntax are in [3].

[1]: openshift/installer#415
[2]: openshift/installer#425
[3]: https://help.github.com/articles/searching-issues-and-pull-requests/#search-within-a-users-or-organizations-repositories
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants