tests: Stop bumping worker maxUnavailable to 3 #1238

cgwalters · 2019-11-01T18:54:28Z

IIRC we did this just to speed up these tests because updating
workers 1 by 1 blew out our hour budget.

The router requires a minimum of two workers though, and we're just
going to be fighting its PDB.

Since customers can't sanely do this, let's stop doing it in our
tests. If our tests take too long...we'll have to either cut
down the tests or make them a periodic, etc.

IIRC we did this just to speed up these tests because updating workers 1 by 1 blew out our hour budget. The router requires a minimum of two workers though, and we're just going to be fighting its PDB. Since customers can't sanely do this, let's stop doing it in our tests. If our tests take too long...we'll have to either cut down the tests or make them a periodic, etc.

kikisdeliveryservice · 2019-11-01T18:57:06Z

even if we do this out tests will start failing as we're right on the edge of timing out all the time as is...

kikisdeliveryservice · 2019-11-01T18:59:14Z

let's see how this pans out in ci runs.. before LGTM.

kikisdeliveryservice · 2019-11-01T19:01:45Z

But if we do support this, should we really be changing the tests rather than letting network edge make accommodations?

cgwalters · 2019-11-01T19:02:42Z

Right, I noted that in this part of the commit message:

If our tests take too long...we'll have to either cut down the tests or make them a periodic, etc.

Another option is to make them optional, e.g. we could fairly easily move FIPS to a separate /test e2e-gcp-fips job that doesn't run by default.

I'm running the current e2e tests (without this commit) on a sacrifical 4.3/GCP cluster to see if I can figure out what's going on with those OutOfDisk errors.

kikisdeliveryservice · 2019-11-01T19:04:49Z

Another option is to make them optional, e.g. we could fairly easily move FIPS to a separate /test e2e-gcp-fips job that doesn't run by default.

This might make a lot of sense..

cgwalters · 2019-11-01T19:05:19Z

/retest

kikisdeliveryservice · 2019-11-01T19:05:27Z

Seriously GCPPPPP?!

level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to generate asset \"Master Machines\": failed to fetch availability zones: failed to list zones: Get https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci/zones?alt=json&filter=%28region+eq+https%3A%2F%2Fwww.googleapis.com%2Fcompute%2Fv1%2Fprojects%2Fopenshift-gce-devel-ci%2Fregions%2Fus-east1%29+%28status+eq+UP%29&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: i/o timeout"

/test e2e-gcp-op

cgwalters · 2019-11-01T19:36:17Z

So on the OutOfDisk thing...I am seeing the MCC report it, but...logging into the nodes, I don't see anything in journalctl -u kubelet --grep=OutOfDisk, and df -h looks completely fine (92% free).

walters@toolbox ~/s/d/r/ostree> oc logs deploy/machine-config-controller |tail
I1101 19:33:49.614434       1 node_controller.go:754] Setting node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal to desired config rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.636455       1 node_controller.go:754] Setting node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal to desired config rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.636563       1 node_controller.go:452] Pool worker: node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.655497       1 node_controller.go:754] Setting node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal to desired config rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.655648       1 node_controller.go:452] Pool worker: node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.672322       1 node_controller.go:452] Pool worker: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:51.781481       1 node_controller.go:433] Pool worker: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is reporting Unschedulable
I1101 19:33:51.809661       1 node_controller.go:433] Pool worker: node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal is reporting Unschedulable
I1101 19:33:51.844602       1 node_controller.go:433] Pool worker: node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal is reporting Unschedulable
I1101 19:34:45.067040       1 node_controller.go:433] Pool worker: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is reporting OutOfDisk
walters@toolbox ~/s/d/r/ostree> oc get nodes -o wide
NAME                                                    STATUS                        ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME
walter-lnblq-m-0.c.openshift-gce-devel.internal         Ready                         master   64m   v1.16.2   10.0.0.5                    Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-m-1.c.openshift-gce-devel.internal         Ready                         master   64m   v1.16.2   10.0.0.4                    Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-m-2.c.openshift-gce-devel.internal         Ready                         master   64m   v1.16.2   10.0.0.3                    Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal   NotReady,SchedulingDisabled   worker   54m   v1.16.2   10.0.32.2                   Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal   Ready,SchedulingDisabled      worker   54m   v1.16.2   10.0.32.3                   Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal   Ready,SchedulingDisabled      worker   54m   v1.16.2   10.0.32.4                   Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walters@toolbox ~/s/d/r/ostree>

walters@toolbox ~> oc debug node/walter-lnblq-m-0.c.openshift-gce-devel.internal -- chroot /host df -h /
Starting pod/walter-lnblq-m-0copenshift-gce-develinternal-debug ...
To use host binaries, run `chroot /host`
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4       128G  9.0G  119G   7% /

Removing debug pod ...
walters@toolbox ~>

kikisdeliveryservice · 2019-11-01T21:29:57Z

/skip

kikisdeliveryservice · 2019-11-01T21:31:26Z

Job hasn't officially finished but seeing in logs:

 --- FAIL: TestFIPS (1204.04s)
    mcd_test.go:524: Created fips-e254bc7b-3317-4ea4-9e3f-7833aef9a1a6
    mcd_test.go:115: Pool worker has rendered config fips-e254bc7b-3317-4ea4-9e3f-7833aef9a1a6 with rendered-worker-465d64a464236cf3a1d7d181960a5852 (waited 4.013533076s)
    mcd_test.go:530: pool worker didn't report updated to rendered-worker-465d64a464236cf3a1d7d181960a5852: timed out waiting for the condition

So this isn't going to fix it for us... and since this failure is also being seen in e2e-aws, maybe the fix needs to be elsewhere...

kikisdeliveryservice · 2019-11-01T21:35:12Z

Will wait on final logs to verify..

kikisdeliveryservice · 2019-11-01T21:51:04Z

Seeing two daemon logless (one example):
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1238/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/269/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-ndhpf_machine-config-daemon.log

1 unavailable infra and 1 unavailable worker: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1238/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/269/artifacts/e2e-gcp-op/machineconfigpools.json

kikisdeliveryservice · 2019-11-04T18:15:05Z

/test e2e-gcp-op

kikisdeliveryservice · 2019-11-04T18:23:11Z

Let's see if this passes today... and if we can make some modifications to get our e2e going..

kikisdeliveryservice · 2019-11-04T19:18:29Z

 level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Cluster operator authentication is still updating"
2019/11/04 19:05:42 Container setup in pod e2e-gcp-op failed, exit code 1, reason Error

:(

/retest

kikisdeliveryservice · 2019-11-04T19:18:58Z

@cgwalters any ideas on how to proceed with this (of course ci is currently busted with a different problem entirely afaik...)

kikisdeliveryservice · 2019-11-04T20:11:15Z

looking at the failed run from Friday, the test ran for 5768.476s = 96.14 min < 120 min timeout?

openshift-ci-robot · 2019-11-04T20:16:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ashcrow,cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-11-04T21:38:15Z

@cgwalters: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`cd4ae7a`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-gcp-op	`cd4ae7a`	link	`/test e2e-gcp-op`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2019-11-04T23:31:31Z

@cgwalters any ideas on how to proceed with this (of course ci is currently busted with a different problem entirely afaik...)

Nope, not sure what's going on yet. Only had about 10% bandwidth today for this. Will keep looking at it though.

kikisdeliveryservice · 2019-11-05T00:19:50Z

@cgwalters @ashcrow I'm noticing a bunch of test time discrepancies see my pr:
#1244

I'm going to do some other investigation and will report back also lodged some questions in test-platform.

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 1, 2019

openshift-ci-robot requested review from LorbusChris and ashcrow November 1, 2019 18:54

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 1, 2019

kikisdeliveryservice requested review from kikisdeliveryservice and removed request for LorbusChris November 1, 2019 18:55

ashcrow approved these changes Nov 4, 2019

View reviewed changes

kikisdeliveryservice mentioned this pull request Nov 4, 2019

test/e2e: Drop TestFIPS #1244

Merged

openshift-merge-robot merged commit cd4ae7a into openshift:master Nov 5, 2019

tests: Stop bumping worker maxUnavailable to 3 #1238

tests: Stop bumping worker maxUnavailable to 3 #1238

Uh oh!

Conversation

cgwalters commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019

Uh oh!

cgwalters commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019

Uh oh!

cgwalters commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kikisdeliveryservice commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019

Uh oh!

kikisdeliveryservice commented Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kikisdeliveryservice commented Nov 4, 2019

Uh oh!

kikisdeliveryservice commented Nov 4, 2019

Uh oh!

kikisdeliveryservice commented Nov 4, 2019

Uh oh!

kikisdeliveryservice commented Nov 4, 2019

Uh oh!

kikisdeliveryservice commented Nov 4, 2019

Uh oh!

openshift-ci-robot commented Nov 4, 2019

Uh oh!

openshift-ci-robot commented Nov 4, 2019

Uh oh!

cgwalters commented Nov 4, 2019

Uh oh!

kikisdeliveryservice commented Nov 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kikisdeliveryservice commented Nov 1, 2019 •

edited

Loading

cgwalters commented Nov 1, 2019 •

edited

Loading

kikisdeliveryservice commented Nov 1, 2019 •

edited

Loading