Skip to content

Conversation

@cgwalters
Copy link
Member

IIRC we did this just to speed up these tests because updating
workers 1 by 1 blew out our hour budget.

The router requires a minimum of two workers though, and we're just
going to be fighting its PDB.

Since customers can't sanely do this, let's stop doing it in our
tests. If our tests take too long...we'll have to either cut
down the tests or make them a periodic, etc.

IIRC we did this just to speed up these tests because updating
workers 1 by 1 blew out our hour budget.

The router requires a minimum of two workers though, and we're just
going to be fighting its PDB.

Since customers can't sanely do this, let's stop doing it in our
tests.  If our tests take too long...we'll have to either cut
down the tests or make them a periodic, etc.
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 1, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 1, 2019
@kikisdeliveryservice kikisdeliveryservice requested review from kikisdeliveryservice and removed request for LorbusChris November 1, 2019 18:55
@kikisdeliveryservice
Copy link
Contributor

even if we do this out tests will start failing as we're right on the edge of timing out all the time as is...

@kikisdeliveryservice
Copy link
Contributor

let's see how this pans out in ci runs.. before LGTM.

@kikisdeliveryservice
Copy link
Contributor

But if we do support this, should we really be changing the tests rather than letting network edge make accommodations?

@cgwalters
Copy link
Member Author

Right, I noted that in this part of the commit message:

If our tests take too long...we'll have to either cut down the tests or make them a periodic, etc.

Another option is to make them optional, e.g. we could fairly easily move FIPS to a separate /test e2e-gcp-fips job that doesn't run by default.

I'm running the current e2e tests (without this commit) on a sacrifical 4.3/GCP cluster to see if I can figure out what's going on with those OutOfDisk errors.

@kikisdeliveryservice
Copy link
Contributor

Another option is to make them optional, e.g. we could fairly easily move FIPS to a separate /test e2e-gcp-fips job that doesn't run by default.

This might make a lot of sense..

@cgwalters
Copy link
Member Author

/retest

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Nov 1, 2019

Seriously GCPPPPP?!

level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to generate asset \"Master Machines\": failed to fetch availability zones: failed to list zones: Get https://www.googleapis.com/compute/v1/projects/openshift-gce-devel-ci/zones?alt=json&filter=%28region+eq+https%3A%2F%2Fwww.googleapis.com%2Fcompute%2Fv1%2Fprojects%2Fopenshift-gce-devel-ci%2Fregions%2Fus-east1%29+%28status+eq+UP%29&prettyPrint=false: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: i/o timeout" 

/test e2e-gcp-op

@cgwalters
Copy link
Member Author

cgwalters commented Nov 1, 2019

So on the OutOfDisk thing...I am seeing the MCC report it, but...logging into the nodes, I don't see anything in journalctl -u kubelet --grep=OutOfDisk, and df -h looks completely fine (92% free).

walters@toolbox ~/s/d/r/ostree> oc logs deploy/machine-config-controller |tail
I1101 19:33:49.614434       1 node_controller.go:754] Setting node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal to desired config rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.636455       1 node_controller.go:754] Setting node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal to desired config rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.636563       1 node_controller.go:452] Pool worker: node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.655497       1 node_controller.go:754] Setting node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal to desired config rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.655648       1 node_controller.go:452] Pool worker: node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:49.672322       1 node_controller.go:452] Pool worker: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-d5a9d25defe14976fb373dc091ab98d2
I1101 19:33:51.781481       1 node_controller.go:433] Pool worker: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is reporting Unschedulable
I1101 19:33:51.809661       1 node_controller.go:433] Pool worker: node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal is reporting Unschedulable
I1101 19:33:51.844602       1 node_controller.go:433] Pool worker: node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal is reporting Unschedulable
I1101 19:34:45.067040       1 node_controller.go:433] Pool worker: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is now reporting unready: node walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal is reporting OutOfDisk
walters@toolbox ~/s/d/r/ostree> oc get nodes -o wide
NAME                                                    STATUS                        ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION          CONTAINER-RUNTIME
walter-lnblq-m-0.c.openshift-gce-devel.internal         Ready                         master   64m   v1.16.2   10.0.0.5                    Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-m-1.c.openshift-gce-devel.internal         Ready                         master   64m   v1.16.2   10.0.0.4                    Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-m-2.c.openshift-gce-devel.internal         Ready                         master   64m   v1.16.2   10.0.0.3                    Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-w-a-gbtqd.c.openshift-gce-devel.internal   NotReady,SchedulingDisabled   worker   54m   v1.16.2   10.0.32.2                   Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-w-b-6r4rb.c.openshift-gce-devel.internal   Ready,SchedulingDisabled      worker   54m   v1.16.2   10.0.32.3                   Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walter-lnblq-w-c-4vczn.c.openshift-gce-devel.internal   Ready,SchedulingDisabled      worker   54m   v1.16.2   10.0.32.4                   Red Hat Enterprise Linux CoreOS 43.81.201911011153.0 (Ootpa)   4.18.0-147.el8.x86_64   cri-o://1.16.0-0.4.dev.rhaos4.3.giteed6aa1.el8-rc2
walters@toolbox ~/s/d/r/ostree> 
walters@toolbox ~> oc debug node/walter-lnblq-m-0.c.openshift-gce-devel.internal -- chroot /host df -h /
Starting pod/walter-lnblq-m-0copenshift-gce-develinternal-debug ...
To use host binaries, run `chroot /host`
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda4       128G  9.0G  119G   7% /

Removing debug pod ...
walters@toolbox ~> 

@kikisdeliveryservice
Copy link
Contributor

/skip

@kikisdeliveryservice
Copy link
Contributor

Job hasn't officially finished but seeing in logs:

 --- FAIL: TestFIPS (1204.04s)
    mcd_test.go:524: Created fips-e254bc7b-3317-4ea4-9e3f-7833aef9a1a6
    mcd_test.go:115: Pool worker has rendered config fips-e254bc7b-3317-4ea4-9e3f-7833aef9a1a6 with rendered-worker-465d64a464236cf3a1d7d181960a5852 (waited 4.013533076s)
    mcd_test.go:530: pool worker didn't report updated to rendered-worker-465d64a464236cf3a1d7d181960a5852: timed out waiting for the condition 

So this isn't going to fix it for us... and since this failure is also being seen in e2e-aws, maybe the fix needs to be elsewhere...

@kikisdeliveryservice
Copy link
Contributor

Will wait on final logs to verify..

@kikisdeliveryservice
Copy link
Contributor

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor

Let's see if this passes today... and if we can make some modifications to get our e2e going..

@kikisdeliveryservice
Copy link
Contributor

 level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Cluster operator authentication is still updating"
2019/11/04 19:05:42 Container setup in pod e2e-gcp-op failed, exit code 1, reason Error 

:(

/retest

@kikisdeliveryservice
Copy link
Contributor

@cgwalters any ideas on how to proceed with this (of course ci is currently busted with a different problem entirely afaik...)

@kikisdeliveryservice
Copy link
Contributor

looking at the failed run from Friday, the test ran for 5768.476s = 96.14 min < 120 min timeout?

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 cd4ae7a link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-gcp-op cd4ae7a link /test e2e-gcp-op

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member Author

@cgwalters any ideas on how to proceed with this (of course ci is currently busted with a different problem entirely afaik...)

Nope, not sure what's going on yet. Only had about 10% bandwidth today for this. Will keep looking at it though.

@kikisdeliveryservice
Copy link
Contributor

@cgwalters @ashcrow I'm noticing a bunch of test time discrepancies see my pr:
#1244

I'm going to do some other investigation and will report back also lodged some questions in test-platform.

@openshift-merge-robot openshift-merge-robot merged commit cd4ae7a into openshift:master Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants