ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Add third AWS compute node #3775

wking · 2019-05-14T04:44:00Z

This should help avoid persistent failures like:

STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-527zc, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-5749s, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-fhf2k, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-gb9kw, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-lss79, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-trq9w, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-w997t, on node ip-10-0-57-149.ec2.internal
...
fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones.  0 in one zone and 4 in another zone
Expected
    <int>: 0
to be ~
    <int>: 4

In that case, the nodes were:

$ curl -s 'https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30/artifacts/e2e-aws-upi/nodes.json' | jq -r '.items[] | .metadata.name + "\t" + .metadata.labels["failure-domain.beta.kubernetes.io/zone"]'
ip-10-0-57-149.ec2.internal  us-east-1a
ip-10-0-62-7.ec2.internal    us-east-1a
ip-10-0-64-141.ec2.internal  us-east-1b
ip-10-0-70-141.ec2.internal  us-east-1b
ip-10-0-85-210.ec2.internal  us-east-1c

My guess is that the test logic is guessing that the pods have three zones available because we have control-plane nodes in three zones. But before this commit, we only had the two compute nodes, so we had:

us-east-1a  ip-10-0-57-149.ec2.internal        4 pods
us-east-1b  ip-10-0-64-141.ec2.internal        3 pods
us-east-1c  control-plane node but no compute  0 pods

With this commit, we will have compute in each zone, so the test should pass.

CC @vrutkovs, @sdodson

…i-e2e: Add third AWS compute node This should help avoid persistent failures like [1]: STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-527zc, on node ip-10-0-57-149.ec2.internal STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-5749s, on node ip-10-0-57-149.ec2.internal STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-fhf2k, on node ip-10-0-57-149.ec2.internal STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-gb9kw, on node ip-10-0-64-141.ec2.internal STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-lss79, on node ip-10-0-64-141.ec2.internal STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-trq9w, on node ip-10-0-64-141.ec2.internal STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-w997t, on node ip-10-0-57-149.ec2.internal ... fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones. 0 in one zone and 4 in another zone Expected <int>: 0 to be ~ <int>: 4 In that case, the nodes were: $ curl -s 'https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30/artifacts/e2e-aws-upi/nodes.json' | jq -r '.items[] | .metadata.name + "\t" + .metadata.labels["failure-domain.beta.kubernetes.io/zone"]' ip-10-0-57-149.ec2.internal us-east-1a ip-10-0-62-7.ec2.internal us-east-1a ip-10-0-64-141.ec2.internal us-east-1b ip-10-0-70-141.ec2.internal us-east-1b ip-10-0-85-210.ec2.internal us-east-1c My guess is that the test logic is guessing that the pods have three zones available because we have control-plane nodes in three zones. But before this commit, we only had the two compute nodes, so we had: us-east-1a ip-10-0-57-149.ec2.internal 4 pods us-east-1b ip-10-0-64-141.ec2.internal 3 pods us-east-1c control-plane node but no compute 0 pods With this commit, we will have compute in each zone, so the test should pass. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30

wking · 2019-05-14T06:22:35Z

e2e-aws-upi:

Failing tests:

[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [Suite:k8s]

So fixed the multi-zone failures :). Is this networking issue a flake?

/retest

wking · 2019-05-14T19:51:16Z

e2e-aws-upi:

An error occurred (AlreadyExistsException) when calling the CreateStack operation: Stack [ci-op-i1xm5j2m-16c02-infra] already exists

Leaking stacks. Digging...

wking · 2019-05-14T20:00:15Z

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3775/rehearse-3775-pull-ci-openshift-installer-master-e2e-aws-upi/2/artifacts/e2e-aws-upi/container-logs/teardown.log | gunzip | tail -n3
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-028eef9bda0a5008c" id=sg-028eef9bda0a5008c

Waiter StackDeleteComplete failed: Waiter encountered a terminal failure state

But I don't see any of:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3775/rehearse-3775-pull-ci-openshift-installer-master-e2e-aws-upi/2/artifacts/e2e-aws-upi/container-logs/setup.log | gunzip | grep StackId
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-vpc/219e4cc0-7604-11e9-ac0b-0a16600979f0"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-infra/a13ae8d0-7604-11e9-b3d9-0e0ed2de56d2"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-security/20fb3980-7605-11e9-9222-0a5651437e88"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-bootstrap/7cf3b050-7605-11e9-8ab0-129cd46a326a"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-control-plane/fc6f7df0-7605-11e9-8136-0e3e6f4b77b8"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-0/46f407b0-7606-11e9-96ba-0adb15c4df9c"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-1/5a8ad4c0-7606-11e9-bf7d-1262c1c6cf8e"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-2/805a6580-7606-11e9-bb67-0a4c9dfbdd94"

now.

/retest

openshift-ci-robot · 2019-05-14T21:27:22Z

@wking: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/rehearse/openshift/installer/master/e2e-aws-upi	`009f6d7`	link	`/test pj-rehearse`
ci/rehearse/openshift/installer/master/e2e-vsphere	`009f6d7`	link	`/test pj-rehearse`
ci/prow/pj-rehearse	`009f6d7`	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2019-05-14T21:33:25Z

e2e-aws-upi:

Failing tests:

[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [Suite:k8s]

So same as before. I don't think we need to block this PR on that error, but I'll post a fix here if it's a template issue and I figure it out before someone has time to review the multi-zone fix ;).

abhinavdahiya · 2019-05-15T18:00:22Z

/lgtm

openshift-ci-robot · 2019-05-15T18:00:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/templates/openshift/installer/OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-05-15T18:03:46Z

@wking: Updated the following 2 configmaps:

prow-job-cluster-launch-installer-upi-e2e configmap in namespace ci using the following files:
- key cluster-launch-installer-upi-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
prow-job-cluster-launch-installer-upi-e2e configmap in namespace ci-stg using the following files:
- key cluster-launch-installer-upi-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml

Details

In response to this:

This should help avoid persistent failures like:

STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-527zc, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-5749s, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-fhf2k, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-gb9kw, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-lss79, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-trq9w, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-w997t, on node ip-10-0-57-149.ec2.internal
...
fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones.  0 in one zone and 4 in another zone
Expected
   <int>: 0
to be ~
   <int>: 4

In that case, the nodes were:

$ curl -s 'https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30/artifacts/e2e-aws-upi/nodes.json' | jq -r '.items[] | .metadata.name + "\t" + .metadata.labels["failure-domain.beta.kubernetes.io/zone"]'
ip-10-0-57-149.ec2.internal  us-east-1a
ip-10-0-62-7.ec2.internal    us-east-1a
ip-10-0-64-141.ec2.internal  us-east-1b
ip-10-0-70-141.ec2.internal  us-east-1b
ip-10-0-85-210.ec2.internal  us-east-1c

My guess is that the test logic is guessing that the pods have three zones available because we have control-plane nodes in three zones. But before this commit, we only had the two compute nodes, so we had:

us-east-1a  ip-10-0-57-149.ec2.internal        4 pods
us-east-1b  ip-10-0-64-141.ec2.internal        3 pods
us-east-1c  control-plane node but no compute  0 pods

With this commit, we will have compute in each zone, so the test should pass.

CC @vrutkovs, @sdodson

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 14, 2019

openshift-ci-robot requested review from hardys and trown May 14, 2019 04:44

wking force-pushed the third-aws-upi-compute-node branch from e5cb88d to 009f6d7 Compare May 14, 2019 04:44

wking mentioned this pull request May 14, 2019

ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Initial AWS UPI template #3440

Merged

openshift-ci-robot assigned abhinavdahiya May 15, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2019

openshift-merge-robot merged commit 82896f3 into openshift:master May 15, 2019

wking deleted the third-aws-upi-compute-node branch May 15, 2019 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Add third AWS compute node #3775

ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Add third AWS compute node #3775

Uh oh!

wking commented May 14, 2019 •

edited

Loading

Uh oh!

wking commented May 14, 2019

Uh oh!

wking commented May 14, 2019

Uh oh!

wking commented May 14, 2019

Uh oh!

openshift-ci-robot commented May 14, 2019

Uh oh!

wking commented May 14, 2019

Uh oh!

abhinavdahiya commented May 15, 2019

Uh oh!

openshift-ci-robot commented May 15, 2019

Uh oh!

openshift-ci-robot commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Add third AWS compute node #3775

ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Add third AWS compute node #3775

Uh oh!

Conversation

wking commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented May 14, 2019

Uh oh!

wking commented May 14, 2019

Uh oh!

wking commented May 14, 2019

Uh oh!

openshift-ci-robot commented May 14, 2019

Uh oh!

wking commented May 14, 2019

Uh oh!

abhinavdahiya commented May 15, 2019

Uh oh!

openshift-ci-robot commented May 15, 2019

Uh oh!

openshift-ci-robot commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wking commented May 14, 2019 •

edited

Loading