-
Notifications
You must be signed in to change notification settings - Fork 2.1k
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Add third AWS compute node #3775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Add third AWS compute node #3775
Conversation
…i-e2e: Add third AWS compute node
This should help avoid persistent failures like [1]:
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-527zc, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-5749s, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-fhf2k, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-gb9kw, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-lss79, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-trq9w, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-w997t, on node ip-10-0-57-149.ec2.internal
...
fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones. 0 in one zone and 4 in another zone
Expected
<int>: 0
to be ~
<int>: 4
In that case, the nodes were:
$ curl -s 'https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30/artifacts/e2e-aws-upi/nodes.json' | jq -r '.items[] | .metadata.name + "\t" + .metadata.labels["failure-domain.beta.kubernetes.io/zone"]'
ip-10-0-57-149.ec2.internal us-east-1a
ip-10-0-62-7.ec2.internal us-east-1a
ip-10-0-64-141.ec2.internal us-east-1b
ip-10-0-70-141.ec2.internal us-east-1b
ip-10-0-85-210.ec2.internal us-east-1c
My guess is that the test logic is guessing that the pods have three
zones available because we have control-plane nodes in three zones.
But before this commit, we only had the two compute nodes, so we had:
us-east-1a ip-10-0-57-149.ec2.internal 4 pods
us-east-1b ip-10-0-64-141.ec2.internal 3 pods
us-east-1c control-plane node but no compute 0 pods
With this commit, we will have compute in each zone, so the test
should pass.
[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30
e5cb88d to
009f6d7
Compare
So fixed the multi-zone failures :). Is this networking issue a flake? /retest |
Leaking stacks. Digging... |
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3775/rehearse-3775-pull-ci-openshift-installer-master-e2e-aws-upi/2/artifacts/e2e-aws-upi/container-logs/teardown.log | gunzip | tail -n3
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-028eef9bda0a5008c" id=sg-028eef9bda0a5008c
Waiter StackDeleteComplete failed: Waiter encountered a terminal failure stateBut I don't see any of: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3775/rehearse-3775-pull-ci-openshift-installer-master-e2e-aws-upi/2/artifacts/e2e-aws-upi/container-logs/setup.log | gunzip | grep StackId
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-vpc/219e4cc0-7604-11e9-ac0b-0a16600979f0"
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-infra/a13ae8d0-7604-11e9-b3d9-0e0ed2de56d2"
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-security/20fb3980-7605-11e9-9222-0a5651437e88"
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-bootstrap/7cf3b050-7605-11e9-8ab0-129cd46a326a"
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-control-plane/fc6f7df0-7605-11e9-8136-0e3e6f4b77b8"
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-0/46f407b0-7606-11e9-96ba-0adb15c4df9c"
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-1/5a8ad4c0-7606-11e9-bf7d-1262c1c6cf8e"
"StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-2/805a6580-7606-11e9-bb67-0a4c9dfbdd94"now. /retest |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
So same as before. I don't think we need to block this PR on that error, but I'll post a fix here if it's a template issue and I figure it out before someone has time to review the multi-zone fix ;). |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Updated the following 2 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This should help avoid persistent failures like:
In that case, the nodes were:
My guess is that the test logic is guessing that the pods have three zones available because we have control-plane nodes in three zones. But before this commit, we only had the two compute nodes, so we had:
With this commit, we will have compute in each zone, so the test should pass.
CC @vrutkovs, @sdodson