Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 14, 2019

This should help avoid persistent failures like:

STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-527zc, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-5749s, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-fhf2k, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-gb9kw, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-lss79, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-trq9w, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-w997t, on node ip-10-0-57-149.ec2.internal
...
fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones.  0 in one zone and 4 in another zone
Expected
    <int>: 0
to be ~
    <int>: 4

In that case, the nodes were:

$ curl -s 'https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30/artifacts/e2e-aws-upi/nodes.json' | jq -r '.items[] | .metadata.name + "\t" + .metadata.labels["failure-domain.beta.kubernetes.io/zone"]'
ip-10-0-57-149.ec2.internal  us-east-1a
ip-10-0-62-7.ec2.internal    us-east-1a
ip-10-0-64-141.ec2.internal  us-east-1b
ip-10-0-70-141.ec2.internal  us-east-1b
ip-10-0-85-210.ec2.internal  us-east-1c

My guess is that the test logic is guessing that the pods have three zones available because we have control-plane nodes in three zones. But before this commit, we only had the two compute nodes, so we had:

us-east-1a  ip-10-0-57-149.ec2.internal        4 pods
us-east-1b  ip-10-0-64-141.ec2.internal        3 pods
us-east-1c  control-plane node but no compute  0 pods

With this commit, we will have compute in each zone, so the test should pass.

CC @vrutkovs, @sdodson

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 14, 2019
…i-e2e: Add third AWS compute node

This should help avoid persistent failures like [1]:

  STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-527zc, on node ip-10-0-57-149.ec2.internal
  STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-5749s, on node ip-10-0-57-149.ec2.internal
  STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-fhf2k, on node ip-10-0-57-149.ec2.internal
  STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-gb9kw, on node ip-10-0-64-141.ec2.internal
  STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-lss79, on node ip-10-0-64-141.ec2.internal
  STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-trq9w, on node ip-10-0-64-141.ec2.internal
  STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-w997t, on node ip-10-0-57-149.ec2.internal
  ...
  fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones.  0 in one zone and 4 in another zone
  Expected
      <int>: 0
  to be ~
      <int>: 4

In that case, the nodes were:

  $ curl -s 'https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30/artifacts/e2e-aws-upi/nodes.json' | jq -r '.items[] | .metadata.name + "\t" + .metadata.labels["failure-domain.beta.kubernetes.io/zone"]'
  ip-10-0-57-149.ec2.internal  us-east-1a
  ip-10-0-62-7.ec2.internal    us-east-1a
  ip-10-0-64-141.ec2.internal  us-east-1b
  ip-10-0-70-141.ec2.internal  us-east-1b
  ip-10-0-85-210.ec2.internal  us-east-1c

My guess is that the test logic is guessing that the pods have three
zones available because we have control-plane nodes in three zones.
But before this commit, we only had the two compute nodes, so we had:

  us-east-1a  ip-10-0-57-149.ec2.internal        4 pods
  us-east-1b  ip-10-0-64-141.ec2.internal        3 pods
  us-east-1c  control-plane node but no compute  0 pods

With this commit, we will have compute in each zone, so the test
should pass.

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30
@wking
Copy link
Member Author

wking commented May 14, 2019

e2e-aws-upi:

Failing tests:

[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [Suite:k8s]

So fixed the multi-zone failures :). Is this networking issue a flake?

/retest

@wking
Copy link
Member Author

wking commented May 14, 2019

e2e-aws-upi:

An error occurred (AlreadyExistsException) when calling the CreateStack operation: Stack [ci-op-i1xm5j2m-16c02-infra] already exists

Leaking stacks. Digging...

@wking
Copy link
Member Author

wking commented May 14, 2019

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3775/rehearse-3775-pull-ci-openshift-installer-master-e2e-aws-upi/2/artifacts/e2e-aws-upi/container-logs/teardown.log | gunzip | tail -n3
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-028eef9bda0a5008c" id=sg-028eef9bda0a5008c

Waiter StackDeleteComplete failed: Waiter encountered a terminal failure state

But I don't see any of:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3775/rehearse-3775-pull-ci-openshift-installer-master-e2e-aws-upi/2/artifacts/e2e-aws-upi/container-logs/setup.log | gunzip | grep StackId
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-vpc/219e4cc0-7604-11e9-ac0b-0a16600979f0"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-infra/a13ae8d0-7604-11e9-b3d9-0e0ed2de56d2"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-security/20fb3980-7605-11e9-9222-0a5651437e88"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-bootstrap/7cf3b050-7605-11e9-8ab0-129cd46a326a"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-control-plane/fc6f7df0-7605-11e9-8136-0e3e6f4b77b8"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-0/46f407b0-7606-11e9-96ba-0adb15c4df9c"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-1/5a8ad4c0-7606-11e9-bf7d-1262c1c6cf8e"
    "StackId": "arn:aws:cloudformation:us-east-1:460538899914:stack/ci-op-i1xm5j2m-16c02-compute-2/805a6580-7606-11e9-bb67-0a4c9dfbdd94"

now.

/retest

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/openshift/installer/master/e2e-aws-upi 009f6d7 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-vsphere 009f6d7 link /test pj-rehearse
ci/prow/pj-rehearse 009f6d7 link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented May 14, 2019

e2e-aws-upi:

Failing tests:

[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [Suite:k8s]

So same as before. I don't think we need to block this PR on that error, but I'll post a fix here if it's a template issue and I figure it out before someone has time to review the multi-zone fix ;).

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 82896f3 into openshift:master May 15, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 2 configmaps:

  • prow-job-cluster-launch-installer-upi-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-upi-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
  • prow-job-cluster-launch-installer-upi-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-upi-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
Details

In response to this:

This should help avoid persistent failures like:

STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-527zc, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-5749s, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-fhf2k, on node ip-10-0-57-149.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-gb9kw, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-lss79, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-trq9w, on node ip-10-0-64-141.ec2.internal
STEP: Getting zone name for pod ubelite-spread-rc-f3c96168-7346-11e9-9c48-0a58ac10c848-w997t, on node ip-10-0-57-149.ec2.internal
...
fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones.  0 in one zone and 4 in another zone
Expected
   <int>: 0
to be ~
   <int>: 4

In that case, the nodes were:

$ curl -s 'https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/30/artifacts/e2e-aws-upi/nodes.json' | jq -r '.items[] | .metadata.name + "\t" + .metadata.labels["failure-domain.beta.kubernetes.io/zone"]'
ip-10-0-57-149.ec2.internal  us-east-1a
ip-10-0-62-7.ec2.internal    us-east-1a
ip-10-0-64-141.ec2.internal  us-east-1b
ip-10-0-70-141.ec2.internal  us-east-1b
ip-10-0-85-210.ec2.internal  us-east-1c

My guess is that the test logic is guessing that the pods have three zones available because we have control-plane nodes in three zones. But before this commit, we only had the two compute nodes, so we had:

us-east-1a  ip-10-0-57-149.ec2.internal        4 pods
us-east-1b  ip-10-0-64-141.ec2.internal        3 pods
us-east-1c  control-plane node but no compute  0 pods

With this commit, we will have compute in each zone, so the test should pass.

CC @vrutkovs, @sdodson

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the third-aws-upi-compute-node branch May 15, 2019 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants