-
Notifications
You must be signed in to change notification settings - Fork 2.1k
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e: Initial AWS UPI template #3440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fec962f to
dac4d74
Compare
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
Outdated
Show resolved
Hide resolved
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
Outdated
Show resolved
Hide resolved
petr-muller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
007745d to
799fe4a
Compare
No $ podman pull registry.svc.ci.openshift.org/ci-op-xqpc1pw7/release@sha256:56f0603347fb1e71596953ae1b4a637ab89dce8be8e268ca8cef8646631a3f22
$ podman run --rm -it --entrypoint rpm registry.svc.ci.openshift.org/ci-op-xqpc1pw7/release@sha256:56f0603347fb1e71596953ae1b4a637ab89dce8be8e268ca8cef8646631a3f22 -q awscli
package awscli is not installedCan we get the build log from somewhere? Maybe this aspect of things doesn't play nicely with rehearsals? Wait... $ podman inspect --format '{{.Config.Entrypoint}}' registry.svc.ci.openshift.org/ci-op-xqpc1pw7/release@sha256:56f0603347fb1e71596953ae1b4a637ab89dce8be8e268ca8cef8646631a3f22
[/usr/bin/cluster-version-operator]how did a CVO image get over here? |
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/2/artifacts/build-logs/upi-installer.log.gz says
Each PR triggers a new CVO image build, even if the component is not part of it |
|
Rebased and updated with 799fe4a -> 71cdc8977, although in hindsight I implemented under the assumption that openshift/installer#1649 had landed :p. |
ee071e0 to
422d571
Compare
ci-operator/templates/openshift/installer/cluster-launch-installer-upi-e2e.yaml
Outdated
Show resolved
Hide resolved
|
Hooray, vSphere is now up to: I'm trying to figure out if that's vSphere's current standard... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use something similar to #3612
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use something similar to #3612
Are we committed to that approach? Seems like a hack to me. And if so, this isn't much more of a hack ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we committed to that approach? Seems like a hack to me. And if so, this isn't much more of a hack ;).
That's only for our testing of UPI. and using the copied rhcos.json atleast makes sure we don't need to bump bootsimages here when we bump it in installer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only for our testing of UPI...
Right, but holding out for a real fix will help motivate us to give customers something so that they don't have to bump bootimages when we bump them in the installer ;). Dog food, for the win :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only for our testing of UPI...
Right, but holding out for a real fix will help motivate us to give customers something so that they don't have to bump bootimages when we bump them in the installer ;). Dog food, for the win :p
but we cannot test the bump in installer if you hard code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we cannot test the bump in installer if you hard code here.
ahh, true. But we don't have tight doc coupling now, so there's some benefit in testing both the old AMI (which the docs will still be recommending) and the new AMI (which IPI will use).
|
/test pj-rehearse |
|
With 136ab5408 -> 7b2c175fb, I've rebased onto master, dropped the openshift/installer#1706 workarounds now that that's landed, and pushed a compute node into the second private subnet to avoid failing: |
Looking at the ClusterOperator: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/23/artifacts/e2e-aws-upi/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "image-registry") | ([.status.conditions[] | {key: .type, value: .}] | from_entries).Progressing'
{
"lastTransitionTime": "2019-05-04T03:45:29Z",
"message": "All resources are successfully applied, but the deployment does not exist",
"reason": "WaitingForDeployment",
"status": "True",
"type": "Progressing"
}Checking that Deployment: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/23/artifacts/e2e-aws-upi/must-gather/namespaces/openshift-image-registry/apps/deployments/image-registry.yaml | yaml2json | jq .status.conditions
[
{
"reason": "MinimumReplicasAvailable",
"message": "Deployment has minimum availability.",
"type": "Available",
"status": "True",
"lastTransitionTime": "2019-05-04T03:45:48Z",
"lastUpdateTime": "2019-05-04T03:45:48Z"
},
{
"reason": "NewReplicaSetAvailable",
"message": "ReplicaSet \"image-registry-7749f787d4\" has successfully progressed.",
"type": "Progressing",
"status": "True",
"lastTransitionTime": "2019-05-04T03:45:29Z",
"lastUpdateTime": "2019-05-04T03:45:48Z"
}
]I dunno why we took so long we timed out (by 19 seconds!?). But whatever, I'll just kick it again out of curiousity: /retest |
|
Hrm, same e2e-aws error as last time. This time: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/24/artifacts/e2e-aws-upi/installer/.openshift_install.log | grep fatal
time="2019-05-04T05:42:27Z" level=fatal msg="failed to initialize the cluster: Cluster operator image-registry is still updating: timed out waiting for the condition"
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/24/artifacts/e2e-aws-upi/must-gather/cluster-scoped-resources/config.openshift.io/clusterversions.yaml | yaml2json | jq '.items[0].status.conditions[] | select(.type == "Failing" or .type == "Progressing")'
{
"lastTransitionTime": "2019-05-04T05:37:08Z",
"message": "Cluster operator image-registry is still updating",
"status": "True",
"type": "Failing",
"reason": "ClusterOperatorNotAvailable"
}
{
"lastTransitionTime": "2019-05-04T05:07:58Z",
"message": "Unable to apply 0.0.1-2019-05-04-044538: the cluster operator image-registry has not yet successfully rolled out",
"status": "True",
"type": "Progressing",
"reason": "ClusterOperatorNotAvailable"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/24/artifacts/ee-aws-upi/must-gather/cluster-scoped-resources/config.openshift.io/clusteroperators/image-registry.yaml | yaml2json | jq .status.conditions
[
{
"status": "False",
"message": "The deployment does not have available replicas",
"lastTransitionTime": "2019-05-04T05:13:32Z",
"reason": "NoReplicasAvailable",
"type": "Available"
},
{
"status": "True",
"message": "The deployment has not completed",
"lastTransitionTime": "2019-05-04T05:13:32Z",
"reason": "DeploymentNotCompleted",
"type": "Progressing"
},
{
"status": "False",
"lastTransitionTime": "2019-05-04T05:13:32Z",
"type": "Degraded"
}
]
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/24/artifacts/e2e-aws-upi/pods/openshift-image-registry_cluster-image-registry-operator-fd4799767-9gfvh_cluster-image-registry-operator.log | gunzip | grep Unable
I0504 05:13:39.999971 1 controller.go:199] object changed: *v1.Config, Name=cluster (status=true): changed:status.conditions.5.message={"The deployment has not completed" -> "Unable to apply resources: unable to sync storage configuration: exactly one storage type should be configured at the same time, got 2: [EmptyDir S3]"}, changed:status.conditions.5.reason={"DeploymentNotCompleted" -> "Error"}, changed:status.observedGeneration={"2.000000" -> "3.000000"}But I see no |
Nope, good catch. Fixed with 7b2c175fb -> 7d4e4349b, which will leave the registry provisioning its own S3 bucket. Folks taking the UPI path may want that, or they may want to configure the registry to use an existing S3 bucket that they create themselves. For now, a registry-provisioned bucket seems easiest to exercise in CI. |
|
Ok, this round, AWS: Still two |
|
Double checking the node locations, we get the expected distribution: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/25/artifacts/e2e-aws-upi/nodes.json | jq '.items[] | {name: .metadata.name, zone: .metadata.labels["failure-domain.beta.kubernetes.io/zone"], conditions: ([.status.conditions[] | {key: .type, value: .status}] | from_entries)}'
{
"name": "ip-10-0-50-219.ec2.internal",
"zone": "us-east-1a",
"conditions": {
"MemoryPressure": "False",
"DiskPressure": "False",
"PIDPressure": "False",
"Ready": "True"
}
}
{
"name": "ip-10-0-62-49.ec2.internal",
"zone": "us-east-1a",
"conditions": {
"MemoryPressure": "False",
"DiskPressure": "False",
"PIDPressure": "False",
"Ready": "True"
}
}
{
"name": "ip-10-0-69-73.ec2.internal",
"zone": "us-east-1b",
"conditions": {
"MemoryPressure": "False",
"DiskPressure": "False",
"PIDPressure": "False",
"Ready": "True"
}
}
{
"name": "ip-10-0-75-186.ec2.internal",
"zone": "us-east-1b",
"conditions": {
"MemoryPressure": "False",
"DiskPressure": "False",
"PIDPressure": "False",
"Ready": "True"
}
}
{
"name": "ip-10-0-91-97.ec2.internal",
"zone": "us-east-1c",
"conditions": {
"MemoryPressure": "False",
"DiskPressure": "False",
"PIDPressure": "False",
"Ready": "True"
}
} |
|
🤷♂️ /retest |
|
[AWS][1]: |
|
AWS: Those happened last time too, so they're probably real. Still digging... |
|
/retest |
|
e2e-aws has the same multi-AZ errors as before: Still dunno what's going on there. |
|
/lgtm |
…i-e2e: Add AWS support
We can't indent the here-docs more deeply unless we use tabs and
<<-EOF [1].
The more-specific bootstrap-exporter selector avoids the vSphere job's
service attaching to the AWS job's exporter pod, etc.
The "${!SUBNET}" indirect parameter expansion spreads us over two
zones to avoid failing [2]:
[sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Suite:openshift/conformance/parallel] [Suite:k8s]
I'm somewhat surprised that we need to set AWS_DEFAULT_REGION, but see
[3]:
You must specify a region. You can also configure your region by running "aws configure".
[1]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/utilities/V3_chap02.html#tag_18_07_04
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/22
[3]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_release/3440/rehearse-3440-pull-ci-openshift-installer-master-e2e-aws-upi/10
…bmits: Remove some timeout clobbers Like 4372624 (remote timeout,grace_period from jobs, 2019-05-07, openshift#3713) and 62898a3 (Fixup few remaining fields, 2019-05-08, openshift#3713). This gives us the usual grace period and timeout for OpenShift tests, instead of clobbering the OpenShift values and falling back to the generic Prow defaults.
|
Pushed 7d4e4349b -> 9ab4e372e, fixing the generated-config error and adding an additional commit to fix that issue for the other presubmit jobs too (following earlier partial work in #3713). @sdodson, re- |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Updated the following 5 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
I think #3775 will fix the multi-zone errors. |
Similar to #3305, but this template is AWS-specific. I'll look into unifying later. I also still need to wire up an installer job for this.
CC @abhinavdahiya, @cuppett, @staebler, @vrutkovs