Skip to content

Conversation

@staebler
Copy link
Contributor

In the vein of #2011, this PR add an alibabacloud routes script that fixes hairpin.

Background: Alibaba cloud hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running.

The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer.

Like Azure, we don't need to do this for external traffic, only local clients.

  • How to verify it
    Install on alibabacloud, ensure connections to the internal API load balancer are reliable - both when the local apiserver process is running and stopped.

  • Description for the changelog
    Masters on alibaba can now reliably connect to the apiserver service, without encountering hairpin issues

Copy the files used to workaround load balancer hairpin limitations
in Azure over to be used in Alibaba cloud.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2022

@staebler: This pull request references Bugzilla bug 2042655, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

Bug 2042655: Alibaba hairpin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 19, 2022
@openshift-ci openshift-ci bot requested review from jkyros and yuqi-zhang January 19, 2022 22:38
@kikisdeliveryservice
Copy link
Contributor

not sure why alibaba approvers didnt get auto assigned here, so manually doing so:
/assign @kwoodson @fabianofranz @rvanderp3

@kikisdeliveryservice kikisdeliveryservice removed the request for review from jkyros January 19, 2022 22:55
staebler added a commit to staebler/installer that referenced this pull request Jan 20, 2022
…o wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.
@staebler
Copy link
Contributor Author

I have done two installs where the master nodes are all about to reach the ready state. This is with the reversion of the teardown delay added to the installer.

$ oc get nodes
NAME                                     STATUS   ROLES    AGE     VERSION
mstaeble-hq8b7-master-0                  Ready    master   14m     v1.23.0+60f5a1c
mstaeble-hq8b7-master-1                  Ready    master   14m     v1.23.0+60f5a1c
mstaeble-hq8b7-master-2                  Ready    master   14m     v1.23.0+60f5a1c
mstaeble-hq8b7-worker-us-east-1a-jfgwh   Ready    worker   84s     v1.23.0+60f5a1c
mstaeble-hq8b7-worker-us-east-1b-tbc4c   Ready    worker   2m36s   v1.23.0+60f5a1c

The cluster install succeeds.

INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/staebler/Documents/install-test/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.mstaeble.alicloud-dev.devcluster.openshift.com 
INFO Login to the console with user: "kubeadmin", and password: "<redacted>" 
INFO Time elapsed: 28m55s                         
$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.ci.test-2022-01-19-224345-ci-ln-tthw4qt-latest   True        False         2m57s   Cluster version is 4.10.0-0.ci.test-2022-01-19-224345-ci-ln-tthw4qt-latest

I see the routes service running on the masters.

# systemctl status openshift-alibabacloud-routes.service
● openshift-alibabacloud-routes.service - Work around Azure load balancer hairpin
   Loaded: loaded (/etc/systemd/system/openshift-alibabacloud-routes.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2022-01-20 01:44:14 UTC; 3min 5s ago
  Process: 39238 ExecStart=/bin/bash /opt/libexec/openshift-alibabacloud-routes.sh start (code=exited, status=0/SUCCESS)
 Main PID: 39238 (code=exited, status=0/SUCCESS)
      CPU: 43ms

Jan 20 01:44:14 mstaeble-hq8b7-master-0 systemd[1]: Started Work around Alibaba Cloud load balancer hairpin.
Jan 20 01:44:14 mstaeble-hq8b7-master-0 openshift-alibabacloud-routes[39238]: processing v4 vip 10.0.3.153
Jan 20 01:44:14 mstaeble-hq8b7-master-0 openshift-alibabacloud-routes[39238]: ensuring rule for 10.0.3.153 for internal clients
Jan 20 01:44:14 mstaeble-hq8b7-master-0 openshift-alibabacloud-routes[39238]: done applying vip rules
Jan 20 01:44:14 mstaeble-hq8b7-master-0 systemd[1]: openshift-alibabacloud-routes.service: Succeeded.
Jan 20 01:44:14 mstaeble-hq8b7-master-0 systemd[1]: openshift-alibabacloud-routes.service: Consumed 43ms CPU time

@kwoodson
Copy link

@staebler This strategy seems like a solid approach. My only concern is maintaining this service file once Alibaba's LB feature fixes this problem. Is there a path to migration or removal of these rules? What is the downside for future maintenance?

@staebler
Copy link
Contributor Author

@staebler This strategy seems like a solid approach. My only concern is maintaining this service file once Alibaba's LB feature fixes this problem. Is there a path to migration or removal of these rules? What is the downside for future maintenance?

The service can continue to run even when the load balancer does support hairpinning. The service will detect that communication to the API is working and not add the iptables routes. However, we will need to maintain it indefinitely to continue to support 4.10 users that cannot do hairpinning. This is true even after Alibaba cloud makes the load balancer changes, as the configuration of the load balancers created in 4.10 will still not allow hairpinning.

… of azure

Update the hairpin files copied from Azure so that they reference Alibaba
cloud instead.

Add to the README for apiserver-watcher to describe the behavior on Alibaba
Cloud.
@jianli-wei
Copy link

FYI I tried building with the 2 PRs, and got 4 successful installations. LGTM, thanks!

The build is from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1484374603642966016.

./openshift-install 4.10.0-0.ci.test-2022-01-21-042054-ci-ln-2sq07qb-latest
built from commit b3f8f5c251273721c3e772be5dec68a38d2b8977
release image registry.build01.ci.openshift.org/ci-ln-2sq07qb/release@sha256:d823761c865826c601fdd4c45a06481b02188c8399239ba93bebe0e3dc243f22
release architecture amd64

The QE flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/69306/ (SUCCESS, region: us-east-1)
01-21 13:15:02.769 level=debug msg=Time elapsed per stage:
01-21 13:15:02.769 level=debug msg= cluster: 2m31s
01-21 13:15:02.769 level=debug msg= bootstrap: 1m18s
01-21 13:15:02.769 level=debug msg=Bootstrap Complete: 13m18s
01-21 13:15:02.769 level=debug msg= API: 2m42s
01-21 13:15:02.769 level=debug msg= Bootstrap Destroy: 44s
01-21 13:15:02.769 level=debug msg= Cluster Operators: 12m50s
01-21 13:15:02.769 level=debug msg= Console: 1s
01-21 13:15:02.769 level=info msg=Time elapsed: 30m44s

The QE flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/69309/ (SUCCESS, region: cn-beijing)
01-21 13:23:03.471 level=debug msg=Time elapsed per stage:
01-21 13:23:03.472 level=debug msg= cluster: 2m0s
01-21 13:23:03.472 level=debug msg= bootstrap: 53s
01-21 13:23:03.472 level=debug msg=Bootstrap Complete: 17m21s
01-21 13:23:03.472 level=debug msg= API: 6m44s
01-21 13:23:03.472 level=debug msg= Bootstrap Destroy: 31s
01-21 13:23:03.472 level=debug msg= Cluster Operators: 12m0s
01-21 13:23:03.472 level=info msg=Time elapsed: 32m47s

The QE flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/69310/ (SUCCESS, region: cn-hangzhou)
01-21 13:39:00.600 level=debug msg=Time elapsed per stage:
01-21 13:39:00.600 level=debug msg= cluster: 1m46s
01-21 13:39:00.600 level=debug msg= bootstrap: 49s
01-21 13:39:00.600 level=debug msg=Bootstrap Complete: 32m34s
01-21 13:39:00.600 level=debug msg= API: 5m48s
01-21 13:39:00.600 level=debug msg= Bootstrap Destroy: 33s
01-21 13:39:00.600 level=debug msg= Cluster Operators: 12m40s
01-21 13:39:00.600 level=info msg=Time elapsed: 48m24s

The QE flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/69321/ (SUCCESS, region: ap-southeast-1)
01-21 14:09:50.588 level=debug msg=Time elapsed per stage:
01-21 14:09:50.588 level=debug msg= cluster: 2m5s
01-21 14:09:50.588 level=debug msg= bootstrap: 56s
01-21 14:09:50.588 level=debug msg=Bootstrap Complete: 23m10s
01-21 14:09:50.588 level=debug msg= API: 4m20s
01-21 14:09:50.588 level=debug msg= Bootstrap Destroy: 35s
01-21 14:09:50.588 level=debug msg= Cluster Operators: 5m47s
01-21 14:09:50.588 level=info msg=Time elapsed: 32m35s

The QE flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/69330/ (SUCCESS, region: cn-hongkong)
01-21 14:51:43.461 level=debug msg=Time elapsed per stage:
01-21 14:51:43.461 level=debug msg= cluster: 1m57s
01-21 14:51:43.461 level=debug msg= bootstrap: 48s
01-21 14:51:43.461 level=debug msg=Bootstrap Complete: 15m41s
01-21 14:51:43.461 level=debug msg= API: 4m47s
01-21 14:51:43.461 level=debug msg= Bootstrap Destroy: 39s
01-21 14:51:43.461 level=debug msg= Cluster Operators: 14m2s
01-21 14:51:43.461 level=info msg=Time elapsed: 33m11s

@kwoodson
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2022
@kwoodson
Copy link

/assign @kikisdeliveryservice

@mtulio
Copy link
Contributor

mtulio commented Jan 21, 2022

Successfully installation on us-east-1.
I ran quick tests forcing to remove master nodes from the internal LB, and also stopping masters. The cluster will still be working when we have at least two masters up and running due etcd quorum.
/lgtm

@kwoodson
Copy link

/approve

Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the excellent and thorough run through @jianli-wei

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 24, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice, kwoodson, mtulio, staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2022
@@ -0,0 +1,11 @@
name: openshift-alibabacloud-routes.service
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is copied over from the azure/gcp code but could you remind me why this is disabled by default? Or does it not matter since the path service triggers it?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 24, 2022

@staebler: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-workers-rhel8 da4c446 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-metal-ipi da4c446 link false /test e2e-metal-ipi
ci/prow/okd-e2e-aws da4c446 link false /test okd-e2e-aws
ci/prow/e2e-aws-upgrade-single-node da4c446 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-workers-rhel7 da4c446 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-disruptive da4c446 link false /test e2e-aws-disruptive
ci/prow/e2e-vsphere-upgrade da4c446 link false /test e2e-vsphere-upgrade
ci/prow/e2e-aws-single-node da4c446 link false /test e2e-aws-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 04f026a into openshift:master Jan 25, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 25, 2022

@staebler: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 2042655 has not been moved to the MODIFIED state.

Details

In response to this:

Bug 2042655: Alibaba hairpin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

clnperez added a commit to openshift-powervs/installer that referenced this pull request Feb 1, 2022
* azure: Check HyperVGenerations for instance type

If an instance type that does not support HyperVGeneration version 1
then terraform returns an error mentioning there's support only for
V1. Adding a check during install config to check for the versions
supported by the instance type provided.

* Ensure removal of placement-groups during cluster destroy on AWS

* Adjust the startup order of httpd container

Run the httpd container after the coreos-downloader completes to ensure that the kernel parameters can be added correctly.

Signed-off-by: Zhou Hao <[email protected]>

* Add IP outputs for IBM terraform instances

Add the IP addresses for IBM bootstrap and master nodes to allow
collecting of logs from those nodes.

* Revert "Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.

* baremetal: networkConfig field now accepts yaml instead of string value

The current patch allows the user to specify the content of the install-config networkConfig field directly as a yaml object. Content validation (for a generic yaml) is now carried on by the install config asset

* remove unused kube terraform provider

* vendor: update openshift/api to include some alibaba infra changes

* Update openshift/api to 6e0b1eb97188.
* Update kube modules to v0.23.0.
* Update controller-runtime to v0.11.0.
* Remove unused terraform-provider-kubernetes.

* hack: use go 1.17 for verifying codegen

The hack/verify-codegen.sh script was using an image that included
go 1.16. However, the updated k8s.io/json module calls the
`(reflect.StructField) IsExported` function, which is new in go 1.17.
Consequently, the script needs to be updated to use an image that
include go 1.17 rather than 1.16.

* Bump Fedora CoreOS to 35.20220116.2.0

* Alibaba: fix system disk category of bootstrap

Remove hard coding, support users can specify cloud_efficiency in regions
that do not support cloud_essd disk category

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix creating public record being skipped

If the user chooses a base domain for which there is no zone, creating
the A record in the zone is simply skipped rather than raising an error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix VSwitch subnets overlap

Fix the overlapping problem of the VSwitch subnet of the Nat gateway with
 the master node VSwitch subnets

Signed-off-by: sunhui <[email protected]>

* remove unsupported options

* Add proxy for ironic-agent.service

Avoid the issue that ironic agent image cannot be downloaded due to network proxy.

Signed-off-by: Zhou Hao <[email protected]>

* Revert "remove unsupported options"

This reverts commit 2684f8d.

* remove unsupported options for existing resources

* Alibaba: fix resource creation for existing network

When users use an existing network, no longer create Nat gateways and
EIPs

Signed-off-by: sunhui <[email protected]>

* gen'd install configs yaml

* update alibaba for provider spec api changes

This change updates the alibaba provider spec usage related to the
vswitch, security groups, and resource group. The API for the provider
spec is changing to use a discriminated union to capture the various
methods for finding resources (by id, name, or tags).

It also updates several machine api references to note the bifurcated
nature of the api version between v1beta1 and v1.

* update vendor for latest Aliababa API changes

This change is to update the vendor references to support the Alibaba
resrouce reference updates to the API.

* remove validation related to unsupported options

* update validation for unsupported options

* openstack: Fix invalid-https-certificate detection

Fix the reference to an unbound variable; avoid incrementing the invalid
certificate counter in a subshell.

* Alibaba: fix support region list

Remove unsupport region Nanjing and Dubai.

Signed-off-by: sunhui <[email protected]>

* Bug 2043297: bump RHCOS 4.10 bootimage metadata

These changes will update the RHCOS 4.10 bootimage metadata in the installer.
This change includes fixes for the following BZs:

Bug 2008521 - gcp-hostname service should correct invalid search entries in resolv.conf
Bug 2043296 - Ignition fails when reusing existing statically-keyed LUKS volume
Bug 2043721 - Installer bootstrap hosts using outdated kubelet containing bugs

This change will also introduce artifacts for for Aliyun, AWS GovCloud regions, and Nutanix.

Changes generated with:

$ cosa shell
[coreos-assembler]$ plume cosa2stream --target data/data/coreos/rhcos.json --distro rhcos --no-signatures \
--url https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases aarch64=410.84.202201251203-0 \
ppc64le=410.84.202201251004-0 s390x=410.84.202201251002-0  x86_64=410.84.202201251210-0
Verification Steps:

Install a new 4.10 cluster
oc debug node/<node name> -- chroot /host rpm-ostree status
Verify that the deployment version matches the version from this PR
that matches the architecture you are testing on. (i.e. x86_64
should have version 410.84.202201251210-0)

* Bug 2045916: IBMCloud: Stop defaulting to dedicated storage profile

Move off the dedicated storage machine profile, as it has shown to
be less reliable for provisioning on IBM Cloud.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2045916

* Alibaba: fix destroy not exist security group

The destroyer should not error when it attempts to delete a security group
 that does not exist.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix endpoint error in some regions

Update sdk and terraform provier version, and add some endpoints of ECS
 service to fix endpoint error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: update vendor

* Revert "update validation for unsupported options"

This reverts commit e5d628d.

* Revert "remove validation related to unsupported options"

This reverts commit 20f8626.

* Alibaba: support internal publish strategy

Support internal publish strategy for platform Alibaba Cloud

Signed-off-by: sunhui <[email protected]>

å

* Alibaba: fix installer index panic

Add NAT gateway validation to check the region whether support NAT gateway

Signed-off-by: sunhui <[email protected]>

* remove validation for unsupported options

* Alibaba: fix destory exist private zone

Should not destroy pre-configured alicloud DNS private zone

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix validation of resource group ID

Fix resource group ID validation errors caused by pagination issues

Signed-off-by: sunhui <[email protected]>

* update custom image ostype

* Bug 2047258: Read GovCloud from RHCOS stream

AMIs for GovCloud regions have been added to the RHCOS stream. Remove
validation requiring users to provide an AMI.

* Remove Caleb Boylan from core installer reviewers

Co-authored-by: rna-afk <[email protected]>
Co-authored-by: Joel Speed <[email protected]>
Co-authored-by: Zhou Hao <[email protected]>
Co-authored-by: Christopher J Schaefer <[email protected]>
Co-authored-by: staebler <[email protected]>
Co-authored-by: Andrea Fasano <[email protected]>
Co-authored-by: OpenShift Merge Robot <[email protected]>
Co-authored-by: Vadim Rutkovsky <[email protected]>
Co-authored-by: sunhui <[email protected]>
Co-authored-by: Jeff Nowicki <[email protected]>
Co-authored-by: Michael McCune <[email protected]>
Co-authored-by: Pierre Prinetti <[email protected]>
Co-authored-by: Huijing Hei <[email protected]>
Co-authored-by: patrickdillon <[email protected]>
Co-authored-by: Kiran Thyagaraja <[email protected]>
clnperez added a commit to openshift-powervs/installer that referenced this pull request Feb 3, 2022
* azure: Check HyperVGenerations for instance type

If an instance type that does not support HyperVGeneration version 1
then terraform returns an error mentioning there's support only for
V1. Adding a check during install config to check for the versions
supported by the instance type provided.

* Ensure removal of placement-groups during cluster destroy on AWS

* Adjust the startup order of httpd container

Run the httpd container after the coreos-downloader completes to ensure that the kernel parameters can be added correctly.

Signed-off-by: Zhou Hao <[email protected]>

* Add IP outputs for IBM terraform instances

Add the IP addresses for IBM bootstrap and master nodes to allow
collecting of logs from those nodes.

* Revert "Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.

* baremetal: networkConfig field now accepts yaml instead of string value

The current patch allows the user to specify the content of the install-config networkConfig field directly as a yaml object. Content validation (for a generic yaml) is now carried on by the install config asset

* remove unused kube terraform provider

* vendor: update openshift/api to include some alibaba infra changes

* Update openshift/api to 6e0b1eb97188.
* Update kube modules to v0.23.0.
* Update controller-runtime to v0.11.0.
* Remove unused terraform-provider-kubernetes.

* hack: use go 1.17 for verifying codegen

The hack/verify-codegen.sh script was using an image that included
go 1.16. However, the updated k8s.io/json module calls the
`(reflect.StructField) IsExported` function, which is new in go 1.17.
Consequently, the script needs to be updated to use an image that
include go 1.17 rather than 1.16.

* Bump Fedora CoreOS to 35.20220116.2.0

* Alibaba: fix system disk category of bootstrap

Remove hard coding, support users can specify cloud_efficiency in regions
that do not support cloud_essd disk category

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix creating public record being skipped

If the user chooses a base domain for which there is no zone, creating
the A record in the zone is simply skipped rather than raising an error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix VSwitch subnets overlap

Fix the overlapping problem of the VSwitch subnet of the Nat gateway with
 the master node VSwitch subnets

Signed-off-by: sunhui <[email protected]>

* remove unsupported options

* Add proxy for ironic-agent.service

Avoid the issue that ironic agent image cannot be downloaded due to network proxy.

Signed-off-by: Zhou Hao <[email protected]>

* Revert "remove unsupported options"

This reverts commit 2684f8d.

* Azure Stack: Add UPI Instructions for internal CA

Many Azure Stack environments use internal CAs. In these cases
special steps are needed for a UPI install.

* remove unsupported options for existing resources

* Alibaba: fix resource creation for existing network

When users use an existing network, no longer create Nat gateways and
EIPs

Signed-off-by: sunhui <[email protected]>

* gen'd install configs yaml

* update alibaba for provider spec api changes

This change updates the alibaba provider spec usage related to the
vswitch, security groups, and resource group. The API for the provider
spec is changing to use a discriminated union to capture the various
methods for finding resources (by id, name, or tags).

It also updates several machine api references to note the bifurcated
nature of the api version between v1beta1 and v1.

* update vendor for latest Aliababa API changes

This change is to update the vendor references to support the Alibaba
resrouce reference updates to the API.

* remove validation related to unsupported options

* update validation for unsupported options

* openstack: Fix invalid-https-certificate detection

Fix the reference to an unbound variable; avoid incrementing the invalid
certificate counter in a subshell.

* Alibaba: fix support region list

Remove unsupport region Nanjing and Dubai.

Signed-off-by: sunhui <[email protected]>

* Bug 2043297: bump RHCOS 4.10 bootimage metadata

These changes will update the RHCOS 4.10 bootimage metadata in the installer.
This change includes fixes for the following BZs:

Bug 2008521 - gcp-hostname service should correct invalid search entries in resolv.conf
Bug 2043296 - Ignition fails when reusing existing statically-keyed LUKS volume
Bug 2043721 - Installer bootstrap hosts using outdated kubelet containing bugs

This change will also introduce artifacts for for Aliyun, AWS GovCloud regions, and Nutanix.

Changes generated with:

$ cosa shell
[coreos-assembler]$ plume cosa2stream --target data/data/coreos/rhcos.json --distro rhcos --no-signatures \
--url https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases aarch64=410.84.202201251203-0 \
ppc64le=410.84.202201251004-0 s390x=410.84.202201251002-0  x86_64=410.84.202201251210-0
Verification Steps:

Install a new 4.10 cluster
oc debug node/<node name> -- chroot /host rpm-ostree status
Verify that the deployment version matches the version from this PR
that matches the architecture you are testing on. (i.e. x86_64
should have version 410.84.202201251210-0)

* Bug 2045916: IBMCloud: Stop defaulting to dedicated storage profile

Move off the dedicated storage machine profile, as it has shown to
be less reliable for provisioning on IBM Cloud.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2045916

* Alibaba: fix destroy not exist security group

The destroyer should not error when it attempts to delete a security group
 that does not exist.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix endpoint error in some regions

Update sdk and terraform provier version, and add some endpoints of ECS
 service to fix endpoint error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: update vendor

* Revert "update validation for unsupported options"

This reverts commit e5d628d.

* Revert "remove validation related to unsupported options"

This reverts commit 20f8626.

* Alibaba: support internal publish strategy

Support internal publish strategy for platform Alibaba Cloud

Signed-off-by: sunhui <[email protected]>

å

* Alibaba: fix installer index panic

Add NAT gateway validation to check the region whether support NAT gateway

Signed-off-by: sunhui <[email protected]>

* remove validation for unsupported options

* Alibaba: fix destory exist private zone

Should not destroy pre-configured alicloud DNS private zone

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix validation of resource group ID

Fix resource group ID validation errors caused by pagination issues

Signed-off-by: sunhui <[email protected]>

* update custom image ostype

* Bug 2047258: Read GovCloud from RHCOS stream

AMIs for GovCloud regions have been added to the RHCOS stream. Remove
validation requiring users to provide an AMI.

* Remove Caleb Boylan from core installer reviewers

* aws: Remove non-public AWS regions from list of regions

When creating the install-config, the installer displays regions
of all partitions of AWS. Certain regions also need extra information
for the validation to work and should not be taken as input since we
only ask for the bare minimum amount of information to create the
install config.

The best approach here would be to only display all the public regions
of AWS and allow for other regions after the install-config is created
to allow for the user to add the extra information.

* openstack: Don't shortcut cloud scraping if quota is unavailable

This results in an incorrect failure to validate network capabilities
because network extensions weren't loaded.

Co-authored-by: rna-afk <[email protected]>
Co-authored-by: Joel Speed <[email protected]>
Co-authored-by: Zhou Hao <[email protected]>
Co-authored-by: Christopher J Schaefer <[email protected]>
Co-authored-by: staebler <[email protected]>
Co-authored-by: Andrea Fasano <[email protected]>
Co-authored-by: OpenShift Merge Robot <[email protected]>
Co-authored-by: Vadim Rutkovsky <[email protected]>
Co-authored-by: sunhui <[email protected]>
Co-authored-by: Jeff Nowicki <[email protected]>
Co-authored-by: patrickdillon <[email protected]>
Co-authored-by: Michael McCune <[email protected]>
Co-authored-by: Pierre Prinetti <[email protected]>
Co-authored-by: Huijing Hei <[email protected]>
Co-authored-by: Kiran Thyagaraja <[email protected]>
Co-authored-by: Matthew Booth <[email protected]>
clnperez added a commit to clnperez/installer that referenced this pull request Feb 25, 2022
* azure: Check HyperVGenerations for instance type

If an instance type that does not support HyperVGeneration version 1
then terraform returns an error mentioning there's support only for
V1. Adding a check during install config to check for the versions
supported by the instance type provided.

* Ensure removal of placement-groups during cluster destroy on AWS

* Adjust the startup order of httpd container

Run the httpd container after the coreos-downloader completes to ensure that the kernel parameters can be added correctly.

Signed-off-by: Zhou Hao <[email protected]>

* Add IP outputs for IBM terraform instances

Add the IP addresses for IBM bootstrap and master nodes to allow
collecting of logs from those nodes.

* Revert "Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.

* baremetal: networkConfig field now accepts yaml instead of string value

The current patch allows the user to specify the content of the install-config networkConfig field directly as a yaml object. Content validation (for a generic yaml) is now carried on by the install config asset

* remove unused kube terraform provider

* vendor: update openshift/api to include some alibaba infra changes

* Update openshift/api to 6e0b1eb97188.
* Update kube modules to v0.23.0.
* Update controller-runtime to v0.11.0.
* Remove unused terraform-provider-kubernetes.

* hack: use go 1.17 for verifying codegen

The hack/verify-codegen.sh script was using an image that included
go 1.16. However, the updated k8s.io/json module calls the
`(reflect.StructField) IsExported` function, which is new in go 1.17.
Consequently, the script needs to be updated to use an image that
include go 1.17 rather than 1.16.

* Bump Fedora CoreOS to 35.20220116.2.0

* Alibaba: fix system disk category of bootstrap

Remove hard coding, support users can specify cloud_efficiency in regions
that do not support cloud_essd disk category

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix creating public record being skipped

If the user chooses a base domain for which there is no zone, creating
the A record in the zone is simply skipped rather than raising an error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix VSwitch subnets overlap

Fix the overlapping problem of the VSwitch subnet of the Nat gateway with
 the master node VSwitch subnets

Signed-off-by: sunhui <[email protected]>

* remove unsupported options

* Add proxy for ironic-agent.service

Avoid the issue that ironic agent image cannot be downloaded due to network proxy.

Signed-off-by: Zhou Hao <[email protected]>

* Revert "remove unsupported options"

This reverts commit 2684f8d.

* remove unsupported options for existing resources

* Alibaba: fix resource creation for existing network

When users use an existing network, no longer create Nat gateways and
EIPs

Signed-off-by: sunhui <[email protected]>

* gen'd install configs yaml

* update alibaba for provider spec api changes

This change updates the alibaba provider spec usage related to the
vswitch, security groups, and resource group. The API for the provider
spec is changing to use a discriminated union to capture the various
methods for finding resources (by id, name, or tags).

It also updates several machine api references to note the bifurcated
nature of the api version between v1beta1 and v1.

* update vendor for latest Aliababa API changes

This change is to update the vendor references to support the Alibaba
resrouce reference updates to the API.

* remove validation related to unsupported options

* update validation for unsupported options

* openstack: Fix invalid-https-certificate detection

Fix the reference to an unbound variable; avoid incrementing the invalid
certificate counter in a subshell.

* Alibaba: fix support region list

Remove unsupport region Nanjing and Dubai.

Signed-off-by: sunhui <[email protected]>

* Bug 2043297: bump RHCOS 4.10 bootimage metadata

These changes will update the RHCOS 4.10 bootimage metadata in the installer.
This change includes fixes for the following BZs:

Bug 2008521 - gcp-hostname service should correct invalid search entries in resolv.conf
Bug 2043296 - Ignition fails when reusing existing statically-keyed LUKS volume
Bug 2043721 - Installer bootstrap hosts using outdated kubelet containing bugs

This change will also introduce artifacts for for Aliyun, AWS GovCloud regions, and Nutanix.

Changes generated with:

$ cosa shell
[coreos-assembler]$ plume cosa2stream --target data/data/coreos/rhcos.json --distro rhcos --no-signatures \
--url https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases aarch64=410.84.202201251203-0 \
ppc64le=410.84.202201251004-0 s390x=410.84.202201251002-0  x86_64=410.84.202201251210-0
Verification Steps:

Install a new 4.10 cluster
oc debug node/<node name> -- chroot /host rpm-ostree status
Verify that the deployment version matches the version from this PR
that matches the architecture you are testing on. (i.e. x86_64
should have version 410.84.202201251210-0)

* Bug 2045916: IBMCloud: Stop defaulting to dedicated storage profile

Move off the dedicated storage machine profile, as it has shown to
be less reliable for provisioning on IBM Cloud.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2045916

* Alibaba: fix destroy not exist security group

The destroyer should not error when it attempts to delete a security group
 that does not exist.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix endpoint error in some regions

Update sdk and terraform provier version, and add some endpoints of ECS
 service to fix endpoint error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: update vendor

* Revert "update validation for unsupported options"

This reverts commit e5d628d.

* Revert "remove validation related to unsupported options"

This reverts commit 20f8626.

* Alibaba: support internal publish strategy

Support internal publish strategy for platform Alibaba Cloud

Signed-off-by: sunhui <[email protected]>

å

* Alibaba: fix installer index panic

Add NAT gateway validation to check the region whether support NAT gateway

Signed-off-by: sunhui <[email protected]>

* remove validation for unsupported options

* Alibaba: fix destory exist private zone

Should not destroy pre-configured alicloud DNS private zone

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix validation of resource group ID

Fix resource group ID validation errors caused by pagination issues

Signed-off-by: sunhui <[email protected]>

* update custom image ostype

* Bug 2047258: Read GovCloud from RHCOS stream

AMIs for GovCloud regions have been added to the RHCOS stream. Remove
validation requiring users to provide an AMI.

* Remove Caleb Boylan from core installer reviewers

Co-authored-by: rna-afk <[email protected]>
Co-authored-by: Joel Speed <[email protected]>
Co-authored-by: Zhou Hao <[email protected]>
Co-authored-by: Christopher J Schaefer <[email protected]>
Co-authored-by: staebler <[email protected]>
Co-authored-by: Andrea Fasano <[email protected]>
Co-authored-by: OpenShift Merge Robot <[email protected]>
Co-authored-by: Vadim Rutkovsky <[email protected]>
Co-authored-by: sunhui <[email protected]>
Co-authored-by: Jeff Nowicki <[email protected]>
Co-authored-by: Michael McCune <[email protected]>
Co-authored-by: Pierre Prinetti <[email protected]>
Co-authored-by: Huijing Hei <[email protected]>
Co-authored-by: patrickdillon <[email protected]>
Co-authored-by: Kiran Thyagaraja <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants