Skip to content

Conversation

@mtulio
Copy link
Contributor

@mtulio mtulio commented Jan 14, 2022

Setting up --tear-down-delay flag to cluster-bootstrap to wait 10m until it tear down, to give time to wait for two kube-apiserver pods is available to finish the bootstrap in AlibabaCloud (only). The default value will be set to 0.

This is caused due to a premature ending of bootkube of cluster-bootstrap/kube-apiserver when scheduled the kube-apiserver to only one master, and the "SLB limitation where cannot be accessed by backend servers"[1].
[1] https://www.alibabacloud.com/help/en/doc-detail/55206.htm

This is an investigation of the bug https://bugzilla.redhat.com/show_bug.cgi?id=2035757

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 14, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 14, 2022
@mtulio mtulio changed the title fix: wait for at least two kube-apiserver pods on alibabacloud WIP | lab: wait for at least two kube-apiserver pods on alibabacloud Jan 14, 2022
@mtulio mtulio force-pushed the wait-for-kas branch 4 times, most recently from f386d1d to 495284c Compare January 14, 2022 21:49
@mtulio mtulio changed the title WIP | lab: wait for at least two kube-apiserver pods on alibabacloud cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud Jan 14, 2022
@staebler staebler changed the title cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud Jan 17, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 17, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 17, 2022

@mtulio: This pull request references Bugzilla bug 2035757, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianli-wei

Details

In response to this:

Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from jianli-wei January 17, 2022 17:54
@staebler
Copy link
Contributor

Here is my analysis of the steps that cause the bug.

  1. One of the masters creates a kube-apiserver pod.
  2. The cluster-bootstrap on the bootstrap node sees that all of the kube-apiserver pods are ready (even though there has only been 1 created so far).
  3. The control plane on the bootstrap is torn down.
  4. The node for the master with the kube-apiserver stops reporting heartbeat since it cannot access the api server via api-int.
  5. The kube-apiserver pod stops behaving since its token is revoked.
  6. Nobody can access the api server any more.

The proposed solution presented in this PR is to delay the time between steps (2) and (3) so that the kube-apiservers on the other nodes have a chance to start while the temporary control plane on the bootstrap node is still running.

This is not a long-term solution as this does not address any issues that may arise during upgrades as new apiserver revisions are rolled out. If there is a time when there is only one kube-apiserver pod running, then the cluster will get stuck with a non-responding apiserver.

- One of the masters creates a kube-apiserver pod.
- The cluster-bootstrap on the bootstrap node sees that all of the kube-apiserver pods are ready (even though there has only been 1 created so far).
- The control plane on the bootstrap is torn down.
- The node for the master with the kube-apiserver stops reporting heartbeat since it cannot access the api server via api-int.
- The kube-apiserver pod stops behaving since its token is revoked.
- Nobody can access the api server any more.

BZ https://bugzilla.redhat.com/show_bug.cgi?id=2035757
@mtulio mtulio marked this pull request as ready for review January 17, 2022 18:03
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 17, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 17, 2022

@mtulio: This pull request references Bugzilla bug 2035757, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianli-wei

Details

In response to this:

Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mtulio
Copy link
Contributor Author

mtulio commented Jan 17, 2022

@staebler I just changed to use the bootstrap template. ptal?

Tests:
for alibabacloud :

$ jq -r '.storage.files[] | select(.path=="/usr/local/bin/bootkube.sh") | .contents.source' .local/clusters/tmp-alibaba/bootstrap.ign  |sed 's/data\:text\/plain\;charset\=utf\-8;base64,//g' |base64 -d |grep -A 10 CLUSTER_BOOTSTRAP_IMAGE -m2
[...]
--
        "${CLUSTER_BOOTSTRAP_IMAGE}" \
        start \
          --tear-down-early=false \
          --tear-down-delay="10m" \
          --asset-dir=/assets \
          --required-pods="${REQUIRED_PODS}"

for others (aws):

$ jq -r '.storage.files[] | select(.path=="/usr/local/bin/bootkube.sh") | .contents.source' .local/clusters/tmp-aws/bootstrap.ign  |sed 's/data\:text\/plain\;charset\=utf\-8;base64,//g' |base64 -d |grep -A 10 CLUSTER_BOOTSTRAP_IMAGE -m2
CLUSTER_BOOTSTRAP_IMAGE=$(image_for cluster-bootstrap)
[...]
        "${CLUSTER_BOOTSTRAP_IMAGE}" \
        start \
          --tear-down-early=false \
          --tear-down-delay="0" \
          --asset-dir=/assets \
          --required-pods="${REQUIRED_PODS}"
}

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a nit about code structure.

@mtulio mtulio requested a review from staebler January 17, 2022 18:30
@mtulio
Copy link
Contributor Author

mtulio commented Jan 17, 2022

As we have no CI for this component yet, sharing the result of the cluster running with this version:

DEBUG Time elapsed per stage:                      
DEBUG            cluster: 2m35s                    
DEBUG          bootstrap: 1m17s                    
DEBUG Bootstrap Complete: 21m47s                   
DEBUG                API: 2m31s                    
DEBUG  Cluster Operators: 20m34s                   
INFO Time elapsed: 46m22s

NAME                                   STATUS   ROLES    AGE   VERSION
mrbkas-f7vb6-master-0                  Ready    master   56m   v1.23.0+60f5a1c
mrbkas-f7vb6-master-1                  Ready    master   54m   v1.23.0+60f5a1c
mrbkas-f7vb6-master-2                  Ready    master   55m   v1.23.0+60f5a1c

kube-apiserver-mrbkas-f7vb6-master-0         5/5     Running     0          20m
kube-apiserver-mrbkas-f7vb6-master-1         5/5     Running     0          12m
kube-apiserver-mrbkas-f7vb6-master-2         5/5     Running     0          14m

cc @kwoodson regarding CI PR:
openshift/release#20841

@jstuever
Copy link
Contributor

/uncc

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@mtulio
Copy link
Contributor Author

mtulio commented Jan 18, 2022

@staebler Not sure if the /override is valid for the PR lifecycle. Do we need to send /override ci/prow/e2e-aws-upgrade again? The last summary shows that it ran. :(

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

8 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2022

@mtulio: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ibmcloud 91c637d link false /test e2e-ibmcloud
ci/prow/e2e-azure-upi 91c637d link false /test e2e-azure-upi
ci/prow/e2e-libvirt 91c637d link false /test e2e-libvirt
ci/prow/e2e-crc 91c637d link false /test e2e-crc
ci/prow/e2e-aws-workers-rhel7 91c637d link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-workers-rhel8 91c637d link false /test e2e-aws-workers-rhel8
ci/prow/e2e-aws-single-node 91c637d link false /test e2e-aws-single-node
ci/prow/okd-e2e-aws-upgrade 91c637d link false /test okd-e2e-aws-upgrade
ci/prow/okd-e2e-aws 91c637d link false /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@mtulio
Copy link
Contributor Author

mtulio commented Jan 19, 2022

/override ci/prow/e2e-aws-upgrade
Already skipped and not processed by ci bot

/override ci/prow/e2e-aws-upgrade The install is successful. The changes in this PR only affect the installation process.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2022

@mtulio: mtulio unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file.

Details

In response to this:

/override ci/prow/e2e-aws-upgrade
Already skipped and not processed by ci bot

/override ci/prow/e2e-aws-upgrade The install is successful. The changes in this PR only affect the installation process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mtulio
Copy link
Contributor Author

mtulio commented Jan 19, 2022

@mtulio: mtulio unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file.

@staebler @patrickdillon please skip it again.

@mtulio
Copy link
Contributor Author

mtulio commented Jan 19, 2022

/retest-required

@openshift-merge-robot openshift-merge-robot merged commit 6e2d76b into openshift:master Jan 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 19, 2022

@mtulio: All pull requests linked via external trackers have merged:

Bugzilla bug 2035757 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mtulio mtulio deleted the wait-for-kas branch January 19, 2022 16:38
staebler added a commit to staebler/installer that referenced this pull request Jan 20, 2022
…o wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.
clnperez added a commit to openshift-powervs/installer that referenced this pull request Feb 1, 2022
* azure: Check HyperVGenerations for instance type

If an instance type that does not support HyperVGeneration version 1
then terraform returns an error mentioning there's support only for
V1. Adding a check during install config to check for the versions
supported by the instance type provided.

* Ensure removal of placement-groups during cluster destroy on AWS

* Adjust the startup order of httpd container

Run the httpd container after the coreos-downloader completes to ensure that the kernel parameters can be added correctly.

Signed-off-by: Zhou Hao <[email protected]>

* Add IP outputs for IBM terraform instances

Add the IP addresses for IBM bootstrap and master nodes to allow
collecting of logs from those nodes.

* Revert "Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.

* baremetal: networkConfig field now accepts yaml instead of string value

The current patch allows the user to specify the content of the install-config networkConfig field directly as a yaml object. Content validation (for a generic yaml) is now carried on by the install config asset

* remove unused kube terraform provider

* vendor: update openshift/api to include some alibaba infra changes

* Update openshift/api to 6e0b1eb97188.
* Update kube modules to v0.23.0.
* Update controller-runtime to v0.11.0.
* Remove unused terraform-provider-kubernetes.

* hack: use go 1.17 for verifying codegen

The hack/verify-codegen.sh script was using an image that included
go 1.16. However, the updated k8s.io/json module calls the
`(reflect.StructField) IsExported` function, which is new in go 1.17.
Consequently, the script needs to be updated to use an image that
include go 1.17 rather than 1.16.

* Bump Fedora CoreOS to 35.20220116.2.0

* Alibaba: fix system disk category of bootstrap

Remove hard coding, support users can specify cloud_efficiency in regions
that do not support cloud_essd disk category

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix creating public record being skipped

If the user chooses a base domain for which there is no zone, creating
the A record in the zone is simply skipped rather than raising an error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix VSwitch subnets overlap

Fix the overlapping problem of the VSwitch subnet of the Nat gateway with
 the master node VSwitch subnets

Signed-off-by: sunhui <[email protected]>

* remove unsupported options

* Add proxy for ironic-agent.service

Avoid the issue that ironic agent image cannot be downloaded due to network proxy.

Signed-off-by: Zhou Hao <[email protected]>

* Revert "remove unsupported options"

This reverts commit 2684f8d.

* remove unsupported options for existing resources

* Alibaba: fix resource creation for existing network

When users use an existing network, no longer create Nat gateways and
EIPs

Signed-off-by: sunhui <[email protected]>

* gen'd install configs yaml

* update alibaba for provider spec api changes

This change updates the alibaba provider spec usage related to the
vswitch, security groups, and resource group. The API for the provider
spec is changing to use a discriminated union to capture the various
methods for finding resources (by id, name, or tags).

It also updates several machine api references to note the bifurcated
nature of the api version between v1beta1 and v1.

* update vendor for latest Aliababa API changes

This change is to update the vendor references to support the Alibaba
resrouce reference updates to the API.

* remove validation related to unsupported options

* update validation for unsupported options

* openstack: Fix invalid-https-certificate detection

Fix the reference to an unbound variable; avoid incrementing the invalid
certificate counter in a subshell.

* Alibaba: fix support region list

Remove unsupport region Nanjing and Dubai.

Signed-off-by: sunhui <[email protected]>

* Bug 2043297: bump RHCOS 4.10 bootimage metadata

These changes will update the RHCOS 4.10 bootimage metadata in the installer.
This change includes fixes for the following BZs:

Bug 2008521 - gcp-hostname service should correct invalid search entries in resolv.conf
Bug 2043296 - Ignition fails when reusing existing statically-keyed LUKS volume
Bug 2043721 - Installer bootstrap hosts using outdated kubelet containing bugs

This change will also introduce artifacts for for Aliyun, AWS GovCloud regions, and Nutanix.

Changes generated with:

$ cosa shell
[coreos-assembler]$ plume cosa2stream --target data/data/coreos/rhcos.json --distro rhcos --no-signatures \
--url https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases aarch64=410.84.202201251203-0 \
ppc64le=410.84.202201251004-0 s390x=410.84.202201251002-0  x86_64=410.84.202201251210-0
Verification Steps:

Install a new 4.10 cluster
oc debug node/<node name> -- chroot /host rpm-ostree status
Verify that the deployment version matches the version from this PR
that matches the architecture you are testing on. (i.e. x86_64
should have version 410.84.202201251210-0)

* Bug 2045916: IBMCloud: Stop defaulting to dedicated storage profile

Move off the dedicated storage machine profile, as it has shown to
be less reliable for provisioning on IBM Cloud.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2045916

* Alibaba: fix destroy not exist security group

The destroyer should not error when it attempts to delete a security group
 that does not exist.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix endpoint error in some regions

Update sdk and terraform provier version, and add some endpoints of ECS
 service to fix endpoint error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: update vendor

* Revert "update validation for unsupported options"

This reverts commit e5d628d.

* Revert "remove validation related to unsupported options"

This reverts commit 20f8626.

* Alibaba: support internal publish strategy

Support internal publish strategy for platform Alibaba Cloud

Signed-off-by: sunhui <[email protected]>

å

* Alibaba: fix installer index panic

Add NAT gateway validation to check the region whether support NAT gateway

Signed-off-by: sunhui <[email protected]>

* remove validation for unsupported options

* Alibaba: fix destory exist private zone

Should not destroy pre-configured alicloud DNS private zone

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix validation of resource group ID

Fix resource group ID validation errors caused by pagination issues

Signed-off-by: sunhui <[email protected]>

* update custom image ostype

* Bug 2047258: Read GovCloud from RHCOS stream

AMIs for GovCloud regions have been added to the RHCOS stream. Remove
validation requiring users to provide an AMI.

* Remove Caleb Boylan from core installer reviewers

Co-authored-by: rna-afk <[email protected]>
Co-authored-by: Joel Speed <[email protected]>
Co-authored-by: Zhou Hao <[email protected]>
Co-authored-by: Christopher J Schaefer <[email protected]>
Co-authored-by: staebler <[email protected]>
Co-authored-by: Andrea Fasano <[email protected]>
Co-authored-by: OpenShift Merge Robot <[email protected]>
Co-authored-by: Vadim Rutkovsky <[email protected]>
Co-authored-by: sunhui <[email protected]>
Co-authored-by: Jeff Nowicki <[email protected]>
Co-authored-by: Michael McCune <[email protected]>
Co-authored-by: Pierre Prinetti <[email protected]>
Co-authored-by: Huijing Hei <[email protected]>
Co-authored-by: patrickdillon <[email protected]>
Co-authored-by: Kiran Thyagaraja <[email protected]>
clnperez added a commit to openshift-powervs/installer that referenced this pull request Feb 3, 2022
* azure: Check HyperVGenerations for instance type

If an instance type that does not support HyperVGeneration version 1
then terraform returns an error mentioning there's support only for
V1. Adding a check during install config to check for the versions
supported by the instance type provided.

* Ensure removal of placement-groups during cluster destroy on AWS

* Adjust the startup order of httpd container

Run the httpd container after the coreos-downloader completes to ensure that the kernel parameters can be added correctly.

Signed-off-by: Zhou Hao <[email protected]>

* Add IP outputs for IBM terraform instances

Add the IP addresses for IBM bootstrap and master nodes to allow
collecting of logs from those nodes.

* Revert "Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.

* baremetal: networkConfig field now accepts yaml instead of string value

The current patch allows the user to specify the content of the install-config networkConfig field directly as a yaml object. Content validation (for a generic yaml) is now carried on by the install config asset

* remove unused kube terraform provider

* vendor: update openshift/api to include some alibaba infra changes

* Update openshift/api to 6e0b1eb97188.
* Update kube modules to v0.23.0.
* Update controller-runtime to v0.11.0.
* Remove unused terraform-provider-kubernetes.

* hack: use go 1.17 for verifying codegen

The hack/verify-codegen.sh script was using an image that included
go 1.16. However, the updated k8s.io/json module calls the
`(reflect.StructField) IsExported` function, which is new in go 1.17.
Consequently, the script needs to be updated to use an image that
include go 1.17 rather than 1.16.

* Bump Fedora CoreOS to 35.20220116.2.0

* Alibaba: fix system disk category of bootstrap

Remove hard coding, support users can specify cloud_efficiency in regions
that do not support cloud_essd disk category

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix creating public record being skipped

If the user chooses a base domain for which there is no zone, creating
the A record in the zone is simply skipped rather than raising an error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix VSwitch subnets overlap

Fix the overlapping problem of the VSwitch subnet of the Nat gateway with
 the master node VSwitch subnets

Signed-off-by: sunhui <[email protected]>

* remove unsupported options

* Add proxy for ironic-agent.service

Avoid the issue that ironic agent image cannot be downloaded due to network proxy.

Signed-off-by: Zhou Hao <[email protected]>

* Revert "remove unsupported options"

This reverts commit 2684f8d.

* Azure Stack: Add UPI Instructions for internal CA

Many Azure Stack environments use internal CAs. In these cases
special steps are needed for a UPI install.

* remove unsupported options for existing resources

* Alibaba: fix resource creation for existing network

When users use an existing network, no longer create Nat gateways and
EIPs

Signed-off-by: sunhui <[email protected]>

* gen'd install configs yaml

* update alibaba for provider spec api changes

This change updates the alibaba provider spec usage related to the
vswitch, security groups, and resource group. The API for the provider
spec is changing to use a discriminated union to capture the various
methods for finding resources (by id, name, or tags).

It also updates several machine api references to note the bifurcated
nature of the api version between v1beta1 and v1.

* update vendor for latest Aliababa API changes

This change is to update the vendor references to support the Alibaba
resrouce reference updates to the API.

* remove validation related to unsupported options

* update validation for unsupported options

* openstack: Fix invalid-https-certificate detection

Fix the reference to an unbound variable; avoid incrementing the invalid
certificate counter in a subshell.

* Alibaba: fix support region list

Remove unsupport region Nanjing and Dubai.

Signed-off-by: sunhui <[email protected]>

* Bug 2043297: bump RHCOS 4.10 bootimage metadata

These changes will update the RHCOS 4.10 bootimage metadata in the installer.
This change includes fixes for the following BZs:

Bug 2008521 - gcp-hostname service should correct invalid search entries in resolv.conf
Bug 2043296 - Ignition fails when reusing existing statically-keyed LUKS volume
Bug 2043721 - Installer bootstrap hosts using outdated kubelet containing bugs

This change will also introduce artifacts for for Aliyun, AWS GovCloud regions, and Nutanix.

Changes generated with:

$ cosa shell
[coreos-assembler]$ plume cosa2stream --target data/data/coreos/rhcos.json --distro rhcos --no-signatures \
--url https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases aarch64=410.84.202201251203-0 \
ppc64le=410.84.202201251004-0 s390x=410.84.202201251002-0  x86_64=410.84.202201251210-0
Verification Steps:

Install a new 4.10 cluster
oc debug node/<node name> -- chroot /host rpm-ostree status
Verify that the deployment version matches the version from this PR
that matches the architecture you are testing on. (i.e. x86_64
should have version 410.84.202201251210-0)

* Bug 2045916: IBMCloud: Stop defaulting to dedicated storage profile

Move off the dedicated storage machine profile, as it has shown to
be less reliable for provisioning on IBM Cloud.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2045916

* Alibaba: fix destroy not exist security group

The destroyer should not error when it attempts to delete a security group
 that does not exist.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix endpoint error in some regions

Update sdk and terraform provier version, and add some endpoints of ECS
 service to fix endpoint error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: update vendor

* Revert "update validation for unsupported options"

This reverts commit e5d628d.

* Revert "remove validation related to unsupported options"

This reverts commit 20f8626.

* Alibaba: support internal publish strategy

Support internal publish strategy for platform Alibaba Cloud

Signed-off-by: sunhui <[email protected]>

å

* Alibaba: fix installer index panic

Add NAT gateway validation to check the region whether support NAT gateway

Signed-off-by: sunhui <[email protected]>

* remove validation for unsupported options

* Alibaba: fix destory exist private zone

Should not destroy pre-configured alicloud DNS private zone

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix validation of resource group ID

Fix resource group ID validation errors caused by pagination issues

Signed-off-by: sunhui <[email protected]>

* update custom image ostype

* Bug 2047258: Read GovCloud from RHCOS stream

AMIs for GovCloud regions have been added to the RHCOS stream. Remove
validation requiring users to provide an AMI.

* Remove Caleb Boylan from core installer reviewers

* aws: Remove non-public AWS regions from list of regions

When creating the install-config, the installer displays regions
of all partitions of AWS. Certain regions also need extra information
for the validation to work and should not be taken as input since we
only ask for the bare minimum amount of information to create the
install config.

The best approach here would be to only display all the public regions
of AWS and allow for other regions after the install-config is created
to allow for the user to add the extra information.

* openstack: Don't shortcut cloud scraping if quota is unavailable

This results in an incorrect failure to validate network capabilities
because network extensions weren't loaded.

Co-authored-by: rna-afk <[email protected]>
Co-authored-by: Joel Speed <[email protected]>
Co-authored-by: Zhou Hao <[email protected]>
Co-authored-by: Christopher J Schaefer <[email protected]>
Co-authored-by: staebler <[email protected]>
Co-authored-by: Andrea Fasano <[email protected]>
Co-authored-by: OpenShift Merge Robot <[email protected]>
Co-authored-by: Vadim Rutkovsky <[email protected]>
Co-authored-by: sunhui <[email protected]>
Co-authored-by: Jeff Nowicki <[email protected]>
Co-authored-by: patrickdillon <[email protected]>
Co-authored-by: Michael McCune <[email protected]>
Co-authored-by: Pierre Prinetti <[email protected]>
Co-authored-by: Huijing Hei <[email protected]>
Co-authored-by: Kiran Thyagaraja <[email protected]>
Co-authored-by: Matthew Booth <[email protected]>
clnperez added a commit to clnperez/installer that referenced this pull request Feb 25, 2022
* azure: Check HyperVGenerations for instance type

If an instance type that does not support HyperVGeneration version 1
then terraform returns an error mentioning there's support only for
V1. Adding a check during install config to check for the versions
supported by the instance type provided.

* Ensure removal of placement-groups during cluster destroy on AWS

* Adjust the startup order of httpd container

Run the httpd container after the coreos-downloader completes to ensure that the kernel parameters can be added correctly.

Signed-off-by: Zhou Hao <[email protected]>

* Add IP outputs for IBM terraform instances

Add the IP addresses for IBM bootstrap and master nodes to allow
collecting of logs from those nodes.

* Revert "Bug 2035757: cluster-bootstrap/alibaba: set tear-down-delay to wait kube-apiserver rolls out on AlibabaCloud (openshift#5535)"

This reverts commit 6e2d76b.

With openshift/machine-config-operator#2919, it is no longer necessary to delay the teardown
of the bootstrap control plane. The cluster will no longer get into an unusable state when there is only a single
kube-apiserver pod running.

* baremetal: networkConfig field now accepts yaml instead of string value

The current patch allows the user to specify the content of the install-config networkConfig field directly as a yaml object. Content validation (for a generic yaml) is now carried on by the install config asset

* remove unused kube terraform provider

* vendor: update openshift/api to include some alibaba infra changes

* Update openshift/api to 6e0b1eb97188.
* Update kube modules to v0.23.0.
* Update controller-runtime to v0.11.0.
* Remove unused terraform-provider-kubernetes.

* hack: use go 1.17 for verifying codegen

The hack/verify-codegen.sh script was using an image that included
go 1.16. However, the updated k8s.io/json module calls the
`(reflect.StructField) IsExported` function, which is new in go 1.17.
Consequently, the script needs to be updated to use an image that
include go 1.17 rather than 1.16.

* Bump Fedora CoreOS to 35.20220116.2.0

* Alibaba: fix system disk category of bootstrap

Remove hard coding, support users can specify cloud_efficiency in regions
that do not support cloud_essd disk category

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix creating public record being skipped

If the user chooses a base domain for which there is no zone, creating
the A record in the zone is simply skipped rather than raising an error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix VSwitch subnets overlap

Fix the overlapping problem of the VSwitch subnet of the Nat gateway with
 the master node VSwitch subnets

Signed-off-by: sunhui <[email protected]>

* remove unsupported options

* Add proxy for ironic-agent.service

Avoid the issue that ironic agent image cannot be downloaded due to network proxy.

Signed-off-by: Zhou Hao <[email protected]>

* Revert "remove unsupported options"

This reverts commit 2684f8d.

* remove unsupported options for existing resources

* Alibaba: fix resource creation for existing network

When users use an existing network, no longer create Nat gateways and
EIPs

Signed-off-by: sunhui <[email protected]>

* gen'd install configs yaml

* update alibaba for provider spec api changes

This change updates the alibaba provider spec usage related to the
vswitch, security groups, and resource group. The API for the provider
spec is changing to use a discriminated union to capture the various
methods for finding resources (by id, name, or tags).

It also updates several machine api references to note the bifurcated
nature of the api version between v1beta1 and v1.

* update vendor for latest Aliababa API changes

This change is to update the vendor references to support the Alibaba
resrouce reference updates to the API.

* remove validation related to unsupported options

* update validation for unsupported options

* openstack: Fix invalid-https-certificate detection

Fix the reference to an unbound variable; avoid incrementing the invalid
certificate counter in a subshell.

* Alibaba: fix support region list

Remove unsupport region Nanjing and Dubai.

Signed-off-by: sunhui <[email protected]>

* Bug 2043297: bump RHCOS 4.10 bootimage metadata

These changes will update the RHCOS 4.10 bootimage metadata in the installer.
This change includes fixes for the following BZs:

Bug 2008521 - gcp-hostname service should correct invalid search entries in resolv.conf
Bug 2043296 - Ignition fails when reusing existing statically-keyed LUKS volume
Bug 2043721 - Installer bootstrap hosts using outdated kubelet containing bugs

This change will also introduce artifacts for for Aliyun, AWS GovCloud regions, and Nutanix.

Changes generated with:

$ cosa shell
[coreos-assembler]$ plume cosa2stream --target data/data/coreos/rhcos.json --distro rhcos --no-signatures \
--url https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases aarch64=410.84.202201251203-0 \
ppc64le=410.84.202201251004-0 s390x=410.84.202201251002-0  x86_64=410.84.202201251210-0
Verification Steps:

Install a new 4.10 cluster
oc debug node/<node name> -- chroot /host rpm-ostree status
Verify that the deployment version matches the version from this PR
that matches the architecture you are testing on. (i.e. x86_64
should have version 410.84.202201251210-0)

* Bug 2045916: IBMCloud: Stop defaulting to dedicated storage profile

Move off the dedicated storage machine profile, as it has shown to
be less reliable for provisioning on IBM Cloud.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2045916

* Alibaba: fix destroy not exist security group

The destroyer should not error when it attempts to delete a security group
 that does not exist.

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix endpoint error in some regions

Update sdk and terraform provier version, and add some endpoints of ECS
 service to fix endpoint error.

Signed-off-by: sunhui <[email protected]>

* Alibaba: update vendor

* Revert "update validation for unsupported options"

This reverts commit e5d628d.

* Revert "remove validation related to unsupported options"

This reverts commit 20f8626.

* Alibaba: support internal publish strategy

Support internal publish strategy for platform Alibaba Cloud

Signed-off-by: sunhui <[email protected]>

å

* Alibaba: fix installer index panic

Add NAT gateway validation to check the region whether support NAT gateway

Signed-off-by: sunhui <[email protected]>

* remove validation for unsupported options

* Alibaba: fix destory exist private zone

Should not destroy pre-configured alicloud DNS private zone

Signed-off-by: sunhui <[email protected]>

* Alibaba: fix validation of resource group ID

Fix resource group ID validation errors caused by pagination issues

Signed-off-by: sunhui <[email protected]>

* update custom image ostype

* Bug 2047258: Read GovCloud from RHCOS stream

AMIs for GovCloud regions have been added to the RHCOS stream. Remove
validation requiring users to provide an AMI.

* Remove Caleb Boylan from core installer reviewers

Co-authored-by: rna-afk <[email protected]>
Co-authored-by: Joel Speed <[email protected]>
Co-authored-by: Zhou Hao <[email protected]>
Co-authored-by: Christopher J Schaefer <[email protected]>
Co-authored-by: staebler <[email protected]>
Co-authored-by: Andrea Fasano <[email protected]>
Co-authored-by: OpenShift Merge Robot <[email protected]>
Co-authored-by: Vadim Rutkovsky <[email protected]>
Co-authored-by: sunhui <[email protected]>
Co-authored-by: Jeff Nowicki <[email protected]>
Co-authored-by: Michael McCune <[email protected]>
Co-authored-by: Pierre Prinetti <[email protected]>
Co-authored-by: Huijing Hei <[email protected]>
Co-authored-by: patrickdillon <[email protected]>
Co-authored-by: Kiran Thyagaraja <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants