Skip to content

Conversation

@kwoodson
Copy link

@kwoodson kwoodson commented Mar 4, 2022

AlibabaCloud provider has an issue with the recent terraform provider updates. When the bootstrap is torn down the SLB server backends are removed. This seems to be an issue in the terraform provider.

This code removes the teardown of the SLB for the bootstrap. This adds an additional stage to add all of the control plane nodes including the bootstrap. When the bootstrap is torn down the instance is removed automatically from the SLB (service load balancer).

@kwoodson
Copy link
Author

kwoodson commented Mar 4, 2022

DEBUG Time elapsed per stage:                      
DEBUG                 cluster: 2m9s                
DEBUG               bootstrap: 1m2s                
DEBUG slb_attach_controlplane: 7s                  
DEBUG      Bootstrap Complete: 14m29s              
DEBUG                     API: 4m51s               
DEBUG       Bootstrap Destroy: 55s                 
DEBUG       Cluster Operators: 12m18s              
INFO Time elapsed: 31m6s          

@kwoodson
Copy link
Author

kwoodson commented Mar 4, 2022

/test e2e-alibaba

@kwoodson
Copy link
Author

kwoodson commented Mar 4, 2022

/test e2e-alibaba

@kwoodson
Copy link
Author

kwoodson commented Mar 4, 2022

@jsafrane The e2e-alibaba runs and is able to install with this PR.

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bootstrap VM is not being removed from the load balancers' backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a virtual server group alicloud_slb_server_group can be used, of course, this is not required

@bd233
Copy link
Contributor

bd233 commented Mar 10, 2022

This seems to be an issue in the terraform provider.

@kwoodson Have you located the cause of the issue? can you briefly describe it, thanks. I checked the resourceAliyunSlbBackendServersDelete, but found no problem

@kwoodson
Copy link
Author

@bd233 The issue is when the teardown step of the bootstrap happens, all of the servers are removed from the SLB. This causes the cluster installation to fail as the oc commands no longer succeed. I'm able to manually add all control plane nodes back to the cluster and the cluster is functional but that is not something we can do during the install.

By adding this extra step to the installation then all nodes successfully get added and the bootstrap will be removed once the instance is released.

Comment on lines 22 to 36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have been overlooked from the message on my first review, but there is no code run for this stage during bootstrap destroy to remove the bootstrap VM from the backend of the slb.

Copy link
Author

@kwoodson kwoodson Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staebler I'm sorry for the late response. Just returning from PTO.

When I tested this code previously, the bootstrap release automatically removes the instance from the SLB list of backends. I'll test this morning and re-verify that behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this handled implicitly by the bootstrap VM being deleted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there is any value in explicitly removing the bootstrap VM from the load balancer? If we are going to rely on implicit behavior, then we at least need some good comments explaining why the bootstrap VM does not need to be explicitly removed from the load balancer. A year from now, I am going to look at this code and not remember why we are not removing from the load balancer: I am going to think it was an oversight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the effective difference is that the master backends are created after the bootstrap backends, right? If that works, I think that is a cleaner approach. I would like to see some comments on the stage explaining why the master backends need to be added after the bootstrap backends.

I also think that this will be a bit easier to revert once the bug is fixed in the terraform provider.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running into the same issue now. Wondering if there is a race condition upon running the next stage. I've had a successful cluster install and a failure. The bootstrap destroy ran but the slb_attachment stage never ran.

DEBUG Plan: 0 to add, 0 to change, 11 to destroy. 
...
DEBUG Destroy complete! Resources: 11 destroyed.   

<-- loadbalancer backends were removed here -->

DEBUG Loading Install Config...                    
DEBUG   Loading SSH Key...                         
DEBUG   Loading Base Domain...                     
DEBUG     Loading Platform...                      
DEBUG   Loading Cluster Name...                    
DEBUG     Loading Base Domain...                   
DEBUG     Loading Platform...                      
DEBUG   Loading Networking...                      
DEBUG     Loading Platform...                      
DEBUG   Loading Pull Secret...                     
DEBUG   Loading Platform...                        
DEBUG Using Install Config loaded from state file  
INFO Waiting up to 40m0s (until 3:20PM) for the cluster at https://api.test.alicloud-dev.devcluster.openshift.com:6443 to initialize... 
W0316 14:40:59.738715 1143968 reflector.go:324] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.test.alicloud-dev.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 47.253.132.94:6443: connect: connection refused

@staebler Any ideas why the next stage did not proceed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a full log that you can show me? I suspect that you are not running the right binary if the "slb_attach_controlplane" stage did not run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some troubleshooting in cluster.go and watching the stages, I can see that the stages execute sequentially. The cluster, bootstrap, and then the attachment run successfully. Then the teardown occurs for the bootstrap which removes the backend servers from the SLB.

Is there a way to defer the attachment until the teardown of the bootstrap is complete?

Copy link
Contributor

@staebler staebler Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to defer the attachment. You want the masters attached while bootstrapping is proceeding.

@kwoodson
Copy link
Author

/test e2e-alibaba

@kwoodson
Copy link
Author

@staebler I think this was in line with what you were suggesting previously. This mirrors gcp's functionality and is variable driven. This was successful for cluster creation. PTAL

DEBUG Time elapsed per stage:                      
DEBUG                 cluster: 2m12s               
DEBUG               bootstrap: 1m13s               
DEBUG slb_attach_controlplane: 6s                  
DEBUG      Bootstrap Complete: 12m24s              
DEBUG                     API: 6m31s               
DEBUG       Bootstrap Destroy: 55s                 
DEBUG       Cluster Operators: 13m6s               
INFO Time elapsed: 30m2s   

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was in line with what you were suggesting previously.

Indeed!

Note that this will cause a merge conflict with #5715.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
load_balancer_id = "${element(var.slb_ids, count.index)}"
load_balancer_id = var.slb_ids[count.index]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the quotes and dollar sign noise any more.

Suggested change
count = "${length(var.slb_ids)}"
count = length(var.slb_ids)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
server_id = "${backend_servers.value}"
server_id = backend_servers.value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in data/data/alibabacloud/variables-alibabacloud.tf instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this superfluous change.

Comment on lines 4 to 9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"github.com/hashicorp/terraform-exec/tfexec"
"github.com/openshift/installer/pkg/terraform"
"github.com/openshift/installer/pkg/terraform/providers"
"github.com/openshift/installer/pkg/terraform/stages"
alitypes "github.com/openshift/installer/pkg/types/alibabacloud"
"github.com/pkg/errors"
"github.com/hashicorp/terraform-exec/tfexec"
"github.com/pkg/errors"
"github.com/openshift/installer/pkg/terraform"
"github.com/openshift/installer/pkg/terraform/providers"
"github.com/openshift/installer/pkg/terraform/stages"
alitypes "github.com/openshift/installer/pkg/types/alibabacloud"

Comment on lines 27 to 28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// all of the controlplane backends to be removed from the SLB. This stage reattaches only the
// master controlplane nodes by calling apply with the ali_boostrap_lb set to false.
// all of the controlplane backends to be removed from the SLB. This stage attaches the
// master VMs and the bootstrap VMs as backend servers to the SLBs on create. Later,
// on bootstrap destroy, this stages removes only the bootstrap VM from the backend servers.

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit. Otherwise it looks good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is not correct.

Suggested change
description = "The load balancer IDs of the bootstrap ECS."
description = "The IDs of the load balancers to which to attach the bootstrap and control plane VMs."

@kwoodson
Copy link
Author

Latest changes are successful. I'll update one more time to address the nit.

DEBUG Time elapsed per stage:                      
DEBUG                 cluster: 2m18s               
DEBUG               bootstrap: 1m12s               
DEBUG slb_attach_controlplane: 8s                  
DEBUG      Bootstrap Complete: 13m11s              
DEBUG                     API: 6m25s               
DEBUG       Bootstrap Destroy: 57s                 
DEBUG       Cluster Operators: 10m28s              
INFO Time elapsed: 28m21s                         

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2022
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

8 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@kwoodson
Copy link
Author

I'm not sure what this job has to do with okd-e2e-aws-upgrade. Might be good to be able to only test the necessary areas of change for this Alibaba work. Let's see what happens.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2022

@kwoodson: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-alibaba 5d9ab1f699454b379c9cfdc43c93949842ea4db3 link false /test e2e-alibaba
ci/prow/e2e-azure-upi 5d9ab1f699454b379c9cfdc43c93949842ea4db3 link false /test e2e-azure-upi
ci/prow/e2e-ibmcloud 0e9bc90 link false /test e2e-ibmcloud
ci/prow/e2e-crc 0e9bc90 link false /test e2e-crc
ci/prow/okd-e2e-aws-upgrade 0e9bc90 link false /test okd-e2e-aws-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants