modules/aws: Drop auto-scaling groups (ASG) #88

wking · 2018-07-31T21:26:47Z

Currently, the installer creates an ASG for both masters and workers. While this is okay for workers, this causes quite a few headaches for masters. For example, etcd will be run on the master nodes, but this will be unstable if Amazon is free to delete nodes and recreate them without etcd knowing about it. Additionally, this makes bootstrap tricky because every master needs to be identical. In order to break the dependency loop between the MachineConfigController (TNC) and the master nodes, a CNAME is currently used to temporarily redirect Ignition's requests to an S3 bucket with a pre-computed bootstrap config.

This commit drops the use of ASG for both masters and workers and instead creates individual machines. Once we adopt the use of MachineSets and MachineControllers, we'll regain the functionality we lost by dropping ASGs.

While renaming the previous *-asg modules, I've also renamed the main Terraform files to main.tf to comply with the standard module structure.

The new main.tf follow the example set by modules/aws/etcd/nodes.tf, which is why the instance resource declarations have moved to the end of the file. I've also adjusted the master policy to more closely match the more-restrictive etcd policy, because I don't think the masters will need to push S3 resources, etc.

I've also dropped some redundant names (e.g. master_profile -> master), because the profile-ness is already covered in the resource name (e.g. aws_iam_instance_profile.master).

The worker load balancer is a bit more complicated now, since we have to loop over both the provided load balancers and the created worker instances to setup associations.

wking · 2018-07-31T21:33:50Z

tests/smoke/aws/README.md

Ah, I still need to fix this...

crawford · 2018-07-31T22:59:42Z

retest this please

wking · 2018-07-31T23:04:28Z

I'm not sure if the smoke tests will pick it up, but there's some sort of issue with (at least) the console at https://{$CLUSTER_NAME}-api.${BASE_DOMAIN}:6443/console/ returning end-of-file before the TLS connection is established. I'm still trying to figure out what's going on there.

abhinavdahiya · 2018-07-31T23:12:56Z

modules/aws/master/main.tf

one master <=> one subnet ? How do we make sure we have enough subnet ids

one master <=> one subnet ? How do we make sure we have enough subnet ids

Hmm. I was trying to replace this:

vpc_zone_identifier = ["${var.subnet_ids}"]

And the line I have here is how etcd works now. Maybe it works because element has some built-in modulus protection?

abhinavdahiya · 2018-07-31T23:14:03Z

modules/aws/worker/main.tf

Same as in master.tf

abhinavdahiya · 2018-07-31T23:18:45Z

Documentation/design/installconfig.md

This is previous work of tectonic-installer. No need to edit this.

This is previous work of tectonic-installer. No need to edit this.

Ah, thanks. Dropped with c9351ed -> 46de8e5.

wking · 2018-08-01T02:24:00Z

Timeouts in both Jenkins (during teardown):

00:38:59 module.vpc.aws_nat_gateway.nat_gw.1: Still destroying... (ID: nat-071d666045c18d567, 8m20s elapsed)
00:39:07 Cancelling nested steps due to timeout

and e2e-aws (during setup):

Waiting for API at https://ci-op-vwnb6vqd-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
Interrupted
2018/08/01 01:30:54 Container setup in pod e2e-aws failed, exit code 1, reason Error
Another process exited

/retest

wking · 2018-08-01T04:05:53Z

This time e2e-aws errored with:

1 error(s) occurred:

* module.masters.aws_instance.master: 1 error(s) occurred:

* aws_instance.master: Error launching source instance: InvalidParameterValue: Value (ci-op-p1k7nnm1-68485-master-profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name
	status code: 400, request id: 4141cced-f796-4c3d-8ea8-c59d911d5911

But ci-op-p1k7nnm1-68485-master-profile looks like a reasonable profile name to me. I'll push a new commit with a small commit-message typo fix to trigger new tests.

wking · 2018-08-01T16:11:37Z

The Jenkins error was:

04:13:57 module.vpc.aws_vpc.new_vpc: Destroying... (ID: vpc-7685de10)
04:13:57 module.vpc.aws_vpc.new_vpc: Destruction complete after 1s
04:13:57
04:13:57 Error: Error applying plan:
04:13:57
04:13:57 1 error(s) occurred:
04:13:57
04:13:57 * module.vpc.output.aws_lbs: Resource 'aws_elb.console' does not have attribute 'id' for variable 'aws_elb.console.id'

But aws_elb resources should definitely have an id attribute, so I'm not sure what that's about. Maybe we're missing a dependency in the Terraform DAG that's causing racy teardown?

The e2e-aws error was:

3 error(s) occurred:

* module.vpc.aws_route.to_nat_gw[1]: 1 error(s) occurred:

* aws_route.to_nat_gw.1: Error finding route after creating it: Unable to find matching route for Route Table (rtb-0896234416fbe24be) and destination CIDR block (0.0.0.0/0).
* module.vpc.aws_route.to_nat_gw[4]: 1 error(s) occurred:

* aws_route.to_nat_gw.4: Error finding route after creating it: Unable to find matching route for Route Table (rtb-0421e441c551bb583) and destination CIDR block (0.0.0.0/0).
* module.vpc.aws_route.to_nat_gw[5]: 1 error(s) occurred:

* aws_route.to_nat_gw.5: Error finding route after creating it: Unable to find matching route for Route Table (rtb-012e1b3ba024d050f) and destination CIDR block (0.0.0.0/0).

We've seen the "Error finding route..." issue a few times before, there may be a race or other bug there causing this instability.

/retest
retest this please

wking · 2018-08-01T17:34:41Z

This time I'm looking at the full Jenkins logs, and the smoke tests are failing:

...
Apply complete! Resources: 4 added, 0 changed, 0 destroyed.
 Running smoke test...
...
--- FAIL: Test (2700.00s)
    --- FAIL: Test/Common (900.00s)
        --- FAIL: Test/Common/APIAvailable (900.00s)
        	smoke_test.go:112: failed with error: failed to connect to API server: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/version: EOF
        	smoke_test.go:113: retrying in 10s
...
        	smoke_test.go:112: failed with error: failed to connect to API server: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/version: EOF
        	smoke_test.go:113: retrying in 10s
        	common_test.go:18: Failed to connect to API server in 15m0s.
    --- FAIL: Test/Cluster (1800.00s)
        --- FAIL: Test/Cluster/AllNodesRunning (600.00s)
        	smoke_test.go:112: failed with error: failed to list nodes: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/nodes: EOF
        	smoke_test.go:113: retrying in 10s
...
        	smoke_test.go:112: failed with error: failed to list nodes: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/nodes: EOF
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:148: Failed to find 7 ready nodes in 10m0s.
        --- FAIL: Test/Cluster/AllResourcesCreated (600.00s)
        	cluster_test.go:215: looking for resources defined by the provided manifests...
        	cluster_test.go:268: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api: EOF
        	smoke_test.go:112: failed with error: all defined resources were not present
        	smoke_test.go:113: retrying in 30s
...
       	cluster_test.go:215: looking for resources defined by the provided manifests...
        	cluster_test.go:268: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api: EOF
        	smoke_test.go:112: failed with error: all defined resources were not present
        	smoke_test.go:113: retrying in 30s
        	cluster_test.go:292: timed out waiting for all manifests to be created after 10m0s
        --- FAIL: Test/Cluster/AllPodsRunning (600.00s)
        	smoke_test.go:112: failed with error: could not list pods: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/pods: EOF
        	smoke_test.go:113: retrying in 3s
...
        	smoke_test.go:112: failed with error: could not list pods: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/pods: EOF
        	smoke_test.go:113: retrying in 3s
        	cluster_test.go:71: Timed out waiting for pods to be ready.
FAIL
 Exiting... Destroying Tectonic and cleaning SSH keys...
...
module.etcd.aws_instance.etcd_node[0]: Destroying... (ID: i-0f6ca290b9571a64e)
module.etcd.aws_instance.etcd_node[2]: Destroying... (ID: i-0a6cfec594469d45b)
module.etcd.aws_instance.etcd_node[1]: Destroying... (ID: i-0d124acd0914412f7)
module.etcd.aws_instance.etcd_node.1: Still destroying... (ID: i-0d124acd0914412f7, 10s elapsed)
module.etcd.aws_instance.etcd_node.0: Still destroying... (ID: i-0f6ca290b9571a64e, 10s elapsed)
module.etcd.aws_instance.etcd_node.2: Still destroying... (ID: i-0a6cfec594469d45b, 10s elapsed)
...
module.etcd.aws_instance.etcd_node.1: Still destroying... (ID: i-0d124acd0914412f7, 8m20s elapsed)
module.etcd.aws_instance.etcd_node.1: Still destroying... (ID: i-0d124acd0914412f7, 8m30s elapsed)
Sending interrupt signal to process
Terminated
script returned exit code 143

So it looks like there were some smoke-test timeouts, and those timeouts caused the entire teardown process to time out.

yifan-gu · 2018-08-13T19:50:36Z

Looks like the etcd nodes can't come up because they can't get the ignition from tnc.
Debugging.

(updated)
The elb for tnc reports 0 healthy instances. I added the master manually into the elb, then the etcd nodes can boot up.

I'm going to try to add the instance ids into the elb to make it work, but seems @crawford is moving the masters into the etcd nodes, so we probably don't need elb for it anymore?

wking · 2018-08-14T20:56:35Z

modules/aws/worker/main.tf

I expect this would work either way, and var.instance_count is shorter and (at least for me) easier to read than length(var.load_balancers). Is there a reason to switch to using the latter, @yifan-gu?

This won't work, say if:

length(var.load_balancers) == 4 instance_count = 2 count_index = 4*2 = 8

Then, count.index % var.instance_count ranges from [0, 1], which won't include all elbs in the list, and count.index / var.instance_count ranges from [0, 3] which will be out of boundary.

But we can do instance = [count.index % var.instance_count], elb = [count.index / var.instance_count], which will be identical.

But we can do instance = [count.index % var.instance_count], elb = [count.index / var.instance_count], which will be identical.

Done with c42f215 -> f883131.

wking · 2018-08-14T21:54:04Z

modules/aws/master/main.tf

The worker aws_elb_attachment resource is replacing this. But I don't see an analogous entry in master for worker nodes. Can you explain why we need this in more detail, @yifan-gu?

But I don't see an analogous entry in master for worker nodes.

@yifan-gu pointed me at the load_balancers entry in the old aws_autoscaling_group.masters. That's what this entry is replacing.

yifan-gu · 2018-08-15T01:14:04Z

Looks like the test is failing due do resource limit being hit

wking · 2018-08-20T18:35:58Z

@yifan-gu rebased around #127 with 9421288 -> d93d94b

Currently, the installer creates an ASG for both masters and workers. While this is okay for workers, this causes quite a few headaches for masters. For example, etcd will be run on the master nodes, but this will be unstable if Amazon is free to delete nodes and recreate them without etcd knowing about it. Additionally, this makes bootstrap tricky because every master needs to be identical. In order to break the dependency loop between the MachineConfigController (TNC) and the master nodes, a CNAME is currently used to temporarily redirect Ignition's requests to an S3 bucket with a pre-computed bootstrap config. This commit drops the use of ASG for both masters and workers and instead creates individual machines. Once we adopt the use of MachineSets and MachineControllers, we'll regain the functionality we lost by dropping ASGs. While renaming the previous *-asg modules, I've also renamed the main Terraform files to main.tf to comply with the standard module structure [1]. The new main.tf follow the example set by modules/aws/etcd/nodes.tf, which is why the instance resource declarations have moved to the end of the file. I've also adjusted the master policy to more closely match the more-restrictive etcd policy, because I don't think the masters will need to push S3 resources, etc. I've also dropped some redundant names (e.g. "master_profile" -> "master"), because the profile-ness is already covered in the resource name (e.g. "aws_iam_instance_profile.master"). The worker load balancer (previously aws_autoscaling_attachment.workers, now aws_elb_attachment.workers) is a bit more complicated now, since we have to loop over both the provided load balancers and the created worker instances to set up associations. There are also similar master load balancer associations in aws_elb_attachment.masters, replacing the old load_balancers entry in aws_autoscaling_group.masters. [1]: https://www.terraform.io/docs/modules/create.html#standard-module-structure

wking · 2018-08-20T18:38:21Z

And I've dropped two more ssh_key entries with d93d94b -> 446e9f1.

yifan-gu · 2018-08-23T23:03:23Z

/retest

yifan-gu · 2018-08-23T23:37:48Z

/lgtm

openshift-ci-robot · 2018-08-23T23:37:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yifan-gu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking,yifan-gu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yifan-gu · 2018-08-24T00:16:07Z

/retest

Bug 1905307: cluster-baremetal-operator: add provisioning CR as related object

Add Power VS platform permission check

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 31, 2018

wking force-pushed the drop-auto-scaling-groups branch from c2dd6f7 to cff7e05 Compare July 31, 2018 21:31

wking commented Jul 31, 2018

View reviewed changes

tests/smoke/aws/README.md Outdated

Copy link

Member Author

wking Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I still need to fix this...

wking force-pushed the drop-auto-scaling-groups branch from cff7e05 to c9351ed Compare July 31, 2018 22:02

crawford added the run-smoke-tests label Jul 31, 2018

abhinavdahiya reviewed Jul 31, 2018

View reviewed changes

wking force-pushed the drop-auto-scaling-groups branch from c9351ed to 46de8e5 Compare July 31, 2018 23:28

wking force-pushed the drop-auto-scaling-groups branch 2 times, most recently from 4f9d6b5 to e0dde38 Compare August 1, 2018 04:06

wking changed the title ~~modules/aws: Drop auto-scaling groups~~ wip: modules/aws: Drop auto-scaling groups Aug 3, 2018

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 3, 2018

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2018

wking force-pushed the drop-auto-scaling-groups branch from 4e9b293 to c42f215 Compare August 14, 2018 20:54

wking commented Aug 14, 2018

View reviewed changes

wking force-pushed the drop-auto-scaling-groups branch from c42f215 to f883131 Compare August 14, 2018 21:50

wking commented Aug 14, 2018

View reviewed changes

wking force-pushed the drop-auto-scaling-groups branch from f883131 to 9421288 Compare August 14, 2018 22:44

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2018

wking force-pushed the drop-auto-scaling-groups branch from 9421288 to d93d94b Compare August 20, 2018 18:35

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2018

wking force-pushed the drop-auto-scaling-groups branch from d93d94b to 446e9f1 Compare August 20, 2018 18:38

wking changed the title ~~wip: modules/aws: Drop auto-scaling groups~~ modules/aws: Drop auto-scaling groups Aug 20, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2018

wking changed the title ~~modules/aws: Drop auto-scaling groups~~ modules/aws: Drop auto-scaling groups (ASG) Aug 21, 2018

openshift-ci-robot assigned yifan-gu Aug 23, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2018

crawford mentioned this pull request Aug 23, 2018

*: implement new bootstrap process #162

Merged

openshift-merge-robot merged commit 2414a30 into openshift:master Aug 24, 2018

wking mentioned this pull request Aug 24, 2018

tests/smoke/BUILD.bazel: Drop the 'smoke' target #143

Merged

wking deleted the drop-auto-scaling-groups branch August 24, 2018 05:32

wking mentioned this pull request Aug 24, 2018

cluster-launch-installer-e2e: remove etcd nodes openshift/release#1274

Merged

wking referenced this pull request in joelddiaz/cluster-operator Sep 6, 2018

WIP: label-based uninstaller

06f6bad

wking mentioned this pull request Sep 13, 2018

fix typo in lifecycle/ignore_changes for workers #245

Merged

stbenjam pushed a commit to stbenjam/installer that referenced this pull request Feb 10, 2021

Merge pull request openshift#88 from stbenjam/related-obj

366bda2

Bug 1905307: cluster-baremetal-operator: add provisioning CR as related object

clnperez added a commit to clnperez/installer that referenced this pull request Dec 13, 2021

Merge pull request openshift#88 from bkhadars/platform_validation1

c5f0d5e

Add Power VS platform permission check

modules/aws: Drop auto-scaling groups (ASG) #88

modules/aws: Drop auto-scaling groups (ASG) #88

Uh oh!

Conversation

wking commented Jul 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crawford commented Jul 31, 2018

Uh oh!

wking commented Jul 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking commented Aug 1, 2018

Uh oh!

wking commented Aug 1, 2018

Uh oh!

wking commented Aug 1, 2018

Uh oh!

wking commented Aug 1, 2018

Uh oh!

yifan-gu commented Aug 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifan-gu Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifan-gu commented Aug 15, 2018

Uh oh!

wking commented Aug 20, 2018

Uh oh!

wking commented Aug 20, 2018

Uh oh!

yifan-gu commented Aug 23, 2018

Uh oh!

yifan-gu commented Aug 23, 2018

Uh oh!

openshift-ci-robot commented Aug 23, 2018

Uh oh!

yifan-gu commented Aug 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yifan-gu commented Aug 13, 2018 •

edited

Loading

yifan-gu Aug 14, 2018 •

edited

Loading

wking Aug 14, 2018 •

edited

Loading

wking Aug 14, 2018 •

edited

Loading