Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jul 31, 2018

Currently, the installer creates an ASG for both masters and workers. While this is okay for workers, this causes quite a few headaches for masters. For example, etcd will be run on the master nodes, but this will be unstable if Amazon is free to delete nodes and recreate them without etcd knowing about it. Additionally, this makes bootstrap tricky because every master needs to be identical. In order to break the dependency loop between the MachineConfigController (TNC) and the master nodes, a CNAME is currently used to temporarily redirect Ignition's requests to an S3 bucket with a pre-computed bootstrap config.

This commit drops the use of ASG for both masters and workers and instead creates individual machines. Once we adopt the use of MachineSets and MachineControllers, we'll regain the functionality we lost by dropping ASGs.

While renaming the previous *-asg modules, I've also renamed the main Terraform files to main.tf to comply with the standard module structure.

The new main.tf follow the example set by modules/aws/etcd/nodes.tf, which is why the instance resource declarations have moved to the end of the file. I've also adjusted the master policy to more closely match the more-restrictive etcd policy, because I don't think the masters will need to push S3 resources, etc.

I've also dropped some redundant names (e.g. master_profile -> master), because the profile-ness is already covered in the resource name (e.g. aws_iam_instance_profile.master).

The worker load balancer is a bit more complicated now, since we have to loop over both the provided load balancers and the created worker instances to setup associations.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 31, 2018
@wking wking force-pushed the drop-auto-scaling-groups branch from c2dd6f7 to cff7e05 Compare July 31, 2018 21:31
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I still need to fix this...

@wking wking force-pushed the drop-auto-scaling-groups branch from cff7e05 to c9351ed Compare July 31, 2018 22:02
@crawford
Copy link
Contributor

retest this please

@wking
Copy link
Member Author

wking commented Jul 31, 2018

I'm not sure if the smoke tests will pick it up, but there's some sort of issue with (at least) the console at https://{$CLUSTER_NAME}-api.${BASE_DOMAIN}:6443/console/ returning end-of-file before the TLS connection is established. I'm still trying to figure out what's going on there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one master <=> one subnet ? How do we make sure we have enough subnet ids

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one master <=> one subnet ? How do we make sure we have enough subnet ids

Hmm. I was trying to replace this:

vpc_zone_identifier = ["${var.subnet_ids}"]

And the line I have here is how etcd works now. Maybe it works because element has some built-in modulus protection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in master.tf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is previous work of tectonic-installer. No need to edit this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is previous work of tectonic-installer. No need to edit this.

Ah, thanks. Dropped with c9351ed -> 46de8e5.

@wking wking force-pushed the drop-auto-scaling-groups branch from c9351ed to 46de8e5 Compare July 31, 2018 23:28
@wking
Copy link
Member Author

wking commented Aug 1, 2018

Timeouts in both Jenkins (during teardown):

00:38:59 module.vpc.aws_nat_gateway.nat_gw.1: Still destroying... (ID: nat-071d666045c18d567, 8m20s elapsed)
00:39:07 Cancelling nested steps due to timeout

and e2e-aws (during setup):

Waiting for API at https://ci-op-vwnb6vqd-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
Interrupted
2018/08/01 01:30:54 Container setup in pod e2e-aws failed, exit code 1, reason Error
Another process exited

/retest

@wking
Copy link
Member Author

wking commented Aug 1, 2018

This time e2e-aws errored with:

1 error(s) occurred:

* module.masters.aws_instance.master: 1 error(s) occurred:

* aws_instance.master: Error launching source instance: InvalidParameterValue: Value (ci-op-p1k7nnm1-68485-master-profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name
	status code: 400, request id: 4141cced-f796-4c3d-8ea8-c59d911d5911

But ci-op-p1k7nnm1-68485-master-profile looks like a reasonable profile name to me. I'll push a new commit with a small commit-message typo fix to trigger new tests.

@wking wking force-pushed the drop-auto-scaling-groups branch 2 times, most recently from 4f9d6b5 to e0dde38 Compare August 1, 2018 04:06
@wking
Copy link
Member Author

wking commented Aug 1, 2018

The Jenkins error was:

04:13:57 module.vpc.aws_vpc.new_vpc: Destroying... (ID: vpc-7685de10)
04:13:57 module.vpc.aws_vpc.new_vpc: Destruction complete after 1s
04:13:57
04:13:57 Error: Error applying plan:
04:13:57
04:13:57 1 error(s) occurred:
04:13:57
04:13:57 * module.vpc.output.aws_lbs: Resource 'aws_elb.console' does not have attribute 'id' for variable 'aws_elb.console.id'

But aws_elb resources should definitely have an id attribute, so I'm not sure what that's about. Maybe we're missing a dependency in the Terraform DAG that's causing racy teardown?

The e2e-aws error was:

3 error(s) occurred:

* module.vpc.aws_route.to_nat_gw[1]: 1 error(s) occurred:

* aws_route.to_nat_gw.1: Error finding route after creating it: Unable to find matching route for Route Table (rtb-0896234416fbe24be) and destination CIDR block (0.0.0.0/0).
* module.vpc.aws_route.to_nat_gw[4]: 1 error(s) occurred:

* aws_route.to_nat_gw.4: Error finding route after creating it: Unable to find matching route for Route Table (rtb-0421e441c551bb583) and destination CIDR block (0.0.0.0/0).
* module.vpc.aws_route.to_nat_gw[5]: 1 error(s) occurred:

* aws_route.to_nat_gw.5: Error finding route after creating it: Unable to find matching route for Route Table (rtb-012e1b3ba024d050f) and destination CIDR block (0.0.0.0/0).

We've seen the "Error finding route..." issue a few times before, there may be a race or other bug there causing this instability.

/retest
retest this please

@wking
Copy link
Member Author

wking commented Aug 1, 2018

This time I'm looking at the full Jenkins logs, and the smoke tests are failing:

...
Apply complete! Resources: 4 added, 0 changed, 0 destroyed.
 Running smoke test...
...
--- FAIL: Test (2700.00s)
    --- FAIL: Test/Common (900.00s)
        --- FAIL: Test/Common/APIAvailable (900.00s)
        	smoke_test.go:112: failed with error: failed to connect to API server: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/version: EOF
        	smoke_test.go:113: retrying in 10s
...
        	smoke_test.go:112: failed with error: failed to connect to API server: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/version: EOF
        	smoke_test.go:113: retrying in 10s
        	common_test.go:18: Failed to connect to API server in 15m0s.
    --- FAIL: Test/Cluster (1800.00s)
        --- FAIL: Test/Cluster/AllNodesRunning (600.00s)
        	smoke_test.go:112: failed with error: failed to list nodes: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/nodes: EOF
        	smoke_test.go:113: retrying in 10s
...
        	smoke_test.go:112: failed with error: failed to list nodes: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/nodes: EOF
        	smoke_test.go:113: retrying in 10s
        	cluster_test.go:148: Failed to find 7 ready nodes in 10m0s.
        --- FAIL: Test/Cluster/AllResourcesCreated (600.00s)
        	cluster_test.go:215: looking for resources defined by the provided manifests...
        	cluster_test.go:268: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api: EOF
        	smoke_test.go:112: failed with error: all defined resources were not present
        	smoke_test.go:113: retrying in 30s
...
       	cluster_test.go:215: looking for resources defined by the provided manifests...
        	cluster_test.go:268: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api: EOF
        	smoke_test.go:112: failed with error: all defined resources were not present
        	smoke_test.go:113: retrying in 30s
        	cluster_test.go:292: timed out waiting for all manifests to be created after 10m0s
        --- FAIL: Test/Cluster/AllPodsRunning (600.00s)
        	smoke_test.go:112: failed with error: could not list pods: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/pods: EOF
        	smoke_test.go:113: retrying in 3s
...
        	smoke_test.go:112: failed with error: could not list pods: Get https://ci-pr-88-2cc66-api.tectonic-ci.de:6443/api/v1/pods: EOF
        	smoke_test.go:113: retrying in 3s
        	cluster_test.go:71: Timed out waiting for pods to be ready.
FAIL
 Exiting... Destroying Tectonic and cleaning SSH keys...
...
module.etcd.aws_instance.etcd_node[0]: Destroying... (ID: i-0f6ca290b9571a64e)
module.etcd.aws_instance.etcd_node[2]: Destroying... (ID: i-0a6cfec594469d45b)
module.etcd.aws_instance.etcd_node[1]: Destroying... (ID: i-0d124acd0914412f7)
module.etcd.aws_instance.etcd_node.1: Still destroying... (ID: i-0d124acd0914412f7, 10s elapsed)
module.etcd.aws_instance.etcd_node.0: Still destroying... (ID: i-0f6ca290b9571a64e, 10s elapsed)
module.etcd.aws_instance.etcd_node.2: Still destroying... (ID: i-0a6cfec594469d45b, 10s elapsed)
...
module.etcd.aws_instance.etcd_node.1: Still destroying... (ID: i-0d124acd0914412f7, 8m20s elapsed)
module.etcd.aws_instance.etcd_node.1: Still destroying... (ID: i-0d124acd0914412f7, 8m30s elapsed)
Sending interrupt signal to process
Terminated
script returned exit code 143

So it looks like there were some smoke-test timeouts, and those timeouts caused the entire teardown process to time out.

@wking wking changed the title modules/aws: Drop auto-scaling groups wip: modules/aws: Drop auto-scaling groups Aug 3, 2018
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 3, 2018
@yifan-gu
Copy link
Contributor

yifan-gu commented Aug 13, 2018

Looks like the etcd nodes can't come up because they can't get the ignition from tnc.
Debugging.

(updated)
The elb for tnc reports 0 healthy instances. I added the master manually into the elb, then the etcd nodes can boot up.

I'm going to try to add the instance ids into the elb to make it work, but seems @crawford is moving the masters into the etcd nodes, so we probably don't need elb for it anymore?

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 14, 2018
@wking wking force-pushed the drop-auto-scaling-groups branch from 4e9b293 to c42f215 Compare August 14, 2018 20:54
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this would work either way, and var.instance_count is shorter and (at least for me) easier to read than length(var.load_balancers). Is there a reason to switch to using the latter, @yifan-gu?

Copy link
Contributor

@yifan-gu yifan-gu Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, say if:

length(var.load_balancers) == 4
instance_count = 2
count_index = 4*2 = 8

Then, count.index % var.instance_count ranges from [0, 1], which won't include all elbs in the list, and count.index / var.instance_count ranges from [0, 3] which will be out of boundary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can do instance = [count.index % var.instance_count], elb = [count.index / var.instance_count], which will be identical.

Copy link
Member Author

@wking wking Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can do instance = [count.index % var.instance_count], elb = [count.index / var.instance_count], which will be identical.

Done with c42f215 -> f883131.

@wking wking force-pushed the drop-auto-scaling-groups branch from c42f215 to f883131 Compare August 14, 2018 21:50
Copy link
Member Author

@wking wking Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker aws_elb_attachment resource is replacing this. But I don't see an analogous entry in master for worker nodes. Can you explain why we need this in more detail, @yifan-gu?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't see an analogous entry in master for worker nodes.

@yifan-gu pointed me at the load_balancers entry in the old aws_autoscaling_group.masters. That's what this entry is replacing.

@wking wking force-pushed the drop-auto-scaling-groups branch from f883131 to 9421288 Compare August 14, 2018 22:44
@yifan-gu
Copy link
Contributor

Looks like the test is failing due do resource limit being hit

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2018
@wking wking force-pushed the drop-auto-scaling-groups branch from 9421288 to d93d94b Compare August 20, 2018 18:35
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 20, 2018
@wking
Copy link
Member Author

wking commented Aug 20, 2018

@yifan-gu rebased around #127 with 9421288 -> d93d94b

Currently, the installer creates an ASG for both masters and workers.
While this is okay for workers, this causes quite a few headaches for
masters.  For example, etcd will be run on the master nodes, but this
will be unstable if Amazon is free to delete nodes and recreate them
without etcd knowing about it.  Additionally, this makes bootstrap
tricky because every master needs to be identical.  In order to break
the dependency loop between the MachineConfigController (TNC) and the
master nodes, a CNAME is currently used to temporarily redirect
Ignition's requests to an S3 bucket with a pre-computed bootstrap
config.

This commit drops the use of ASG for both masters and workers and
instead creates individual machines. Once we adopt the use of
MachineSets and MachineControllers, we'll regain the functionality we
lost by dropping ASGs.

While renaming the previous *-asg modules, I've also renamed the main
Terraform files to main.tf to comply with the standard module
structure [1].

The new main.tf follow the example set by modules/aws/etcd/nodes.tf,
which is why the instance resource declarations have moved to the end
of the file.  I've also adjusted the master policy to more closely
match the more-restrictive etcd policy, because I don't think the
masters will need to push S3 resources, etc.

I've also dropped some redundant names (e.g. "master_profile" ->
"master"), because the profile-ness is already covered in the resource
name (e.g. "aws_iam_instance_profile.master").

The worker load balancer (previously
aws_autoscaling_attachment.workers, now aws_elb_attachment.workers) is
a bit more complicated now, since we have to loop over both the
provided load balancers and the created worker instances to set up
associations.  There are also similar master load balancer
associations in aws_elb_attachment.masters, replacing the old
load_balancers entry in aws_autoscaling_group.masters.

[1]: https://www.terraform.io/docs/modules/create.html#standard-module-structure
@wking wking force-pushed the drop-auto-scaling-groups branch from d93d94b to 446e9f1 Compare August 20, 2018 18:38
@wking
Copy link
Member Author

wking commented Aug 20, 2018

And I've dropped two more ssh_key entries with d93d94b -> 446e9f1.

@wking wking changed the title wip: modules/aws: Drop auto-scaling groups modules/aws: Drop auto-scaling groups Aug 20, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2018
@wking wking changed the title modules/aws: Drop auto-scaling groups modules/aws: Drop auto-scaling groups (ASG) Aug 21, 2018
@yifan-gu
Copy link
Contributor

/retest

@yifan-gu
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yifan-gu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yifan-gu
Copy link
Contributor

/retest

@openshift-merge-robot openshift-merge-robot merged commit 2414a30 into openshift:master Aug 24, 2018
@wking wking deleted the drop-auto-scaling-groups branch August 24, 2018 05:32
wking referenced this pull request in joelddiaz/cluster-operator Sep 6, 2018
stbenjam pushed a commit to stbenjam/installer that referenced this pull request Feb 10, 2021
Bug 1905307: cluster-baremetal-operator: add provisioning CR as related object
clnperez added a commit to clnperez/installer that referenced this pull request Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants