-
Notifications
You must be signed in to change notification settings - Fork 1.5k
modules/aws: Drop auto-scaling groups (ASG) #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modules/aws: Drop auto-scaling groups (ASG) #88
Conversation
c2dd6f7 to
cff7e05
Compare
tests/smoke/aws/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I still need to fix this...
cff7e05 to
c9351ed
Compare
|
retest this please |
|
I'm not sure if the smoke tests will pick it up, but there's some sort of issue with (at least) the console at |
modules/aws/master/main.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one master <=> one subnet ? How do we make sure we have enough subnet ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one master <=> one subnet ? How do we make sure we have enough subnet ids
Hmm. I was trying to replace this:
vpc_zone_identifier = ["${var.subnet_ids}"]And the line I have here is how etcd works now. Maybe it works because element has some built-in modulus protection?
modules/aws/worker/main.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in master.tf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is previous work of tectonic-installer. No need to edit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is previous work of tectonic-installer. No need to edit this.
Ah, thanks. Dropped with c9351ed -> 46de8e5.
c9351ed to
46de8e5
Compare
|
Timeouts in both Jenkins (during teardown): /retest |
|
This time e2e-aws errored with: But |
4f9d6b5 to
e0dde38
Compare
|
The Jenkins error was: But The e2e-aws error was: We've seen the "Error finding route..." issue a few times before, there may be a race or other bug there causing this instability. /retest |
|
This time I'm looking at the full Jenkins logs, and the smoke tests are failing: So it looks like there were some smoke-test timeouts, and those timeouts caused the entire teardown process to time out. |
|
Looks like the etcd nodes can't come up because they can't get the ignition from tnc. (updated) I'm going to try to add the instance ids into the elb to make it work, but seems @crawford is moving the masters into the etcd nodes, so we probably don't need elb for it anymore? |
4e9b293 to
c42f215
Compare
modules/aws/worker/main.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this would work either way, and var.instance_count is shorter and (at least for me) easier to read than length(var.load_balancers). Is there a reason to switch to using the latter, @yifan-gu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work, say if:
length(var.load_balancers) == 4
instance_count = 2
count_index = 4*2 = 8
Then, count.index % var.instance_count ranges from [0, 1], which won't include all elbs in the list, and count.index / var.instance_count ranges from [0, 3] which will be out of boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can do instance = [count.index % var.instance_count], elb = [count.index / var.instance_count], which will be identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can do
instance = [count.index % var.instance_count], elb = [count.index / var.instance_count], which will be identical.
Done with c42f215 -> f883131.
c42f215 to
f883131
Compare
modules/aws/master/main.tf
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't see an analogous entry in master for worker nodes.
@yifan-gu pointed me at the load_balancers entry in the old aws_autoscaling_group.masters. That's what this entry is replacing.
f883131 to
9421288
Compare
|
Looks like the test is failing due do resource limit being hit |
9421288 to
d93d94b
Compare
Currently, the installer creates an ASG for both masters and workers. While this is okay for workers, this causes quite a few headaches for masters. For example, etcd will be run on the master nodes, but this will be unstable if Amazon is free to delete nodes and recreate them without etcd knowing about it. Additionally, this makes bootstrap tricky because every master needs to be identical. In order to break the dependency loop between the MachineConfigController (TNC) and the master nodes, a CNAME is currently used to temporarily redirect Ignition's requests to an S3 bucket with a pre-computed bootstrap config. This commit drops the use of ASG for both masters and workers and instead creates individual machines. Once we adopt the use of MachineSets and MachineControllers, we'll regain the functionality we lost by dropping ASGs. While renaming the previous *-asg modules, I've also renamed the main Terraform files to main.tf to comply with the standard module structure [1]. The new main.tf follow the example set by modules/aws/etcd/nodes.tf, which is why the instance resource declarations have moved to the end of the file. I've also adjusted the master policy to more closely match the more-restrictive etcd policy, because I don't think the masters will need to push S3 resources, etc. I've also dropped some redundant names (e.g. "master_profile" -> "master"), because the profile-ness is already covered in the resource name (e.g. "aws_iam_instance_profile.master"). The worker load balancer (previously aws_autoscaling_attachment.workers, now aws_elb_attachment.workers) is a bit more complicated now, since we have to loop over both the provided load balancers and the created worker instances to set up associations. There are also similar master load balancer associations in aws_elb_attachment.masters, replacing the old load_balancers entry in aws_autoscaling_group.masters. [1]: https://www.terraform.io/docs/modules/create.html#standard-module-structure
d93d94b to
446e9f1
Compare
|
/retest |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking, yifan-gu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
Bug 1905307: cluster-baremetal-operator: add provisioning CR as related object
Add Power VS platform permission check
Currently, the installer creates an ASG for both masters and workers. While this is okay for workers, this causes quite a few headaches for masters. For example, etcd will be run on the master nodes, but this will be unstable if Amazon is free to delete nodes and recreate them without etcd knowing about it. Additionally, this makes bootstrap tricky because every master needs to be identical. In order to break the dependency loop between the MachineConfigController (TNC) and the master nodes, a CNAME is currently used to temporarily redirect Ignition's requests to an S3 bucket with a pre-computed bootstrap config.
This commit drops the use of ASG for both masters and workers and instead creates individual machines. Once we adopt the use of MachineSets and MachineControllers, we'll regain the functionality we lost by dropping ASGs.
While renaming the previous
*-asgmodules, I've also renamed the main Terraform files tomain.tfto comply with the standard module structure.The new
main.tffollow the example set bymodules/aws/etcd/nodes.tf, which is why the instance resource declarations have moved to the end of the file. I've also adjusted the master policy to more closely match the more-restrictive etcd policy, because I don't think the masters will need to push S3 resources, etc.I've also dropped some redundant names (e.g.
master_profile->master), because the profile-ness is already covered in the resource name (e.g.aws_iam_instance_profile.master).The worker load balancer is a bit more complicated now, since we have to loop over both the provided load balancers and the created worker instances to setup associations.