-
Notifications
You must be signed in to change notification settings - Fork 531
Add enhancement for AWS Placement Groups in Machine API #995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add enhancement for AWS Placement Groups in Machine API #995
Conversation
|
/uncc @imcleod @kbsingh |
jeana-redhat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted a few things while reading - pretty sure my questions don't need to be addressed until closer to actually creating product docs for this.
| ### User Stories | ||
|
|
||
| - As an OpenShift cluster administrator, I would like to use a spread placement group to ensure that my critical | ||
| workloads are unlikely to suffer from hardware failures simultanesouly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| workloads are unlikely to suffer from hardware failures simultanesouly. | |
| workloads are unlikely to suffer from hardware failures simultaneously. |
|
|
||
| - As an OpenShift cluster administrator, I would like to use a spread placement group to ensure that my critical | ||
| workloads are unlikely to suffer from hardware failures simultanesouly. | ||
| - As an operator of HPC workloads on OpenShift, I would like to use a cluster placement group to ensure that there is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPC
I don't see this term in existing product docs. Google tells me it stands for "high-performance computing", I'm assuming that is the meaning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's my bad, i should have spelled that out on first usage, will update. You're correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good. Mostly I am wondering if we refer to this in prod docs with some other term, since it seems like the idea of HPC ought to have come up. But that's for me to dig in on :)
elmiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this generally reads well to me @JoelSpeed. i added a comment about the CAO that we discussed during our team standup. my only other comment is that i think it would be nice to call more attention to the idea that this feature is highly sensitive to user configuration, but i'm not sure where that might fit, maybe drawbacks?
| There are currently no viable alternatives. This feature is unique in AWS and there isn't an alternative way for users | ||
| to make use of the benefits provided by placement groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to highlight that I don't think there's another way to place groups of instances together in AWS, so there is no alternative, I'll try to reword this
| When using Cluster placement groups it is recommended to launch all instances of the same instance type and if | ||
| possible, to launch instances simultaneously to prevent the likelihood that capacity limits are reached within the | ||
| placement group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "AWS recommends launching ..." rather than "it is recommended to launch ..." to make it clear that we aren't recommending that users do something that we don't actually let them do...
|
|
||
| ###### Limitations of Spread Placement Groups | ||
|
|
||
| - Spread placement groups are not supported for Dedicated Instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe clarify somehow that this isn't about "OpenShift Dedicated"? (likewise below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will prefix with AWS for now, perhaps linking to the AWS docs might help
|
|
||
| ### Open Questions | ||
|
|
||
| - Do we need to propogate user defined tags from the Infrastructure resource onto the placement groups we create? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Do we need to propogate user defined tags from the Infrastructure resource onto the placement groups we create? | |
| - Do we need to propagate user defined tags from the Infrastructure resource onto the placement groups we create? |
| // This parameter is only used when a Machine is being created and the named | ||
| // placement group does not exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like maybe if the placement group already exists, that it should ensure that it has the correct type, and error out if not?
Otherwise someone might accidentally use the same GroupName for two different MachineSets with different placement needs.
(If MachineSets and placement groups were 1-to-1 then the placement group name could just be based off the MachineSet name...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like maybe if the placement group already exists, that it should ensure that it has the correct type, and error out if not?
We could implement that, though at the moment, the group type is optional, so would it make sense to ignore the check if the placement group type is empty?
Otherwise someone might accidentally use the same GroupName for two different MachineSets with different placement needs.
It is valid to use the same placement group across multiple machinesets, though yes, I can see the intention here, I think if the user has set an intended type, then checking against this should be safe. The get out jail card is to either fix the type or remove the type field
(If MachineSets and placement groups were 1-to-1 then the placement group name could just be based off the MachineSet name...)
This isn't necessarily true as there are valid use cases for using the same placement group across multiple machinesets, have mentioned this elsewhere
sdodson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very well put together enhancement, just a few questions and comments.
| ### Non-Goals | ||
|
|
||
| - Allowing this feature to be leveraged during installation (day 0) | ||
| - Users will be able to modify MachineSets after manifest creation if required but this will not be exposed as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I render manifests and then modify them, will I be able to set a PlacementGroup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for worker nodes, this won't work for control plane machines though as we don't use MAPI to create them. Will clarify this statement to show this is workers only
| - Users will be able to modify MachineSets after manifest creation if required but this will not be exposed as a | ||
| part of the installation configuration before manifest generation | ||
| - Support for Placement Groups for Control Plane Machines | ||
| - Users may replace their control plane at a later date using future automation if they wish to have their control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you claim this, I'd like like to see exactly how this will happen. If you choose to permanently exclude it, you don't have to design it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, so the future automation I'm referring to is what I'm designing in #1008, now that the PR is up, I could link to this doc I guess but I think given how early that other enhancement is in review, I might wait a bit
| ensure that there is | ||
| minimal latency between instances within my cluster, allowing maximal throughput of my workloads. | ||
| - As an operator of a large scale distributed and replicated data store, I would like to use a partition placement | ||
| group to ensure that different shards of my data are replicated within different fault domains within the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a current customer requirement or just explaining how else this could be used? I'm ok with the latter, but I'd like to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not a current customer requirement as far as I understand. The customer ask we have that's driving this is for using Cluster placement groups, which is the second user story.
I want to implement this feature fully so we've explored all of the use cases and added users stories for each
| The changes are outlined below: | ||
|
|
||
| ```golang | ||
| type Placement struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is part of AWSMachineProviderConfig. How many of these are present in our standard installation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of AWSMachineProviderConfig, it is a singleton within the provider config.
On AWS, we create 3 MachineSets and 3 control plane Machines, each of those will contain an AWSMachineProviderConfig, and the MachineSets each create 1 Machine by default.
If a user wants to add placement groups during install, they probably want different placement groups per MachineSet, so I'd expect them to make up to 3 edits during the manifest phase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The installer will create 1 MachineSet for each availability zone. By default, the installer will include all availability zones in a region. So, taking us-east-1 as an example, not 3 edits but rather 6 edits.
| // +optional | ||
| GroupType AWSPlacementGroupType `json:"groupType,omitempty"` | ||
|
|
||
| // PartitionCount specifies the number of partitions for a Partition placement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description really makes this sound like it should be part of discriminated union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've added the type as an explicit option so we know what the customer wants us to do, as there (currently) is no configuration apart from for partition types, I didn't think we would need to add another struct level to make it a proper discriminated union, but I guess there's no harm in it
To clarify you're suggestion is:
type Placement struct {
Region
AvailabilityZone
Tenancy
GroupType AWSPlacementGroupType
GroupName string
Partition *AWSPartitionPlacement
}
type AWSPartitionPlacement struct {
Count int32
Number int32
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if @deads2k was suggesting that the entire placement group policy should be a discriminating union?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That example makes sense to me, but I would suggest breaking it out into a separate type, which could be inlined:
type Placement struct {
// ...
GroupUnion `json:",inline"`
}
// +union
type GroupUnion struct {
// +unionDiscriminator
// +kubebuilder:validation:Required
// +required
GroupType AWSPlacementGroupType
// +optional
Cluster *AWSClusterPlacement
// +optional
Partition *AWSPartitionPlacement
// +optional
Spread *AWSSpreadPlacement
}
type AWSClusterPlacement struct {}
type AWSPartitionPlacement struct {
Count int32
Number int32
}
type AWSSpreadPlacement struct {}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this yep, though, I'm not hugely familiar with the union types, does this mean users have to set one of the structs to be non-empty? Or can they just set the type for Cluster or Spread and omit the struct? Also, do we even need to add those structs to the API if they are empty/not required right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can specify the discriminator, or specify one of the structs, or both. I think you can omit the empty structs and corresponding fields if you prefer.
| The changes are outlined below: | ||
|
|
||
| ```golang | ||
| type Placement struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you link to the current docs for me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| ```golang | ||
| type Placement struct { | ||
| // Existing Fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing fields in Placement are
- Region string
json:"region,omitempty" - AvailabilityZone string
json:"availabilityZone,omitempty" - Tenancy InstanceTenancy
json:"tenancy,omitempty"
Can you describe how these interact with cluster, partition, and spread options? Are they expected to empty when setting a groupName? Does Spread make sense if there's only one AZ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Region and AvailabilityZone remain the same, there's no effect apart from you may want to create multiple machinesets across a single AZ to make use of a partition placement group, though we would normally just spread machines across AZs
Tenancy is affected because you can't use dedicated hosts with certain placement group types, this is called out in the sections where I talk about each type of placement group
Does Spread make sense if there's only one AZ?
Yes! Placement groups are a way of defining how your EC2 instances are laid out within a AZ, as opposed to across AZs. For spread in particular, each AZ in AWS is divided into 7 sections, each section having independent power and network for HA. When you create a spread placement group, it will place at most 1 instance in each of the 7 sections. If you use the same PG across 2 AZs, you then in theory have 14 sections, 7 in each AZ.
| types without detriment. | ||
|
|
||
| Spread placement groups have a limited capacity. You can only launch seven instances within a Spread placement group | ||
| within a single availability zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can I use it for muitiple availability zones normally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing to stop you using it across multiple availability zones, but a MachineSet is tied to a single AZ, so each MachineSet gains a limit of seven instances when it has a spread placement group (unless they create more than one machineset in the same zone, in which case, fewer than 7 I would expect)
| // placement group does not exist. | ||
| // Valid values are "Cluster", "Partition", "Spread", and omitted, which means | ||
| // no opinion and the platform chooses a good default which may change over time. | ||
| // The current default value is "Spread". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "spread" equivalent to our current behavior? I honestly don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The current behaviour is that an ec2 instance we create can go anywhere within the AZ, just gonna paste a response from another comment as I think it explains the difference
For spread in particular, each AZ in AWS is divided into 7 sections, each section having independent power and network for HA. When you create a spread placement group, it will place at most 1 instance in each of the 7 sections. If you use the same PG across 2 AZs, you then in theory have 14 sections, 7 in each AZ.
In our current behaviour, we could quite easily create 5 instances and have them all end up in the same section of the datacenter, meaning a power failure in that section would wipe out all instances
I've discussed this with @sdodson and we agreed that there isn't actually a sane default. If the value is omitted and we need to create a placement group, that should be considered an error, I've updated the description here.
| // PartitionNumber specifies the numbered partition in which instances should be launched. | ||
| // This value is only used in conjunction with a `Partition` placement Group. | ||
| // It is recommended to only use this value if multiple MachineSets share | ||
| // a single Placement Group, in which case, each MachineSet should represent an individual partition number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cam you provide an example of using this later in the enhancement? i think it's too big for godoc here, but I don't quite see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, have added a yaml example to the Partition Placement Groups heading
| As Spread placement groups keep instances in different racks, they can be expanded over time and can also mix instance | ||
| types without detriment. | ||
|
|
||
| Spread placement groups have a limited capacity. You can only launch seven instances within a Spread placement group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when someone tries to scale beyond this, how will the failure be presented to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error will be presented to them on the Machine, which will have moved into the Failed phase, this should be sufficient to prompt them that they can't add a new instance to the placement group
| To ensure we leave user accounts clean, we should remove any placement group that is no longer required. | ||
|
|
||
| Each time a Machine is removed, we must check whether it belongs to a placement group. | ||
| If it belongs to a placement group, we must also check that there are no other instances within the placement group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should we delete a placement group that we did not create?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't, we will only delete placement groups that look like they were created by MAPI (tagged appropriately) and also only in the case when we can recreate the PG if required later (ie we know what groupType it is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we will still leak a placement group in the following scenario?
- User specifies a placement group on a Machine or MachineSet.
- OpenShift creates the placement group.
- User creates an instance outside of OpenShift and assigns it to the placement group.
- User deletes the Machine or MachineSet from step 1.
- OpenShift observes that the placement group is still in use and does not delete it.
- User deletes the instance from step 3.
Maybe that's fine since the user is doing funky things in this example. Anyway, I guess the installer's cluster deprovisioning logic would still clean up any unused placement group with the cluster tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that's fine since the user is doing funky things in this example. Anyway, I guess the installer's cluster deprovisioning logic would still clean up any unused placement group with the cluster tag.
You're right, we would "leak" it in this case, but, this isn't particularly bad as placement groups are a no-cost feature of AWS, the only problem would be if the customer leaked, by doing this process, a lot of placement groups and then reached the 500 group limit
The installer deprovision should catch it though as I've already implemented that and tested it before I'd implemented the removal logic in the POC
| If using the Cluster Autoscaler, the Cluster Autoscaler will detect the failed launch and will, after some period (15 | ||
| minutes by default), attempt to scale a different group of instances to fulfil the capacity gap, if it is able to. | ||
|
|
||
| We recommend that users leverage multiple MachineSets, each with their own Placement Group to minimise the likelihood |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"minimise" how very british. :) I don't think we ever made a doc declaring american spelling as standard, but now I secretly suspect those people who add spell checkers of trying to backdoor it!
Entirely tongue in cheek, the spelling is fine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we ever made a doc declaring american spelling as standard, but now I secretly suspect those people who add spell checkers of trying to backdoor it!
If I spell it with a z (pronounced "zed") then I get red squiggly line under it ;)
I have wondered about whether we had an official spelling recommendation, I know product docs have to be american english but this isn't customer facing (or is it?)
Until I'm forced to write and speak in american english, I'll stick to british and keep pronouncing quay in the british way as well ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how very british
The Britishism that trips me up the most is the parsimonious spelling of "fulfil".
I have wondered about whether we had an official spelling recommendation, I know product docs have to be american english but this isn't customer facing (or is it?)
Upstream has a similar guideline:
The English-language documentation uses U.S. English spelling and grammar.
https://kubernetes.io/docs/contribute/style/style-guide/
I don't know whether that applies to enhancements though, and anyway, we're not upstream.
Until I'm forced to write and speak in american english, I'll stick to british and keep pronouncing quay in the british way as well ;)
For what it's worth, I think you have parts of the US on your side with that one (I think the British pronunciation is common in the northeastern states).
|
|
||
| Using existing webhooks that inspect the Machine provider configuration, we can warn users if they create a MachineSet | ||
| with a Spread placement group and attempt to scale it over seven instances. | ||
| This however is not fool-proof, as the user may not have set the placement group type, or may have set it incorrectly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we require everything to be set ahead of time, we could add this protection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have required fields within the provider specs, especially with optional features such as this so I'm not sure if we could reasonably enforce things to be set ahead of time. I can imagine customers frequently tweak their machineset definitions, delete all the current machines and let MAPI create them new ones based on the new configuration
|
|
||
| We may also wish to extend the Cluster Autoscaler Operator to inspect MachineAutoscalers and their targets. | ||
| If any MachineAutoscaler targeted a MachineSet with a spread placement group, it could warn if the maximum replica | ||
| count is set higher than 7. This however is non-trivial as there are currently no webhooks for the Cluster Autoscaler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create this resource via the API, right? Why can't we enforce it during creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could if we had webhooks, but we don't today and I'd rather not expand the scope of this enhancement to introduce them, it doesn't seem justified for a single validation like this.
The MachineAutoscaler is a separate CR to the MachineSet, so to check this, we need to have something lookup the MachineSet based on the object reference, then check the spec of that and report back whether it's using a spread group or not. While that's entirely possible, there's nothing to stop the end user changing the definition of their machineset during the lifecycle of the cluster, they could create the MachineAutoscaler, then configure the placement groups later
sdodson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the new comments it looks fine to me, api review not withstanding.
| // +optional | ||
| GroupType AWSPlacementGroupType `json:"groupType,omitempty"` | ||
|
|
||
| // PartitionCount specifies the number of partitions for a Partition placement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if @deads2k was suggesting that the entire placement group policy should be a discriminating union?
I've updated based on the last couple of comments and the union suggestion |
| // When a Machine is deleted, if this leaves an empty Placement Group, the group is also | ||
| // deleted. | ||
| // +optional | ||
| GroupName string `json:"groupName,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have three machinesets and each contains an AWSMachineProviderConfig and I specify a GroupName of "BestGroupEver" in each machineset, what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's allowed and is a valid use case. If you read through the partition placement groups section, it mentions having a partition placement group with multiple partitions and having each partition represented by the group number.
There isn't a negative effect to having multiple MachineSets with the same group name, if it already exists, it will just be used as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a negative effect to having multiple MachineSets with the same group name, if it already exists, it will just be used as is
Given three different definitions of the GroupName, which of those definition is accurate? Based on what I'm seeing here, I gather that it is racy.
Using this as both the definition PlacementGroup and the reference to which PlacementGroup to use makes the usage of this API inconsistent depending on read order. The same resources played against two different clusters can have two different results. this is a problem with the API.
As an alternative, why not have one resource that describes the PlacementGroup and simply reference those PlacementGroups from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given three different definitions of the GroupName, which of those definition is accurate? Based on what I'm seeing here, I gather that it is racy.
I think you're right this would cause a problem for users as one of the MachineSets would come up successfully and the others, as we have it implemented now, would fail with errors
As an alternative, why not have one resource that describes the PlacementGroup and simply reference those PlacementGroups from here.
This is something we could do, though it would be a lot more effort to get done (will need to review team planning/product requirements) and does add complexity for the users. A concern that has come up before is proliferation of the number of CRDs within OpenShift, does this tiny feature warrant adding a new CRD? Is that acceptable?
I'll have a few conversations today and gather feedback, fwiw, to avoid a lot of the complexity here about preventing misconfiguration, having a new CRD is probably the best thing to do. I'm wondering if we should investigate the idea of the AWSCluster resource from Cluster API which gathers a lot of information like this, if that's a singleton within the cluster and has a list of placement group configs, it could manage those and the other "cluster level" infrastructure
| GroupName string `json:"groupName,omitempty"` | ||
|
|
||
| // AWSPlacementGroupUnion embeds the placement group configuration within the placement. | ||
| AWSPlacementGroupUnion `json:",inline"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if i specify the same group three times as in https://github.com/openshift/enhancements/pull/995/files#r798856729 and each one specifies a different GroupType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are planning (I will double check if this made it into the enhancement or just comments) to check the group type against the group we find on AWS. If the user specifies a cluster PG and we find a spread PG, we will return an error and tell them they need to fix the Machine configuration
| // +kubebuilder:validation:Minimum:=1 | ||
| // +kubebuilder:validation:Maximum:=7 | ||
| // +optional | ||
| Count int32 `json:"count,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if different machinesets specify this number differently for the same GroupName?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the above statement, "This value is only observed when creating a placement group", I would infer that the parameters applied would be those specified in the first reconciled machineset that is configured to use the group.
The more I read and come to understand how placement groups work, the more it seems to me that placement groups really should be configured as separate config objects from machines/machinesets. Defining placement groups within machinesets makes the semantics extremely confusing. As it is, both the basic usage and behaviors of placement groups as well as edge cases (such as the ones David brings up) are difficult to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I read and come to understand how placement groups work, the more it seems to me that placement groups really should be configured as separate config objects from machines/machinesets. Defining placement groups within machinesets makes the semantics extremely confusing. As it is, both the basic usage and behaviors of placement groups as well as edge cases (such as the ones David brings up) are difficult to understand.
I could envisage a separate CRD for this that then gets created, thought that's a lot of work and means more controllers and more complication from a user perspective, so I think there's a balance to make here.
The upstream community in Azure already have similar concepts being created by the Machine controller so we have followed that pattern for now.
What happens if different machinesets specify this number differently for the same GroupName?
We can validate that this configuration is matching the existing placement group and error out if it isn't, this would enforce the count and type to be consistent across multiple MachineSets
| // AWSPartitionPlacementGroupType is the "Partition" placement group type. | ||
| // Partition placement groups reduce the likelihood of hardware failures | ||
| // disrupting your application's availability. | ||
| // Partition placement groups are recommended for use with large scale | ||
| // distributed and replicated workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear to me from this godoc what "Partition" actually does. From my reading of the implementation details, if I understand them correctly, "Partition" defines two very different behaviors depending on whether number is specified or not:
- When
numberis specified, "Partition" behaves like "Cluster" except it places the instances within thenumberth partition. - When
numberis not specified, "Partition" behaves like "Spread" except it restricts instances to racks belonging to partitions of the placement group.
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of
When number is specified, "Partition" behaves like "Cluster" except it places the instances within the numberth partition.
The additional behaviour here is that if you had two cluster groups, they may overlap and instances in one group may be on the same physical hardware as the other. With partitions, each partition is distinct, so you are guaranteed that an instance in partition 1 is not on the same hardware as partition 2
When number is not specified, "Partition" behaves like "Spread" except it restricts instances to racks belonging to partitions of the placement group.
This is right, the only difference is that in spread you can only have 1 instance per hardware failure domain (the 7 partitions in each AZ), where in a partition it spreads them equally across however many partitions you choose
| We will introduce new fields within the `Placement` section of the AWS Machine Provider Config which will allow users | ||
| to specify which placement group to create instances within. | ||
|
|
||
| If the placement group does not exist, it will be created based upon the group type and, if applicable partition count provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is missing either a comma or a conjunct. I would guess the former, but I want to make sure there isn't some missing information here.
| If the placement group does not exist, it will be created based upon the group type and, if applicable partition count provided. | |
| If the placement group does not exist, it will be created based upon the group type and, if applicable, partition count provided. |
| If the placement group does not exist, it will be created based upon the group type and, if applicable partition count provided. | |
| If the placement group does not exist, it will be created based upon the group type and [missing words here], if applicable partition count provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First suggestion, missing comma is what I had intended
|
|
||
| // GroupName specifies the name of an AWS placement group to create the Machine within. | ||
| // If the group specified does not exist, it will be created based on the GroupType | ||
| // paramenter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // paramenter. | |
| // parameter. |
| // +union | ||
| type AWSPlacementGroupUnion struct { | ||
| // GroupType specifies the type of AWS placement group to use for this Machine. | ||
| // This parameter is only used when a Machine is being created and the named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Machine or MachineSet", right? Or do I have a fundamental misunderstanding of this API?
| // This parameter is only used when a Machine is being created and the named | |
| // This parameter is only used when a Machine or MachineSet is being created and the named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MachineSet is to a Machine as a ReplicaSet is to a Pod, so we don't actually create the PG until the first Machine in a MachineSet is created
| // GroupType specifies the type of AWS placement group to use for this Machine. | ||
| // This parameter is only used when a Machine is being created and the named | ||
| // placement group does not exist. | ||
| // Valid values are "Cluster", "Partition", "Spread". | ||
| // If no value is given and a placement group is required to be created, this is | ||
| // considered an error. | ||
| // Note: If the value of this field is "Spread", instanceTenancy may not be set | ||
| // to "dedicated". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the godoc on the AWSPlacementGroupType type definitions does not get included in the generated Swagger, so any documentation from the const definitions that you want the user to see in the generated Swagger should be repeated here for the field definition. That said, I don't know whether users really look to Swagger for API documentation, so maybe this isn't important. (Where do users turn for MAPI documentation? Do most users just read the Go definitions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently have the Machine ProviderSpecs documented very well, typically we have guides that instruct users to turn on a particular feature. I imagine there will be a new page or section for placement groups under which we will talk about the new fields and the way that the structure should be created within the MachineSet.
Our docs writer would normally grab this information from the enhancement IIUC (cc @jeana-redhat)
| In particular this means checking that the `groupType` has been set on the Machine configuration, if this is not set, | ||
| it is likely that Machine API did not create the placement group (and wouldn't be able to create a new placement group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused because you wrote the following above:
Should the user not define the type, we will return an error and place the Machine into a Failed state.
So if groupName is specified and groupType is omitted, then the Machine API definitely didn't create the placement group, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if groupName is specified and groupType is omitted, then the Machine API definitely didn't create the placement group, right?
I believe so yes. Though, having now talked about making sure that the placement group we are using is the correct kind for the customer, I think we want to require name and type and then perform validation that the group is of the type they are expecting. I think I missed this while updating from previous feedback
| - You can create a maximum of 500 placement groups per account in each Region. | ||
| - The name that you specify for a placement group must be unique within your AWS account for the Region. | ||
| - An instance can be launched in one placement group at a time; it cannot span multiple placement groups. | ||
| - You cannot launch AWS Dedicated Hosts in placement groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above, you wrote the following:
- A partition placement group with AWS Dedicated Instances can have a maximum of two partitions.
I don't know how to reconcile the "general limitation" on line 336 with the cited limitation for "Partition". Is there a distinction between "AWS Dedicated Hosts" and "AWS Dedicated Instances"? Is the user able launch a AWS Dedicated Host and then put it in a placement group after it is launched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copied from the AWS docs, which having checked, of course contradict themselves. I think the general limitations is "You should assume this", and then looking at partition in particular, "you can now use this but only for this type of placement group" (Dedicated hosts/instances are synonyms)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To thicken the plot, it looks like there are two different pages for Amazon EC2 Dedicated Hosts and Amazon EC2 Dedicated Instances
. Quite confusing unfortunately, are we really sure they are synonyms? The instanceTenancy type also has a "dedicated" and an "host" value here, could they highlight this difference already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the two pages, they appear to me to describe the same thing
| If using the Cluster Autoscaler, the Cluster Autoscaler will detect the failed launch and will, after some period (15 | ||
| minutes by default), attempt to scale a different group of instances to fulfil the capacity gap, if it is able to. | ||
|
|
||
| We recommend that users leverage multiple MachineSets, each with their own Placement Group to minimise the likelihood |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how very british
The Britishism that trips me up the most is the parsimonious spelling of "fulfil".
I have wondered about whether we had an official spelling recommendation, I know product docs have to be american english but this isn't customer facing (or is it?)
Upstream has a similar guideline:
The English-language documentation uses U.S. English spelling and grammar.
https://kubernetes.io/docs/contribute/style/style-guide/
I don't know whether that applies to enhancements though, and anyway, we're not upstream.
Until I'm forced to write and speak in american english, I'll stick to british and keep pronouncing quay in the british way as well ;)
For what it's worth, I think you have parts of the US on your side with that one (I think the British pronunciation is common in the northeastern states).
|
|
||
| This failure will prevent adding additional compute capacity to the cluster and in turn, will result in potentially | ||
| pending workloads. | ||
| As this feature is only supported on worker nodes, we do not expect any detrimental impact on the control plane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be supported, but is it possible for users to configure this on the control-plane Machines, or will we block this somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not supported for the moment, this is blocked purely on the fact that we aren't going to implement this in the installer. I've added some notes about control plane machines to the future implementation and drawbacks sections
| - Support for Placement Groups for Control Plane Machines | ||
| - Users may replace their control plane at a later date using future automation if they wish to have their control | ||
| plane within a placement group | ||
| - Support for moving Machines between placement groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean it is not possible to change the placement group of an instance after it has been launched?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can, but not via Machine API. To change the PG, you have to shut down the instance on AWS and then move it, and then restart it. AWS document that this can't be done but I've been informed by AWS engineers that it can be done
|
@Miciah @deads2k Thanks for the reviews, I've updated the enhancement to clarify some of the points you made and responded in the comments for each The important change is we will now require group name and group type and validate these so that if users do share the config across multiple groups, they definitely get the correct type of group. Will add the group count for partition into that validation as well |
| // GroupName specifies the name of an AWS placement group to create the Machine within. | ||
| // If the group specified does not exist, it will be created based on the GroupType | ||
| // parameter. | ||
| // When a Machine is deleted, if this leaves an empty Placement Group, the group is also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call out in this comment that only Placement Groups created by the cluster will be deleted.
| installer generated machineset manifests before the cluster is created. | ||
| We do not intend to support placement groups for control plane machines at all at this point. | ||
|
|
||
| We will however need to update the installer cluster deprovisioning logic, as currently it does not know how to remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user adds an instance to a placement group created by the cluster, then the destroyer will never complete as it will continually fail to delete the placement group. That does not sound like a terrible problem to me: But I did want to call it out. It is similar issue to not being able to delete a VPC to which the user has added an instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a note under this to call that out, I guess the only thing we can do is try to make sure a useful error is returned in this case?
|
|
||
| ```yaml | ||
| - ec2:DescribePlacementGroups | ||
| - ec2:CreatePlacementGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ec2:DeletePlacementGroup, too, right?
If so, then this permission also needs to be added to the permissions checked by the installer since it will be needed by the destroyer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I hadn't done any testing of the deletion (or implemented it) when this was written so it slipped through. It is needed though so will add a note.
Do you happen to have a link to roughly where the list of permissions the installer checks is sourced from? Can raise a PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| ### Open Questions | ||
|
|
||
| - Do we need to propagate user defined tags from the Infrastructure resource onto the placement groups we create? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suspect so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already created a PR for this so this isn't actually an open question anymore, will remove
| ## Proposal | ||
|
|
||
| We will introduce a new CRD, `AWSPlacementGroup` within the `machine.openshift.io/v1` API group and a new | ||
| new object reference within the `Placement` section of the AWS Machine Provider Config which will allow users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| new object reference within the `Placement` section of the AWS Machine Provider Config which will allow users | |
| object reference within the `Placement` section of the AWS Machine Provider Config which will allow users |
| // provided by attached IAM role where the actuator is running. | ||
| // +optional | ||
| CredentialsSecret *corev1.LocalObjectReference `json:"credentialsSecret,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // provided by attached IAM role where the actuator is running. | |
| // +optional | |
| CredentialsSecret *corev1.LocalObjectReference `json:"credentialsSecret,omitempty"` | |
| // provided by attached IAM role where the actuator is running. | |
| // +optional | |
| CredentialsSecret *corev1.LocalObjectReference `json:"credentialsSecret,omitempty"` |
| // Note the partition count of a placement group cannot be changed after creation. | ||
| // If unset, AWS will provide a default partition count. | ||
| // This default is currently 2. | ||
| // Note: If instanceTenancy is set to "dedicated", the maximum number of partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the Tenancy field under Placement right?
https://github.com/openshift/api/blob/b632c5fc10c0ea86cb18577db9388109a43d465f/machine/v1beta1/types_awsprovider.go#L189
It would be nice if we could make it more obvious it is that one. Maybe we could specifying its full path?
| // PartitionNumber specifies the numbered partition in which instances should be launched. | ||
| // It is recommended to only use this value if multiple MachineSets share | ||
| // a single Placement Group, in which case, each MachineSet should represent an individual partition number. | ||
| // If unset, when a Partition placement group is used, AWS will attempt to | ||
| // distribute instances evenly between partitions. | ||
| // +kubebuilder:validation:Minimum:=1 | ||
| // +kubebuilder:validation:Maximum:=7 | ||
| // +optional | ||
| PartitionNumber int32 `json:"number,omitempty"` | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding (but correct me if I'm wrong here) after reading the AWS docs is that the partition number is a concept only for Partition placement groups.
With that in mind, what happens if the user specifies a PartitionNumber but then references an AWSPlacementGroup of an AWSPlacementGroupType (which doesn't have the concept of partition number)? What is the behaviour we intend to pursue? We could maybe add a note on that in the spec comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return an error in that case to the user, they probably didn't mean to set the partition number in that case I would guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| availabilityZone: us-east-1a | ||
| group: | ||
| name: my-partition-placement-group | ||
| partitionNumber: 1 | 2 | 3 # This value will vary across the three MachineSets, either 1, 2 or 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my understanding is that once a partition placement group is created (and it's initial count specified), then machines can be created within the various logical partitions (correct me if I'm wrong here).
A couple of follow up questions to this:
- the count of the desired logical partitions must be specified at creation time, instead where these partitions are within a region is not pin-pointed by the creation step. How does this work? Is a partition automatically assigned to an AZ depending on where the first instance that we put into the partition lives? Or does a single logical partition allow, for example, 2 instances that live on two different AZs to be in the same logical partition? And if the latter is true what how does this count towards the partitions per Availability Zone cap?
- do you envision us allowing different MachineSets biding to the same logical partition number? If yes, what if the two MachineSets instances are on different AZs? (this then ties back to the previous question).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can have 7 partitions maximum, but that is 7 partitions per availability zone in actuality.
So there is a partition numbered 1 in AZ A and in AZ B, and you can put the machines in to partition 1 on each of these, but because they are separate AZs, they are separate failure domains already. It's best to just think of placement groups as a way to organise machines within a single AZ
Machines in AZ A P1 and P2 are protected from hardware failures by being in different redundancy zones within the AZ, AZ A P1 and AZ B P1 are protected because they are in separate AZs (data centers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. Makes sense, thank you.
| To ensure we leave user accounts clean, we should remove any placement group that is no longer required. | ||
| If a user deletes a managed `AWSPlacementGroup`, we will remove it from AWS provided it has no instances registered | ||
| within it. To enable this clean up logic a finalizer will need to be added to each `AWSPlacementGroup` to allow the | ||
| controller to remove the instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By instance here you mean the "physical" instance of the AWSPlacementGroup, not an AWS EC2 instance within the group, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the placement group on AWS, will clarify
| - You can create a maximum of 500 placement groups per account in each Region. | ||
| - The name that you specify for a placement group must be unique within your AWS account for the Region. | ||
| - An instance can be launched in one placement group at a time; it cannot span multiple placement groups. | ||
| - You cannot launch AWS Dedicated Hosts in placement groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To thicken the plot, it looks like there are two different pages for Amazon EC2 Dedicated Hosts and Amazon EC2 Dedicated Instances
. Quite confusing unfortunately, are we really sure they are synonyms? The instanceTenancy type also has a "dedicated" and an "host" value here, could they highlight this difference already?
damdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the answers and for including the suggestions @JoelSpeed
/lgtm
Fedosin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
da5f106 to
bff8f1f
Compare
|
I've squashed the commits here so that we can get this merged |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Fedosin, sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@JoelSpeed: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR adds a new enhancement describing how we will integrate support for AWS Placement Groups into the Machine API.
Notably, as we have a specific customer requirement, we are not planning to introduce this into the installer initially as this is not required by the customer. This has been left as a future implementation and is noted within the enhancement.