refactor: separate CAPA resources from cluster#706
Conversation
|
Hi @sethp-nr. Thanks for your PR. I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| Resources that are managed by the controllers/actuators should be tagged with: `kubernetes.io/cluster/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws=true`. The latter tag being used to differentiate from resources managed by other tools/components that make use of the common tag. | ||
| Resources handled by these components fall into one of three categories: | ||
|
|
||
| 1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible. |
There was a problem hiding this comment.
Typo? sigs.k8s.io/cluster-api-provider/
should be sigs.k8s.io/cluster-api-provider-aws/ throughout I think?
There was a problem hiding this comment.
I would think that we could shorten this down to sigs.k8s.io/cluster-api/<name or id>=owned, since there should not be a case where a resource is owned by two separate providers.
There was a problem hiding this comment.
As mentioned in the office hours sigs.k8s.io/cluster-api-provider-aws/managed=true becomes redundant here and can be removed.
There was a problem hiding this comment.
I have not done a great job with consistency. The doc changes say one thing, and the code another 😄In code, I went with sigs.k8s.io/cluster-api-provider-aws/cluster/<id>, as a riff on the existing prefix: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/pkg/cloud/aws/tags/types.go#L87
Currently, that's used to define both sigs.k8s.io/cluster-api-provider-aws/managed (which this PR would supersede) and sigs.k8s.io/cluster-api-provider-aws/role, so adding a /cluster/ felt fairly natural.
That said, I'm happy to adjust the tagging scheme to be sigs.k8s.io/cluster-api/{role,cluster} if that makes more sense to y'all.
There was a problem hiding this comment.
I think cluster-api-provider-aws is fine. It might make sense to standardize more as we refine the v1alpha2 and beyond milestones for cluster-api proper.
|
Happy with changing the tag over unless there's a good reason not to. Technically, should we add a breaking change release note as existing resources will be missing tags? |
detiber
left a comment
There was a problem hiding this comment.
I'm wondering if it would make sense to have some type of a transition plan to avoid orphaning clusters that where created with <= 0.2.0. I don't think it's a hard requirement at this point, but it would be nice to have.
| Resources that are managed by the controllers/actuators should be tagged with: `kubernetes.io/cluster/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws=true`. The latter tag being used to differentiate from resources managed by other tools/components that make use of the common tag. | ||
| Resources handled by these components fall into one of three categories: | ||
|
|
||
| 1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible. |
There was a problem hiding this comment.
I would think that we could shorten this down to sigs.k8s.io/cluster-api/<name or id>=owned, since there should not be a case where a resource is owned by two separate providers.
| Resources that are managed by the controllers/actuators should be tagged with: `kubernetes.io/cluster/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws=true`. The latter tag being used to differentiate from resources managed by other tools/components that make use of the common tag. | ||
| Resources handled by these components fall into one of three categories: | ||
|
|
||
| 1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible. |
There was a problem hiding this comment.
As mentioned in the office hours sigs.k8s.io/cluster-api-provider-aws/managed=true becomes redundant here and can be removed.
| Resources handled by these components fall into one of three categories: | ||
|
|
||
| 1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible. | ||
| 2. Resources whose management is shared with the in-cluster aws cloud provider,, such as a security group for load balancer ingress rules. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `kubernetes.io/cluster/<name or id>=owned`, with the latter being the tag defined by the cloud provider. These resources are create/delete only: that is to say their ongoing management is "handed off" to the cloud provider. |
There was a problem hiding this comment.
Is there a particular use case for this?
There was a problem hiding this comment.
We've found two:
- Persistent volume support via EBS (requires tagging the instance)
- Security group rules for ELBs (requires tagging the security group)
This PR covers those cases by adding the "cluster-owned" tag back, but I expect there will be more cases that this change misses.
There was a problem hiding this comment.
Ah, gotcha. In that case, would it make more sense to apply kubernetes.io/cluster/<name or id>=shared instead? Since the cloud provider would not actually own the resource? It should still be able to query and make use of the resource.
There was a problem hiding this comment.
Hmm, maybe? The comment in the cloud provider gives me pause: https://github.com/kubernetes/kubernetes/blob/d7103187a37dcfff79077c80a151e98571487628/pkg/cloudprovider/providers/aws/tags.go#L45-L52
// ResourceLifecycleOwned is the value we use when tagging resources to indicate
// that the resource is considered owned and managed by the cluster,
// and in particular that the lifecycle is tied to the lifecycle of the cluster.
...
// ResourceLifecycleShared is the value we use when tagging resources to indicate
// that the resource is shared between multiple clusters, and should not be destroyed
// if the cluster is destroyed.
And at least as far as I can see, the provider never reads the value of the tag (it only checks for its existence: https://github.com/kubernetes/kubernetes/blob/d7103187a37dcfff79077c80a151e98571487628/pkg/cloudprovider/providers/aws/tags.go#L133 ).
Maybe the thing to do is tag them with a new value, like provided, and PR the cloud provider to adopt those semantics (since it'd just be a const and a comment at this point)?
|
|
||
| 1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible. | ||
| 2. Resources whose management is shared with the in-cluster aws cloud provider,, such as a security group for load balancer ingress rules. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `kubernetes.io/cluster/<name or id>=owned`, with the latter being the tag defined by the cloud provider. These resources are create/delete only: that is to say their ongoing management is "handed off" to the cloud provider. | ||
| 3. Unmanaged resources that are provided by config (such as a common VPC). These resources should be tagged with neither `sigs.k8s.io/cluster-api-provider-aws/managed=true` nor `sigs.k8s.io/cluster-api-provider/<name or id>=owned`. It is expected that the provider will avoid changing these resources as much as is possible. |
There was a problem hiding this comment.
I think we could probably support sigs.k8s.io/cluster-api/<name or id>=shared to provide an analog to the cloud-provider tags.
There was a problem hiding this comment.
I think that makes sense, though I'm not quite sure what the semantics would be. Currently this case covers user-supplied infrastructure (e.g. we have a VPC that's been around since the dawn of the account that we're deploying into). Whose responsibility would it be to keep the shared tag in sync?
There was a problem hiding this comment.
We could adopt the semantics of the cloud-provider here. The shared ownership tag should indicate that it is a cluster resource, but ownership is external. Users (or higher level tooling interacting with cluster-api) would be responsible for managing this tag.
There was a problem hiding this comment.
The semantics of the cloud provider are a lot fuzzier than I'd like. It also plays it fairly fast and loose: e.g. if you have no properly tagged subnets it'll fall back to using the current one, and if you've got exactly one security group (tagged or not), that's the one it'll pick.
I'm also feeling a bit of tension around the workflow that a per-cluster shared tag introduces: maybe I've gotten spoiled, but I'm kind of used to just throwing new clusters into my VPC whenever I want 'em 😆I'd feel better about that if we picked a more static tag for shared infra, something like sigs.k8s.io/cluster-api-provider-aws/shared?
|
Oh, thanks for bringing up the backwards-compatibility question! I meant to raise that during today's meeting, but it slipped my mind. I had to manually clean up some resources to get the code in its existing state to function properly because CAPA wouldn't "see" them, try to re-create them, and get a conflict error. Technically, it's simple enough to read both sets of tags and write just the new set, but I deferred work on that because: a) Testing that flow was a bit of a pain, and I wasn't even sure how many CAPA-managed clusters existed in the world that would be affected by the change |
| input.SecurityGroupIDs = append(input.SecurityGroupIDs, | ||
| s.scope.SecurityGroups()[v1alpha1.SecurityGroupControlPlane].ID, | ||
| s.scope.SecurityGroups()[v1alpha1.SecurityGroupNode].ID, | ||
| s.scope.SecurityGroups()[v1alpha1.SecurityGroupLB].ID, |
There was a problem hiding this comment.
Should we include the LB group on the Control Plane nodes by default? It's probably fine, but it might be good to add a comment that we might want to reconsider this in the future.
| // ReconcileBastion ensures a bastion is created for the cluster | ||
| func (s *Service) ReconcileBastion() error { | ||
| if s.scope.VPC().IsProvided() { | ||
| if s.scope.VPC().IsProvided(s.scope.Name()) { |
There was a problem hiding this comment.
Is there any way to avoid the repetition here?
There was a problem hiding this comment.
What did you have in mind? This happened because the "do I own this?" question now implicitly depends on the name of the cluster, and the VPCSpec doesn't know its own cluster name.
|
/ok-to-test |
| SecurityGroupControlPlane = SecurityGroupRole("controlplane") | ||
|
|
||
| // SecurityGroupLB defines a container for the cloud provider to inject its load balancer ingress rules | ||
| SecurityGroupLB = SecurityGroupRole("lb") |
There was a problem hiding this comment.
suggest creating package constants for lb, controlplane, node and bastion. I'll happy to take that as a follow-up PR.
There was a problem hiding this comment.
I started down this path, but I think it's large enough and self-contained enough to make more sense as a separate PR
|
/lgtm |
d417571 to
d7f3d48
Compare
…der-aws into refactor/aws-resource-ownership
Try and please the linter.
Use gofmt -s
Previously, these would only be applied on create. Try and keep them in sync, even on running clusters.
Expand the notion of "what security groups should exist" to span multiple ENIs.
…der-aws into refactor/aws-resource-ownership
|
Here's the steps I followed to upgrade a 0.2.0 cluster:
|
|
/lgtm /assign @detiber |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: detiber, sethp-nr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This patch tries to separate the ownership concerns between CAPA and the
in-cluster cloud integration.
Flagging it as a WIP for now, it seems to work for me but I'd like to discuss the approach in more detail at next week's meeting.
What this PR does / why we need it:
As it stands, the in-cluster AWS integration tries to add security group rules when it reconciles a load balancer that CAPA then removes.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #704
Special notes for your reviewer:
This is fairly dependent on #697 or #701, else CAPA will crash trying to read the rules created by the LB integration. We could decouple them by adding a check before reading the rules from an "owned" security group, but I'm expecting one of those will land before this PR anyway.
Release note: