refactor: separate CAPA resources from cluster by sethp-nr · Pull Request #706 · kubernetes-sigs/cluster-api-provider-aws

sethp-nr · 2019-04-05T17:59:40Z

This patch tries to separate the ownership concerns between CAPA and the
in-cluster cloud integration.

Flagging it as a WIP for now, it seems to work for me but I'd like to discuss the approach in more detail at next week's meeting.

What this PR does / why we need it:

As it stands, the in-cluster AWS integration tries to add security group rules when it reconciles a load balancer that CAPA then removes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #704

Special notes for your reviewer:

This is fairly dependent on #697 or #701, else CAPA will crash trying to read the rules created by the LB integration. We could decouple them by adding a check before reading the rules from an "owned" security group, but I'm expecting one of those will land before this PR anyway.

Release note:

BREAKING CHANGE: improve handling of resources shared with the in-cluster AWS integration. See docs/upgrade-to-0.3.0.md for migration details.

k8s-ci-robot · 2019-04-05T17:59:53Z

Hi @sethp-nr. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

randomvariable · 2019-04-08T17:30:07Z

-Resources that are managed by the controllers/actuators should be tagged with: `kubernetes.io/cluster/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws=true`. The latter tag being used to differentiate from resources managed by other tools/components that make use of the common tag.
+Resources handled by these components fall into one of three categories:
+
+1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible.


Typo? sigs.k8s.io/cluster-api-provider/
should be sigs.k8s.io/cluster-api-provider-aws/ throughout I think?

I would think that we could shorten this down to sigs.k8s.io/cluster-api/<name or id>=owned, since there should not be a case where a resource is owned by two separate providers.

As mentioned in the office hours sigs.k8s.io/cluster-api-provider-aws/managed=true becomes redundant here and can be removed.

I have not done a great job with consistency. The doc changes say one thing, and the code another 😄In code, I went with sigs.k8s.io/cluster-api-provider-aws/cluster/<id>, as a riff on the existing prefix: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/pkg/cloud/aws/tags/types.go#L87

Currently, that's used to define both sigs.k8s.io/cluster-api-provider-aws/managed (which this PR would supersede) and sigs.k8s.io/cluster-api-provider-aws/role, so adding a /cluster/ felt fairly natural.

That said, I'm happy to adjust the tagging scheme to be sigs.k8s.io/cluster-api/{role,cluster} if that makes more sense to y'all.

I think cluster-api-provider-aws is fine. It might make sense to standardize more as we refine the v1alpha2 and beyond milestones for cluster-api proper.

randomvariable · 2019-04-08T17:31:55Z

Happy with changing the tag over unless there's a good reason not to.

Technically, should we add a breaking change release note as existing resources will be missing tags?

detiber

I'm wondering if it would make sense to have some type of a transition plan to avoid orphaning clusters that where created with <= 0.2.0. I don't think it's a hard requirement at this point, but it would be nice to have.

detiber · 2019-04-08T17:57:07Z

-Resources that are managed by the controllers/actuators should be tagged with: `kubernetes.io/cluster/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws=true`. The latter tag being used to differentiate from resources managed by other tools/components that make use of the common tag.
+Resources handled by these components fall into one of three categories:
+
+1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible.


I would think that we could shorten this down to sigs.k8s.io/cluster-api/<name or id>=owned, since there should not be a case where a resource is owned by two separate providers.

detiber · 2019-04-08T17:58:21Z

-Resources that are managed by the controllers/actuators should be tagged with: `kubernetes.io/cluster/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws=true`. The latter tag being used to differentiate from resources managed by other tools/components that make use of the common tag.
+Resources handled by these components fall into one of three categories:
+
+1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible.


As mentioned in the office hours sigs.k8s.io/cluster-api-provider-aws/managed=true becomes redundant here and can be removed.

detiber · 2019-04-08T17:59:07Z

+Resources handled by these components fall into one of three categories:
+
+1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible.
+2. Resources whose management is shared with the in-cluster aws cloud provider,, such as a security group for load balancer ingress rules. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `kubernetes.io/cluster/<name or id>=owned`, with the latter being the tag defined by the cloud provider. These resources are create/delete only: that is to say their ongoing management is "handed off" to the cloud provider.


Is there a particular use case for this?

We've found two:

Persistent volume support via EBS (requires tagging the instance)

Security group rules for ELBs (requires tagging the security group)

This PR covers those cases by adding the "cluster-owned" tag back, but I expect there will be more cases that this change misses.

Ah, gotcha. In that case, would it make more sense to apply kubernetes.io/cluster/<name or id>=shared instead? Since the cloud provider would not actually own the resource? It should still be able to query and make use of the resource.

Hmm, maybe? The comment in the cloud provider gives me pause: https://github.com/kubernetes/kubernetes/blob/d7103187a37dcfff79077c80a151e98571487628/pkg/cloudprovider/providers/aws/tags.go#L45-L52

// ResourceLifecycleOwned is the value we use when tagging resources to indicate
// that the resource is considered owned and managed by the cluster,
// and in particular that the lifecycle is tied to the lifecycle of the cluster.
...
// ResourceLifecycleShared is the value we use when tagging resources to indicate
// that the resource is shared between multiple clusters, and should not be destroyed
// if the cluster is destroyed.

And at least as far as I can see, the provider never reads the value of the tag (it only checks for its existence: https://github.com/kubernetes/kubernetes/blob/d7103187a37dcfff79077c80a151e98571487628/pkg/cloudprovider/providers/aws/tags.go#L133 ).

Maybe the thing to do is tag them with a new value, like provided, and PR the cloud provider to adopt those semantics (since it'd just be a const and a comment at this point)?

detiber · 2019-04-08T18:00:46Z

+
+1. Fully-managed resources whose lifecycle is tied to the cluster. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `sigs.k8s.io/cluster-api-provider-aws/managed=true` and the actuator is expected to keep these resources as closely in sync with the spec as possible.
+2. Resources whose management is shared with the in-cluster aws cloud provider,, such as a security group for load balancer ingress rules. These resources should be tagged with `sigs.k8s.io/cluster-api-provider/<name or id>=owned` and `kubernetes.io/cluster/<name or id>=owned`, with the latter being the tag defined by the cloud provider. These resources are create/delete only: that is to say their ongoing management is "handed off" to the cloud provider.
+3. Unmanaged resources that are provided by config (such as a common VPC). These resources should be tagged with neither `sigs.k8s.io/cluster-api-provider-aws/managed=true` nor `sigs.k8s.io/cluster-api-provider/<name or id>=owned`. It is expected that the provider will avoid changing these resources as much as is possible.


I think we could probably support sigs.k8s.io/cluster-api/<name or id>=shared to provide an analog to the cloud-provider tags.

I think that makes sense, though I'm not quite sure what the semantics would be. Currently this case covers user-supplied infrastructure (e.g. we have a VPC that's been around since the dawn of the account that we're deploying into). Whose responsibility would it be to keep the shared tag in sync?

We could adopt the semantics of the cloud-provider here. The shared ownership tag should indicate that it is a cluster resource, but ownership is external. Users (or higher level tooling interacting with cluster-api) would be responsible for managing this tag.

The semantics of the cloud provider are a lot fuzzier than I'd like. It also plays it fairly fast and loose: e.g. if you have no properly tagged subnets it'll fall back to using the current one, and if you've got exactly one security group (tagged or not), that's the one it'll pick.

I'm also feeling a bit of tension around the workflow that a per-cluster shared tag introduces: maybe I've gotten spoiled, but I'm kind of used to just throwing new clusters into my VPC whenever I want 'em 😆I'd feel better about that if we picked a more static tag for shared infra, something like sigs.k8s.io/cluster-api-provider-aws/shared?

sethp-nr · 2019-04-08T20:55:41Z

Oh, thanks for bringing up the backwards-compatibility question! I meant to raise that during today's meeting, but it slipped my mind.

I had to manually clean up some resources to get the code in its existing state to function properly because CAPA wouldn't "see" them, try to re-create them, and get a conflict error.

Technically, it's simple enough to read both sets of tags and write just the new set, but I deferred work on that because:

a) Testing that flow was a bit of a pain, and I wasn't even sure how many CAPA-managed clusters existed in the world that would be affected by the change
b) I didn't know how or when to drop support for the legacy scheme. v1alpha2?

detiber · 2019-04-09T14:54:48Z

+		input.SecurityGroupIDs = append(input.SecurityGroupIDs,
+			s.scope.SecurityGroups()[v1alpha1.SecurityGroupControlPlane].ID,
+			s.scope.SecurityGroups()[v1alpha1.SecurityGroupNode].ID,
+			s.scope.SecurityGroups()[v1alpha1.SecurityGroupLB].ID,


Should we include the LB group on the Control Plane nodes by default? It's probably fine, but it might be good to add a comment that we might want to reconsider this in the future.

vincepri · 2019-04-09T15:32:41Z

 // ReconcileBastion ensures a bastion is created for the cluster
 func (s *Service) ReconcileBastion() error {
-	if s.scope.VPC().IsProvided() {
+	if s.scope.VPC().IsProvided(s.scope.Name()) {


Is there any way to avoid the repetition here?

What did you have in mind? This happened because the "do I own this?" question now implicitly depends on the name of the cluster, and the VPCSpec doesn't know its own cluster name.

detiber · 2019-04-10T16:13:21Z

/ok-to-test

ashish-amarnath · 2019-04-16T16:04:58Z

 	SecurityGroupControlPlane = SecurityGroupRole("controlplane")
+
+	// SecurityGroupLB defines a container for the cloud provider to inject its load balancer ingress rules
+	SecurityGroupLB = SecurityGroupRole("lb")


suggest creating package constants for lb, controlplane, node and bastion. I'll happy to take that as a follow-up PR.

I started down this path, but I think it's large enough and self-contained enough to make more sense as a separate PR

detiber · 2019-04-17T18:52:14Z

/lgtm
/assign @vincepri

…der-aws into refactor/aws-resource-ownership

Try and please the linter.

Use gofmt -s

Previously, these would only be applied on create. Try and keep them in sync, even on running clusters.

Expand the notion of "what security groups should exist" to span multiple ENIs.

…der-aws into refactor/aws-resource-ownership

sethp-nr · 2019-04-29T23:41:47Z

Here's the steps I followed to upgrade a 0.2.0 cluster:

kubectl scale statefulset -n aws-provider-system aws-provider-controller-manager --replicas=0
AWS_PROFILE=my-profile clusterawsadm migrate -n cluster-name 0.3.0
Update the image for the aws-provider-controller-manager to point to something with this fix
kubectl scale statefulset -n aws-provider-system aws-provider-controller-manager --replicas=1
Wait ~2 minutes for the security group changes to all settle
kubectl exec -n kube-system -it kube-controller-manager-ip-10-11-0-233.ec2.internal -- sh -c 'kill 1' as a workaround for AWS cloud provider caches security group tags forever kubernetes/kubernetes#77019

chuckha · 2019-05-09T20:21:19Z

/lgtm

/assign @detiber

detiber · 2019-05-09T21:03:28Z

/approve

k8s-ci-robot · 2019-05-09T21:03:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: detiber, sethp-nr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [detiber]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 5, 2019

k8s-ci-robot requested review from ashish-amarnath and krousey April 5, 2019 17:59

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 5, 2019

randomvariable reviewed Apr 8, 2019

View reviewed changes

detiber reviewed Apr 8, 2019

View reviewed changes

Comment thread pkg/apis/awsprovider/v1alpha1/types.go Outdated

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 8, 2019

detiber reviewed Apr 9, 2019

View reviewed changes

vincepri reviewed Apr 9, 2019

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 10, 2019

justaugustus mentioned this pull request Apr 10, 2019

[WIP] Add support for an existing resource group kubernetes-sigs/cluster-api-provider-azure#174

Closed

sethp-nr changed the title ~~WIP: refactor: separate CAPA resources from cluster~~ refactor: separate CAPA resources from cluster Apr 12, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 12, 2019

ashish-amarnath reviewed Apr 16, 2019

View reviewed changes

Comment thread pkg/apis/awsprovider/v1alpha1/awsclusterproviderconfig_types.go Outdated

ashish-amarnath reviewed Apr 16, 2019

View reviewed changes

k8s-ci-robot assigned detiber and vincepri Apr 17, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2019

rudoi mentioned this pull request Apr 22, 2019

AWS LoadBalancer doesn't like multiple Security Groups on instances #729

Closed

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2019

Andrew Rudoi added 2 commits April 23, 2019 13:18

chore: add new deps for migration tool

f4532f5

feat: add 0.3.0 migration tool

d7f3d48

sethp-nr force-pushed the refactor/aws-resource-ownership branch from d417571 to d7f3d48 Compare April 23, 2019 20:18

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 23, 2019

sethp-nr added 5 commits April 23, 2019 13:27

Merge branch 'master' of github.com:kubernetes-sigs/cluster-api-provi…

23c58cf

…der-aws into refactor/aws-resource-ownership

chore: preallocate arns slice

46e103d

Try and please the linter.

fix: simplify code

d21ae49

Use gofmt -s

fix: actuate core security groups

d4808f4

Previously, these would only be applied on create. Try and keep them in sync, even on running clusters.

fix: reconcile core security groups across ENIs

0a1739c

Expand the notion of "what security groups should exist" to span multiple ENIs.

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2019

Merge branch 'master' of github.com:kubernetes-sigs/cluster-api-provi…

96c19b6

…der-aws into refactor/aws-resource-ownership

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2019

sethp-nr and others added 5 commits April 29, 2019 16:57

fix: preallocate slice as per the linter

49b6b90

chore: paginate migration

e28b52a

chore: use var for maxARNs

631ea66

fix: clean up end check

8a68b1d

Create upgrade-to-0.3.0.md

8ad92e9

detiber reviewed May 1, 2019

View reviewed changes

Comment thread Gopkg.toml Outdated

fix: relax aws-sdk-go version constraint

c10318d

k8s-ci-robot assigned chuckha May 9, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 9, 2019

k8s-ci-robot merged commit 2c093fd into kubernetes-sigs:master May 9, 2019

sethp-nr deleted the refactor/aws-resource-ownership branch May 10, 2019 16:39

rudoi mentioned this pull request May 30, 2019

REQUEST: New membership for rudoi kubernetes/org#862

Closed

6 tasks

detiber mentioned this pull request Dec 30, 2019

⚠️ ELB uses separate security group #1456

Merged

Conversation

sethp-nr commented Apr 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

randomvariable commented Apr 8, 2019

Uh oh!

detiber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sethp-nr commented Apr 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

detiber commented Apr 10, 2019

Uh oh!

Uh oh!

ashish-amarnath Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

detiber commented Apr 17, 2019

Uh oh!

sethp-nr commented Apr 29, 2019

Uh oh!

Uh oh!

chuckha commented May 9, 2019

Uh oh!

detiber commented May 9, 2019

Uh oh!

k8s-ci-robot commented May 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

sethp-nr commented Apr 5, 2019 •

edited

Loading

ashish-amarnath Apr 16, 2019 •

edited

Loading