Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ support ability to move custom objects in clusterctl #3337

Merged
merged 1 commit into from
Aug 10, 2020

Conversation

rsmitty
Copy link
Contributor

@rsmitty rsmitty commented Jul 14, 2020

This PR will add the ability for anything in clusterctl's object graph
that is part of a CRD with the label clusterctl.cluster.x-k8s.io/move: "" to also be eligible for moving.
This allows for providers to explicitly tag their own custom resources
and ensure that they're also processed as part of clusterctl move

Signed-off-by: Spencer Smith [email protected]

Fixes #3329

@k8s-ci-robot k8s-ci-robot requested review from benmoss and vincepri July 14, 2020 20:32
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 14, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @rsmitty. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 14, 2020
@rsmitty
Copy link
Contributor Author

rsmitty commented Jul 14, 2020

@wfernandes @vincepri @fabriziopandini I'm still working through some tests and validating locally, but here's a first cut for you to look at on the "clusterctl move" feature we talked about yesterday.

@vincepri
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 14, 2020
@rsmitty rsmitty force-pushed the move branch 2 times, most recently from 5262434 to d92bc61 Compare July 14, 2020 23:24
@vincepri
Copy link
Member

/assign @fabriziopandini @wfernandes

@vincepri
Copy link
Member

/milestone v0.3.x

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rsmitty thanks for this new version of this PR!
The PR looks good apart from some nits.

However, there are two points I would like to clarify with you:

  1. It is possible that the external CRD is defining global objects? Because if this is the case there is a set of considerations that should be addressed: What happens if they are used by two namespaces? Should we leave them in the original cluster? What happens if they get moved two times?
  2. What is the contract around moving objects which are not related to Cluster API provider? What happens if we are moving an object while it is being reconciled?

cmd/clusterctl/client/cluster/mover_test.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
@rsmitty
Copy link
Contributor Author

rsmitty commented Jul 16, 2020

@fabriziopandini I've fixed all the naming nits in an amended commit. Gonna resolve those.

To talk about questions:

It is possible that the external CRD is defining global objects? 
Because if this is the case there is a set of considerations that should be addressed: 
What happens if they are used by two namespaces? Should we leave them in the original cluster? 
What happens if they get moved two times?

It's definitely possible that the external CRD is global. That's what we're doing. Servers and ServerClasses are cluster-wide and can be associated with one or more clusters. We handle setting "in-use" labels on these resources to keep them from getting used my clusters at the same time. That said, I think we agreed the other day that cluster-wide objects shouldn't get deleted from the source cluster. I'll look into making that happen, as it seems like the safest option.

Getting moved two times I don't think should happen unless I'm misunderstanding. We do checks to de-dup the nodes in the object graph.

What is the contract around moving objects which are not related to Cluster API provider? 
What happens if we are moving an object while it is being reconciled?

Great question and I'm not sure I have a good answer. Do we have any kind of way to know a resource is being actively reconciled? If not, I'm not sure we can provide anything other than best effort moving of these resources.

Copy link
Contributor

@wfernandes wfernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rsmitty! I noticed that when running the tests locally I would see some interesting failures. In clusterctl/client/cluster when I run go test ./... -count=3 I see failures such as

--- FAIL: Test_getMoveSequence (0.01s)
    --- FAIL: Test_getMoveSequence/A_ClusterResourceSet_applied_to_a_cluster (0.00s)
        mover_test.go:447:
            Expected
                <[]string | len:1, cap:1>: [
                    "addons.cluster.x-k8s.io/v1alpha3, Kind=ClusterResourceSet, ns1/crs1",
                ]
            to consist of
                <[]interface {} | len:1, cap:1>: [
                    [
                        "cluster.x-k8s.io/v1alpha3, Kind=Cluster, ns1/cluster1",
                        "addons.cluster.x-k8s.io/v1alpha3, Kind=ClusterResourceSet, ns1/crs1",
                    ],
                ]
            the missing elements were
                <[]interface {} | len:1, cap:1>: [
                    "cluster.x-k8s.io/v1alpha3, Kind=Cluster, ns1/cluster1",
                ]
--- FAIL: Test_objectMover_move (0.02s)
    --- FAIL: Test_objectMover_move/A_ClusterResourceSet_applied_to_a_cluster (0.00s)
        mover_test.go:506: ns1/cluster1 not deleted in source cluster
        mover_test.go:510: error = Object 'Kind' is missing in 'unstructured object has no kind' when checking for ns1/cluster1 deleted in source cluster
        mover_test.go:506: ns1/cluster1 not deleted in source cluster
        mover_test.go:506: ns1/cluster1-ca not deleted in source cluster
        mover_test.go:506: ns1/cluster1-kubeconfig not deleted in source cluster
FAIL
FAIL    sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster   1.099s
FAIL

Not sure what's going on here though.

continue
}
if !apierrors.IsNotFound(err) {
if oFrom.GetNamespace() != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if oFrom.GetNamespace() != "" {
if err== nil && oFrom.GetNamespace() != ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can quite do this. I tried it at first. If err is nil and there's no namespace for the object, it won't meet this condition and will fail in the condition below with the error mover_test.go:508: error = <nil> when checking for /externalTest deleted in source cluster

@rsmitty
Copy link
Contributor Author

rsmitty commented Jul 16, 2020

@wfernandes I know, I've been seeing that error in local testing too. It's weird because it's flaky. I can run it multiple times and get diff behavior.

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rsmitty thanks for updating for global object management.

Getting moved two times I don't think should happen unless I'm misunderstanding.

Sorry for my poor explanation. The use case is:

  • I have a source cluster with CAPI objects in ns1 and ns2, both referencing a global object Server1
  • move --namespace ns1; Server1 get copied in target cluster, but it remains in the source Cluster as well (Server1 now exists both in the source and in the target )
  • Server1 is changed in source (or in target), so the two version of Server1 are dis-aligned
  • move --namespace ns1; Server1 should be copied in target cluster, but the version are dis-aligned.

How do we manage conflict resolution?

cmd/clusterctl/client/cluster/mover.go Show resolved Hide resolved
if err != nil {
return err
}

// Discovery the object graph for the selected types:
// - Nodes are defined the Kubernetes objects (Clusters, Machines etc.) identified during the discovery process.
// - Edges are derived by the OwnerReferences between nodes.
if err := objectGraph.Discovery(namespace, types); err != nil {
if err := objectGraph.Discovery(namespace); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a topic for a follow-up PR/something to be tracked in a issue, but once we determine the external CRD to move, it would be great to check if they exist on the target cluster (we are already doing this for the providers in checkTargetProviders)

cmd/clusterctl/client/cluster/mover_test.go Show resolved Hide resolved
cmd/clusterctl/client/cluster/mover.go Show resolved Hide resolved
cmd/clusterctl/client/cluster/objectgraph.go Outdated Show resolved Hide resolved
Comment on lines 228 to 249
if _, ok := crd.Labels[clusterctlv1.ClusterctlMoveLabelName]; ok {
o.types[getKindAPIString(typeMeta)].forceMove = true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit
move this above L227 and change it into

Suggested change
if _, ok := crd.Labels[clusterctlv1.ClusterctlMoveLabelName]; ok {
o.types[getKindAPIString(typeMeta)].forceMove = true
}
if _, ok := crd.Labels[clusterctlv1.ClusterctlMoveLabelName]; ok {
typeMeta.forceMove = true
}

cmd/clusterctl/client/cluster/objectgraph_test.go Outdated Show resolved Hide resolved
@rsmitty
Copy link
Contributor Author

rsmitty commented Jul 17, 2020

Updated to address @fabriziopandini's nits. Still working on the test flakiness, as well as pondering on your question about conflict resolution. Not sure we've got a good answer for that yet.

@rsmitty
Copy link
Contributor Author

rsmitty commented Jul 17, 2020

I'm not sure there's a great way to manage divergence between two global resources in two different clusters. We do, at least, already check to see if a resource already exists in the target cluster during the move step. Maybe we could refuse to update global resources during that check?

But then again, it may be that I'm misunderstanding the longer term goals of the clusterctl move command. We're really wanting to use it specifically for bootstrap => initial management plane porting, as well as the occasional need to move the management plane to new servers or something. It's something I've not really thought about happening regularly and I've been assuming the source cluster essentially gets torn down after the move happens, so divergence wouldn't really matter.

@vincepri
Copy link
Member

@rsmitty The long term vision for move would be that it's going to move the entire management cluster all the time, and you wouldn't need to init before move is initiated (which is a pre-req today). These changes would greatly simplify move supporting more than just cluster api native resources, and cleanup a lot of the code we have today.

Hope that helps!

@fabriziopandini
Copy link
Member

fabriziopandini commented Aug 3, 2020

@rsmitty @vincepri I have opened #3354 to discuss the log term vision for clusterctl move
In the meantime, wrt to corner cases related to moving global resources, I think we should just define an approach and document it; IMO, the only use case we should ensure is the pivot to an empty cluster, and everything else should be supported at best effort for now.

In the future, a possible solution to properly handle external resources (not CAPI resources) is to implement a set of hooks/a plugin mechanism so clusterctl can trigger resource specific tasks, but this requires some more design

@vincepri
Copy link
Member

vincepri commented Aug 3, 2020

/milestone v0.3.9

@rsmitty
Copy link
Contributor Author

rsmitty commented Aug 4, 2020

@fabriziopandini I think that makes a lot of sense and I'd like to see this landed as soon as we're all comfortable with it. I added a note in the docs that specifically point out that global resources are copied, but not deleted from the source cluster.

@rsmitty rsmitty force-pushed the move branch 2 times, most recently from 1256057 to 76e813f Compare August 4, 2020 15:59
@rsmitty
Copy link
Contributor Author

rsmitty commented Aug 5, 2020

There's something flaky going on with the mover_test and the ClusterResourceSet that I can not figure out for the life of me. I don't think it's necessarily an issue with this PR, but I'd love another set of eyes on it if you get some free time @fabriziopandini.

@fabriziopandini
Copy link
Member

I'm trying to reproduce errors locally

@fabriziopandini
Copy link
Member

@rsmitty I have sent #3463 to fix CRS test flakes

This PR will add the ability for anything in clusterctl's object graph
that is part of a CRD with the label `clusterctl.cluster.x-k8s.io/move: ""` to also be eligible for moving.
This allows for providers to explicitly tag their own custom resources
and ensure that they're also processed as part of `clusterctl move`

Signed-off-by: Spencer Smith <[email protected]>
@rsmitty
Copy link
Contributor Author

rsmitty commented Aug 6, 2020

Cool, looks like we're passing tests in CI now (as well as locally). Thanks @fabriziopandini! Hopefully we're ready to go with this PR now.

@fabriziopandini
Copy link
Member

/lgtm
over to @vincepri @wfernandes for a final check

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2020
@wfernandes
Copy link
Contributor

I ran all the tests locally and nothing else stood out.
/approve
/lgtm

@wfernandes
Copy link
Contributor

Unfortunately, I don't have approver access. I'll defer to @vincepri 🙂

@fabriziopandini
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, wfernandes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit ef5a9e7 into kubernetes-sigs:master Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support moving custom resources in clusterctl
5 participants