Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement upgrade procedure for SMI TrafficSplit v1alpha2 #3962

Closed
stefanprodan opened this issue Jan 22, 2020 · 9 comments · Fixed by #6003
Closed

Implement upgrade procedure for SMI TrafficSplit v1alpha2 #3962

stefanprodan opened this issue Jan 22, 2020 · 9 comments · Fixed by #6003
Assignees
Milestone

Comments

@stefanprodan
Copy link

stefanprodan commented Jan 22, 2020

The upgrade from TrafficSplit v1alpha1 to v1alpha2 can't be done in place by Kubernetes since the weight type changed from resource.Quantity to int.

In order to ship #3951 the Linkerd CLI could implement the following upgrade procedure:

  • fetch all TrafficSplit objects v1alpha1
  • remove creationTimestamp, generation, resourceVersion, selfLink and uid from the TrafficSplit objects
  • change the version to v1alpha2
  • transform the weights field from resource.Quantity (YAML string) to int (YAML number)
  • apply the TrafficSplit v1alpha2 CRD on the cluster (at this point the old Linkerd controller with enter in crash loop)
  • apply all TrafficSplit v1alpha2 objects
  • proceed with the control plane upgrade

PS. The upgrade procedure describe here is just a wild guess, I haven't tested/implemented this so some experimentation with #3951 is required to validate my assumptions.

@adleong
Copy link
Member

adleong commented Jan 29, 2020

Linkerd does not manage traffic split resources. In other words, it reads them but doesn't create, update, or delete them. Therefore, I don't think Linkerd should upgrade traffic split resources in the cluster from v1alpha1 to v1alpha2: that should be the operator's responsibility.

I think we probably need a conversion webhook that can serve v1alpha2 traffic splits which are stored as v1alpha1 traffic splits. In other words:

  • The Linkerd manifests should include a conversion webhook for serving v1alpha2 traffic splits
  • The linkerd check command should warn if you have any trafficsplits stored as v1alpha1 and recommend that you migrate them
  • In a later version:
    • Stop serving v1alpha1 traffic splits and remove the conversion webhook

@stefanprodan
Copy link
Author

A conversion webhook is the best option but this implies switching to apiextensions.k8s.io/v1 see servicemeshinterface/smi-sdk-go#37

@adleong
Copy link
Member

adleong commented Feb 20, 2020

I spent a bunch of time thinking about this and coming up with what I think the next steps should be here and wrote them in a comment on this issue. However, it appears that the comment never posted. 😭

I'll try to remember what I came up with...

In order to accomplish this migration without downtime, I think we need to go through a deprecation cycle:

phase 1:

  • Linkerd should create both v1alpha1 and v1alpha2 traffic split clients. All of Linkerd's trafficsplit lookups are first attempted with the v1alpha2 client. If parsing of the object fails (because it was stored in the v1alpha1 format) then we catch that error and try again with the v1alpha1 client. In this case we internally convert the v1alpha1 object into a v1alpha2 object for internal use. Note that we do not write the object back to kubernetes so no modification of the stored object takes place.
  • Linkerd check should look up all traffic splits and warn if any of them are in the v1alpha1 format. It should advise users to upgrade them to the v1alpha2 format.

phase 2:

  • Attempting to do a linkerd upgrade to this version should fail if any traffic splits are still in v1alpha1 format
  • After upgrade, all lookups are done with the v1alpha2 client.
  • The v1alpha1 client can be removed.

@adleong
Copy link
Member

adleong commented Apr 7, 2020

I implemented phase 1 as described here. For reference the code is available at https://github.com/linkerd/linkerd2/compare/alex/ts-v1alpha2?expand=1

Unfortunately, I don't think this approach will work. Linkerd fails to start because the TrafficSplit v1alpha3 and v1alpha2 clients fail to sync their caches. This is because when they attempt to fetch TrafficSplits at versions v1alpha2 or v1alpha3 respectively, the traffic split that was previously stored at version v1alpha1 is returned at the newer version. However, this TrafficSplit object is still in the v1alpha1 format and cannot be parsed by these clients. Thus the caches can't be synced and Linkerd fails to start.

This is a fundamental property of how CRD versioning and client-go work. If a CRD exists that has been stored at a previous (backwards incompatible) version, newer informers cannot be started.

I really don't see any path forward here other than to use conversion webhooks. Conversion webhooks became stable in Kuberenetes 1.16 so we must wait until 1.16 is Linkerd's minimum Kubernetes version before we can upgrade onto the newer version of trafficsplit.

@stale
Copy link

stale bot commented Jul 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 6, 2020
@adleong
Copy link
Member

adleong commented Jul 6, 2020

Another option would be to abandon the typed code-generated clients for TrafficSplit and instead use an untyped dynamic client (as we do for the Link CRD, see: #4710). This would allow us to write custom deserialization logic that can account for the weight field being either an int or a Quantity.

@sbvitok
Copy link

sbvitok commented Sep 9, 2020

any news?

@stale
Copy link

stale bot commented Dec 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@SVronskiy
Copy link

SVronskiy commented Apr 6, 2021

Hello!

Disabling of using header l5d-dst-override in linkerd 2.9 makes it impossible to get in to green/canary service by header for running complience tests.

We definetly need some alternative way how to get in to the green/canary service for testing it, so implementing TrafficSplit.specs.smi-spec.io/v1alpha4 is requred for running A/B, B/G, Canary complience tests.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants