Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace cannot be deleted after upgrading antrea from 0.4.0 to 0.5.0-dev #494

Closed
tnqn opened this issue Mar 11, 2020 · 8 comments
Closed
Assignees
Labels
area/provider/azure Issues or PRs related to azure provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@tnqn
Copy link
Member

tnqn commented Mar 11, 2020

Describe the bug
I upgraded my cluster from 0.4.0 to 0.5.0-dev, none of namespaces can be deleted due to the errors below:

status:
  conditions:
  - lastTransitionTime: "2020-03-11T03:32:45Z"
    message: 'Discovery failed for some groups, 1 failing: unable to retrieve the
      complete list of server APIs: system.antrea.tanzu.vmware.com/v1beta1: the server
      is currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2020-03-11T03:32:45Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2020-03-11T03:32:45Z"
    message: All content successfully deleted
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  phase: Terminating

It should be introduced by this change #356 where v1beta1.system.antrea.tanzu.vmware.com was removed and antrea-controller doesn't has that API any more. But the apiservice was not unregistered and blocked all namespace deletion.

To Reproduce

  1. Deploy antrea v0.4.0
  2. Upgrade to master
  3. Create and delete a namespace
    The namespace will be terminating forever.

Expected
Upgrading antrea shouldn't cause this error.
Either clear upgrade instruction should be provided, or applying newer yaml should be compatible.

Actual behavior
The K8s cluster cannot work when deleting namespace after upgrading antrea.

Versions:
Please provide the following information:

  • Antrea version (Docker image tag). v0.4.0 -> v0.5.0-dev-25cf655
  • Kubernetes version (use kubectl version). If your Kubernetes components have different versions, please provide the version for all of them. v1.16.2
@tnqn tnqn added bug priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 11, 2020
@tnqn tnqn added this to the Antrea v0.5.0 release milestone Mar 11, 2020
@tnqn
Copy link
Member Author

tnqn commented Mar 11, 2020

Applying a newer version of yaml is the most used way to upgrade, but without extra arguments there will be this kind of issues when some resources are removed from the yaml.
To clean up unneeded resources by applying yaml, kubectl provides "--prune" argument.

--prune=false: Automatically delete resource objects, including the uninitialized ones, that do not 
appear in the configs and are created by either apply or create --save-config. Should be used with either -l or --all.

However, APIService is not in default prunewhitelist, we will have to specify what resources need to clean up:

kubectl apply -f antrea-0.5.0.yaml --prune -l app=antrea --prune-whitelist=apiregistration.k8s.io/v1/APIService

I think we could add this instruction to release note and a section in getting-started.md after "Installation".
What do you think? @jianjuns @antoninbas @salv-orlando @weiqiangt

@jianjuns
Copy link
Contributor

Documenting the solution in 0.5.0 sounds good to me.
Not sure how to handle this with AKS Engine. Let us discuss with cloud folks.

@McCodeman McCodeman added area/provider/azure Issues or PRs related to azure provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Mar 11, 2020
@antoninbas
Copy link
Contributor

We could also take care of this internally in the init phase of the Antrea controller through the apiserver client: delete all the apiregistration.k8s.io/v1/APIService resources with the app=antrea label which we know are no longer required. Otherwise, this may create issues for people who upgrade without reading the documentation (not to mention we have to keep this around in the documentation indefinitely).

@tnqn
Copy link
Member Author

tnqn commented Mar 12, 2020

@antoninbas it would require more permission granted to antrea-controller account, not sure whether it's worth (and the permission needs to be kept indefinitely to guarantee user can upgrade from 0.4.0). If we don't consider the long term possibility that we may have to deprecate some resources in future, maybe another way is having a dummy api to keep it work if this group "system.antrea" is likely to be used soon.
But I think we will need user to read upgrade/release doc sooner or later, e.g. https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#urgent-upgrade-notes.
Of course we should be very careful to add resources and make api change to avoid this.

@antoninbas
Copy link
Contributor

I did consider the dummy API case. If the goal is to ultimately remove "system.antrea", then now is probably the right time to do it, as the number of users is bound to grow (hopefully) and it will just mean more users will run into this issue when we eventually remove it.

We should be more careful in the future, as the impact of this issue is very large IMO. With the antctl framework refactor, it seems we are no longer using apiserver aggregation, at least for now?

Finally, will we run into a similar issue if we decide to update the version number for one of our API services (e.g. v1beta1.clusterinformation.antrea.tanzu.vmware.com -> v1beta2.clusterinformation.antrea.tanzu.vmware.com / v1.clusterinformation.antrea.tanzu.vmware.com)?

@tnqn
Copy link
Member Author

tnqn commented Mar 13, 2020

@antoninbas we still need apiserver aggregation for networking.antrea group, the system.antrea is removed because it changed to get system info from crd.
Your concern of api version is correct, that's also why I thought we may have to ask user to read upgrade/release doc to remove specific resources when their existence will cause trouble.

@antoninbas
Copy link
Contributor

I am a bit concerned with always requiring --prune -l app=antrea --prune-whitelist=apiregistration.k8s.io/v1/APIService from now on. That makes Antrea more "complicated" to upgrade, compared to say Calico: https://docs.projectcalico.org/v3.5/getting-started/kubernetes/upgrade/upgrade and I wonder if we will get support issues from users who didn't go through the docs carefully and assumed they could do a simple apply. But maybe it is acceptable until we release Antrea 1.0, at which time we can drop the --prune in our upgrade instructions to subsequent minor releases. Maybe @McCodeman has some thoughts.

It seems that K8s is keeping old API versions around (e.g. there is both a v1.networking.k8s.io and a v1beta1.networking.k8s.io registered), although that may be for other reasons (it seems they do not handle object conversion from old version to the new one automatically yet). That's a bit different in their case, because I imagine that even if they decide to remove an API group / version for some reason between 2 releases, the API service will be removed altogether and it will not block namespace deletion.

@tnqn
Copy link
Member Author

tnqn commented Mar 25, 2020

Fixed by #527

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

5 participants