Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting projects is leaving projects on zombie state #18125

Closed
gbaufake opened this issue Jan 16, 2018 · 46 comments
Closed

Deleting projects is leaving projects on zombie state #18125

gbaufake opened this issue Jan 16, 2018 · 46 comments
Assignees
Labels
component/service-catalog kind/bug Categorizes issue or PR as related to a bug. priority/P2

Comments

@gbaufake
Copy link

gbaufake commented Jan 16, 2018

After deleting projects via openshift UI, the project is not being deleted. Trying via oc command generates:

Error from server (Conflict): Operation cannot be fulfilled on namespaces "istio-system": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system

Version

oc v3.7.23
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
openshift v3.7.18
kubernetes v1.7.6+a08f5eeb62

Steps To Reproduce
  1. Create a project
  2. Deploy something on this project
  3. Delete this project via UI
  4. If it doesn't get deleted, try deleting via oc command
Current Result
  • Projects on Zoombie state
Expected Result
  • Projects Deleted
Additional Information

Screenshots
image

@php-coder
Copy link
Contributor

Does it only reproducible when you removes project via UI? In other words, would it go to a zombie state if you remove it from CLI (oc delete project <project_name>)?

@gbaufake
Copy link
Author

gbaufake commented Jan 16, 2018

@php-coder

I checked with OC command and the result is the same as the UI;

Do you know what is the cause of this issue?

@gbaufake gbaufake changed the title Deleting projects via UI is leaving projects on zombie state Deleting projects is leaving projects on zombie state Jan 16, 2018
@php-coder
Copy link
Contributor

@gbaufake No, I don't. I hope that @juanvallejo knows or at least could know who knows :)

Meanwhile, did you check logs? Is there something that could be related to the issue?

@juanvallejo
Copy link
Contributor

juanvallejo commented Jan 16, 2018

Was able to reproduce using a 3.9 client against a 3.9 cluster.
Steps I took:

# create a new project 'deleteme'
$ oc new-project deleteme
Now using project "deleteme" on server "https://127.0.0.1:8443".
...

# deploy an application on that project
$ oc new-app <path/to/app>
--> Found image d5b68e7 (3 weeks old) in image stream ...
...
--> Success
...
    Run 'oc status' to view your app.

# immediately delete project after `oc new-app` finishes running
$ oc delete deleteme
project "deleteme" deleted

# try deleting project once more
$ oc delete deleteme
oc delete project deleteme
Error from server (Conflict): Operation cannot be fulfilled on namespaces "deleteme": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

# check to see if project can still be listed
$ oc projects
You have access to the following projects and can switch between them with 'oc project <projectname>':

    default
  * deleteme

Using project "deleteme" on server "https://127.0.0.1:8443".

The project is finally deleted after a minute or so, and no longer appears in the output of $ oc projects.

@gbaufake I suspect that maybe one or more resources that are created as part of deploying an application in your project are taking a bit longer than normal to be deleted (or maybe there are a lot of resources to delete in the first place). Since all resources belonging to a project must be deleted before the project itself can be deleted, the project will continue to exist until everything in it is gone.

However since the project has been marked for deletion already (when you deleted it through the webconsole), attempting to delete it a second time (as seen in my example above), will render the (Conflict) error that you are seeing.

Can you confirm that you are no longer able to list the deleted project (through oc projects) after deleting it, and waiting a minute or two?

@gbaufake
Copy link
Author

gbaufake commented Jan 16, 2018

@juanvallejo
Yes, some projects are on "terminating state" more than 2 days.

Projects can be listed with oc

@juanvallejo
Copy link
Contributor

cc @soltysh

@gbaufake any chance you could list the resources that remain in the project while it is on the "terminating" state? After you get the (Conflict) error message when deleting it, do oc get all on the project. (Feel free to redact anything / just post the resource kinds).

@hhovsepy
Copy link
Member

"oc get all" returns "No resources found." for the "terminating" state projects.

@juanvallejo
Copy link
Contributor

juanvallejo commented Jan 16, 2018

@deads2k @soltysh @liggitt could this maybe be failure to delete a resource in the namespace that is not part of "all"?

@liggitt
Copy link
Contributor

liggitt commented Jan 16, 2018

@deads2k @soltysh @liggitt could this maybe be failure to delete a resource in the namespace that is not part of "all"?

No. oc get all will not list every resource in the project.

Check the controller logs... the namespace controller will indicate the resources it could not delete

@gbaufake
Copy link
Author

@liggitt service atomic-openshift-master-controllers status -l -f would do the work?

@soltysh
Copy link
Contributor

soltysh commented Jan 16, 2018

@gbaufake yes, that should do. In case there's nothing in the logs you can also try increasing the loglevels and grep for namespace_controller.go or namespaced_resources_deleter.go. These will come from the namespace controller @liggitt mentioned.

@soltysh
Copy link
Contributor

soltysh commented Jan 16, 2018

@ironcladlou since you're the GC expert, any ideas what might be stuck when removing a project in a 3.7 version?

@ironcladlou
Copy link
Contributor

The controller logs already requested should help reveal the problem.

@henning-cg
Copy link

henning-cg commented Jan 17, 2018

Not the original poster, but we are having the same problem. The controller logs obtained via service atomic-openshift-master-controllers status -l -f show:

01-17 13:48:05.396213661 +0100 CET (durationBeforeRetry 2m2s). Error:
ene 17 13:46:03 master1.*****.com atomic-openshift-master-controllers[1992]: E0117 13:46:03.396421    1992 glusterfs.go:647] glusterfs: error when deleting the volume :
ene 17 13:46:03 master1.*****.com atomic-openshift-master-controllers[1992]: E0117 13:46:03.396494    1992 goroutinemap.go:166] Operation for "delete-pvc-c7db9d3a-f973-11e7-a8d9-000c29f66ce4[cba7fb1f-f973-11e7-a8d9-000c29f66ce4]" failed. No retries permitted until 2018-

@gbaufake
Copy link
Author

gbaufake commented Jan 17, 2018

Some Logs from atomic-openshift-master-controllers status -l -f

https://paste.fedoraproject.org/paste/QgJ3S1QTiGRVvhREEEnoDQ

@liggitt
Copy link
Contributor

liggitt commented Jan 17, 2018

@gbaufake if you have an API group that is unresponsive (as you do), the namespace controller cannot guarantee it has cleaned up all the resources in the namespace.

It is expected that the namespace will remain in Terminating state until the controller can ensure it has discovered and removed all the resources in that namespace.

@gbaufake
Copy link
Author

@liggitt Is there a way to restart API group specifically?

@louyihua
Copy link
Contributor

It's the problem of the 'Service Catalog' API group under the kube-service-catalog namespace.
Please check the states of the two pods under this namespace.

@louyihua
Copy link
Contributor

@gbaufake

Jan 17 08:37:21  atomic-openshift-master-controllers[4416]: E0117 08:37:21.347636    4416 namespace_controller.go:148] unable to retrieve the complete list of server APIs: istio.io/v1alpha1: the server could not find the requested resource, servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"service-catalog-signer\")'\nTrying to reach: 'https://172.30.231.104:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding

Your log shows there is a certificate problem of the ServiceCatalog API group. Please fix this issue first.

@soltysh
Copy link
Contributor

soltysh commented Jan 18, 2018

Seems like the cert issue is related to #17952. From https://bugzilla.redhat.com/show_bug.cgi?id=1525014#c14 one possible solution was to re-create the service catalog.

@gbaufake
Copy link
Author

@soltysh Using this workaround that you mentioned may lead to openshift/openshift-ansible#6572?

@pweil- pweil- added kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Jan 18, 2018
@gbaufake
Copy link
Author

gbaufake commented Jan 28, 2018

After correcting the certs, I brought a new cluster up

oc version

oc v3.7.27
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://ip:8443
openshift v3.7.27
kubernetes v1.7.6+a08f5eeb62
`

and still faced the same problem on deleting projects.

I used @soltysh workaround oc delete apiservices.apiregistration.k8s.io/v1beta1.servicecatalog.k8s.io -n kube-service-catalog then ran the service-catalog playbook again.

The only problem is the serviceBinding which is staying behind.

oc get servicebinding

NAME AGE
jenkins-persistent-7fhmj-7wg7q 1h
jenkins-persistent-dbjdt-ts8g5 21m`

Also I tried to delete the first serviceBinding with force=true

oc delete servicebindings jenkins-persistent-7fhmj-7wg7q --force=true

servicebinding "jenkins-persistent-7fhmj-7wg7q" deleted

On controller-manager I saw this log.

0128 21:21:58.854041 1 controller_binding.go:190] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Processing
--
  | I0128 21:21:58.854139 1 controller_binding.go:218] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": trying to bind to ServiceInstance "jenkins/jenkins-persistent-7fhmj" that has ongoing asynchronous operation
  | I0128 21:21:58.854265 1 controller_binding.go:880] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Setting condition "Ready" to False
  | I0128 21:21:58.854292 1 controller_binding.go:926] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Updating status
  | I0128 21:21:58.854363 1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceBinding", Namespace:"jenkins", Name:"jenkins-persistent-7fhmj-7wg7q", UID:"325f296f-0464-11e8-ba34-0a580a820006", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"89365", FieldPath:""}): type: 'Warning' reason: 'ErrorAsyncOperationInProgress' trying to bind to ServiceInstance "jenkins/jenkins-persistent-7fhmj" that has ongoing asynchronous operation
  | I0128 21:21:58.860746 1 controller.go:232] Error syncing ServiceBinding jenkins/jenkins-persistent-7fhmj-7wg7q: Ongoing Asynchronous operation

Also for the other serviceBinding (oc delete servicebindings jenkins-persistent-dbjdt-ts8g5 --force=true) I tried to delete as well and saw a different log than the first one on controller-manager:

I0128 21:24:41.659239 1 controller_binding.go:842] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Deleting Secret "jenkins/jenkins-persistent-dbjdt-credentials-yyqnh"
--
  | I0128 21:24:41.662509 1 controller_binding.go:880] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Setting condition "Ready" to False
  | I0128 21:24:41.662546 1 controller_binding.go:926] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Updating status
  | E0128 21:24:41.671371 1 controller_binding.go:929] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Error updating status: ServiceBinding.servicecatalog.k8s.io "jenkins-persistent-dbjdt-ts8g5" is invalid: status.currentOperation: Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal
  | I0128 21:24:41.671406 1 controller.go:237] Dropping ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5" out of the queue: ServiceBinding.servicecatalog.k8s.io "jenkins-persistent-dbjdt-ts8g5" is invalid: status.currentOperation: Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal

@soltysh
Copy link
Contributor

soltysh commented Jan 29, 2018

This looks like a problem that @openshift/team-service-catalog should look into

@jboyd01
Copy link
Contributor

jboyd01 commented Jan 29, 2018

@jboyd01 jboyd01 assigned jboyd01 and unassigned pmorie Jan 29, 2018
@jboyd01
Copy link
Contributor

jboyd01 commented Feb 6, 2018

"Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal" looks to be the same issue that is causing https://bugzilla.redhat.com/show_bug.cgi?id=1535902 (try to delete an instance or binding while it is being provisioned async).

@jboyd01
Copy link
Contributor

jboyd01 commented Mar 5, 2018

fixed in 3.9 via upstream kubernetes-retired/service-catalog#1708 and re-vendored into OpenShift with #18633

@nemonik
Copy link

nemonik commented Jun 1, 2018

I'm seeing the same thing

➜  ~ oc delete project nginx-ingress
Error from server (Conflict): Operation cannot be fulfilled on namespaces "nginx-ingress": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

The project is denoted as

This project marked for deletion

in the web console.

@laurafitzgerald
Copy link

I'm seeing this in Minishift 3.11
oc get all returns No resources found.

@pyates86
Copy link
Member

pyates86 commented Sep 3, 2019

This is still an issue on some 3.11 clusters.

It's because of the finalizer 'kubernetes' not being removed from the project:

  finalizers:
  - kubernetes

I cleared up 1000's of projects by following these steps:

  1. Do:
    oc get projects |grep Terminating |awk '{print $1}' > mylist

  2. Create and run this script to create a json file for each terminating project (while removing kubernetes finalizer):

#!/bin/bash
filename='mylist'
while read p; do
    echo $p
    oc get project $p -o json |grep -v "kubernetes" > $p.json
done < $filename
  1. Run:
    kubectl proxy --port=8080 &

4.Run this script to remove finalizer from running config:

#!/bin/bash
filename='mylist'
while read p; do
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @$p.json localhost:8080/api/v1/namespaces/$p/finalize;
done < $filename
  1. oc get projects |grep Terminating

Terminating projects should be gone.

@greg-pendlebury
Copy link

greg-pendlebury commented Oct 1, 2019

We too got hit by this today. Quite stumped until we found this post. The solution from @pyates86 resolved it for us.

oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth SSPI Kerberos SPNEGO

Server ....
openshift v3.11.117
kubernetes v1.11.0+d4cacc0

@greg-pendlebury
Copy link

Spoke too soon... our team tried reusing that project name today and it immediately went back into the same Terminating state after it was created.

FWIW, it is almost exactly the same issue reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1507440#c45

Right down to it being a persistent Jenkins serviceinstance and reporting:

Error polling last operation: Status: 500; ErrorMessage: <nil>; Description: templateinstances.template.openshift.io "{...ID goes here...}" not found; ResponseError: <nil>

I have now read a number of reports that indicate the 'fix' from @pyates86 above will just hide the issue for you, but not resolve it.

@fvaleri
Copy link

fvaleri commented Oct 18, 2019

The cleanup procedure from @pyates86 works fine with minishift v1.34.1+c2ff9cb (oc v3.11.0+0cbc58b), but you need to be cluster-admin, use oc proxy --port=8080 & and do the following JSON replacements before running the 2nd script:

  • "kind": "Project" --> "kind": "Namespace"
  • v1 --> project.openshift.io/v1

@vtlrazin
Copy link

for OCP 4.1 working with
"kind": "Project" --> "kind": "Namespace"
apiVersion: "project.openshift.io/v1" -->apiVersion: "v1"

@apastel
Copy link

apastel commented Oct 31, 2019

@vtlrazin Thanks your comment helped when the original suggestion was giving me
"the API version in the data (project.openshift.io/v1) does not match the expected API version (v1)"

@trumbaut
Copy link

trumbaut commented Nov 6, 2019

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.

@Bengrunt
Copy link

Bengrunt commented Jan 20, 2020

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.

FYI, that issue is not accessible. I only have a redhat developer account. :/
I'd gladly get the solution though since it affects our 3.11 cluster as well.

Thanks!

@apastel
Copy link

apastel commented Jan 22, 2020

Someone made a script to help with this, using the solution mentioned by @pyates86 above.
I forked it and modified it to remove the Authorization header since that was causing a problem for me.
https://github.com/apastel/useful-scripts/blob/master/openshift/force-delete-openshift-project

@saikaushik-itsmyworld
Copy link

saikaushik-itsmyworld commented Feb 3, 2020

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.
FYI, that issue is not accessible. I only have a redhat developer account. :/
I'd gladly get the solution though since it affects our 3.11 cluster as well.

Can I know How I can get access to this link. Even I'm facing the same issues with one of my project in 3.11 cluster

Thanks!

@sarvjeetrajvansh
Copy link

sarvjeetrajvansh commented Feb 25, 2020

I am also facing same issue.
the project is in terminating state.

kind: Project
apiVersion: project.openshift.io/v1
metadata:
  name: icp4iapic2
  uid: 1d33c67d-4e74-11ea-bc04-0a826dbb1b51
  resourceVersion: '7631358'
  creationTimestamp:    ###'2020-02-13T15:18:40Z'
  deletionTimestamp: '2020-02-25T09:32:53Z'
  annotations:
    mcm.ibm.com/accountID: id-mycluster-account
    mcm.ibm.com/type: System
    openshift.io/description: ''
    openshift.io/display-name: ''
    openshift.io/requester: admin
    spec:
  finalizers:
    - kubernetes
status:
  phase: Terminating

@apastel
Copy link

apastel commented Feb 25, 2020

I am also facing same issue.
the project is in terminating state.

A solution is already in this thread.

@sarvjeetrajvansh
Copy link

sarvjeetrajvansh commented Feb 26, 2020

If Any one is still facing any issue.
I have just formalized above step into shell script.
https://github.com/sarvjeetrajvansh/publiccode/blob/shell/cleanprojectopenshift.sh

pass your namespace as argument to script.

@rrw
Copy link

rrw commented Mar 31, 2020

@splatas
Copy link

splatas commented Apr 16, 2020

Here the instructions from @pyates86 updated (pay attention on step 5):

This is still an issue on some 3.11 clusters.

It's because of the finalizer 'kubernetes' not being removed from the project:

finalizers:

  • kubernetes

I cleared up 1000's of projects by following these steps:

  1. Do: create a file with projects in state 'Terminating'

    oc get projects |grep Terminating |awk '{print $1}' > mylist_project_terminating

  2. Create and run this script to create a json file for each terminating project (while removing kubernetes finalizer):

    script_create_json.sh:

    #!/bin/bash
    filename='mylist'
    while read p; do
    echo $p
    oc get project $p -o json |grep -v "kubernetes" > $p.json
    done < $filename

  3. Run: proxy al cluster

    kubectl proxy --port=8080 &

  4. Run this script to remove finalizer from running config:

    script_remove_finalizer.sh:

    #!/bin/bash
    filename='mylist'
    while read p; do
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @$p.json localhost:8080/api/v1/namespaces/$p/finalize;
    done < $filename

  5. If it fails, check .json files generated:
    {
    "apiVersion": "project.openshift.io/v1",
    "kind": "Project",
    ...

    Replace "project.openshift.io/v1" with "v1" in that file:
    "apiVersion": "v1",

    ... and run the script again.

  6. Run validation:
    oc get projects |grep Terminating

Terminating projects should be gone.

@sgremyachikh
Copy link

https://raw.githubusercontent.com/sarvjeetrajvansh/publiccode/shell/cleanprojectopenshift.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/service-catalog kind/bug Categorizes issue or PR as related to a bug. priority/P2
Projects
None yet
Development

No branches or pull requests