Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient regional quota to satisfy request and katib job is blocked. #749

Closed
Jeffwan opened this issue Aug 20, 2020 · 24 comments
Closed

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Aug 20, 2020

Prow status: https://prow.k8s.io/?repo=kubeflow%2Fkatib

ERROR: (gcloud.beta.container.clusters.create) ResponseError: code=403, message=Insufficient regional quota to satisfy request: resource "CPUS": request requires
'48.0' and is short '40.0'. project has a quota of '500.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage
=USED&project=kubeflow-ci.

Reported by @andreyvelich

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.77
kind/bug 0.89

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.54

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

1 similar comment
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.54

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 20, 2020

Seems CPU quota is almost used out and it's match with error message.

gcloud compute regions describe us-east1 --project=kubeflow-ci         
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: us-east1
id: '1230'
kind: compute#region
name: us-east1
quotas:
- limit: 500.0
  metric: CPUS
  usage: 484.0
- limit: 200000.0
  metric: DISKS_TOTAL_GB
  usage: 13886.0
....

@andreyvelich
Copy link
Member

@Jeffwan Thank you for creating the issue.
Can we check which pods are currently running at cluster and use CPUs?

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 20, 2020

I kick off an one off job to clean up deployment to release some resources.

@andreyvelich
Copy link
Member

I will try to re-run tests.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 20, 2020

@andreyvelich

There're few clusters under this project and I am trying to clean up kubeflow-ci cluster first. All workflows are running there. it doesn't seem we have a groupby <cluster, cpu> utils. Once the clean up is done, I can check the total CPU usage at the project level again.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 20, 2020

kubeflow-periodic-0-3-branch-tf-serving-3367-e0d1      Active   486d
kubeflow-periodic-0-5-branch-tf-serving-3856-4443      Active   335d
kubeflow-periodic-master-deployapp-878-a4a0            Active   536d
kubeflow-periodic-master-tf-serving-353-b738           Active   623d
kubeflow-periodic-master-tf-serving-721-fd6b           Active   562d
kubeflow-periodic-master-tf-serving-913-2737           Active   530d
kubeflow-periodic-release-branch-tf-serving-227-9734   Active   617d
kubeflow-presubmit-deployapp-1817-e39b1d3-3922-6f80    Active   672d
kubeflow-presubmit-deployapp-1817-f1c14ea-3928-4442    Active   672d
kubeflow-presubmit-tf-serving-2338-9316696-5673-b788   Active   554d
kubeflow-presubmit-tf-serving-2449-e9ea4dd-5627-7ec9   Active   555d
kubeflow-presubmit-tf-serving-2474-1720719-5700-fc62   Active   553d
kubeflow-presubmit-tf-serving-2784-3488c99-6610-fdef   Active   516d
kubeflow-presubmit-tf-serving-2991-7732038-6736-f518   Active   497d
kubeflow-presubmit-tf-serving-3464-2de5dd8-6288-31d2   Active   433d
kubeflow-presubmit-tf-serving-3464-7c4ef28-7168-3901   Active   433d
kubeflow-presubmit-tf-serving-3464-9165fce-1152-5c01   Active   432d
kubeflow-presubmit-tf-serving-3464-9165fce-2688-141d   Active   433d
kubeflow-presubmit-tf-serving-3464-9165fce-8256-a10f   Active   432d

I think these are some leaked resources, I will delete them as well

k get all -n kubeflow-presubmit-tf-serving-2338-9316696-5673-b788                                                  ○ kubeflow-testing 10:33:28
NAME                            READY   STATUS    RESTARTS   AGE
pod/mnist-cpu-bc4ddfd96-ssmtv   1/1     Running   33         512d

NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/mnist-cpu   ClusterIP   10.39.255.61    <none>        9000/TCP,8500/TCP   554d
service/mnist-gpu   ClusterIP   10.39.244.227   <none>        9000/TCP,8500/TCP   554d

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mnist-cpu   1/1     1            1           554d

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/mnist-cpu-bc4ddfd96   1         1         1       554d

@andreyvelich
Copy link
Member

Do we have any ideas how this pod was deployed?
It was running for 512d which is strange.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 20, 2020

Not sure. it's long time ago. :D. kubeflow-testing doesn't have resource leaking. In total, it uses less than 100 CPUs.(only 7 nodes, I sum all the requests), could be other clusters, Need some time to figure it out

kubeflow-testing                          2018-03-29T17:46:26+00:00  us-east1-d     RUNNING
kf-vmaster-n00                            2019-04-02T12:15:07+00:00  us-east1-b     RUNNING
kf-ci-v1                                  2020-02-03T23:14:27+00:00  us-east1-d     RUNNING
fairing-ci                                2020-03-09T17:08:54+00:00  us-central1-a  RUNNING
ztor-presubmit-v1-1150-21e7089-1683-8f6b  2020-04-08T12:34:44+00:00  us-east1-d     RUNNING
kf-ci-management                          2020-04-28T21:22:24+00:00  us-central1    RUNNING
ztor-presubmit-v1-1175-2f86c79-0370-4672  2020-06-26T03:42:13+00:00  us-east1-d     RUNNING
zmit-e2e-v1alpha3-1235-c772f95-9616-2f11  2020-06-28T02:52:57+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-6400-180a  2020-07-24T14:17:23+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-2272-dcce  2020-07-25T01:20:32+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-1840-a12b  2020-07-25T11:21:13+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-7808-9113  2020-07-26T15:35:59+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-2752-cd98  2020-07-27T06:11:29+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-8096-3703  2020-07-29T08:55:21+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-7216-b6d9  2020-08-05T09:20:25+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1175-56c27aa-5600-5bc3  2020-08-11T03:50:57+00:00  us-east1-d     RUNNING
zbmit-e2e-v1beta1-1303-30e3e23-2896-853c  2020-08-19T20:29:44+00:00  us-east1-d     RUNNING
zbmit-e2e-v1beta1-1305-9179667-8816-238b  2020-08-19T20:51:50+00:00  us-east1-d     RUNNING

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 20, 2020

Em. Seems kfc-ci-v1, fairing-ci, kubeflow-testing usage is reasonable. @jinchihe @jlewi Any other clues?

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2020

It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.

e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).

Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.

To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.

Not all of the tests have been migrated to this model.

To remediate this I'm going to disable the ability of tests to create infrastructure in project kubeflow-ci. This will break any tests that are still doing that and acting as a forcing function for them to fix things.

I'm removing a bunch of permissions from [email protected] GSA. Noteably

  • Kubernetes Engine Admin

I'm attaching the full policy before the modifications
kubeflow-ci.policy.iam.txt

I deleted all the ephemeral clusters. This should free up significant CPU.

Tests that were using kubeflow-ci for ephemeral infrastructure will need to migrate that to creating ephemeral infra in different projects. My initial guess is that this primarily impacts Katib (@andreyvelich @johnugeorge ).

Each WG should probably use its own GCP project for this to allow better isolation and quota management.

WG"s can projects using GitOPS by creating the project here:
https://github.com/kubeflow/community-infra/tree/master/prod

As part of #737 it would be nice to document this so that other WGs could follow a similar approach.

Related to: #650 - Organize projects into folder.

@jlewi
Copy link
Contributor

jlewi commented Aug 21, 2020

@kubeflow/kfserving-owners @andreyvelich @gaocegege @johnugeorge @Bobgy @rmgogogo @terrytangyuan see previous comment as its possible tests for your WG were impacted.

@andreyvelich
Copy link
Member

It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.

e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).

In Katib test infra we create individual cluster for each presubmit after that we clean-up this cluster under kubeflow-ci project
For example, one of our workflow: http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-katib-presubmit-e2e-v1beta1-1299-b2d713c-3236-d712?tab=workflow.

Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.

To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.

Not all of the tests have been migrated to this model.

I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?

To remediate this I'm going to disable the ability of tests to create infrastructure in project kubeflow-ci. This will break any tests that are still doing that and acting as a forcing function for them to fix things.

I'm removing a bunch of permissions from [email protected] GSA. Noteably

Does it affect TF or PyTorch operators test infra @johnugeorge @terrytangyuan @Jeffwan ?

@johnugeorge
Copy link
Member

Yes. it should be the same case with operators also

@yuzisun
Copy link
Member

yuzisun commented Aug 22, 2020

@jlewi this is affecting KFServing CI and currently blocking multiple PRs, I have created the PR kubeflow/community-infra#10 to setup a GCP project for KFServing, please help review, thanks!

@yuzisun
Copy link
Member

yuzisun commented Aug 23, 2020

/priority p0

@gaocegege
Copy link
Member

I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?

SGTM.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 26, 2020

I filed a PR for training projects. kubeflow/community-infra#13

@jlewi
Copy link
Contributor

jlewi commented Aug 27, 2020

As I mentioned in yesterdays community meeting another remediation would be to revert the IAM policy changes in
#749 (comment)

To grant an extension to the existing projects e.g. katib and kfserving so that they can continue to create clusters in kubeflow-ci until they have successfully setup and migrated to wg specific projects.

[email protected] should have sufficient privileges to do this. So anyone in
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt
Could do this.

It looks like the members of ci-team@ is outdated; a bunch of those folks are likely no longer active in the project. It might make sense to replace those folks with members from the respective wgs that are depending on kubeflow-ci so that they can help administer and contribute.

cd @kubeflow/wg-automl-leads @kubeflow/wg-serving-leads @kubeflow/wg-training-leads

@animeshsingh
Copy link

@jlewi you have a view on who all with adderesses @google.com are active here?
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt

Additionally I don't recognize who else is active from these names

[email protected] [Don't think he is active anymore]
[email protected] [Who?]
[email protected] [Who?]

@jlewi
Copy link
Contributor

jlewi commented Aug 27, 2020

@animeshsingh you should be able to use history to see who committed the changes
@scottilee = scottleehello@
@harshad16 = harshad@

@stale
Copy link

stale bot commented Nov 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

@Bobgy Bobgy closed this as completed Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants