Insufficient regional quota to satisfy request and katib job is blocked. #749

Jeffwan · 2020-08-20T16:57:58Z

Prow status: https://prow.k8s.io/?repo=kubeflow%2Fkatib

ERROR: (gcloud.beta.container.clusters.create) ResponseError: code=403, message=Insufficient regional quota to satisfy request: resource "CPUS": request requires
'48.0' and is short '40.0'. project has a quota of '500.0' with '8.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage
=USED&project=kubeflow-ci.

Reported by @andreyvelich

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-08-20T16:58:07Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.77
kind/bug	0.89

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-08-20T16:58:16Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/engprod	0.54

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-08-20T16:58:16Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/engprod	0.54

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Jeffwan · 2020-08-20T16:59:05Z

Seems CPU quota is almost used out and it's match with error message.

gcloud compute regions describe us-east1 --project=kubeflow-ci         
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: us-east1
id: '1230'
kind: compute#region
name: us-east1
quotas:
- limit: 500.0
  metric: CPUS
  usage: 484.0
- limit: 200000.0
  metric: DISKS_TOTAL_GB
  usage: 13886.0
....

andreyvelich · 2020-08-20T17:17:12Z

@Jeffwan Thank you for creating the issue.
Can we check which pods are currently running at cluster and use CPUs?

Jeffwan · 2020-08-20T17:17:25Z

I kick off an one off job to clean up deployment to release some resources.

andreyvelich · 2020-08-20T17:20:46Z

I will try to re-run tests.

Jeffwan · 2020-08-20T17:22:01Z

@andreyvelich

There're few clusters under this project and I am trying to clean up kubeflow-ci cluster first. All workflows are running there. it doesn't seem we have a groupby <cluster, cpu> utils. Once the clean up is done, I can check the total CPU usage at the project level again.

Jeffwan · 2020-08-20T17:35:23Z

kubeflow-periodic-0-3-branch-tf-serving-3367-e0d1      Active   486d
kubeflow-periodic-0-5-branch-tf-serving-3856-4443      Active   335d
kubeflow-periodic-master-deployapp-878-a4a0            Active   536d
kubeflow-periodic-master-tf-serving-353-b738           Active   623d
kubeflow-periodic-master-tf-serving-721-fd6b           Active   562d
kubeflow-periodic-master-tf-serving-913-2737           Active   530d
kubeflow-periodic-release-branch-tf-serving-227-9734   Active   617d
kubeflow-presubmit-deployapp-1817-e39b1d3-3922-6f80    Active   672d
kubeflow-presubmit-deployapp-1817-f1c14ea-3928-4442    Active   672d
kubeflow-presubmit-tf-serving-2338-9316696-5673-b788   Active   554d
kubeflow-presubmit-tf-serving-2449-e9ea4dd-5627-7ec9   Active   555d
kubeflow-presubmit-tf-serving-2474-1720719-5700-fc62   Active   553d
kubeflow-presubmit-tf-serving-2784-3488c99-6610-fdef   Active   516d
kubeflow-presubmit-tf-serving-2991-7732038-6736-f518   Active   497d
kubeflow-presubmit-tf-serving-3464-2de5dd8-6288-31d2   Active   433d
kubeflow-presubmit-tf-serving-3464-7c4ef28-7168-3901   Active   433d
kubeflow-presubmit-tf-serving-3464-9165fce-1152-5c01   Active   432d
kubeflow-presubmit-tf-serving-3464-9165fce-2688-141d   Active   433d
kubeflow-presubmit-tf-serving-3464-9165fce-8256-a10f   Active   432d

I think these are some leaked resources, I will delete them as well

k get all -n kubeflow-presubmit-tf-serving-2338-9316696-5673-b788                                                  ○ kubeflow-testing 10:33:28
NAME                            READY   STATUS    RESTARTS   AGE
pod/mnist-cpu-bc4ddfd96-ssmtv   1/1     Running   33         512d

NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/mnist-cpu   ClusterIP   10.39.255.61    <none>        9000/TCP,8500/TCP   554d
service/mnist-gpu   ClusterIP   10.39.244.227   <none>        9000/TCP,8500/TCP   554d

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mnist-cpu   1/1     1            1           554d

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/mnist-cpu-bc4ddfd96   1         1         1       554d

andreyvelich · 2020-08-20T17:41:42Z

Do we have any ideas how this pod was deployed?
It was running for 512d which is strange.

Jeffwan · 2020-08-20T18:04:35Z

Not sure. it's long time ago. :D. kubeflow-testing doesn't have resource leaking. In total, it uses less than 100 CPUs.(only 7 nodes, I sum all the requests), could be other clusters, Need some time to figure it out

kubeflow-testing                          2018-03-29T17:46:26+00:00  us-east1-d     RUNNING
kf-vmaster-n00                            2019-04-02T12:15:07+00:00  us-east1-b     RUNNING
kf-ci-v1                                  2020-02-03T23:14:27+00:00  us-east1-d     RUNNING
fairing-ci                                2020-03-09T17:08:54+00:00  us-central1-a  RUNNING
ztor-presubmit-v1-1150-21e7089-1683-8f6b  2020-04-08T12:34:44+00:00  us-east1-d     RUNNING
kf-ci-management                          2020-04-28T21:22:24+00:00  us-central1    RUNNING
ztor-presubmit-v1-1175-2f86c79-0370-4672  2020-06-26T03:42:13+00:00  us-east1-d     RUNNING
zmit-e2e-v1alpha3-1235-c772f95-9616-2f11  2020-06-28T02:52:57+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-6400-180a  2020-07-24T14:17:23+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-2272-dcce  2020-07-25T01:20:32+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-1840-a12b  2020-07-25T11:21:13+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-7808-9113  2020-07-26T15:35:59+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-2752-cd98  2020-07-27T06:11:29+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-8096-3703  2020-07-29T08:55:21+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1171-232e11d-7216-b6d9  2020-08-05T09:20:25+00:00  us-east1-d     RUNNING
ztor-presubmit-v1-1175-56c27aa-5600-5bc3  2020-08-11T03:50:57+00:00  us-east1-d     RUNNING
zbmit-e2e-v1beta1-1303-30e3e23-2896-853c  2020-08-19T20:29:44+00:00  us-east1-d     RUNNING
zbmit-e2e-v1beta1-1305-9179667-8816-238b  2020-08-19T20:51:50+00:00  us-east1-d     RUNNING

Jeffwan · 2020-08-20T18:17:01Z

Em. Seems kfc-ci-v1, fairing-ci, kubeflow-testing usage is reasonable. @jinchihe @jlewi Any other clues?

jlewi · 2020-08-21T14:18:03Z

It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.

e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).

Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.

To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.

Not all of the tests have been migrated to this model.

To remediate this I'm going to disable the ability of tests to create infrastructure in project kubeflow-ci. This will break any tests that are still doing that and acting as a forcing function for them to fix things.

I'm removing a bunch of permissions from [email protected] GSA. Noteably

Kubernetes Engine Admin

I'm attaching the full policy before the modifications
kubeflow-ci.policy.iam.txt

I deleted all the ephemeral clusters. This should free up significant CPU.

Tests that were using kubeflow-ci for ephemeral infrastructure will need to migrate that to creating ephemeral infra in different projects. My initial guess is that this primarily impacts Katib (@andreyvelich @johnugeorge ).

Each WG should probably use its own GCP project for this to allow better isolation and quota management.

WG"s can projects using GitOPS by creating the project here:
https://github.com/kubeflow/community-infra/tree/master/prod

As part of #737 it would be nice to document this so that other WGs could follow a similar approach.

Related to: #650 - Organize projects into folder.

jlewi · 2020-08-21T14:28:50Z

@kubeflow/kfserving-owners @andreyvelich @gaocegege @johnugeorge @Bobgy @rmgogogo @terrytangyuan see previous comment as its possible tests for your WG were impacted.

andreyvelich · 2020-08-21T15:39:27Z

It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits.

e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib).

In Katib test infra we create individual cluster for each presubmit after that we clean-up this cluster under kubeflow-ci project
For example, one of our workflow: http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-katib-presubmit-e2e-v1beta1-1299-b2d713c-3236-d712?tab=workflow.

Some background:
Originally all of our test infrastructure ran in project kubeflow-ci. This included ephmeral infrastructure such as GKE clusters spun up for the lifetime of the tests.

To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted.

Not all of the tests have been migrated to this model.

I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?

To remediate this I'm going to disable the ability of tests to create infrastructure in project kubeflow-ci. This will break any tests that are still doing that and acting as a forcing function for them to fix things.

I'm removing a bunch of permissions from [email protected] GSA. Noteably

Does it affect TF or PyTorch operators test infra @johnugeorge @terrytangyuan @Jeffwan ?

johnugeorge · 2020-08-21T18:02:57Z

Yes. it should be the same case with operators also

yuzisun · 2020-08-22T12:16:12Z

@jlewi this is affecting KFServing CI and currently blocking multiple PRs, I have created the PR kubeflow/community-infra#10 to setup a GCP project for KFServing, please help review, thanks!

yuzisun · 2020-08-23T07:52:03Z

/priority p0

gaocegege · 2020-08-25T01:43:20Z

I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?

SGTM.

Jeffwan · 2020-08-26T17:24:39Z

I filed a PR for training projects. kubeflow/community-infra#13

jlewi · 2020-08-27T02:54:02Z

As I mentioned in yesterdays community meeting another remediation would be to revert the IAM policy changes in
#749 (comment)

To grant an extension to the existing projects e.g. katib and kfserving so that they can continue to create clusters in kubeflow-ci until they have successfully setup and migrated to wg specific projects.

[email protected] should have sufficient privileges to do this. So anyone in
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt
Could do this.

It looks like the members of ci-team@ is outdated; a bunch of those folks are likely no longer active in the project. It might make sense to replace those folks with members from the respective wgs that are depending on kubeflow-ci so that they can help administer and contribute.

cd @kubeflow/wg-automl-leads @kubeflow/wg-serving-leads @kubeflow/wg-training-leads

animeshsingh · 2020-08-27T03:58:11Z

@jlewi you have a view on who all with adderesses @google.com are active here?
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt

Additionally I don't recognize who else is active from these names

[email protected] [Don't think he is active anymore]
[email protected] [Who?]
[email protected] [Who?]

jlewi · 2020-08-27T13:57:45Z

@animeshsingh you should be able to use history to see who committed the changes
@scottilee = scottleehello@
@harshad16 = harshad@

stale · 2020-11-26T21:52:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

issue-label-bot bot added area/katib kind/bug labels Aug 20, 2020

issue-label-bot bot added the area/engprod label Aug 20, 2020

andreyvelich mentioned this issue Aug 21, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

This was referenced Aug 22, 2020

Add support for generic URI in Storage object kserve/kserve#979

Merged

Setup KFServing project kubeflow/community-infra#10

Merged

k8s-ci-robot added the priority/p0 label Aug 23, 2020

andreyvelich mentioned this issue Aug 25, 2020

fix(metrics-collector): allow user to nuke ephemeral-storage requests kubeflow/katib#1312

Merged

andreyvelich mentioned this issue Aug 25, 2020

Add AutoML project kubeflow/community-infra#11

Merged

This was referenced Sep 1, 2020

Prevent tests from creating clusters and other resources in project kubeflow-ci after 09/01 #752

Closed

gcloud build failed due to network issue #753

Closed

stale bot added the lifecycle/stale label Nov 26, 2020

Bobgy closed this as completed Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insufficient regional quota to satisfy request and katib job is blocked. #749

Insufficient regional quota to satisfy request and katib job is blocked. #749

Jeffwan commented Aug 20, 2020

issue-label-bot bot commented Aug 20, 2020

issue-label-bot bot commented Aug 20, 2020

issue-label-bot bot commented Aug 20, 2020

Jeffwan commented Aug 20, 2020 •

edited

Loading

andreyvelich commented Aug 20, 2020

Jeffwan commented Aug 20, 2020

andreyvelich commented Aug 20, 2020

Jeffwan commented Aug 20, 2020

Jeffwan commented Aug 20, 2020 •

edited

Loading

andreyvelich commented Aug 20, 2020

Jeffwan commented Aug 20, 2020 •

edited

Loading

Jeffwan commented Aug 20, 2020

jlewi commented Aug 21, 2020

jlewi commented Aug 21, 2020

andreyvelich commented Aug 21, 2020

johnugeorge commented Aug 21, 2020

yuzisun commented Aug 22, 2020

yuzisun commented Aug 23, 2020

gaocegege commented Aug 25, 2020

Jeffwan commented Aug 26, 2020

jlewi commented Aug 27, 2020

animeshsingh commented Aug 27, 2020

jlewi commented Aug 27, 2020

stale bot commented Nov 26, 2020

Insufficient regional quota to satisfy request and katib job is blocked. #749

Insufficient regional quota to satisfy request and katib job is blocked. #749

Comments

Jeffwan commented Aug 20, 2020

issue-label-bot bot commented Aug 20, 2020

issue-label-bot bot commented Aug 20, 2020

issue-label-bot bot commented Aug 20, 2020

Jeffwan commented Aug 20, 2020 • edited Loading

andreyvelich commented Aug 20, 2020

Jeffwan commented Aug 20, 2020

andreyvelich commented Aug 20, 2020

Jeffwan commented Aug 20, 2020

Jeffwan commented Aug 20, 2020 • edited Loading

andreyvelich commented Aug 20, 2020

Jeffwan commented Aug 20, 2020 • edited Loading

Jeffwan commented Aug 20, 2020

jlewi commented Aug 21, 2020

jlewi commented Aug 21, 2020

andreyvelich commented Aug 21, 2020

johnugeorge commented Aug 21, 2020

yuzisun commented Aug 22, 2020

yuzisun commented Aug 23, 2020

gaocegege commented Aug 25, 2020

Jeffwan commented Aug 26, 2020

jlewi commented Aug 27, 2020

animeshsingh commented Aug 27, 2020

jlewi commented Aug 27, 2020

stale bot commented Nov 26, 2020

Jeffwan commented Aug 20, 2020 •

edited

Loading

Jeffwan commented Aug 20, 2020 •

edited

Loading

Jeffwan commented Aug 20, 2020 •

edited

Loading