-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insufficient regional quota to satisfy request and katib job is blocked. #749
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
1 similar comment
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
Seems CPU quota is almost used out and it's match with error message.
|
@Jeffwan Thank you for creating the issue. |
I kick off an one off job to clean up deployment to release some resources. |
I will try to re-run tests. |
There're few clusters under this project and I am trying to clean up kubeflow-ci cluster first. All workflows are running there. it doesn't seem we have a groupby <cluster, cpu> utils. Once the clean up is done, I can check the total CPU usage at the project level again. |
I think these are some leaked resources, I will delete them as well
|
Do we have any ideas how this pod was deployed? |
Not sure. it's long time ago. :D.
|
It looks like some pretty large clusters are still being created inside project: kubeflow-ci for individual presubmits. e.g. zbmit-e2e-v1beta1-1305-9179667-8816-238b was created at 08-19. I'm not sure which presubmit this is coming from (maybe Katib). Some background: To enable better management of ephemeral infrastructure we started moving ephmeral clusters into separate projects e.g. kubeflow-ci-deployments. The thinking was this would make it easier to deal with resource leaks because everything in the ephemeral project could just be deleted. Not all of the tests have been migrated to this model. To remediate this I'm going to disable the ability of tests to create infrastructure in project I'm removing a bunch of permissions from
I'm attaching the full policy before the modifications I deleted all the ephemeral clusters. This should free up significant CPU. Tests that were using kubeflow-ci for ephemeral infrastructure will need to migrate that to creating ephemeral infra in different projects. My initial guess is that this primarily impacts Katib (@andreyvelich @johnugeorge ). Each WG should probably use its own GCP project for this to allow better isolation and quota management. WG"s can projects using GitOPS by creating the project here: As part of #737 it would be nice to document this so that other WGs could follow a similar approach. Related to: #650 - Organize projects into folder. |
@kubeflow/kfserving-owners @andreyvelich @gaocegege @johnugeorge @Bobgy @rmgogogo @terrytangyuan see previous comment as its possible tests for your WG were impacted. |
In Katib test infra we create individual cluster for each presubmit after that we clean-up this cluster under kubeflow-ci project
I am fine with migrating Katib test infra for independent GCP project. What do you think @gaocegege @johnugeorge ?
Does it affect TF or PyTorch operators test infra @johnugeorge @terrytangyuan @Jeffwan ? |
Yes. it should be the same case with operators also |
@jlewi this is affecting KFServing CI and currently blocking multiple PRs, I have created the PR kubeflow/community-infra#10 to setup a GCP project for KFServing, please help review, thanks! |
/priority p0 |
SGTM. |
I filed a PR for training projects. kubeflow/community-infra#13 |
As I mentioned in yesterdays community meeting another remediation would be to revert the IAM policy changes in To grant an extension to the existing projects e.g. katib and kfserving so that they can continue to create clusters in kubeflow-ci until they have successfully setup and migrated to wg specific projects. [email protected] should have sufficient privileges to do this. So anyone in It looks like the members of ci-team@ is outdated; a bunch of those folks are likely no longer active in the project. It might make sense to replace those folks with members from the respective wgs that are depending on kubeflow-ci so that they can help administer and contribute. cd @kubeflow/wg-automl-leads @kubeflow/wg-serving-leads @kubeflow/wg-training-leads |
@jlewi you have a view on who all with adderesses @google.com are active here? Additionally I don't recognize who else is active from these names [email protected] [Don't think he is active anymore] |
@animeshsingh you should be able to use history to see who committed the changes |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions. |
Prow status: https://prow.k8s.io/?repo=kubeflow%2Fkatib
Reported by @andreyvelich
The text was updated successfully, but these errors were encountered: