-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test cluster default
reports unauthorized
error
#29003
Comments
/sig k8s-infra |
default
is downdefault
reports unauthorized error
default
reports unauthorized errordefault
reports unauthorized
error
This cluster is part of the google infrastructure. I will advise to move the community-owned infrastructure by adding the |
We are hitting this issue as well in https://github.com/kubernetes-sigs/kernel-module-management for all our PRs. |
Hitting this issue as well in kubernetes-sigs/descheduler#937 |
one more report from @strongjz here:
|
error coming from prow reconciler? specifically: test-infra/prow/plank/reconciler.go Line 735 in d6acd10
|
In case it's helpful, we're getting this for multiple PRs in https://github.com/kubernetes-sigs/gateway-api as well this morning. |
Running into this over at https://github.com/kubernetes-sigs/aws-ebs-csi-driver as well. |
Feature changing PRs in k/k are affected via |
CAPZ PR tests are also affected I see some folks tried to switch to the community cluster but that PR itself is hitting the issue in test-infra #29008 :( |
Please don't switch everything to the community cluster: We're still very very tight on GCP budget this year and that cluster has already had capacity issues of late. We don't want to resolve them by increasing autoscaling capacity due to the tight budget (we're still on track for at least 3.4M on 3M credits this year and actively working to cut costs). There is an EKS cluster coming online that workloads could switch to in the near future. Hopefully we'll have this resolved before then anyhow though. |
The kubeconfig for the default build cluster doesn't seem to be sufficient for accessing the build cluster any more. I've reproduced locally using the kubeconfig from the cluster. I noticed a single failure of the gencred job, but the timing doesn't seem to align with when the issue first started. And a successful rerun of the job did not resolve the issue despite indicating the |
Ah I see a lot of errors from the kubernetes-external-secrets deployment like the following. I think that could be the explanation for the kubeconfig going stale:
I've kicked over the pod and now it has synced the secret. |
Jobs running again for Gateway API thank you @cjwagner 🖖 |
thanks @cjwagner ! |
Things should be fixed now. It seems that the root cause of this outage was the KES deployment getting stuck on some internal error that resulted in neither the pod crashing nor metrics indicating a failed secret sync (for which we already have an alert). |
What happened:
We're not able to run jobs across multiple projects, where it always seems to affect the cluster
default
.Example: kubernetes-sigs/release-sdk#169 (comment)
What you expected to happen:
Being able to run the jobs.
How to reproduce it (as minimally and precisely as possible):
Right now it reproduces across multiple repositories, including k/k.
Please provide links to example occurrences, if any:
Anything else we need to know?:
cc @kubernetes/sig-k8s-infra
The text was updated successfully, but these errors were encountered: