Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test cluster default reports unauthorized error #29003

Closed
saschagrunert opened this issue Mar 13, 2023 · 17 comments
Closed

Test cluster default reports unauthorized error #29003

saschagrunert opened this issue Mar 13, 2023 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.

Comments

@saschagrunert
Copy link
Member

What happened:
We're not able to run jobs across multiple projects, where it always seems to affect the cluster default.

Example: kubernetes-sigs/release-sdk#169 (comment)

Pod can not be created: create pod test-pod ... 5bca66f65b in cluster default: Unauthorized BaseSHA:8a85aa260e42313a68b0ad487b537b2b616641fc

What you expected to happen:
Being able to run the jobs.

How to reproduce it (as minimally and precisely as possible):
Right now it reproduces across multiple repositories, including k/k.

Please provide links to example occurrences, if any:

Anything else we need to know?:
cc @kubernetes/sig-k8s-infra

@saschagrunert saschagrunert added the kind/bug Categorizes issue or PR as related to a bug. label Mar 13, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 13, 2023
@saschagrunert
Copy link
Member Author

/sig k8s-infra

@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 13, 2023
@saschagrunert saschagrunert changed the title Test cluster default is down Test cluster default reports unauthorized error Mar 13, 2023
@saschagrunert saschagrunert changed the title Test cluster default reports unauthorized error Test cluster default reports unauthorized error Mar 13, 2023
@ameukam
Copy link
Member

ameukam commented Mar 13, 2023

This cluster is part of the google infrastructure. I will advise to move the community-owned infrastructure by adding the cluster: k8s-infra-prow-build if this is critical. On-call folx are in PST. Will take some time before an intervention happening.

@ybettan
Copy link
Contributor

ybettan commented Mar 13, 2023

We are hitting this issue as well in https://github.com/kubernetes-sigs/kernel-module-management for all our PRs.

@a7i
Copy link
Contributor

a7i commented Mar 13, 2023

Hitting this issue as well in kubernetes-sigs/descheduler#937

@ArangoGutierrez
Copy link
Contributor

@dims
Copy link
Member

dims commented Mar 13, 2023

one more report from @strongjz here:
https://kubernetes.slack.com/archives/CCK68P2Q2/p1678714493121869

description: 'Pod can not be created: create pod test-pods/3ccff4bf-c171-11ed-97cb-9a5bca66f65b
    in cluster default: Unauthorized'

@dims
Copy link
Member

dims commented Mar 13, 2023

error coming from prow reconciler?
https://cs.k8s.io/?q=Pod%20can%20not%20be%20created&i=nope&files=&excludeFiles=&repos=

specifically:

return "", "", fmt.Errorf("create pod %s in cluster %s: %w", podName.String(), pj.ClusterAlias(), err)

@shaneutt
Copy link
Member

In case it's helpful, we're getting this for multiple PRs in https://github.com/kubernetes-sigs/gateway-api as well this morning.

@torredil
Copy link
Member

Running into this over at https://github.com/kubernetes-sigs/aws-ebs-csi-driver as well.

@sanposhiho
Copy link
Member

sanposhiho commented Mar 13, 2023

Feature changing PRs in k/k are affected via pull-kubernetes-e2e-gce-cos-alpha-features.
See https://prow.k8s.io/?job=pull-kubernetes-e2e-gce-cos-alpha-features
No jobs have been success for a while.

@CecileRobertMichon
Copy link
Member

CAPZ PR tests are also affected

I see some folks tried to switch to the community cluster but that PR itself is hitting the issue in test-infra #29008 :(

@BenTheElder
Copy link
Member

Please don't switch everything to the community cluster: We're still very very tight on GCP budget this year and that cluster has already had capacity issues of late. We don't want to resolve them by increasing autoscaling capacity due to the tight budget (we're still on track for at least 3.4M on 3M credits this year and actively working to cut costs).

There is an EKS cluster coming online that workloads could switch to in the near future. Hopefully we'll have this resolved before then anyhow though.

@cjwagner
Copy link
Member

The kubeconfig for the default build cluster doesn't seem to be sufficient for accessing the build cluster any more. I've reproduced locally using the kubeconfig from the cluster.

I noticed a single failure of the gencred job, but the timing doesn't seem to align with when the issue first started. And a successful rerun of the job did not resolve the issue despite indicating the default cluster was successfully processed. https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-test-infra-gencred-refresh-kubeconfig

@cjwagner
Copy link
Member

Ah I see a lot of errors from the kubernetes-external-secrets deployment like the following. I think that could be the explanation for the kubeconfig going stale:

{"level":30,"message_time":"2023-03-13T19:25:54.105Z","pid":18,"hostname":"kubernetes-external-secrets-5f98c9ff97-ngs9k","payload":{},"msg":"starting poller for prow-monitoring/prometheus-alert-slack-post-testing-ops-secret-url"}
{"level":50,"message_time":"2023-03-13T19:25:54.106Z","pid":18,"hostname":"kubernetes-external-secrets-5f98c9ff97-ngs9k","payload":{"err":{"type":"TypeError","message":"Cannot read property 'get' of undefined","stack":"TypeError: Cannot read property 'get' of undefined\n    at Poller._scheduleNextPoll (/app/lib/poller.js:361:30)\n    at Poller.start (/app/lib/poller.js:415:10)\n    at Daemon._addPoller (/app/lib/daemon.js:59:43)\n    at Daemon.start (/app/lib/daemon.js:89:16)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (internal/process/task_queues.js:95:5)"}},"msg":"status check went boom for prow-monitoring/prometheus-alert-slack-post-testing-ops-secret-url"}

I've kicked over the pod and now it has synced the secret.

@shaneutt
Copy link
Member

Jobs running again for Gateway API thank you @cjwagner 🖖

@dims
Copy link
Member

dims commented Mar 13, 2023

thanks @cjwagner !

@cjwagner
Copy link
Member

Things should be fixed now. It seems that the root cause of this outage was the KES deployment getting stuck on some internal error that resulted in neither the pod crashing nor metrics indicating a failed secret sync (for which we already have an alert).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
None yet
Development

No branches or pull requests