Test cluster `default` reports `unauthorized` error #29003

saschagrunert · 2023-03-13T08:31:29Z

What happened:
We're not able to run jobs across multiple projects, where it always seems to affect the cluster default.

Example: kubernetes-sigs/release-sdk#169 (comment)

Pod can not be created: create pod test-pod ... 5bca66f65b in cluster default: Unauthorized BaseSHA:8a85aa260e42313a68b0ad487b537b2b616641fc

What you expected to happen:
Being able to run the jobs.

How to reproduce it (as minimally and precisely as possible):
Right now it reproduces across multiple repositories, including k/k.

Please provide links to example occurrences, if any:

Anything else we need to know?:
cc @kubernetes/sig-k8s-infra

The text was updated successfully, but these errors were encountered:

saschagrunert · 2023-03-13T08:31:55Z

/sig k8s-infra

ameukam · 2023-03-13T08:40:15Z

This cluster is part of the google infrastructure. I will advise to move the community-owned infrastructure by adding the cluster: k8s-infra-prow-build if this is critical. On-call folx are in PST. Will take some time before an intervention happening.

ybettan · 2023-03-13T13:08:27Z

We are hitting this issue as well in https://github.com/kubernetes-sigs/kernel-module-management for all our PRs.

a7i · 2023-03-13T13:57:16Z

Hitting this issue as well in kubernetes-sigs/descheduler#937

ArangoGutierrez · 2023-03-13T14:13:49Z

Affected https://github.com/kubernetes-sigs/node-feature-discovery-operator/pulls

dims · 2023-03-13T14:18:50Z

one more report from @strongjz here:
https://kubernetes.slack.com/archives/CCK68P2Q2/p1678714493121869

description: 'Pod can not be created: create pod test-pods/3ccff4bf-c171-11ed-97cb-9a5bca66f65b
    in cluster default: Unauthorized'

dims · 2023-03-13T14:20:11Z

error coming from prow reconciler?
https://cs.k8s.io/?q=Pod%20can%20not%20be%20created&i=nope&files=&excludeFiles=&repos=

specifically:

test-infra/prow/plank/reconciler.go

Line 735 in d6acd10

    
           return "", "", fmt.Errorf("create pod %s in cluster %s: %w", podName.String(), pj.ClusterAlias(), err)

shaneutt · 2023-03-13T14:22:34Z

In case it's helpful, we're getting this for multiple PRs in https://github.com/kubernetes-sigs/gateway-api as well this morning.

torredil · 2023-03-13T14:40:40Z

Running into this over at https://github.com/kubernetes-sigs/aws-ebs-csi-driver as well.

sanposhiho · 2023-03-13T14:48:31Z

Feature changing PRs in k/k are affected via pull-kubernetes-e2e-gce-cos-alpha-features.
See https://prow.k8s.io/?job=pull-kubernetes-e2e-gce-cos-alpha-features
No jobs have been success for a while.

CecileRobertMichon · 2023-03-13T17:10:00Z

CAPZ PR tests are also affected

I see some folks tried to switch to the community cluster but that PR itself is hitting the issue in test-infra #29008 :(

BenTheElder · 2023-03-13T17:20:09Z

Please don't switch everything to the community cluster: We're still very very tight on GCP budget this year and that cluster has already had capacity issues of late. We don't want to resolve them by increasing autoscaling capacity due to the tight budget (we're still on track for at least 3.4M on 3M credits this year and actively working to cut costs).

There is an EKS cluster coming online that workloads could switch to in the near future. Hopefully we'll have this resolved before then anyhow though.

cjwagner · 2023-03-13T19:26:04Z

The kubeconfig for the default build cluster doesn't seem to be sufficient for accessing the build cluster any more. I've reproduced locally using the kubeconfig from the cluster.

I noticed a single failure of the gencred job, but the timing doesn't seem to align with when the issue first started. And a successful rerun of the job did not resolve the issue despite indicating the default cluster was successfully processed. https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-test-infra-gencred-refresh-kubeconfig

cjwagner · 2023-03-13T19:31:42Z

Ah I see a lot of errors from the kubernetes-external-secrets deployment like the following. I think that could be the explanation for the kubeconfig going stale:

{"level":30,"message_time":"2023-03-13T19:25:54.105Z","pid":18,"hostname":"kubernetes-external-secrets-5f98c9ff97-ngs9k","payload":{},"msg":"starting poller for prow-monitoring/prometheus-alert-slack-post-testing-ops-secret-url"}
{"level":50,"message_time":"2023-03-13T19:25:54.106Z","pid":18,"hostname":"kubernetes-external-secrets-5f98c9ff97-ngs9k","payload":{"err":{"type":"TypeError","message":"Cannot read property 'get' of undefined","stack":"TypeError: Cannot read property 'get' of undefined\n    at Poller._scheduleNextPoll (/app/lib/poller.js:361:30)\n    at Poller.start (/app/lib/poller.js:415:10)\n    at Daemon._addPoller (/app/lib/daemon.js:59:43)\n    at Daemon.start (/app/lib/daemon.js:89:16)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (internal/process/task_queues.js:95:5)"}},"msg":"status check went boom for prow-monitoring/prometheus-alert-slack-post-testing-ops-secret-url"}

I've kicked over the pod and now it has synced the secret.

shaneutt · 2023-03-13T19:37:44Z

Jobs running again for Gateway API thank you @cjwagner 🖖

dims · 2023-03-13T19:39:38Z

thanks @cjwagner !

cjwagner · 2023-03-13T19:41:54Z

Things should be fixed now. It seems that the root cause of this outage was the KES deployment getting stuck on some internal error that resulted in neither the pod crashing nor metrics indicating a failed secret sync (for which we already have an alert).

saschagrunert added the kind/bug Categorizes issue or PR as related to a bug. label Mar 13, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 13, 2023

k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 13, 2023

saschagrunert changed the title ~~Test cluster default is down~~ Test cluster default reports unauthorized error Mar 13, 2023

saschagrunert changed the title ~~Test cluster default reports unauthorized error~~ Test cluster default reports unauthorized error Mar 13, 2023

This was referenced Mar 13, 2023

Add a conditions field to LocalQueueStatus kubernetes-sigs/kueue#623

Merged

Commonize reconciler code by generic Job reconciler kubernetes-sigs/kueue#627

Merged

This was referenced Mar 13, 2023

[all] Tag pushes to master as :latest kubernetes/cloud-provider-openstack#2161

Closed

Add dulek and mdbooth as reviewers kubernetes/cloud-provider-openstack#2162

Merged

a7i mentioned this issue Mar 13, 2023

run descheduler builds/checks on community cluster k8s-infra-prow-build #29008

Closed

ybettan mentioned this issue Mar 13, 2023

Adding the namespace to the <module>.ready label. kubernetes-sigs/kernel-module-management#328

Merged

mdbooth mentioned this issue Mar 13, 2023

⚠️Remove OpenStackMachineSpec.Subnet kubernetes-sigs/cluster-api-provider-openstack#1504

Merged

klueska mentioned this issue Mar 13, 2023

Update DRAManager to allow multiple plugins to process a single claim kubernetes/kubernetes#116513

Merged

torredil mentioned this issue Mar 13, 2023

Stop treating prefixes as magic in DeviceManager kubernetes-sigs/aws-ebs-csi-driver#1518

Merged

sanposhiho mentioned this issue Mar 13, 2023

fix(HPA): ignore the container resource metrics in HPA controller when the feature gate is disabled kubernetes/kubernetes#116043

Merged

shaneutt mentioned this issue Mar 13, 2023

Introducing Topology Mode Annotation, Deprecating Topology Hints Annotation kubernetes/kubernetes#116522

Merged

tenzen-y mentioned this issue Mar 13, 2023

Bump k8s.io/component-helpers from 0.26.1 to 0.26.2 kubernetes-sigs/kueue#624

Merged

oomichi mentioned this issue Mar 13, 2023

pull-kubespray-yamllint is failing kubernetes-sigs/kubespray#9888

Closed

saschagrunert closed this as completed Mar 14, 2023

pbetkier mentioned this issue May 29, 2023

Presubmits not starting: Pod can not be created: (...) in cluster default: Unauthorized #29622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test cluster `default` reports `unauthorized` error #29003

Test cluster `default` reports `unauthorized` error #29003

saschagrunert commented Mar 13, 2023

saschagrunert commented Mar 13, 2023

ameukam commented Mar 13, 2023

ybettan commented Mar 13, 2023

a7i commented Mar 13, 2023

ArangoGutierrez commented Mar 13, 2023

dims commented Mar 13, 2023

dims commented Mar 13, 2023 •

edited

Loading

shaneutt commented Mar 13, 2023

torredil commented Mar 13, 2023

sanposhiho commented Mar 13, 2023 •

edited

Loading

CecileRobertMichon commented Mar 13, 2023

BenTheElder commented Mar 13, 2023

cjwagner commented Mar 13, 2023

cjwagner commented Mar 13, 2023

shaneutt commented Mar 13, 2023

dims commented Mar 13, 2023

cjwagner commented Mar 13, 2023

Test cluster default reports unauthorized error #29003

Test cluster default reports unauthorized error #29003

Comments

saschagrunert commented Mar 13, 2023

saschagrunert commented Mar 13, 2023

ameukam commented Mar 13, 2023

ybettan commented Mar 13, 2023

a7i commented Mar 13, 2023

ArangoGutierrez commented Mar 13, 2023

dims commented Mar 13, 2023

dims commented Mar 13, 2023 • edited Loading

shaneutt commented Mar 13, 2023

torredil commented Mar 13, 2023

sanposhiho commented Mar 13, 2023 • edited Loading

CecileRobertMichon commented Mar 13, 2023

BenTheElder commented Mar 13, 2023

cjwagner commented Mar 13, 2023

cjwagner commented Mar 13, 2023

shaneutt commented Mar 13, 2023

dims commented Mar 13, 2023

cjwagner commented Mar 13, 2023

Test cluster `default` reports `unauthorized` error #29003

Test cluster `default` reports `unauthorized` error #29003

dims commented Mar 13, 2023 •

edited

Loading

sanposhiho commented Mar 13, 2023 •

edited

Loading