Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache deployer fails if the cluster signer is not set #4505

Closed
davidspek opened this issue Sep 16, 2020 · 23 comments · Fixed by #4525
Closed

Cache deployer fails if the cluster signer is not set #4505

davidspek opened this issue Sep 16, 2020 · 23 comments · Fixed by #4525

Comments

@davidspek
Copy link
Contributor

davidspek commented Sep 16, 2020

What steps did you take:

[A clear and concise description of what the bug is.]
When deploying kubeflow using kfctl_istio_dex.v1.1.0.yaml on a Charmed Kubernetes 1.19 cluster the cache-server and cache-deployer-deployment pods get stuck in PodInitializing and CrashLoopBackOff respectively. The cache-server pod shows the error MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found. Redploying either or both of the pods does not fix the issue. The cache-deployer-deployment pod gives the following logs:

+ echo 'Start deploying cache service to existing cluster:'
+ NAMESPACE=kubeflow
+ MUTATING_WEBHOOK_CONFIGURATION_NAME=cache-webhook-kubeflow
+ WEBHOOK_SECRET_NAME=webhook-server-tls
Start deploying cache service to existing cluster:
+ kubectl get mutatingwebhookconfigurations cache-webhook-kubeflow --namespace kubeflow --ignore-not-found
+ kubectl get secrets webhook-server-tls --namespace kubeflow --ignore-not-found
+ webhook_config_exists=false
+ grep cache-webhook-kubeflow -w
+ webhook_secret_exists=false
+ grep webhook-server-tls -w
+ '[' false '==' true ]
+ '[' false '==' true ]
+ '[' false '==' true ]
+ export 'CA_FILE=ca_cert'
+ rm -f ca_cert
+ touch ca_cert
+ ./webhook-create-signed-cert.sh --namespace kubeflow --cert_output_path ca_cert --secret webhook-server-tls
+ [[ 6 -gt 0 ]]
+ case ${1} in
+ namespace=kubeflow
+ shift
+ shift
+ [[ 4 -gt 0 ]]
+ case ${1} in
+ cert_output_path=ca_cert
+ shift
+ shift
+ [[ 2 -gt 0 ]]
+ case ${1} in
+ secret=webhook-server-tls
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z ']'
+ service=cache-server
+ '[' -z webhook-server-tls ']'
+ '[' -z kubeflow ']'
+ '[' -z ca_cert ']'
++ command -v openssl
+ '[' '!' -x /usr/bin/openssl ']'
+ csrName=cache-server.kubeflow
++ mktemp -d
+ tmpdir=/tmp/tmp.KGlEMA
+ echo 'creating certs in tmpdir /tmp/tmp.KGlEMA '
creating certs in tmpdir /tmp/tmp.KGlEMA 
+ cat
+ openssl genrsa -out /tmp/tmp.KGlEMA/server-key.pem 2048
Generating RSA private key, 2048 bit long modulus (2 primes)
.......................................................................................+++++
...................................................................+++++
e is 65537 (0x010001)
+ openssl req -new -key /tmp/tmp.KGlEMA/server-key.pem -subj /CN=cache-server.kubeflow.svc -out /tmp/tmp.KGlEMA/server.csr -config /tmp/tmp.KGlEMA/csr.conf
+ echo 'start running kubectl...'
start running kubectl...
+ kubectl delete csr cache-server.kubeflow
certificatesigningrequest.certificates.k8s.io "cache-server.kubeflow" deleted
+ cat
+ kubectl create -f -
++ cat /tmp/tmp.KGlEMA/server.csr
++ base64
++ tr -d '\n'
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow created
+ true
+ kubectl get csr cache-server.kubeflow
NAME                    AGE   SIGNERNAME                     REQUESTOR                                                             CONDITION
cache-server.kubeflow   0s    kubernetes.io/legacy-unknown   system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa   Pending
+ '[' 0 -eq 0 ']'
+ break
+ kubectl certificate approve cache-server.kubeflow
No resources found
error: no kind "CertificateSigningRequest" is registered for version "certificates.k8s.io/v1" in scheme "k8s.io/kubernetes/pkg/kubectl/scheme/scheme.go:28"

The cache-server.kubeflow csr is stuck in a Pending condition. However, manually running kubectl certificate approve cache-server.kubeflow does work.

The following pull requests seem to be related:
openshift/oc#501
openshift/installer#3943

Environment:

Charmed Kubernetes 1.19 running on Ubuntu 20.04.1.

How did you deploy Kubeflow Pipelines (KFP)?
full Kubeflow deployment

/kind bug
/area backend

@davidspek
Copy link
Contributor Author

/area backend

@Ark-kun Ark-kun self-assigned this Sep 18, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 18, 2020

I wonder what would be the best way to deal with this issue.
The request we send is v1beta1, not v1. This looks like a bug in some version of kubectl.

@davidspek
Copy link
Contributor Author

@Ark-kun, could this have to do with the cluster being Kubernetes 1.19 and its changes in regards to beta API's?

@Ark-kun
Copy link
Contributor

Ark-kun commented Sep 20, 2020

Maybe the problem is related to the version mismatch between kubectl version in the container and the Kubernetes server version.
Kubectl is v1.16 in the deployer container:

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.9", GitCommit:"a17149e1a189050796ced469dbd78d380f2ed5ef", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:51Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

@Ark-kun Ark-kun changed the title webhook-server-tls secret not created Cache deployer fails in Kubernetes 1.19 Sep 21, 2020
Ark-kun added a commit to Ark-kun/pipelines that referenced this issue Sep 21, 2020
k8s-ci-robot pushed a commit that referenced this issue Sep 30, 2020
…rver (#4525)

* Cache deployer - Using the same kubectl version as the server

Fixes #4505

* Changed the PATH precedence

* Unquoted the jq output

* Fixed the curl options
Bobgy pushed a commit that referenced this issue Oct 10, 2020
…rver (#4525)

* Cache deployer - Using the same kubectl version as the server

Fixes #4505

* Changed the PATH precedence

* Unquoted the jq output

* Fixed the curl options
@davidspek
Copy link
Contributor Author

@Ark-kun It doesn't seem like this issue has been resolved. I just deployed Kubeflow 1.2 on Kubernetes 1.19.4 and the cache-server and cache-deployer-deployment are still stuck with errors.

I have spotted 2 Certificate Signing Requests, both identical with one in namespace istio-system and the other in kubeflow. I remember there was an issue that the CSR was not being approved, which it now is but I don't think it is getting issued.

kubectl describe csr cache-server.kubeflow -n istio-system

Name:               cache-server.kubeflow
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Fri, 20 Nov 2020 22:54:30 +0100
Requesting User:    system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa
Signer:             kubernetes.io/legacy-unknown
Status:             Approved
Subject:
  Common Name:    cache-server.kubeflow.svc
  Serial Number:  
Subject Alternative Names:
         DNS Names:  cache-server
                     cache-server.kubeflow
                     cache-server.kubeflow.svc
Events:  <none>

Cache-deployer-deployment logs:


+ shift
+ [[ 2 -gt 0 ]]
+ case ${1} in
+ secret=webhook-server-tls
+ shift
+ shift
+ [[ 0 -gt 0 ]]
+ '[' -z ']'
+ service=cache-server
+ '[' -z webhook-server-tls ']'
+ '[' -z kubeflow ']'
+ '[' -z ca_cert ']'
++ command -v openssl
+ '[' '!' -x /usr/bin/openssl ']'
+ csrName=cache-server.kubeflow
++ mktemp -d
+ tmpdir=/tmp/tmp.meigEj
+ echo 'creating certs in tmpdir /tmp/tmp.meigEj '
creating certs in tmpdir /tmp/tmp.meigEj 
+ cat
+ openssl genrsa -out /tmp/tmp.meigEj/server-key.pem 2048
Generating RSA private key, 2048 bit long modulus (2 primes)
............................................+++++
..........................................................................+++++
e is 65537 (0x010001)
+ openssl req -new -key /tmp/tmp.meigEj/server-key.pem -subj /CN=cache-server.kubeflow.svc -out /tmp/tmp.meigEj/server.csr -config /tmp/tmp.meigEj/csr.conf
+ echo 'start running kubectl...'
+ kubectl delete csr cache-server.kubeflow
start running kubectl...
certificatesigningrequest.certificates.k8s.io "cache-server.kubeflow" deleted
+ cat
+ kubectl create -f -
++ cat /tmp/tmp.meigEj/server.csr
++ base64
++ tr -d '\n'
Warning: certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow created
+ true
+ kubectl get csr cache-server.kubeflow
NAME                    AGE   SIGNERNAME                     REQUESTOR                                                             CONDITION
cache-server.kubeflow   0s    kubernetes.io/legacy-unknown   system:serviceaccount:kubeflow:kubeflow-pipelines-cache-deployer-sa   Pending
+ '[' 0 -eq 0 ']'
+ break
+ kubectl certificate approve cache-server.kubeflow
certificatesigningrequest.certificates.k8s.io/cache-server.kubeflow approved
++ seq 10
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ for x in $(seq 10)
++ kubectl get csr cache-server.kubeflow -o 'jsonpath={.status.certificate}'
+ serverCert=
+ [[ '' != '' ]]
+ sleep 1
+ [[ '' == '' ]]
+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1

@davidspek
Copy link
Contributor Author

I think the issue is caused by the fact that signerName is a required field that is not set, and kubernetes.io/legacy-unknown has been removed from Kubernetes 1.19. It will need to replaced by kubernetes.io/kube-apiserver-client, kubernetes.io/kube-apiserver-client-kubelet or kubernetes.io/kubelet-serving.
https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

@davidspek
Copy link
Contributor Author

It would seem that it might also be because --cluster-signing-cert-file and --cluster-signing-key-file need to be set for kube-controller-manager. I'm not sure if that was mentioned in the docs somewhere as a requirement, but it should similarly to how JWT for istio is stated if it is indeed required.

@davidspek
Copy link
Contributor Author

As one would expect, it was the fact that the --cluster-signing-cert-file and --cluster-signing-key-file were not set.

@Bobgy
Copy link
Contributor

Bobgy commented Nov 21, 2020

/reopen
Thanks @davidspek for the investigation

@k8s-ci-robot
Copy link
Contributor

@Bobgy: Reopened this issue.

In response to this:

/reopen
Thanks @davidspek for the investigation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 21, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Nov 21, 2020

To support the webhook set up process stabler, we should seriously consider #4695

@davidspek
Copy link
Contributor Author

I would also suggest using cert-manager, as it seems the other applications are using that as well. Also, for my specific situation with Canonical's CDK, it is a manual multi-step process to copy the ca.key from the EasyRSA node to the master nodes due to the permissions on the file.

Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020
…rver (kubeflow#4525)

* Cache deployer - Using the same kubectl version as the server

Fixes kubeflow#4505

* Changed the PATH precedence

* Unquoted the jq output

* Fixed the curl options
@davidspek davidspek changed the title Cache deployer fails in Kubernetes 1.19 Cache deployer fails if the cluster signer is not set Jan 13, 2021
@stale
Copy link

stale bot commented Jun 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 26, 2021
@grmoktan
Copy link

Hi @davidspek :

I am also getting the same error:

+ echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.
+ exit 1

And I have no way of setting the --cluster-signing-cert-file and --cluster-signing-key-file from my side as the rancher kubernetes deployment is managed elsewhere.

Is there an example of what the cert-manager approach entails?

I'm trying to deploy kubeflow v1.3-branch with kustomize.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 23, 2021
@cavepopo
Copy link

Getting hit by this very same behaviour.
Rancher 2.5.8, k8s v1.20.9.

Any workaround ? I'm still too noobish to hack the cert-manager and other resources...

@Bobgy
Copy link
Contributor

Bobgy commented Aug 21, 2021

I highly recommend checking out v2 caching now, it does not depend on any privilege.

https://www.kubeflow.org/docs/components/pipelines/caching-v2/

@cavepopo
Copy link

I highly recommend checking out v2 caching now, it does not depend on any privilege.

https://www.kubeflow.org/docs/components/pipelines/caching-v2/

Hi @Bobgy ,
Thanks for the tips, excuse my noobiness but how should I use the caching-v2 on an existing install or for a a new install ?

Thanks

@Bobgy
Copy link
Contributor

Bobgy commented Aug 24, 2021

@cavepopo no worries. You'll need to either upgrade your existing install or make a new install.
Note the version requirement (actually latest release is KFP 1.7.0-rc.4):

Kubeflow Pipelines 1.7.0

@kaben
Copy link

kaben commented Nov 17, 2021

Dunno whether this is solved yet. The problems might be in backend/src/cache/deployer/webhook-create-signed-cert.sh line 118, where a CertificateSigningRequest is created with usages including - server auth.

It might need to be replaced with - client auth. See Certificate Signing Request, Kubernetes signers, 1.4,

Permitted key usages - must include ["client auth"]. Must not include key usages beyond ["digital signature", "key encipherment", "client auth"].

The generated CertificateSigningRequest would then read something like this:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: ${csrName}
spec:
  groups:
  - system:authenticated
  request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')
  signerName: kubernetes.io/kube-apiserver-client
  usages:
  - digital signature
  - key encipherment
  - client auth

with suitable replacements in the metadata.name and spec.request fields.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022
@xubofei1983
Copy link

is this still an issue? I have EKS 1.26 and deploy from master branch 2.x and still get no CSR certificate and cache-deployer crash in loop

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 22, 2023
@rimolive
Copy link
Member

rimolive commented Mar 7, 2024

Looks like it's not an issue anymore. I'll close it but feel free to reopen if the issue persists.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Looks like it's not an issue anymore. I'll close it but feel free to reopen if the issue persists.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants