Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mount failure in Azure File CSI migration #96508

Closed
andyzhangx opened this issue Nov 12, 2020 · 10 comments · Fixed by #97877
Closed

mount failure in Azure File CSI migration #96508

andyzhangx opened this issue Nov 12, 2020 · 10 comments · Fixed by #97877
Labels
area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@andyzhangx
Copy link
Member

What happened:
after turning on CSIMigration=true,CSIMigrationAzureFile=true on 1.19, Azure File CSI migration e2e test would failure, main error is like following: fetching NodeStageSecretRef default/azure-storage-account-f5ffebf7f32ce4953afa073-secret failed: kubernetes.io/csi: failed to find the secret azure-storage-account-f5ffebf7f32ce4953afa073-secret in the namespace default with error: secrets "azure-storage-account-f5ffebf7f32ce4953afa073-secret" is forbidden: User "system:node:k8s-agentpool1-29483490-0" cannot get resource "secrets" in API group "" in the namespace "default": no relationship found between node 'k8s-agentpool1-29483490-0' and this object

  • detailed error msg:
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:06:48 +0000 UTC - event for pvc-rhnkx: {file.csi.azure.com_k8s-agentpool1-29483490-0_d0817931-ea0f-48bb-9067-870768068fc0 } Provisioning: External provisioner is provisioning volume for claim "azurefile-631/pvc-rhnkx"
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:06:48 +0000 UTC - event for pvc-rhnkx: {persistentvolume-controller } ExternalProvisioning: waiting for a volume to be created, either by external provisioner "file.csi.azure.com" or manually created by system administrator
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:07:06 +0000 UTC - event for pvc-rhnkx: {file.csi.azure.com_k8s-agentpool1-29483490-0_d0817931-ea0f-48bb-9067-870768068fc0 } ProvisioningSucceeded: Successfully provisioned volume pvc-d848d23d-347e-40f9-a3a9-0b8efa729b88
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:07:08 +0000 UTC - event for azurefile-volume-tester-965tj: {default-scheduler } Scheduled: Successfully assigned azurefile-631/azurefile-volume-tester-965tj to k8s-agentpool1-29483490-0
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:07:08 +0000 UTC - event for azurefile-volume-tester-965tj: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-d848d23d-347e-40f9-a3a9-0b8efa729b88" 
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:07:24 +0000 UTC - event for azurefile-volume-tester-965tj: {kubelet k8s-agentpool1-29483490-0} FailedMount: MountVolume.MountDevice failed for volume "pvc-d848d23d-347e-40f9-a3a9-0b8efa729b88" : fetching NodeStageSecretRef default/azure-storage-account-f5ffebf7f32ce4953afa073-secret failed: kubernetes.io/csi: failed to find the secret azure-storage-account-f5ffebf7f32ce4953afa073-secret in the namespace default with error: secrets "azure-storage-account-f5ffebf7f32ce4953afa073-secret" is forbidden: User "system:node:k8s-agentpool1-29483490-0" cannot get resource "secrets" in API group "" in the namespace "default": no relationship found between node 'k8s-agentpool1-29483490-0' and this object
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:09:11 +0000 UTC - event for azurefile-volume-tester-965tj: {kubelet k8s-agentpool1-29483490-0} FailedMount: Unable to attach or mount volumes: unmounted volumes=[test-volume-1], unattached volumes=[test-volume-1 default-token-np7s2]: timed out waiting for the condition
Nov  7 07:22:29.349: INFO: At 2020-11-07 07:13:47 +0000 UTC - event for azurefile-volume-tester-965tj: {kubelet k8s-agentpool1-29483490-0} FailedMount: Unable to attach or mount volumes: unmounted volumes=[test-volume-1], unattached volumes=[default-token-np7s2 test-volume-1]: timed out waiting for the condition

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_azurefile-csi-driver/430/pull-azurefile-csi-driver-e2e-migration/1324965466602475520/build-log.txt

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.19.0
  • Cloud provider or hardware configuration: Azure
  • OS (e.g: cat /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/priority important-soon
/sig cloud-provider
/area provider/azure
/triage accepted
/sig storage

@andyzhangx andyzhangx added the kind/bug Categorizes issue or PR as related to a bug. label Nov 12, 2020
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. area/provider/azure Issues or PRs related to azure provider triage/accepted Indicates an issue or PR is ready to be actively worked on. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 12, 2020
@andyzhangx
Copy link
Member Author

cc @msau42

@msau42
Copy link
Member

msau42 commented Nov 12, 2020

@andyzhangx can you paste what the in-tree PV object looks like vs CSI PV object?

@msau42
Copy link
Member

msau42 commented Nov 12, 2020

@kubernetes/sig-storage-bugs

@msau42
Copy link
Member

msau42 commented Nov 13, 2020

cc @mattcary

@andyzhangx
Copy link
Member Author

current issue:

  • secret is in namespace default

if PVC and pod are not in namespace default, it would report following error, while if PVC and pod are in namespace default, it works.

  Normal   Scheduled               16s                default-scheduler                  Successfully assigned test/deployment-azurefile-6fb9979464-6p8zb to k8s-agentpool-28749203-0
  Normal   SuccessfulAttachVolume  16s                attachdetach-controller            AttachVolume.Attach succeeded for volume "pvc-01bd1d0b-b965-4d37-8331-817a296c8998"
  Warning  FailedMount             1s (x5 over 9s)    kubelet, k8s-agentpool-28749203-0  MountVolume.MountDevice failed for volume "pvc-01bd1d0b-b965-4d37-8331-817a296c8998" : fetching NodeStageSecretRef default/azure-storage-account-f97cecd00f4d4433ebd6809-secret failed: kubernetes.io/csi: failed to find the secret azure-storage-account-f97cecd00f4d4433ebd6809-secret in the namespace default with error: secrets "azure-storage-account-f97cecd00f4d4433ebd6809-secret" is forbidden: User "system:node:k8s-agentpool-28749203-0" cannot get resource "secrets" in API group "" in the namespace "default": no relationship found between node 'k8s-agentpool-28749203-0' and this object
  • PV config:
$ k get pv pvc-01bd1d0b-b965-4d37-8331-817a296c8998 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/azure-file-resource-group: andy-1191
    pv.kubernetes.io/provisioned-by: file.csi.azure.com
spec:
  accessModes:
  - ReadWriteMany
  azureFile:
    secretName: azure-storage-account-f97cecd00f4d4433ebd6809-secret
    secretNamespace: null
    shareName: pvc-01bd1d0b-b965-4d37-8331-817a296c8998
  capacity:
    storage: 100Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pvc-azurefile
    namespace: test
  persistentVolumeReclaimPolicy: Delete
  storageClassName: azurefile-456-kubernetes.io-azure-file-dynamic-sc-4hv88
  volumeMode: Filesystem
status:
  phase: Bound
  • PVC config
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-azurefile
  namespace: test
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: azurefile-456-kubernetes.io-azure-file-dynamic-sc-4hv88
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: deployment-azurefile
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
      name: deployment-azurefile
    spec:
      nodeSelector:
        "kubernetes.io/os": linux
      containers:
        - name: deployment-azurefile
          image: mcr.microsoft.com/oss/nginx/nginx:1.17.3-alpine
          command:
            - "/bin/sh"
            - "-c"
            - while true; do echo $(date) >> /mnt/azurefile/outfile; sleep 1; done
          volumeMounts:
            - name: azurefile
              mountPath: "/mnt/azurefile"
              readOnly: false
      volumes:
        - name: azurefile
          persistentVolumeClaim:
            claimName: pvc-azurefile
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

@andyzhangx
Copy link
Member Author

I have tried multiple times, some symptoms:

  • it only happened when if PVC and pod are not in namespace default, while secret is in default namespace
  • it seems that kubelet could access one secret created earlier while it cannot access another secret created later, the config is exactly the same, quite strange behavior.

@msau42
Copy link
Member

msau42 commented Nov 17, 2020

Does it work if pvc and pod and secret are all in the same namespace (not default)?

Looking at azure translation code, the inline to csi translation seems problematic, as it's hardcoding default namespace for the secret:

The PV to csi translation function looks ok though:

csiSource.NodeStageSecretRef.Namespace = *azureSource.SecretNamespace

@andyzhangx
Copy link
Member Author

andyzhangx commented Nov 26, 2020

Does it work if pvc and pod and secret are all in the same namespace (not default)?

Looking at azure translation code, the inline to csi translation seems problematic, as it's hardcoding default namespace for the secret:

The PV to csi translation function looks ok though:

csiSource.NodeStageSecretRef.Namespace = *azureSource.SecretNamespace

About above code, inline to csi translation code should be ok since there is no namespace field in inline sturct, so it's always in default

@msau42 I have verified that if pvc and pod and secret are in the same namespace (not default), it works. And I found one funny thing, only statefulset does not work, and deployment always works even pod and secret are not in same namespace, in below example, statefulset and deployment are using same PVC, only statefulset does not work. Not sure whether there is special handling in statefulset.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-azurefile
  namespace: test
  labels:
    app: nginx
spec:
  serviceName: statefulset-azurefile
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      nodeSelector:
        "kubernetes.io/os": linux
      containers:
        - name: statefulset-azurefile
          image: mcr.microsoft.com/oss/nginx/nginx:1.17.3-alpine
          volumeMounts:
            - name: azurefile
              mountPath: /mnt/azurefile
      volumes:
        - name: azurefile
          persistentVolumeClaim:
            claimName: persistent-storage-statefulset-azurefile-0
  selector:
    matchLabels:
      app: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: deployment-azurefile
  namespace: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
      name: deployment-azurefile
    spec:
      nodeSelector:
        "kubernetes.io/os": linux
      containers:
        - name: deployment-azurefile
          image: mcr.microsoft.com/oss/nginx/nginx:1.17.3-alpine
          volumeMounts:
            - name: azurefile
              mountPath: "/mnt/azurefile"
              readOnly: false
      volumes:
        - name: azurefile
          persistentVolumeClaim:
            claimName: persistent-storage-statefulset-azurefile-0

@msau42
Copy link
Member

msau42 commented Dec 1, 2020

one funny thing, only statefulset does not work, and deployment always works even pod and secret are not in same namespace, in below example, statefulset and deployment are using same PVC, only statefulset does not work. Not sure whether there is special handling in statefulset.

That's really odd. The Node authorizer should only care about Pods, not the higher controllers. Do the pod specs look the same?

@andyzhangx
Copy link
Member Author

andyzhangx commented Jan 9, 2021

In Azure file CSI migration scenario, the secret creation happens in CSI driver, kubelet does not set relationship between node and secret. PR(#97877) pass the secretName and secretNamespace as CSI driver volume parameters instead of setting NodeStageSecretRef, I have verified it works well in CSI migration test.

There could be no straightforward fix(as I tried in last two months), #97877 would be a workaround fix, anyway, it would also fix this failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants