Disk error when pods are mounting a certain amount of volumes on a node #201

gonzochic · 2018-02-21T09:43:17Z

We are currently running a 5 node cluster in AKS with all-in-all 10vcpus and 35gb ram. We noticed following behavior: We have a couple of StatefulSets which claim an Azure Disk with some storage. During runtime, the pod is going into a CrashLoop because the volume is suddenly not accessible anymore (io-error). This leads to a crash of the application running in the pod. The health probe recognizes that, restarts, and crashes again. We managed to keep a container running and we were not able to access the volume anymore (but it was still mounted in the OS).

The solution to this problem was usually to manually delete the pod. After rescheduling it suddenly worked again.

In the past this happened only a few times, until yesterday. Yesterday, we had the same issues and as soon we deleted the failing pod, and this pod was rescheduled in running state, another pod was crashing. We always had 4 failing pods with IO-Errors which made us think if has something to do with the total amount of mounted Azure Disks.

We have the following assumption:
If a new pod is scheduled on a node, which already has 4 mounted Azure Disks, one of the running pods (which is claiming one of the volumes) "loses" access to his volume and therefore crashes. Additionally we found the following link which restricts the amount of Azure Disks that can be mounted on a VM (Link)

What we would expect:
If our assumption is correct, i would expect following behaviour:

Pods whith PVC to Azure Disk PV should not be scheduled to a physical node which has already a maximum of volumes mounted
If this is not possible: The newly scheduled Pod should not be able to be scheduled on the node and therefore throw an error (instead of making an already scheduled Pod crash)

Have you observed something similar in the past?

Here some information about our system (reducted private information):

Name:               aks-agentpool-(reducted)
Roles:              agent
Labels:             agentpool=agentpool
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_D2_v2
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=westeurope
                    failure-domain.beta.kubernetes.io/zone=0
                    kubernetes.azure.com/cluster=(reducted)
                    kubernetes.io/hostname=aks-agentpool-(reducted)
                    kubernetes.io/role=agent
                    storageprofile=managed
                    storagetier=Standard_LRS
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Tue, 20 Feb 2018 17:07:16 +0100
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 20 Feb 2018 17:07:42 +0100   Tue, 20 Feb 2018 17:07:42 +0100   RouteCreated                 RouteController created a route
  OutOfDisk            False   Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:16 +0100   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:16 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:16 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready                True    Wed, 21 Feb 2018 09:49:47 +0100   Tue, 20 Feb 2018 17:07:36 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  (reducted)
  Hostname:    (reducted)
Capacity:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             2
 memory:                          7114304Ki
 pods:                            110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:  0
 cpu:                             2
 memory:                          7011904Ki
 pods:                            110
System Info:
 Machine ID:                          (reducted)
 System UUID:                         (reducted)
 Boot ID:                             (reducted)
 Kernel Version:                      4.13.0-1007-azure
 OS Image:                            Debian GNU/Linux 8 (jessie)
 Operating System:                    linux
 Architecture:                        amd64
 Container Runtime Version:           docker://1.12.6
 Kubelet Version:                     v1.8.7
 Kube-Proxy Version:                  v1.8.7
PodCIDR:                              10.244.4.0/24
ExternalID:                           (reducted)

gonzochic · 2018-02-22T13:35:03Z

Steps to reproduce
Provision an AKS in the Azure portal using a single node (we used a D3_V2 as reference, because it should be able to handle 16 volumes, which could be the Kubernetes default). Then run the following yaml file on your cluster which provisions a statefulset for you. We started with a replica count of 8. With 8 replicas we had already 2 pods failing with IO-Errors. Here is the yaml:

apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
  name: myvm
spec:
  selector:
    matchLabels:
      app: myvm # has to match .spec.template.metadata.labels
  serviceName: "myvm"
  replicas: 8 
  template:
    metadata:
      labels:
        app: myvm 
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: ubuntu
        image: ubuntu:xenial
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do sleep 30; done;" ]
        livenessProbe:
          exec:
            command:
            - ls
            - /mnt/data
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          requests:
            cpu: 100m
            memory: 250Mi
        volumeMounts:
        - name: mydata
          mountPath: /mnt/data
  volumeClaimTemplates:
  - metadata:
      name: mydata
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: default
      resources:
        requests:
          storage: 1Gi

LeDominik · 2018-02-22T14:30:40Z

Testing this we can also see in dmesg that when adding a new replica the next device fails (here sde). Incoming is sdl:

[50023.841688] EXT4-fs warning (device sdc): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[50023.986177] EXT4-fs warning (device sdd): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[50024.371780] EXT4-fs warning (device sdc): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[50024.373185] EXT4-fs warning (device sdd): htree_dirblock_to_tree:962: inode #2: lblock 0: comm updatedb.mlocat: error -5 reading directory block
[78545.257391] EXT4-fs warning (device sde): ext4_end_bio:313: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 32834)
[78545.257394] Buffer I/O error on device sde, logical block 32833
[78545.261633] Aborting journal on device sde-8.
[78545.265041] JBD2: Error -5 detected when updating journal superblock for sde-8.
[78545.270385] sd 3:0:0:2: [sde] Synchronizing SCSI cache
[78547.314632] scsi 3:0:0:7: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 4
[78547.341351] sd 3:0:0:7: Attached scsi generic sg5 type 0
[78547.341768] sd 3:0:0:7: [sdl] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB)
[78547.341809] sd 3:0:0:7: [sdl] Write Protect is off
[78547.341811] sd 3:0:0:7: [sdl] Mode Sense: 0f 00 10 00
[78547.341977] sd 3:0:0:7: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA
[78547.359641] sd 3:0:0:7: [sdl] Attached SCSI disk

We went straight to 7 replicas first, then up to 8, and no it's:

OS Disk
Disk for myvm-0 on /dev/sdc (broken)
Disk for myvm-1 on /dev/sdd (broken)
Disk for myvm-2 on /dev/sde (broken after going for 8 replicas see above)
Disk for myvm-3 on /dev/sdf (still ok)
and so on up to myvm-7

andyzhangx · 2018-02-23T12:44:21Z

@gonzochic @LeDominik azure disk only supports ReadWriteOnce which means it could only be used on one node, you could not assign 8 pod replicas since those pods could be on different nodes.
I would suggest use azure file if your replica is more than one.

gonzochic · 2018-02-23T13:05:47Z

@andyzhangx I know that the volume is ReadWriteOnce. If you watch closely you will see that we are scheduling a separate PersistantVolumeand PersistantVolumeClaim for each replica.
In the Azure Portal you are also able to see that 8 Disks are attached to the VM. But three of them are not accessible as you can see from @LeDominik log.

Here the quote from the docs:

Note, however, that while scaling up creates new PersistentVolumeClaims automatically, scaling down does not automatically delete these PVCs. This gives you the choice to keep those initialized PVCs around to make scaling back up quicker, or to extract data before deleting them.

andyzhangx · 2018-02-23T15:30:23Z

@gonzochic you are right, I have tried in my testing env 3 times, the root cause is the dev name(/dev/sd*) would change after attaching 6th data disks(it's always 6th) for D2_V2 VM which allow 8 data disks at maximum. That is to say, 5 data disks are safe... Below is the evidence:

azureuser@k8s-agentpool-87187153-0:/tmp$ tree /dev/disk/azure/
/dev/disk/azure/
â”œâ”€â”€ resource -> ../../sdb
â”œâ”€â”€ resource-part1 -> ../../sdb1
â”œâ”€â”€ root -> ../../sda
â”œâ”€â”€ root-part1 -> ../../sda1
â””â”€â”€ scsi1
    â”œâ”€â”€ lun0 -> ../../../sdc
    â”œâ”€â”€ lun1 -> ../../../sdd
    â”œâ”€â”€ lun2 -> ../../../sde
    â”œâ”€â”€ lun3 -> ../../../sdf
    â””â”€â”€ lun4 -> ../../../sdg

1 directory, 9 files
azureuser@k8s-agentpool-87187153-0:/tmp$ tree /dev/disk/azure/
/dev/disk/azure/
â”œâ”€â”€ resource -> ../../sdb
â”œâ”€â”€ resource-part1 -> ../../sdb1
â”œâ”€â”€ root -> ../../sda
â”œâ”€â”€ root-part1 -> ../../sda1
â””â”€â”€ scsi1
    â”œâ”€â”€ lun0 -> ../../../sdi
    â”œâ”€â”€ lun1 -> ../../../sdd
    â”œâ”€â”€ lun2 -> ../../../sde
    â”œâ”€â”€ lun3 -> ../../../sdf
    â”œâ”€â”€ lun4 -> ../../../sdg
    â””â”€â”€ lun5 -> ../../../sdh

I will contact with Azure Linux VM team to check whether there is any solution for this dev name change issue after attach new data disks.

andyzhangx · 2018-02-24T09:46:26Z

Update:
Current solution is add “cachingmode: None” in azure disk storage class, that would solve this device name change issue. I have proposed a PR, change the default cachingmode as None.

Here is an example:
https://github.com/andyzhangx/Demo/blob/master/pv/storageclass-azuredisk.yaml

gonzochic · 2018-02-24T10:41:44Z

Thanks @andyzhangx, alternatively I saw the following PR. Which makes if configurable how many Disks a Node can mount. I think it is planned for 1.10 and would make it possible to "catch" this error on a Kubernetes Scheduler Level. Additionally this is somewhere a must, or what is currently happening if I try to mount more volumes than the Azure VM size allows?

andyzhangx · 2018-02-24T13:21:23Z

@gonzochic if a node has reached maximum disk number, a new pod with an azure disk mount would fail. In your case, there is "disk error" after mounting other pod with an azure disk,, this issue is different, it's not due to maximum disk a node can mount, it's due to device name change by using “cachingmode: ReadWrite".
Make configurable disk number for cloud provider is a new feature and I don't think it will catch 1.10 since next week will be code freeze. The feature could be in v1.11, I will let you know when it's available.

So to fix your issue, you could use my proposed solution, I have verified it works well:
Current solution is add “cachingmode: None” in azure disk storage class, that would solve this device name change issue. I have proposed a PR, change the default cachingmode as None.

Here is an example:
https://github.com/andyzhangx/Demo/blob/master/pv/storageclass-azuredisk.yaml

andyzhangx · 2018-02-24T13:40:31Z

@gonzochic Just realized you are using AKS, so my proposed solution is

create a new azure disk storage class

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: hdd
provisioner: kubernetes.io/azure-disk
parameters:
  skuname: Standard_LRS
  kind: Managed
  cachingmode: None

change storageClassName: default to storageClassName: hdd in statefulset config

I have already verified in my test env, all 8 replicas are running, no crash any more.

LeDominik · 2018-02-24T15:33:29Z

Hej @andyzhangx — thanks for the tip! We’re going to test that out and post feedback!
(Will take till Monday, I was “encouraged” not to take my laptop with me for the weekend 😄 )

@karataliu

Automatic merge from submit-queue (batch tested with PRs 60346, 60135, 60289, 59643, 52640). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. fix device name change issue for azure disk **What this PR does / why we need it**: fix device name change issue for azure disk due to default host cache setting changed from None to ReadWrite from v1.7, and default host cache setting in azure portal is `None` **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #60344, #57444 also fixes following issues: Azure/acs-engine#1918 Azure/AKS#201 **Special notes for your reviewer**: From v1.7, default host cache setting changed from None to ReadWrite, this would lead to device name change after attach multiple disks on azure vm, finally lead to disk unaccessiable from pod. For an example: statefulset with 8 replicas(each with an azure disk) on one node will always fail, according to my observation, add the 6th data disk will always make dev name change, some pod could not access data disk after that. I have verified this fix on v1.8.4 Without this PR on one node(dev name changes): ``` azureuser@k8s-agentpool2-40588258-0:~$ tree /dev/disk/azure ... â””â”€â”€ scsi1 â”œâ”€â”€ lun0 -> ../../../sdk â”œâ”€â”€ lun1 -> ../../../sdj â”œâ”€â”€ lun2 -> ../../../sde â”œâ”€â”€ lun3 -> ../../../sdf â”œâ”€â”€ lun4 -> ../../../sdg â”œâ”€â”€ lun5 -> ../../../sdh â””â”€â”€ lun6 -> ../../../sdi ``` With this PR on one node(no dev name change): ``` azureuser@k8s-agentpool2-40588258-1:~$ tree /dev/disk/azure ... â””â”€â”€ scsi1 â”œâ”€â”€ lun0 -> ../../../sdc â”œâ”€â”€ lun1 -> ../../../sdd â”œâ”€â”€ lun2 -> ../../../sde â”œâ”€â”€ lun3 -> ../../../sdf â”œâ”€â”€ lun5 -> ../../../sdh â””â”€â”€ lun6 -> ../../../sdi ``` Following `myvm-0`, `myvm-1` is crashing due to dev name change, after controller manager replacement, myvm2-x pods work well. ``` Every 2.0s: kubectl get po Sat Feb 24 04:16:26 2018 NAME READY STATUS RESTARTS AGE myvm-0 0/1 CrashLoopBackOff 13 41m myvm-1 0/1 CrashLoopBackOff 11 38m myvm-2 1/1 Running 0 35m myvm-3 1/1 Running 0 33m myvm-4 1/1 Running 0 31m myvm-5 1/1 Running 0 29m myvm-6 1/1 Running 0 26m myvm2-0 1/1 Running 0 17m myvm2-1 1/1 Running 0 14m myvm2-2 1/1 Running 0 12m myvm2-3 1/1 Running 0 10m myvm2-4 1/1 Running 0 8m myvm2-5 1/1 Running 0 5m myvm2-6 1/1 Running 0 3m ``` **Release note**: ``` fix device name change issue for azure disk ``` /assign @karataliu /sig azure @feiskyer could you mark it as v1.10 milestone? @brendandburns @khenidak @rootfs @jdumars FYI Since it's a critical bug, I will cherry pick this fix to v1.7-v1.9, note that v1.6 does not have this issue since default cachingmode is `None`

LeDominik · 2018-02-26T09:30:28Z

@andyzhangx : Thanks, it works like a charm. On the 1-node-AKS test-cluster we were able to go up to the nodes advertised maximum supported 16 disks with a stateful set, the 17th pod then had to wait due to No nodes are available that match all of the predicates: MaxVolumeCount (1)., so just what we expected.

All disks are working fine, Thanks 👍

andyzhangx · 2018-02-26T10:52:38Z

my pleasure, would you close this issue? thanks.

gonzochic · 2018-03-01T07:46:46Z

Until now we had no more issues! Thanks for that :)

4c74356b41 · 2020-04-23T05:06:18Z

@andyzhangx is this still a thing?

andyzhangx · 2020-04-23T06:06:53Z

@andyzhangx is this still a thing?

no, it should work well now.

4c74356b41 · 2020-04-23T08:58:21Z

even with the default caching? or the default is none? @andyzhangx

andyzhangx · 2020-04-23T09:00:05Z

even with the default caching? or the default is none? @andyzhangx

correct

4c74356b41 · 2020-04-23T09:09:01Z

are you sure @andyzhangx ?

"dataDisks": [
    {
        "lun": 0,
        "name": "kubernetes-dynamic-pvc-1aa96823-19ec-4d9c-a73a-4078eada7a37",
        "createOption": "Attach",
        "caching": "ReadOnly",
        "managedDisk": {
            "storageAccountType": "Premium_LRS",
            "id": "xxx"
        },
        "diskSizeGB": 1000
    },

this is without setting caching to anything

andyzhangx · 2020-04-23T09:11:18Z

are you sure @andyzhangx ?

"dataDisks": [
    {
        "lun": 0,
        "name": "kubernetes-dynamic-pvc-1aa96823-19ec-4d9c-a73a-4078eada7a37",
        "createOption": "Attach",
        "caching": "ReadOnly",
        "managedDisk": {
            "storageAccountType": "Premium_LRS",
            "id": "xxx"
        },
        "diskSizeGB": 1000
    },

this is without setting caching to anything

it should work

andyzhangx mentioned this issue Feb 24, 2018

fix device name change issue for azure disk kubernetes/kubernetes#60346

Merged

gonzochic closed this as completed Mar 1, 2018

andyzhangx mentioned this issue Apr 11, 2018

Input/output error when accessing PV #297

Closed

ghost locked as resolved and limited conversation to collaborators Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk error when pods are mounting a certain amount of volumes on a node #201

Disk error when pods are mounting a certain amount of volumes on a node #201

gonzochic commented Feb 21, 2018

gonzochic commented Feb 22, 2018

LeDominik commented Feb 22, 2018

andyzhangx commented Feb 23, 2018

gonzochic commented Feb 23, 2018 •

edited

Loading

andyzhangx commented Feb 23, 2018

andyzhangx commented Feb 24, 2018

gonzochic commented Feb 24, 2018 •

edited

Loading

andyzhangx commented Feb 24, 2018

andyzhangx commented Feb 24, 2018

LeDominik commented Feb 24, 2018

LeDominik commented Feb 26, 2018

andyzhangx commented Feb 26, 2018

gonzochic commented Mar 1, 2018

4c74356b41 commented Apr 23, 2020

andyzhangx commented Apr 23, 2020

4c74356b41 commented Apr 23, 2020

andyzhangx commented Apr 23, 2020

4c74356b41 commented Apr 23, 2020 •

edited

Loading

andyzhangx commented Apr 23, 2020

Disk error when pods are mounting a certain amount of volumes on a node #201

Disk error when pods are mounting a certain amount of volumes on a node #201

Comments

gonzochic commented Feb 21, 2018

gonzochic commented Feb 22, 2018

LeDominik commented Feb 22, 2018

andyzhangx commented Feb 23, 2018

gonzochic commented Feb 23, 2018 • edited Loading

andyzhangx commented Feb 23, 2018

andyzhangx commented Feb 24, 2018

gonzochic commented Feb 24, 2018 • edited Loading

andyzhangx commented Feb 24, 2018

andyzhangx commented Feb 24, 2018

LeDominik commented Feb 24, 2018

LeDominik commented Feb 26, 2018

andyzhangx commented Feb 26, 2018

gonzochic commented Mar 1, 2018

4c74356b41 commented Apr 23, 2020

andyzhangx commented Apr 23, 2020

4c74356b41 commented Apr 23, 2020

andyzhangx commented Apr 23, 2020

4c74356b41 commented Apr 23, 2020 • edited Loading

andyzhangx commented Apr 23, 2020

gonzochic commented Feb 23, 2018 •

edited

Loading

gonzochic commented Feb 24, 2018 •

edited

Loading

4c74356b41 commented Apr 23, 2020 •

edited

Loading