-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk error when pods are mounting a certain amount of volumes on a node #201
Comments
Steps to reproduce apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
name: myvm
spec:
selector:
matchLabels:
app: myvm # has to match .spec.template.metadata.labels
serviceName: "myvm"
replicas: 8
template:
metadata:
labels:
app: myvm
spec:
terminationGracePeriodSeconds: 10
containers:
- name: ubuntu
image: ubuntu:xenial
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
livenessProbe:
exec:
command:
- ls
- /mnt/data
initialDelaySeconds: 5
periodSeconds: 5
resources:
requests:
cpu: 100m
memory: 250Mi
volumeMounts:
- name: mydata
mountPath: /mnt/data
volumeClaimTemplates:
- metadata:
name: mydata
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: default
resources:
requests:
storage: 1Gi |
Testing this we can also see in
We went straight to 7 replicas first, then up to 8, and no it's:
|
@gonzochic @LeDominik azure disk only supports |
@andyzhangx I know that the volume is Here the quote from the docs:
|
@gonzochic you are right, I have tried in my testing env 3 times, the root cause is the dev name(/dev/sd*) would change after attaching 6th data disks(it's always 6th) for D2_V2 VM which allow 8 data disks at maximum. That is to say, 5 data disks are safe... Below is the evidence:
I will contact with Azure Linux VM team to check whether there is any solution for this dev name change issue after attach new data disks. |
Update: Here is an example: |
Thanks @andyzhangx, alternatively I saw the following PR. Which makes if configurable how many Disks a Node can mount. I think it is planned for 1.10 and would make it possible to "catch" this error on a Kubernetes Scheduler Level. Additionally this is somewhere a must, or what is currently happening if I try to mount more volumes than the Azure VM size allows? |
@gonzochic if a node has reached maximum disk number, a new pod with an azure disk mount would fail. In your case, there is "disk error" after mounting other pod with an azure disk,, this issue is different, it's not due to maximum disk a node can mount, it's due to device name change by using “cachingmode: ReadWrite". So to fix your issue, you could use my proposed solution, I have verified it works well: Here is an example: |
@gonzochic Just realized you are using AKS, so my proposed solution is
I have already verified in my test env, all 8 replicas are running, no crash any more. |
Hej @andyzhangx — thanks for the tip! We’re going to test that out and post feedback! |
Automatic merge from submit-queue (batch tested with PRs 60346, 60135, 60289, 59643, 52640). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. fix device name change issue for azure disk **What this PR does / why we need it**: fix device name change issue for azure disk due to default host cache setting changed from None to ReadWrite from v1.7, and default host cache setting in azure portal is `None` **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #60344, #57444 also fixes following issues: Azure/acs-engine#1918 Azure/AKS#201 **Special notes for your reviewer**: From v1.7, default host cache setting changed from None to ReadWrite, this would lead to device name change after attach multiple disks on azure vm, finally lead to disk unaccessiable from pod. For an example: statefulset with 8 replicas(each with an azure disk) on one node will always fail, according to my observation, add the 6th data disk will always make dev name change, some pod could not access data disk after that. I have verified this fix on v1.8.4 Without this PR on one node(dev name changes): ``` azureuser@k8s-agentpool2-40588258-0:~$ tree /dev/disk/azure ... └── scsi1 ├── lun0 -> ../../../sdk ├── lun1 -> ../../../sdj ├── lun2 -> ../../../sde ├── lun3 -> ../../../sdf ├── lun4 -> ../../../sdg ├── lun5 -> ../../../sdh └── lun6 -> ../../../sdi ``` With this PR on one node(no dev name change): ``` azureuser@k8s-agentpool2-40588258-1:~$ tree /dev/disk/azure ... └── scsi1 ├── lun0 -> ../../../sdc ├── lun1 -> ../../../sdd ├── lun2 -> ../../../sde ├── lun3 -> ../../../sdf ├── lun5 -> ../../../sdh └── lun6 -> ../../../sdi ``` Following `myvm-0`, `myvm-1` is crashing due to dev name change, after controller manager replacement, myvm2-x pods work well. ``` Every 2.0s: kubectl get po Sat Feb 24 04:16:26 2018 NAME READY STATUS RESTARTS AGE myvm-0 0/1 CrashLoopBackOff 13 41m myvm-1 0/1 CrashLoopBackOff 11 38m myvm-2 1/1 Running 0 35m myvm-3 1/1 Running 0 33m myvm-4 1/1 Running 0 31m myvm-5 1/1 Running 0 29m myvm-6 1/1 Running 0 26m myvm2-0 1/1 Running 0 17m myvm2-1 1/1 Running 0 14m myvm2-2 1/1 Running 0 12m myvm2-3 1/1 Running 0 10m myvm2-4 1/1 Running 0 8m myvm2-5 1/1 Running 0 5m myvm2-6 1/1 Running 0 3m ``` **Release note**: ``` fix device name change issue for azure disk ``` /assign @karataliu /sig azure @feiskyer could you mark it as v1.10 milestone? @brendandburns @khenidak @rootfs @jdumars FYI Since it's a critical bug, I will cherry pick this fix to v1.7-v1.9, note that v1.6 does not have this issue since default cachingmode is `None`
@andyzhangx : Thanks, it works like a charm. On the 1-node-AKS test-cluster we were able to go up to the nodes advertised maximum supported 16 disks with a stateful set, the 17th pod then had to wait due to All disks are working fine, Thanks 👍 |
my pleasure, would you close this issue? thanks. |
Until now we had no more issues! Thanks for that :) |
@andyzhangx is this still a thing? |
no, it should work well now. |
even with the default caching? or the default is none? @andyzhangx |
correct |
are you sure @andyzhangx ?
this is without setting caching to anything |
it should work |
We are currently running a 5 node cluster in AKS with all-in-all 10vcpus and 35gb ram. We noticed following behavior: We have a couple of StatefulSets which claim an Azure Disk with some storage. During runtime, the pod is going into a CrashLoop because the volume is suddenly not accessible anymore (io-error). This leads to a crash of the application running in the pod. The health probe recognizes that, restarts, and crashes again. We managed to keep a container running and we were not able to access the volume anymore (but it was still mounted in the OS).
The solution to this problem was usually to manually delete the pod. After rescheduling it suddenly worked again.
In the past this happened only a few times, until yesterday. Yesterday, we had the same issues and as soon we deleted the failing pod, and this pod was rescheduled in running state, another pod was crashing. We always had 4 failing pods with IO-Errors which made us think if has something to do with the total amount of mounted Azure Disks.
We have the following assumption:
If a new pod is scheduled on a node, which already has 4 mounted Azure Disks, one of the running pods (which is claiming one of the volumes) "loses" access to his volume and therefore crashes. Additionally we found the following link which restricts the amount of Azure Disks that can be mounted on a VM (Link)
What we would expect:
If our assumption is correct, i would expect following behaviour:
Have you observed something similar in the past?
Here some information about our system (reducted private information):
The text was updated successfully, but these errors were encountered: