Bottlerocket under-reports Ephemeral Storage Capacity #2743

jonathan-innis · 2023-01-19T19:15:40Z

Image I'm using:

AMI Name: bottlerocket-aws-k8s-1.22-aarch64-v1.11.1-104f8e0f

What I expected to happen:

I expected that the capacity on my worker node would be approximately close to what the actual EBS volume size is to the xvdb mount for my node filesystem.

bash-5.1# df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/root        904M  557M  285M  67% /
devtmpfs          16G     0   16G   0% /dev
tmpfs             16G     0   16G   0% /dev/shm
tmpfs            6.1G  1.2M  6.1G   1% /run
tmpfs            4.0M     0  4.0M   0% /sys/fs/cgroup
tmpfs             16G  476K   16G   1% /etc
tmpfs             16G  4.0K   16G   1% /etc/cni
tmpfs             16G     0   16G   0% /tmp
tmpfs             16G  4.0K   16G   1% /etc/containerd
tmpfs             16G   12K   16G   1% /etc/host-containers
tmpfs             16G  4.0K   16G   1% /etc/kubernetes/pki
tmpfs             16G     0   16G   0% /root/.aws
/dev/nvme1n1p1   4.3T  1.6G  4.1T   1% /local
/dev/nvme0n1p12   36M  944K   32M   3% /var/lib/bottlerocket
overlay          4.3T  1.6G  4.1T   1% /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/modules
overlay          4.3T  1.6G  4.1T   1% /opt/cni/bin
/dev/loop1       384K  384K     0 100% /aarch64-bottlerocket-linux-gnu/sys-root/usr/share/licenses
/dev/loop0        12M   12M     0 100% /var/lib/kernel-devel/.overlay/lower
overlay          4.3T  1.6G  4.1T   1% /aarch64-bottlerocket-linux-gnu/sys-root/usr/src/kernels

bash-5.1# lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0          7:0    0 11.6M  1 loop /var/lib/kernel-devel/.overlay/lower
loop1          7:1    0  292K  1 loop /aarch64-bottlerocket-linux-gnu/sys-root/usr/share/licenses
nvme0n1      259:0    0    2G  0 disk 
|-nvme0n1p1  259:2    0    4M  0 part 
|-nvme0n1p2  259:3    0    5M  0 part 
|-nvme0n1p3  259:4    0   40M  0 part /boot
|-nvme0n1p4  259:5    0  920M  0 part 
|-nvme0n1p5  259:6    0   10M  0 part 
|-nvme0n1p6  259:7    0   25M  0 part 
|-nvme0n1p7  259:8    0    5M  0 part 
|-nvme0n1p8  259:9    0   40M  0 part 
|-nvme0n1p9  259:10   0  920M  0 part 
|-nvme0n1p10 259:11   0   10M  0 part 
|-nvme0n1p11 259:12   0   25M  0 part 
`-nvme0n1p12 259:13   0   42M  0 part /var/lib/bottlerocket
nvme1n1      259:1    0  4.3T  0 disk 
`-nvme1n1p1  259:14   0  4.3T  0 part /var
                                      /opt
                                      /mnt
                                      /local

What actually happened:

status:
  addresses:
  - address: 192.168.124.121
    type: InternalIP
  - address: ip-192-168-124-121.us-west-2.compute.internal
    type: Hostname
  - address: ip-192-168-124-121.us-west-2.compute.internal
    type: InternalDNS
  allocatable:
    attachable-volumes-aws-ebs: "39"
    cpu: 15890m
    ephemeral-storage: "1342050565150"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 28738288Ki
    pods: "234"
    vpc.amazonaws.com/pod-eni: "54"
  capacity:
    attachable-volumes-aws-ebs: "39"
    cpu: "16"
    ephemeral-storage: 1457383148Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    hugepages-32Mi: "0"
    hugepages-64Ki: "0"
    memory: 31737584Ki
    pods: "234"
    vpc.amazonaws.com/pod-eni: "54"

CAdvisor or something in the BR image appears to be under-reporting the amount of capacity that I have on this worker node.

The ephemeral-storage capacity here is approximately 1457383148Ki ~= 1.35 Ti which is not close to the ~4.3T that the lsblk is reporting.

How to reproduce the problem:

Launch a image with the BR AMI that has an EBS volume above > 4TB attached to the /dev/xvdb mount
View the Node's capacity when it connects and joins to the cluster using kubectl get node or kubectl describe node

The text was updated successfully, but these errors were encountered:

jpculp · 2023-01-19T23:48:35Z

@jonathan-innis, thanks for reaching out! We're taking a deeper look into this.

jonathan-innis · 2023-01-25T19:42:04Z

@jpculp Is there any progress or updates on this issue?

jpculp · 2023-01-26T21:39:09Z

Unfortunately not yet. We have to take a deeper look at the interaction between the host, containerd, and cAdvisor. Out of curiosity, do you see the same behavior with bottlerocket-aws-k8s-1.24?

jonathan-innis · 2023-01-26T23:56:33Z

I haven't taken a look at the newer version on K8s 1.24. Let me take a look on a newer version of K8s and get back to you on that.

etungsten · 2023-02-02T20:16:36Z

Hi @jonathan-innis, although I haven't fully root-caused the issue. I wanted to provide an update to offer some information.

I took a deeper look into this and it seems like the issue is stemming from kubelet not refreshing the node status or publishing the wrong filesystem stats to the K8s API.

When kubelet first starts up, cAdvisor hasn't fully initialized before kubelet makes the call to query for filesystem stats so kubelet reports invalid capacity 0 on image filesystem. Apparently this is expected to happen sometimes and eventually cadvisor will settle and start reporting stats. Whenever this happens, kubelet uses whatever stats it was able to scrounge up through this fallback. The linked issue in that code block mentions kubelet goes to the CRI for filesystem information. The issue is that those initial partial filesystem stats under-report available capacity as you've noticed. kubelet then doesn't attempt to update the K8s API with the correct filesystem stats even after cAdvisor is up and running.

After the node becomes ready, if I query the metrics endpoint, both cadvisor stats and node summary stats are reporting correctly:

cAdvisor:

...
container_fs_limit_bytes{container="",device="/dev/nvme1n1p1",id="/",image="",name="",namespace="",pod=""} 4.756159012864e+12 1675367414546

Node summary:

...
  "fs": {
   "time": "2023-02-02T19:50:04Z",
   "availableBytes": 4561598828544,
   "capacityBytes": 4756159012864,
   "usedBytes": 1264123904,
   "inodesFree": 294302950,
   "inodes": 294336000,
   "inodesUsed": 33050
  },

But for some reason, the node object in the cluster does not reflect that in the K8s API:
kubectl describe node

  Hostname:     ip-192-168-92-51.us-west-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         4
  ephemeral-storage:           1073420188Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16073648Ki
  pods:                        58
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         3920m
  ephemeral-storage:           988190301799
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15056816Ki
  pods:                        58

Only reports ~988 GB

What's interesting is that once you either reboot the worker node or restart the kubelet service, the stats sync up correctly:
After rebooting, kubectl describe node:

  Hostname:     ip-192-168-92-51.us-west-2.compute.internal
Capacity:                                                         
  attachable-volumes-aws-ebs:  25                                                                                                      cpu:                         4
  ephemeral-storage:           4644686536Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16073648Ki
  pods:                        58 
Allocatable:
  attachable-volumes-aws-ebs:  25 
  cpu:                         3920m
  ephemeral-storage:           4279469362667
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15056816Ki
  pods:                        58

ephemeral-storage correctly reports the 4.2 TB available.

So it seems like kubelet is not updating node filesystem stats to the K8s API as frequently as it should. I currently can't explain why this is happening on Bottlerocket and not replicating on AL2. I suspect the kubelet fallback to querying the CRI has something to do with it, i.e. cri/containerd vs dockershim (kubernetes/kubernetes#51152).

If you want to work around this issue, you can reboot the nodes or restart kubelet to get kubelet to start reporting the correct ephemeral storage capacity. In the meantime, I'll spend more time digging into kubelet/containerd.

stmcginnis · 2023-03-01T22:34:30Z

Wondering if you are still seeing this behavior. If so, do the stats eventually correct themselves, or once it's in this state does it keep reporting the wrong size indefinitely.

There's a 10 second cache timeout for stats, so I wonder if we are hitting a case where the data in the cache needs to be invalidated before it actually checks again and gets the full storage space.

tweeks-reify · 2024-02-26T19:59:53Z

~~We ran into a similar issue as well on 1.28 resulting in pods unschedulable due to insufficient storage. I tried rebooting but that didn't seem to work~~

~~In our case we have a second EBS volume (1TB) we are using and it seems like its not picking it up at all.~~

I didn't realized I needed to specify the device as /dev/xvdb (/dev/xvda works on the aws linux ami), works fine once updated to that.

James-Quigley · 2024-03-04T21:02:12Z

Still seeing this behavior. EKS 1.25. Entering admin container > sudo sheltie > systemctl restart kubelet.service causes it to start to report the correct value for ephemeral storage

James-Quigley · 2024-03-12T14:14:23Z

FWIW, recently upgraded to 1.26, and the behavior is there as well

ginglis13 · 2024-05-16T17:47:52Z

Hi @James-Quigley @jonathan-innis, I suspect this issue might be addressed by changes to include monitoring of the container runtime cgroup by kubelet #3804. Are you still seeing this issue on versions of Bottlerocket >= 1.19.5?

jonathan-innis added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Jan 19, 2023

kdaula assigned etungsten Jan 30, 2023

kdaula added this to Bottlerocket Engineering Roadmap Jan 30, 2023

stmcginnis added status/needs-info Further information is requested area/kubernetes K8s including EKS, EKS-A, and including VMW and removed status/needs-triage Pending triage or re-evaluation labels Mar 1, 2023

yeazelm unassigned etungsten Mar 4, 2024

webern mentioned this issue Mar 18, 2024

Missing runtime metrics from cAdvisor #3776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bottlerocket under-reports Ephemeral Storage Capacity #2743

Bottlerocket under-reports Ephemeral Storage Capacity #2743

jonathan-innis commented Jan 19, 2023

jpculp commented Jan 19, 2023

jonathan-innis commented Jan 25, 2023

jpculp commented Jan 26, 2023

jonathan-innis commented Jan 26, 2023

etungsten commented Feb 2, 2023

stmcginnis commented Mar 1, 2023

tweeks-reify commented Feb 26, 2024 •

edited

Loading

James-Quigley commented Mar 4, 2024

James-Quigley commented Mar 12, 2024

ginglis13 commented May 16, 2024

Bottlerocket under-reports Ephemeral Storage Capacity #2743

Bottlerocket under-reports Ephemeral Storage Capacity #2743

Comments

jonathan-innis commented Jan 19, 2023

jpculp commented Jan 19, 2023

jonathan-innis commented Jan 25, 2023

jpculp commented Jan 26, 2023

jonathan-innis commented Jan 26, 2023

etungsten commented Feb 2, 2023

stmcginnis commented Mar 1, 2023

tweeks-reify commented Feb 26, 2024 • edited Loading

James-Quigley commented Mar 4, 2024

James-Quigley commented Mar 12, 2024

ginglis13 commented May 16, 2024

tweeks-reify commented Feb 26, 2024 •

edited

Loading