1.13.0 possibly not resizing data disk #2915

nairb774 · 2023-03-21T00:52:17Z

Image I'm using:

Bottlerocket OS 1.13.0 (aws-k8s-1.22) ami-033592485355290ae

What I expected to happen:

The disk that is used for containers and scratch storage is expanded to use the whole EBS volume.

What actually happened:

Containers are failing to be unpacked due to out of disk, and some containers are being evicted for ephemeral storage usage. Looking in journalctl, it does look like the resize happened, so it might be related to something else going on. This "bug" did impact a number of clusters with disparate sets of workloads. Only change was updating the base AMI through machine replacement.

Logs till the point that the machine booted. Looks fairly normal - could it be something other than a failed resize?

journal.txt

How to reproduce the problem:

Upgraded from 1.12.0 to 1.13.0.

The text was updated successfully, but these errors were encountered:

etungsten · 2023-03-21T01:45:18Z

Hi @nairb774, if you're able to reach the host and sudo sheltie via the admin container, can you please paste the output of the following?

lsblk -f
blkid
journactl -u label-data-a -u label-data-b -u prepare-local-fs -u repart-local
df -h
df -i
mount

#2879 and #2807 are the recent changes we made to how the data disk is set up. These changes allow boot to continue in cases where the data volume is intentionally not attached, and where the goal is to use the additional space on the root volume for local storage.

One of the caveats about the change is that if the root volume has increased space (beyond the default 2GB) and the data volume is attached, the system will have two viable data partitions, and it is non-deterministic, undefined behavior which data partition will actually be used for local storage (where container images are stored). Therefore it's strongly advised to not increase the root volume size beyond the default unless it's guaranteed there isn't a second data volume on boot. We were planning to put out a notice along with the release and we're just about finishing up with the release. I know this is an early guess, but I think the aforementioned undefined behavior might be causing what you're seeing here.

nairb774 · 2023-03-21T02:24:01Z

I've rolled things back to 1.12, but as soon as I can, I'll try to bring up a few test nodes with 1.13 and get the requested information.

bcressey · 2023-03-21T03:20:45Z

From the journal.txt you provided:

Mar 21 00:24:27 localhost kernel: nvme nvme1: 2/0/0 default/read/poll queues
Mar 21 00:24:27 localhost kernel: nvme nvme0: 2/0/0 default/read/poll queues
Mar 21 00:24:27 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Mar 21 00:24:27 localhost kernel: GPT:4194303 != 8388607
Mar 21 00:24:27 localhost kernel: GPT:Alternate GPT header not at the end of the disk.  
Mar 21 00:24:27 localhost kernel: GPT:4194303 != 8388607
Mar 21 00:24:27 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Mar 21 00:24:27 localhost kernel:  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13  
Mar 21 00:24:27 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Mar 21 00:24:27 localhost kernel: GPT:2097151 != 157286399
Mar 21 00:24:27 localhost kernel: GPT:Alternate GPT header not at the end of the disk.
Mar 21 00:24:27 localhost kernel: GPT:2097151 != 157286399
Mar 21 00:24:27 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Mar 21 00:24:27 localhost kernel:  nvme1n1: p1

(The GPT numbers are reported in sectors, where each sector is 512 bytes, so multiplying by 512 and dividing by 1024 * 1024 * 1024 gives the GiB figure.)

These GPT errors are saying, in essence:

nvme0n1 is 4 GiB (up from a default size of 2 GiB)
nvme1n1 is 75 GiB (up from a default size of 1 GiB)

This is the failure case @etungsten mentioned, where the behavior is non-deterministic and the failed systems end up with 2 GiB of mutable local storage instead of 75 GiB.

We're extremely sorry about this behavior change and the impact you experienced.

I know it's not much comfort but previously the "extra" 2 GiB on the first EBS volume would have gone unused since the OS ignored that space, resulting in extra cost for EBS storage but no benefit. Adjusting the launch template or tool you're using to reduce the space allocated to the first EBS volume down to the default 2 GiB should avoid the problem and save a modest amount of money.

Could you let us know what provisioning tool you're using? If this convention - allocating more space to the first EBS volume - is very common or the default behavior in some tool, then the impact here could be quite widespread.

misterek · 2023-03-21T03:30:35Z

Do we know when this will be picked up by bottle rocket updater? Do I need to worry tonight?

nairb774 · 2023-03-21T03:38:55Z

Yea, I can confirm that we had, inadvertently, set the root to 4GiB in size. We are creating a launch template using Terraform, and the 4GiB was a carryover from when we had migrated from the "default" EKS managed node image. For some reason I swear you needed to specify the volume_size in the launch template config, but looking at the documentation, as well as the underlying AWS API, it seems that this parameter is indeed optional.

Its late in the evening for me to be able to poke around at this without the possibility of getting someone paged or able to review PRs for the needed system changes, but the given explanation seems quite plausible at this point - between what looks to be extraneous configuration on my part, and the fact that this event did look to only impact some but not all nodes.

On the up side, for our clusters at least, the impact was just a few sad pods, and a few extra evictions. I mostly wanted to get this issue out quick since the release was quite fresh.

I'll try to report back tomorrow AM PT with more information if it is still needed at that time.

misterek · 2023-03-21T03:43:11Z

Fairly certain we have a 10GB root disk -- I don't quite remember the reason. @bcressey will this be released to be picked up by the updater tonight?

dschaaff · 2023-03-21T03:43:55Z

Karpenter picked up the update and we were bit by this. Our root volumes were sized at 4gb instead of 2gb. I think this was a carry over from the example using 4 here https://github.com/aws/karpenter/blob/main/examples/provisioner/bottlerocket.yaml#L26. We've updated that to 2gb now.

See bottlerocket-os/bottlerocket#2915 for why this is needed

bcressey · 2023-03-21T03:46:52Z

@misterek it would be picked up by the updater tonight.

However, for an in-place update, this guard in label-data-a.service and label-data-b.service should suppress the new behavior:

ConditionPathIsSymbolicLink=!/dev/disk/by-partlabel/BOTTLEROCKET-DATA

Since the BOTTLEROCKET-DATA link will already be present for any existing nodes.

misterek · 2023-03-21T03:49:44Z

Ugh. Ok. Thanks. I've deleted that for now. That was handy but seems like it's too risky to use now.

bcressey · 2023-03-21T04:07:44Z

@dschaaff nice catch!

We're rolling back the SSM parameter change for 1.13.0 to try to stop the bleeding for anyone else using the Karpenter default values.

bcressey · 2023-03-21T05:05:38Z

With the SSM parameters rolled back to v1.12.0, new launches via Karpenter, EKS MNG, and eksctl should no longer be affected.

@etungsten is working on a fix to skip the resize unless the new partition would grow to at least 20 GiB, which should avoid this failure mode.

Plan is to release v1.13.1 once that's ready and tested.

FernandoMiguel · 2023-03-21T11:28:29Z

the reason folks using karpenter are setting 4GiBs is to support both GPU and non GPU bottlerocket AMIs, since GPU ones are a tad bigger and would not fit in 2GiBs.
i guess the correct approach would not pick a size, and let the AMI use its default size instead, based on being or not a GPU enabled AMI?

misterek · 2023-03-21T11:30:49Z

Thanks @nairb774 for reporting this so quickly, and thanks @bcressey and @etungsten for addressing it just as quickly!

FernandoMiguel · 2023-03-21T12:00:13Z

karpenter prior changes aws/karpenter-provider-aws#2400

etungsten · 2023-03-21T16:35:35Z

i guess the correct approach would not pick a size, and let the AMI use its default size instead, based on being or not a GPU enabled AMI?

Yes, that would be ideal. But I also understand the value in providing an example of how to adjust the data volume size in the block device mapping if one desires to do so. In which case, the user has to know how much space to give the root volume for the host to boot up correctly.

Fortunately in fix we're adding, if the root volume size is below 22 GiB, then BR skips the data partition creation on the root volume and there will be no ambiguity as long as the data volume is present (with at least 20GiB of space). This reverts the behavior back to how it was before for people using the default karpenter device block mapping example with the 4 GiB root volume.

bcressey · 2023-03-21T16:42:28Z

Yes, that would be ideal. But I also understand the value in providing an example of how to adjust the data volume size in the block device mapping if one desires to do so. In which case, the user has to know how much space to give the root volume for the host to boot up correctly.

Ideally it would be possible to just specify xvdb in the block device mapping, for example:

  blockDeviceMappings:
    - deviceName: /dev/xvdb
      ebs:
        volumeType: gp3
        volumeSize: 20Gi # replace with your required disk size
        deleteOnTermination: true

That ought to use the AMI default for xvda, which is either 2 GiB or 4 GiB depending on the variant, while resizing xvdb to the desired size.

joebowbeer · 2023-05-21T22:29:10Z

@bcressey wrote:

Ideally it would be possible to just specify xvdb in the block device mapping, for example:
  blockDeviceMappings:
    - deviceName: /dev/xvdb
      ebs:
        volumeType: gp3
        volumeSize: 20Gi # replace with your required disk size
        deleteOnTermination: true
That ought to use the AMI default for xvda, which is either 2 GiB or 4 GiB depending on the variant, while resizing xvdb to the desired size.

@bcressey Can you clarify Ideally? Is this what will happen if xvdb is the only device configured?

joebowbeer · 2023-05-22T21:43:02Z

@FernandoMiguel wrote:

i guess the correct approach would not pick a size, and let the AMI use its default size instead, based on being or not a GPU enabled AMI?

Do you happen to know if this default sizing will work?

In our case, we want to adjust some of the xvda parameters, but not the size, for example:

  block_device_mappings {
    # This is the root volume.
    device_name = "/dev/xvda"

    ebs {
      # Let AMI use its default volume_size (2Gi); do not increase:
      # https://github.com/bottlerocket-os/bottlerocket/issues/2915
      volume_type           = "gp3"
      delete_on_termination = true
      encrypted             = true
    }
  }

FernandoMiguel · 2023-05-22T21:44:52Z

@joebowbeer i recall there a bug in karpenter this didn't work

nairb774 added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Mar 21, 2023

bcressey added status/in-progress This issue is currently being worked on and removed status/needs-triage Pending triage or re-evaluation labels Mar 21, 2023

dschaaff added a commit to dschaaff/karpenter that referenced this issue Mar 21, 2023

fix root volume in bottlerocket example

027e3a2

See bottlerocket-os/bottlerocket#2915 for why this is needed

dschaaff mentioned this issue Mar 21, 2023

docs: root volume is incorrect size in bottlerocket example manifest aws/karpenter-provider-aws#3621

Closed

jpmcb mentioned this issue Mar 21, 2023

📢 v1.13.0 SSM parameters are not marked as "latest" #2916

Closed

jpmcb pinned this issue Mar 21, 2023

etungsten mentioned this issue Mar 21, 2023

release: ensure data partition can be grown to 20 GiB #2920

Merged

etungsten closed this as completed in #2920 Mar 21, 2023

This was referenced Mar 21, 2023

[1.13.x] release: ensure data partition can be grown to 20 GiB #2923

Merged

release, os: use new special device symlinks to differentiate "fallback" and "preferred" data partition #2935

Merged

jpmcb mentioned this issue Apr 5, 2023

Document how to do a rollback via Brupop bottlerocket-os/bottlerocket-update-operator#437

Open

stmcginnis unpinned this issue Apr 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.13.0 possibly not resizing data disk #2915

1.13.0 possibly not resizing data disk #2915

nairb774 commented Mar 21, 2023 •

edited

Loading

etungsten commented Mar 21, 2023 •

edited

Loading

nairb774 commented Mar 21, 2023

bcressey commented Mar 21, 2023 •

edited

Loading

misterek commented Mar 21, 2023

nairb774 commented Mar 21, 2023

misterek commented Mar 21, 2023

dschaaff commented Mar 21, 2023

bcressey commented Mar 21, 2023

misterek commented Mar 21, 2023

bcressey commented Mar 21, 2023

bcressey commented Mar 21, 2023

FernandoMiguel commented Mar 21, 2023

misterek commented Mar 21, 2023

FernandoMiguel commented Mar 21, 2023

etungsten commented Mar 21, 2023 •

edited

Loading

bcressey commented Mar 21, 2023

joebowbeer commented May 21, 2023

joebowbeer commented May 22, 2023

FernandoMiguel commented May 22, 2023

1.13.0 possibly not resizing data disk #2915

1.13.0 possibly not resizing data disk #2915

Comments

nairb774 commented Mar 21, 2023 • edited Loading

etungsten commented Mar 21, 2023 • edited Loading

nairb774 commented Mar 21, 2023

bcressey commented Mar 21, 2023 • edited Loading

misterek commented Mar 21, 2023

nairb774 commented Mar 21, 2023

misterek commented Mar 21, 2023

dschaaff commented Mar 21, 2023

bcressey commented Mar 21, 2023

misterek commented Mar 21, 2023

bcressey commented Mar 21, 2023

bcressey commented Mar 21, 2023

FernandoMiguel commented Mar 21, 2023

misterek commented Mar 21, 2023

FernandoMiguel commented Mar 21, 2023

etungsten commented Mar 21, 2023 • edited Loading

bcressey commented Mar 21, 2023

joebowbeer commented May 21, 2023

joebowbeer commented May 22, 2023

FernandoMiguel commented May 22, 2023

nairb774 commented Mar 21, 2023 •

edited

Loading

etungsten commented Mar 21, 2023 •

edited

Loading

bcressey commented Mar 21, 2023 •

edited

Loading

etungsten commented Mar 21, 2023 •

edited

Loading