Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.13.0 possibly not resizing data disk #2915

Closed
nairb774 opened this issue Mar 21, 2023 · 19 comments · Fixed by #2920
Closed

1.13.0 possibly not resizing data disk #2915

nairb774 opened this issue Mar 21, 2023 · 19 comments · Fixed by #2920
Labels
status/in-progress This issue is currently being worked on type/bug Something isn't working

Comments

@nairb774
Copy link

nairb774 commented Mar 21, 2023

Image I'm using:

Bottlerocket OS 1.13.0 (aws-k8s-1.22) ami-033592485355290ae

What I expected to happen:

The disk that is used for containers and scratch storage is expanded to use the whole EBS volume.

What actually happened:

Containers are failing to be unpacked due to out of disk, and some containers are being evicted for ephemeral storage usage. Looking in journalctl, it does look like the resize happened, so it might be related to something else going on. This "bug" did impact a number of clusters with disparate sets of workloads. Only change was updating the base AMI through machine replacement.

Logs till the point that the machine booted. Looks fairly normal - could it be something other than a failed resize?

journal.txt

How to reproduce the problem:

Upgraded from 1.12.0 to 1.13.0.

@nairb774 nairb774 added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Mar 21, 2023
@etungsten
Copy link
Contributor

etungsten commented Mar 21, 2023

Hi @nairb774, if you're able to reach the host and sudo sheltie via the admin container, can you please paste the output of the following?

  • lsblk -f
  • blkid
  • journactl -u label-data-a -u label-data-b -u prepare-local-fs -u repart-local
  • df -h
  • df -i
  • mount

#2879 and #2807 are the recent changes we made to how the data disk is set up. These changes allow boot to continue in cases where the data volume is intentionally not attached, and where the goal is to use the additional space on the root volume for local storage.

One of the caveats about the change is that if the root volume has increased space (beyond the default 2GB) and the data volume is attached, the system will have two viable data partitions, and it is non-deterministic, undefined behavior which data partition will actually be used for local storage (where container images are stored). Therefore it's strongly advised to not increase the root volume size beyond the default unless it's guaranteed there isn't a second data volume on boot. We were planning to put out a notice along with the release and we're just about finishing up with the release. I know this is an early guess, but I think the aforementioned undefined behavior might be causing what you're seeing here.

@nairb774
Copy link
Author

I've rolled things back to 1.12, but as soon as I can, I'll try to bring up a few test nodes with 1.13 and get the requested information.

@bcressey
Copy link
Contributor

bcressey commented Mar 21, 2023

From the journal.txt you provided:

Mar 21 00:24:27 localhost kernel: nvme nvme1: 2/0/0 default/read/poll queues
Mar 21 00:24:27 localhost kernel: nvme nvme0: 2/0/0 default/read/poll queues
Mar 21 00:24:27 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Mar 21 00:24:27 localhost kernel: GPT:4194303 != 8388607
Mar 21 00:24:27 localhost kernel: GPT:Alternate GPT header not at the end of the disk.  
Mar 21 00:24:27 localhost kernel: GPT:4194303 != 8388607
Mar 21 00:24:27 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Mar 21 00:24:27 localhost kernel:  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13  
Mar 21 00:24:27 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Mar 21 00:24:27 localhost kernel: GPT:2097151 != 157286399
Mar 21 00:24:27 localhost kernel: GPT:Alternate GPT header not at the end of the disk.
Mar 21 00:24:27 localhost kernel: GPT:2097151 != 157286399
Mar 21 00:24:27 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Mar 21 00:24:27 localhost kernel:  nvme1n1: p1

(The GPT numbers are reported in sectors, where each sector is 512 bytes, so multiplying by 512 and dividing by 1024 * 1024 * 1024 gives the GiB figure.)

These GPT errors are saying, in essence:

  • nvme0n1 is 4 GiB (up from a default size of 2 GiB)
  • nvme1n1 is 75 GiB (up from a default size of 1 GiB)

This is the failure case @etungsten mentioned, where the behavior is non-deterministic and the failed systems end up with 2 GiB of mutable local storage instead of 75 GiB.

We're extremely sorry about this behavior change and the impact you experienced.

I know it's not much comfort but previously the "extra" 2 GiB on the first EBS volume would have gone unused since the OS ignored that space, resulting in extra cost for EBS storage but no benefit. Adjusting the launch template or tool you're using to reduce the space allocated to the first EBS volume down to the default 2 GiB should avoid the problem and save a modest amount of money.

Could you let us know what provisioning tool you're using? If this convention - allocating more space to the first EBS volume - is very common or the default behavior in some tool, then the impact here could be quite widespread.

@misterek
Copy link

Do we know when this will be picked up by bottle rocket updater? Do I need to worry tonight?

@bcressey bcressey added status/in-progress This issue is currently being worked on and removed status/needs-triage Pending triage or re-evaluation labels Mar 21, 2023
@nairb774
Copy link
Author

Yea, I can confirm that we had, inadvertently, set the root to 4GiB in size. We are creating a launch template using Terraform, and the 4GiB was a carryover from when we had migrated from the "default" EKS managed node image. For some reason I swear you needed to specify the volume_size in the launch template config, but looking at the documentation, as well as the underlying AWS API, it seems that this parameter is indeed optional.

Its late in the evening for me to be able to poke around at this without the possibility of getting someone paged or able to review PRs for the needed system changes, but the given explanation seems quite plausible at this point - between what looks to be extraneous configuration on my part, and the fact that this event did look to only impact some but not all nodes.

On the up side, for our clusters at least, the impact was just a few sad pods, and a few extra evictions. I mostly wanted to get this issue out quick since the release was quite fresh.

I'll try to report back tomorrow AM PT with more information if it is still needed at that time.

@misterek
Copy link

Fairly certain we have a 10GB root disk -- I don't quite remember the reason. @bcressey will this be released to be picked up by the updater tonight?

@dschaaff
Copy link

Karpenter picked up the update and we were bit by this. Our root volumes were sized at 4gb instead of 2gb. I think this was a carry over from the example using 4 here https://github.com/aws/karpenter/blob/main/examples/provisioner/bottlerocket.yaml#L26. We've updated that to 2gb now.

dschaaff added a commit to dschaaff/karpenter that referenced this issue Mar 21, 2023
@bcressey
Copy link
Contributor

@misterek it would be picked up by the updater tonight.

However, for an in-place update, this guard in label-data-a.service and label-data-b.service should suppress the new behavior:

ConditionPathIsSymbolicLink=!/dev/disk/by-partlabel/BOTTLEROCKET-DATA

Since the BOTTLEROCKET-DATA link will already be present for any existing nodes.

@misterek
Copy link

Ugh. Ok. Thanks. I've deleted that for now. That was handy but seems like it's too risky to use now.

@bcressey
Copy link
Contributor

@dschaaff nice catch!

We're rolling back the SSM parameter change for 1.13.0 to try to stop the bleeding for anyone else using the Karpenter default values.

@bcressey
Copy link
Contributor

With the SSM parameters rolled back to v1.12.0, new launches via Karpenter, EKS MNG, and eksctl should no longer be affected.

@etungsten is working on a fix to skip the resize unless the new partition would grow to at least 20 GiB, which should avoid this failure mode.

Plan is to release v1.13.1 once that's ready and tested.

@FernandoMiguel
Copy link

the reason folks using karpenter are setting 4GiBs is to support both GPU and non GPU bottlerocket AMIs, since GPU ones are a tad bigger and would not fit in 2GiBs.
i guess the correct approach would not pick a size, and let the AMI use its default size instead, based on being or not a GPU enabled AMI?

@misterek
Copy link

Thanks @nairb774 for reporting this so quickly, and thanks @bcressey and @etungsten for addressing it just as quickly!

@FernandoMiguel
Copy link

karpenter prior changes aws/karpenter-provider-aws#2400

@etungsten
Copy link
Contributor

etungsten commented Mar 21, 2023

i guess the correct approach would not pick a size, and let the AMI use its default size instead, based on being or not a GPU enabled AMI?

Yes, that would be ideal. But I also understand the value in providing an example of how to adjust the data volume size in the block device mapping if one desires to do so. In which case, the user has to know how much space to give the root volume for the host to boot up correctly.

Fortunately in fix we're adding, if the root volume size is below 22 GiB, then BR skips the data partition creation on the root volume and there will be no ambiguity as long as the data volume is present (with at least 20GiB of space). This reverts the behavior back to how it was before for people using the default karpenter device block mapping example with the 4 GiB root volume.

@bcressey
Copy link
Contributor

Yes, that would be ideal. But I also understand the value in providing an example of how to adjust the data volume size in the block device mapping if one desires to do so. In which case, the user has to know how much space to give the root volume for the host to boot up correctly.

Ideally it would be possible to just specify xvdb in the block device mapping, for example:

  blockDeviceMappings:
    - deviceName: /dev/xvdb
      ebs:
        volumeType: gp3
        volumeSize: 20Gi # replace with your required disk size
        deleteOnTermination: true

That ought to use the AMI default for xvda, which is either 2 GiB or 4 GiB depending on the variant, while resizing xvdb to the desired size.

@joebowbeer
Copy link

@bcressey wrote:

Ideally it would be possible to just specify xvdb in the block device mapping, for example:

  blockDeviceMappings:
    - deviceName: /dev/xvdb
      ebs:
        volumeType: gp3
        volumeSize: 20Gi # replace with your required disk size
        deleteOnTermination: true

That ought to use the AMI default for xvda, which is either 2 GiB or 4 GiB depending on the variant, while resizing xvdb to the desired size.

@bcressey Can you clarify Ideally? Is this what will happen if xvdb is the only device configured?

@joebowbeer
Copy link

@FernandoMiguel wrote:

i guess the correct approach would not pick a size, and let the AMI use its default size instead, based on being or not a GPU enabled AMI?

Do you happen to know if this default sizing will work?

In our case, we want to adjust some of the xvda parameters, but not the size, for example:

  block_device_mappings {
    # This is the root volume.
    device_name = "/dev/xvda"

    ebs {
      # Let AMI use its default volume_size (2Gi); do not increase:
      # https://github.com/bottlerocket-os/bottlerocket/issues/2915
      volume_type           = "gp3"
      delete_on_termination = true
      encrypted             = true
    }
  }

@FernandoMiguel
Copy link

@joebowbeer i recall there a bug in karpenter this didn't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/in-progress This issue is currently being worked on type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants