-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add udev rule to create symlinks using EBS volumes' device names #3977
Conversation
packages/os/ebs-volumes.rules
Outdated
|
||
# Create symlinks for the device name configured in EC2, for both block devices and partitions | ||
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/bin/ghostdog ebs-device-name /dev/$kernel", SYMLINK+="$result" | ||
KERNEL=="nvme[0-9]*n[0-9]*p[0-9]*", ENV{DEVTYPE}=="partition", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/bin/ghostdog ebs-device-name /dev/$kernel", SYMLINK+="$result$number" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set ENV{XVD_DEVICE_NAME}
in the parent, you should be able to import it here via IMPORT{parent}="XVD_DEVICE_NAME"
to avoid the second binary invocation.
packages/os/ebs-volumes.rules
Outdated
ENV{DEVTYPE}!="disk", GOTO="ebs_volumes_end" | ||
|
||
# Create symlinks for the device name configured in EC2, for both block devices and partitions | ||
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/bin/ghostdog ebs-device-name /dev/$kernel", SYMLINK+="$result" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd sort of prefer to do this as part of the ghostdog scan
operation, but it's OK to have a different subcommand.
I'd like it to take similar arguments though. It should also be set up with similar guards, and run after the io_timeout rule:
# add these guards to the top, after ACTION=="remove"
KERNEL!="nvme*", GOTO="ebs_volumes_end"
ATTRS{model}!="Amazon Elastic Block Store", GOTO="ebs_volumes_end"
...
# Set XVD_DEVICE_NAME for disk
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/bin/ghostdog ebs-device-name $devnode"
# Set XVD_DEVICE_NAME for partition
KERNEL=="nvme[0-9]*n[0-9]*p[0-9]*", ENV{DEVTYPE}=="partition", IMPORT{parent}="{XVD_DEVICE_NAME}"
# Use XVD_DEVICE_NAME
XVD_DEVICE_NAME=="xvd*", SYMLINK+="$env{XVD_DEVICE_NAME}"
sources/ghostdog/src/nvme.rs
Outdated
#[derive(Debug)] | ||
/// Structure returned by the NVME_IDENTIFY_CTRL command | ||
/// See: https://github.com/linux-nvme/libnvme/blob/1ec2432ad0f71a8fc7e37548405d52923dd3dc6e/src/nvme/types.h#L1365 | ||
struct NvmeIdCtrl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray, this struct has the same number of members as the types.h struct.
sources/ghostdog/src/nvme.rs
Outdated
|
||
impl From<NvmeIdCtrl> for EBSNvmeDevice { | ||
fn from(value: NvmeIdCtrl) -> Self { | ||
// The device name is within the first 32 bytes of the vendor specific data, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the name start at byte 0 of vendor specific data? If the name is shorter than 32 bytes, are the remaining bytes guaranteed to be whitespace (including, as Rust helpfully points out, zero bytes)?
Nit: if the name starts at byte 0, one could use trim_end() rather than trim(). Is it worth testing for a non-blank device name here (just to make for a more informative error message)?
sources/ghostdog/src/nvme.rs
Outdated
// This is required, otherwise the syscall always returns the same device; this | ||
// value corresponds to what Amazon Linux does. | ||
// See: https://github.com/amazonlinux/amazon-ec2-utils/blob/09b53f2a20c3e4d227acfb452138fdb672baaa1a/ebsnvme-id#L114 | ||
cdw10: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defined in NVME 1.1, Figure 81 (page 84). The value 1 means "The Identify Controller data structure is returned to
the host."
sources/ghostdog/src/nvme.rs
Outdated
// Test that the size of the structure matches the expected size, in case the | ||
// structure shape is changed. This is forces developers to update the expected | ||
// size of the struct. | ||
fn test_nvme_id_ctrl_size() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I was so relieved when I saw this test. Yes, by all means, let us validate that the huge lump of repr(C) definition actually added up to 4k bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me.
Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might wish to update the lookaside cache and rerun the Github CI. If you're happy with appending the partition number to XVD_DEVICE_NAME, I'm happy (see comment on ebs-volumes.rules).
/// Print the device type in the environment key format udev expects. | ||
fn emit_device_type(device_type: &str) { | ||
println!("BOTTLEROCKET_DEVICE_TYPE={}", device_type); | ||
} | ||
|
||
/// Print the device name in the environment key format udev expects. | ||
fn emit_device_name(device_name: &str) { | ||
println!("XVD_DEVICE_NAME={}", device_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://docs.aws.amazon.com/ebs/latest/userguide/nvme-ebs-volumes.html I believe this is fine because:
- Device names are at most 32 bytes long, and unused trailing bytes are space characters.
- Device names do not contain any characters that would require quoting or escaping.
Therefore this will give us a valid environment key assignment of the form XVD_DEVICE_NAME=/some/stuff and udev will be happy.
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/bin/ghostdog ebs-device-name $devnode", SYMLINK+="$env{XVD_DEVICE_NAME}" | ||
|
||
# Add symlink for partition | ||
KERNEL=="nvme[0-9]*n[0-9]*p[0-9]*", ENV{DEVTYPE}=="partition", IMPORT{parent}="XVD_DEVICE_NAME", SYMLINK+="$env{XVD_DEVICE_NAME}$number" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What form do you expect XVD_DEVICE_NAME to take? Is appending the partition number to that going to be unambiguous? I ask because if XVD_DEVICE_NAME is something like /dev/sda1, you could get a symlink for /dev/sda11 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is exactly what AL2023 does, so That Will Be Fine (probably, not provably, but what do you want). The XVD_DEVICE_NAME from the controller is of the form xvda, and the partitions get named xvda1, xvda128, and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For attached-after-boot EBS, /dev/sdb and /dev/sdb1 appear. Also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. The udev rules can be tweaked a bit for simplicity.
For readability, I prefer udev rules that bail out and skip to the end immediately if they don't match what the rule is trying to cover. And any conditions checked at top don't need to be checked again in the body of the rules file.
packages/release/release.spec
Outdated
@@ -122,6 +122,7 @@ Requires: %{_cross_os}kexec-tools | |||
Requires: %{_cross_os}keyutils | |||
Requires: %{_cross_os}makedumpfile | |||
Requires: %{_cross_os}netdog | |||
Requires: %{_cross_os}nvme-cli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can omit the dependency in "release" since it will come in via ghostdog
sources/ghostdog/src/main.rs
Outdated
let device_name = | ||
&device_info[NVME_IDENTIFY_DATA_SIZE - 1024..NVME_IDENTIFY_DATA_SIZE - 1024 + 32]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let device_name = | |
&device_info[NVME_IDENTIFY_DATA_SIZE - 1024..NVME_IDENTIFY_DATA_SIZE - 1024 + 32]; | |
let offset = NVME_IDENTIFY_DATA_SIZE - 1024; | |
let device_name = &device_info[offset..offset + 32]; |
packages/os/ebs-volumes.rules
Outdated
|
||
# Follow AWS recommendation of never timing out IO on EBS volumes attached via NVMe | ||
KERNEL=="nvme*", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/io_timeout}="4294967295" | ||
KERNEL=="nvme*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/io_timeout}="4294967295" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: restore the previous guard and omit the redundant KERNEL
and ATTRS{model}
checks:
KERNEL=="nvme*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/io_timeout}="4294967295" | |
SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ATTR{queue/io_timeout}="4294967295" |
(Is it OK to keep the `SUBSYSTEM!="block" guard at the top? Or did that cause partitions to be skipped? Not sure why you removed it. If it works it's better to have it at the top and omit it here also.)
Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
The new command uses nvme-cli to query the vendor data from the EBS volumes, which contains the device name. Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
This adds two new rules to create symlinks for both the disks and the partitions, using the device name configured for the EBS volume Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
(Forced push addresses nits left in previous reviews.) |
Hey @bcressey @arnaldo2792, Thanks for working on this! Just so I fully understand, this introduces native support for instance store local NVME volumes without having to build a custom bootstrap container, right? Will this work with Karpenter's RAID0 feature out of the box? Even with the bootstrap container workaround, I believe Karpenter had issues allocating the volumes for use, as described here: |
Hey @ikevinc thanks for the kind words! Just a clarification, this does not solve the problem you are pointing out. This PR introduced a udev rule that allows users to identify NVME devices using EBS device mappings (see this). The feature that you want is currently being worked in bottlerocket-os/bottlerocket-core-kit#15 and bottlerocket-os/bottlerocket-core-kit#62. With these two changes, you will be able to use ephemeral storage and configure a RAID 0 array, without requiring you to have bootstrap containers. |
I see. Thanks for the clarifications. Will those make it into the 1.21.0 release? |
Issue number:
Closes #3975
Description of changes:
V2
The first two commits in the series add tools to interact with NVMe devices. They were used to query the device info instead of issuing
ioctl
syscalls, like in the first version of the PR.The third commit extends
ghostdog
and adds theebs-device-name
command to emit the device name from the vendor data returned bynvme-cli
.The last commit adds two
udev
rules that use the value emitted byghostdog
to create the required symlinks.V1
The first commit in the series extends
ghostdog
with a command to query the device usingioctl
. The implementation was inspired by the script used in Amazon Linux andlibnvme
. Since theioctl
syscall is an actualC
API, the structs used to interact with it includerepr(C)
so that the struct is in a compatible format forC
. The Rustonomicon suggests the usage ofrust-bindgen
to generate rust bindings to interact withC
APIs. However, adding the entirelibnvme
would be overkill so I added just the structures that are needed to query the device information we want.For the created structures, I used
kvm-bindings
as a reference to understand how the structures should be configured, as well as how the Rust types are mapped toC
types.The second commit in the series updates the existing udev rules to create symlinks to the EBS volumes, for both the devices and the partitions, using the device name configured for the volume.
Testing done:
With
aws-ecs-2
, x86_64, I confirmed that the symlinks are created pointing at the correct locations. In my instance,xvda
corresponds to the root volume (nvme0n1
) andxvdb
corresponds to the data volume (nvme1n1
):Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.