Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.7.2: Lustre kmod modprobe breaks custom AMI based on RHEL8 #5913

Open
nyetsche opened this issue Dec 4, 2023 · 4 comments
Open

3.7.2: Lustre kmod modprobe breaks custom AMI based on RHEL8 #5913

nyetsche opened this issue Dec 4, 2023 · 4 comments

Comments

@nyetsche
Copy link

nyetsche commented Dec 4, 2023

My organization requires using RHEL8 (a supported OS) from the privately shared RedHat licensed base. We then use pcluster build-image to make it ready for ParallelCluster.

The pcluster build-image task has started failing for us recently. The initial AMI starts with RHEL-8.8 (I also tried 8.7, but is updated to RHEL 8.9 from the redhat-release RPM during build:

EVENTS  1700589295187   Step UpdateOS   1700589294393
EVENTS  1700589295187   ExecuteBash: STARTED EXECUTION  1700589294395

[...]

EVENTS  1700589326128   Stdout:  redhat-release                           x86_64  8.9-0.1.el8                    rhel-8-baseos-rhui-rpms       45 k 1700589326002

That comes from the UpdateOS section of the playbook:

121       - name: UpdateOS
122         action: ExecuteBash
123         inputs:
124           commands:
125             - |
126               set -v
127               OS='{{ build.OperatingSystemName.outputs.stdout }}'
128               PLATFORM='{{ build.PlatformName.outputs.stdout }}'
129
130               if [[ ${!PLATFORM} == RHEL ]]; then
131                 yum -y update
[...]

The yum -y update brings the OS to all most recent packages, including redhat-release and kernel-*.

The failure occurs later, during a kernel_module 'lnet': https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.7.2/cookbooks/aws-parallelcluster-environment/resources/lustre/partial/_install_lustre_centos_redhat.rb#L36

EVENTS  1700590745016   Stdout: [2023-11-21T18:19:01+00:00] INFO: dnf_package[kmod-lustre-client, lustre-client, dracut] installed ["kmod-lustre-client", "lustre-client", nil] at ["0:2.12.8-1.fsx7.el8.x86_64", "0:2.12.8-1.fsx7.el8.x86_64", nil]    1700590741488
EVENTS  1700590745016   Stdout:       - install version 0:2.12.8-1.fsx7.el8.x86_64 of package kmod-lustre-client    1700590741488
EVENTS  1700590745016   Stdout:       - install version 0:2.12.8-1.fsx7.el8.x86_64 of package lustre-client 1700590741488
EVENTS  1700590745016   Stdout:     * kernel_module[lnet] action install[2023-11-21T18:19:04+00:00] INFO: Processing kernel_module[lnet] action install ((eval) line 36)    1700590744740
EVENTS  1700590745016   Stdout:       ================================================================================  1700590744770
EVENTS  1700590745016   Stdout:       Error executing action `install` on resource 'kernel_module[lnet]'    1700590744770
EVENTS  1700590745016   Stdout:       ================================================================================  1700590744770
EVENTS  1700590745016   Stdout:       Mixlib::ShellOut::ShellCommandFailed  1700590744770
EVENTS  1700590745016   Stdout:       ------------------------------------  1700590744770
EVENTS  1700590745016   Stdout:       Expected process to exit with [0], but received '1'   1700590744770
EVENTS  1700590745016   Stdout:       ---- Begin output of modprobe lnet ----   1700590744770
EVENTS  1700590745016   Stdout:       STDOUT:   1700590744770
EVENTS  1700590745016   Stdout:       STDERR: modprobe: FATAL: Module lnet not found in directory /lib/modules/4.18.0-513.5.1.el8_9.x86_64  1700590744770

That is, there's no module in /lib/modules/4.18.0-513.5.1.el8_9.x86_64.

The kernel matrix compability in this document https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html indeed doesn't mention 4.18.0-513, and the upstream at https://downloads.whamcloud.com/public/lustre/latest-2.12-release/el8/client/ doesn't include it either. So I realize this is actually a Lustre packaging issue, but I'm not sure how to get in touch with the FSX Lustre team. Even so, it'd be great to have a workaround. Right now we can't use new AMIs for compute nodes.

I'm unsure of the best way forward here - blacklist redhat-release* and/or kernel-* from build-image process? Ignore errors from modprobe lnet?

@nyetsche nyetsche added the 3.x label Dec 4, 2023
@hgreebe
Copy link
Contributor

hgreebe commented Dec 5, 2023

A workaround could be to try not upgrading the os in the build image process by setting this config option to false: https://docs.aws.amazon.com/parallelcluster/latest/ug/Build-v3.html#Build-v3-UpdateOsPackages

@coderforlife
Copy link

@hgreebe The documentation says that option is false by default.

@coderforlife
Copy link

Can an option be added to the image builder to NOT include lustre/fsx support at all? Many setups do not require it and it would make it way easier to support many custom AMIs as it is the biggest sticking point in version compatibility.

@enrico-usai
Copy link
Contributor

enrico-usai commented Jan 15, 2024

Hi @coderforlife ,
you're correct the UpdateOsPackages is set to false by default.

@hgreebe suggested @nyetsche to set it to false because he said:

The initial AMI starts with RHEL-8.8 (I also tried 8.7, but is updated to RHEL 8.9 from the redhat-release RPM during build

and the UpdateOS step would be executed ONLY when UpdateOsPackages is set to true. So this should have solved the issue for @nyetsche.

Anyway we tracked internally the feature to avoid installing FSx for lustre drivers and support updated kernels when the client is not yet available.

Enrico

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants