Skip to content

refactor NVHPC easyblock into generic NvidiaBase easyblock, and custom easyblocks for nvidia-compilers + NVHPC#3788

Merged
boegel merged 25 commits intoeasybuilders:developfrom
lexming:nvhpc
Dec 14, 2025
Merged

refactor NVHPC easyblock into generic NvidiaBase easyblock, and custom easyblocks for nvidia-compilers + NVHPC#3788
boegel merged 25 commits intoeasybuilders:developfrom
lexming:nvhpc

Conversation

@lexming
Copy link
Contributor

@lexming lexming commented Jun 18, 2025

Fixes easybuilders/easybuild-framework#4853

Depends on:

The existing NVHPC easyblock becomes NvidiaBase, similar to what we do on Intel side with IntelBase. Two new easyblocks are added on top of NvidiaBase:

  • nvidia-compilers: provides support for (only) the compilers in NVHPC, analogue to intel-compilers
  • NVHPC: sits on top of nvidia-compilers and adds other packages bundled in NVHPC to make a full toolchain. NVHPCX for MPI and NVBLAS for the math libraries.

I also added two extra features to NVHPC:

  • Setting a default cuda_compute_capabilities is now optional. If done, the generated module will set the environment variable $EBNVHPCCUDACC, which will then be used to set cuda_compute_capabilities for every installation in that toolchain automatically.
  • Setting a default CUDA version continues to be mandatory, unless CUDA is an external dependency. In all cases, the generated module will define the environment variable $EBNVHPCCUDAVER, which is used to ensure consistency in the CUDA versions of nvidia-compilers and NVHPC and also, to generate the CUDA templates (e.g. %(cudaver)s)

Notes:

  • the changes to FFTW broadens a check for old versions of OpenMPI to also cover NVHPC

@lexming
Copy link
Contributor Author

lexming commented Jul 25, 2025

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS NVHPC-22.11-CUDA-11.7.0.eb
  • SUCCESS NVHPC-23.7-CUDA-12.1.1.eb
  • SUCCESS NVHPC-24.7-CUDA-12.6.0.eb

Build succeeded for 3 out of 3 (3 easyconfigs in total)
node250.hydra.os - Linux Rocky Linux 9.5 (Blue Onyx), x86_64, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 1 x NVIDIA Tesla P100-PCIE-16GB, 570.158.01, Python 3.9.21
See https://gist.github.com/lexming/0f04145f88e30abf70ff640ae6e6f103 for a full test report.

Comment on lines +482 to +483
if LooseVersion(self.version) >= LooseVersion('25.0'):
remove(os.path.join(abs_install_subdir, 'comm_libs', 'nccl'))
Copy link
Collaborator

@Thyre Thyre Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried to use the EasyBlock on our JSC systems to install nvidia-compilers. While this EasyBlock worked on almost all machines, this particular remove failed on JUWELS. I don't know why to be honest, but for some reason this folder doesn't exist. The one above (nccl_dir_glob) does exist and is removed.

Running the install command again in a shell results in the following directory structure:

eb-shell> ls /p/project1/cjsc/reuter1/EasyBuild/Next/easybuild/juwels/software/nvidia-compilers/25.9-CUDA-12/Linux_x86_64/25.9/comm_libs
12.9  13.0  hpcx  mpi
eb-shell> ls /p/project1/cjsc/reuter1/EasyBuild/Next/easybuild/juwels/software/nvidia-compilers/25.9-CUDA-12/Linux_x86_64/25.9/comm_libs/12.9
hpcx  nccl  nccl-2.26  nccl-2.27  nvshmem
eb-shell> ls /p/project1/cjsc/reuter1/EasyBuild/Next/easybuild/juwels/software/nvidia-compilers/25.9-CUDA-12/Linux_x86_64/25.9/comm_libs/13.0/
hpcx  nccl  nvshmem

I would suggest to not fail if those paths (i.e. the generic nccl, mpi and nvshmem ones) do not exist to begin with...

Copy link
Collaborator

@Thyre Thyre Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the math libraries could also pose problems, since not everything is provided by CUDA, NCCL and NVSHMEM.
With NVHPC 25.9, one could provide most things with the new environment variables. The math libraries could be a bit more complicated though, as e.g. cuFFTmp is a separate package not included in CUDA. However, we only have one environment variable to pass the path 🙈

(ignoring that cuFFTMp doesn't support CUDA 13 yet, which would also cause issues if we e.g. want to include that in the CUDA installation).

Copy link
Contributor Author

@lexming lexming Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to not fail if those paths (i.e. the generic nccl, mpi and nvshmem ones) do not exist to begin with...

That makes sense, several removes already were under a glob expansion to ensure existence of those paths. I have extended that to all removes so that none can fail. See 195e18a

]
for nvhpc_opt in disabled_nvhpc_options:
if self.cfg[nvhpc_opt]:
self.log.warning(f"Option '{nvhpc_opt}' forced to disabled in {self.name}-{self.version}")
Copy link
Collaborator

@Thyre Thyre Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that this immediately aborts the build instead of "silently" changing the option.
This only shows up in the logs, so people might not be aware of the changes (like I just did).

I'm also confused why we remove NCCL and the math libraries, but keep NVSHMEM?
I'd prefer to allow all three of them, but disable them by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed to an error in a78668e
Regarding NVSHMEM that is just an oversight, nvidia-compilers does not provide anything else than the compilers. If any users needs extra stuff, NVHPC should be used.

Copy link
Collaborator

@Thyre Thyre Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any users needs extra stuff, NVHPC should be used.

Well, this is not compatible with the approach chosen in framework, where toolchains now build on top of nvidia-compilers. So one would need to maintain a custom set of toolchains to keep the options one has right now.

Until we're certain that we are able to provide all the libraries NVHPC ships internally, I'm reluctant to simply completely remove all the libraries with giving users only the option to use none (nvidia-compilers) or all (NVHPC) of the libraries (including HPCX), with nothing sensible in between.

Just to give an overview what is missing in CUDA (just the math libraries alone):

There are applications which rely on these for certain functionality. Take e.g. GROMACS, using cuFFTMp as an optional feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping any of the math_libs also requires keeping nccl and nvshmem, since those are directly linked. Alternatively, one would need to provide the libraries via dependencies, which will be complicated if we want to build NVSHMEM (which needs the compiler we're just installing).

Comment on lines +70 to +72
# also include the location where libm & co live on Debian-based systems
# cfr. https://github.com/easybuilders/easybuild-easyblocks/pull/919
append LDLIBARGS=-L/usr/lib/x86_64-linux-gnu;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be careful with this, in the case we ever want to include NVHPC (or at least modules built on top of it) in EESSI in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually very old code and one of the few things I have not changed in this PR, so I suggest to leave this potential issue for another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, let's deal with this in a follow-up PR.

It should boils down to making this easyblock aware of the sysroot EasyBuild configuration setting.

May be of interest to @adammccartney

for filename in ["libnuma.so", "libnuma.so.1"]:
path = os.path.join(compilers_subdir, "lib", filename)
if os.path.islink(path):
os.remove(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we use os.remove here, and remove below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is old code from current nvhpc.py and at the time our remove() didn't support symlinks. Fixed in d9a3adb

adammccartney added a commit to adammccartney/easybuild-easyconfigs that referenced this pull request Oct 31, 2025
Based on [easybuilders#23125](easybuilders#23125),
this tries out one possible variation of an easyconfig for the
nvidia-compilers. This is to make the module compatible with the
defaults in EESSI/2023.06. It swaps out the SYSTEM toolchain for
GCCcore-13.2.0 and uses that same gcccore version in the other deps.
Also default to CUDA-12.4.

Requires:
+ easybuilders/easybuild-easyblocks#3788
+ easybuilders/easybuild-framework#4927
adammccartney added a commit to adammccartney/easybuild-easyconfigs that referenced this pull request Oct 31, 2025
Based on [easybuilders#23125](easybuilders#23125),
this tries out one possible variation of an easyconfig for the
nvidia-compilers. This is to make the module compatible with the
defaults in EESSI/2023.06. It swaps out the SYSTEM toolchain for
GCCcore-13.2.0 and uses that same gcccore version in the other deps.
Also default to CUDA-12.4.

Requires:
+ easybuilders/easybuild-easyblocks#3788
+ easybuilders/easybuild-framework#4927
@lexming
Copy link
Contributor Author

lexming commented Nov 20, 2025

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS NVHPC-22.11-CUDA-11.7.0.eb
  • SUCCESS NVHPC-23.7-CUDA-12.1.1.eb
  • SUCCESS NVHPC-24.7-CUDA-12.6.0.eb

Build succeeded for 3 out of 3 (total: 1 hour 59 mins 32 secs) (3 easyconfigs in total)
node801.hydra.os - Linux Rocky Linux 9.6 (Blue Onyx), x86_64, AMD EPYC 9275F 24-Core Processor, 1 x NVIDIA NVIDIA H200 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/lexming/f13a13fde4ec827acfb1cfb091642526 for a full test report.

@lexming
Copy link
Contributor Author

lexming commented Nov 21, 2025

Test report by @lexming
Using easyblocks from PR(s) #3788

Overview of tested easyconfigs from easybuilders/easybuild-easyconfigs#23125 (in order)

  • SUCCESS nvidia-compilers-25.1-CUDA-12.6.0.eb
  • SUCCESS nvidia-compilers-25.1.eb
  • SUCCESS nvidia-compilers-25.3.eb
  • SUCCESS NVHPC-25.1-CUDA-12.6.0.eb
  • SUCCESS NVHPC-25.1.eb
  • SUCCESS NVHPC-25.3.eb
  • SUCCESS CUDA-12.8.0.eb
  • SUCCESS nvidia-compilers-25.3-CUDA-12.8.0.eb
  • SUCCESS NVHPC-25.3-CUDA-12.8.0.eb

Build succeeded for 9 out of 9 (total: 46 mins 29 secs) (8 easyconfigs in total)
node400.hydra.os - Linux Rocky Linux 9.6 (Blue Onyx), x86_64, AMD EPYC 7282 16-Core Processor, 1 x NVIDIA NVIDIA A100-PCIE-40GB, 580.95.05, Python 3.9.21
See https://gist.github.com/lexming/26f604699123b2ff0e67d228ab5e740c for a full test report.

@lexming
Copy link
Contributor Author

lexming commented Nov 21, 2025

Test report by @lexming
Using easyblocks from PR(s) #3788

Overview of tested easyconfigs from easybuilders/easybuild-easyconfigs#23125 (in order)

  • SUCCESS nvidia-compilers-25.1-CUDA-12.6.0.eb
  • SUCCESS nvidia-compilers-25.1.eb
  • SUCCESS nvidia-compilers-25.3.eb
  • SUCCESS NVHPC-25.1-CUDA-12.6.0.eb
  • SUCCESS NVHPC-25.1.eb
  • SUCCESS NVHPC-25.3.eb
  • SUCCESS nvidia-compilers-25.3-CUDA-12.8.0.eb
  • SUCCESS NVHPC-25.3-CUDA-12.8.0.eb

Build succeeded for 8 out of 8 (total: 32 mins 55 secs) (8 easyconfigs in total)
node800.hydra.os - Linux Rocky Linux 9.6 (Blue Onyx), x86_64, AMD EPYC 9275F 24-Core Processor, 1 x NVIDIA NVIDIA H200 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/lexming/5f38d13b06fd79c6a99eed7bc31de87e for a full test report.

@Thyre
Copy link
Collaborator

Thyre commented Nov 27, 2025

To correctly set the default CUDA version, we need to change the command in install_components/install. This is especially important with CUDA 13 (NVHPC 25.9+).

By default, the installer runs:

    CC=$INSTALL_DIR/$arch/$release/compilers/bin/nvc
    DESIREDCUDA=$($CC -printcudaversion 2>&1 | grep -i "selected cuda version" | cut -d'=' -f2)

which determines e.g. the paths for the symlinks created by NVHPC.

Without a GPU present, this will likely result in the maximum CUDA version. We should replace $CC -printcudaversion by e.g. nvc -printcudaversion -acc -gpu=sm_70 where sm_70 should be the lowest GPU architecture one has specified. I need to look into this to fix our installation on e.g. JUSUF...

@jhmeinke
Copy link
Contributor

If we already loaded CUDA, we could do something like this:

DESIREDCUDA=$(nvcc --version | sed -n 's/^.*release \([0-9]\+\.[0-9]\+\).*$/\1/p')

(c.f. https://stackoverflow.com/questions/9727688/how-to-get-the-cuda-version)

@lexming
Copy link
Contributor Author

lexming commented Nov 28, 2025

We can do this in a follow-up PR. CUDA 13 is not covered here.

@Thyre
Copy link
Collaborator

Thyre commented Nov 28, 2025

A diff like this could work (here for our JSC repo):

diff --git a/Custom_EasyBlocks/generic/nvidiabase.py b/Custom_EasyBlocks/generic/nvidiabase.py
index 855286e5e..ced354e1d 100644
--- a/Custom_EasyBlocks/generic/nvidiabase.py
+++ b/Custom_EasyBlocks/generic/nvidiabase.py
@@ -48,7 +48,7 @@ from easybuild.framework.easyconfig import CUSTOM
 from easybuild.tools import LooseVersion
 from easybuild.tools.build_log import EasyBuildError, print_warning
 from easybuild.tools.config import build_option
-from easybuild.tools.filetools import adjust_permissions, remove, symlink, write_file
+from easybuild.tools.filetools import adjust_permissions, remove, symlink, write_file, apply_regex_substitutions
 from easybuild.tools.modules import MODULE_LOAD_ENV_HEADERS, get_software_root, get_software_version
 from easybuild.tools.run import run_shell_cmd
 from easybuild.tools.systemtools import AARCH64, X86_64, get_cpu_architecture, get_shared_lib_ext
@@ -443,6 +443,15 @@ class NvidiaBase(PackedBinary):
                 'NVHPC_STDPAR_CUDACC': self.default_compute_capability[0].replace('.', ''),
             })

+        # Before installing, make sure that NVHPC chooses the CUDA version we desire
+        # By default, NVHPC calls 'nvc -printcudaversion', which completely ignores our set
+        # version, and only cares about the supported GPUs and found CUDA driver.
+        # On a system without GPUs, this may return an incompatible CUDA version to the one
+        # we define in active_cuda_version.
+        desired_cuda_version = self.cfg['default_cuda_version'] or self.active_cuda_version
+        desired_cuda_var_regex = [(r'DESIREDCUDA=\$(.*)', f'DESIREDCUDA={str(desired_cuda_version)}')]
+        apply_regex_substitutions('./install_components/install', desired_cuda_var_regex, on_missing_match='error')
+
         cmd_env = ' '.join([f'{name}={value}' for name, value in sorted(nvhpc_env_vars.items())])
         run_shell_cmd(f"{cmd_env} ./install")

Depending on if we want to allow users to use CUDA as a dependency and set default_cuda_version, we should keep the desired_cuda_version or remove it.

@Thyre
Copy link
Collaborator

Thyre commented Nov 28, 2025

We can do this in a follow-up PR. CUDA 13 is not covered here.

Given that this already breaks NVHPC-25.9-CUDA-12.9.1.eb in easybuilders/easybuild-easyconfigs#23989, we should fix this. The same issue applies to all other multi-CUDA NVHPC versions, where one tries to use the older CUDA. See e.g. this old forum thread:

https://forums.developer.nvidia.com/t/how-to-install-multi-cuda-versions-hpc-sdk-why-nvhpc-default-cuda-does-not-take-effect/223351

@Thyre
Copy link
Collaborator

Thyre commented Dec 8, 2025

I'm fine with postponing fixes for NVHPC 25.9+ / CUDA 13.0 to not hold up the PR being included in an EasyBuild release. However, we should fix these things as soon as possible after merging.

There are other things we should take a look at as well, e.g. providing the NVCOMPILER_[...] options in our external EasyConfigs, so that NVHPC finds our external NSight-Systems and so on, and prefers them over the internal installation, which we remove.

@boegel boegel changed the title refactor NVHPC easyblock into NvidiaBase, nvidia-compilers and NVHPC refactor NVHPC easyblock into generic NvidiaBase easyblock, and custom easyblocks for nvidia-compilers + NVHPC Dec 14, 2025
@boegel
Copy link
Member

boegel commented Dec 14, 2025

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS FFTW.MPI-3.3.10-gompi-2023b.eb

  • SUCCESS FFTW.MPI-3.3.10-lompi-2025b.eb

Build succeeded for 2 out of 2 (total: 56 mins 14 secs) (2 easyconfigs in total)
node4245.shinx.os - Linux RHEL 9.6, x86_64, AMD EPYC 9654 96-Core Processor (zen4), Python 3.9.21
See https://gist.github.com/boegel/c2d2b767764aeb742f1b5345324970d1 for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel merged commit 908eb30 into easybuilders:develop Dec 14, 2025
22 checks passed
@lexming lexming deleted the nvhpc branch December 15, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide a full toolchain purely based on NVHPC

4 participants