refactor NVHPC easyblock into generic NvidiaBase easyblock, and custom easyblocks for nvidia-compilers + NVHPC#3788
Conversation
9786049 to
fe7cd1f
Compare
…s cannot be determined in NvidiaBase
|
Test report by @lexming Overview of tested easyconfigs (in order)
Build succeeded for 3 out of 3 (3 easyconfigs in total) |
| if LooseVersion(self.version) >= LooseVersion('25.0'): | ||
| remove(os.path.join(abs_install_subdir, 'comm_libs', 'nccl')) |
There was a problem hiding this comment.
We tried to use the EasyBlock on our JSC systems to install nvidia-compilers. While this EasyBlock worked on almost all machines, this particular remove failed on JUWELS. I don't know why to be honest, but for some reason this folder doesn't exist. The one above (nccl_dir_glob) does exist and is removed.
Running the install command again in a shell results in the following directory structure:
eb-shell> ls /p/project1/cjsc/reuter1/EasyBuild/Next/easybuild/juwels/software/nvidia-compilers/25.9-CUDA-12/Linux_x86_64/25.9/comm_libs
12.9 13.0 hpcx mpi
eb-shell> ls /p/project1/cjsc/reuter1/EasyBuild/Next/easybuild/juwels/software/nvidia-compilers/25.9-CUDA-12/Linux_x86_64/25.9/comm_libs/12.9
hpcx nccl nccl-2.26 nccl-2.27 nvshmem
eb-shell> ls /p/project1/cjsc/reuter1/EasyBuild/Next/easybuild/juwels/software/nvidia-compilers/25.9-CUDA-12/Linux_x86_64/25.9/comm_libs/13.0/
hpcx nccl nvshmemI would suggest to not fail if those paths (i.e. the generic nccl, mpi and nvshmem ones) do not exist to begin with...
There was a problem hiding this comment.
Removing the math libraries could also pose problems, since not everything is provided by CUDA, NCCL and NVSHMEM.
With NVHPC 25.9, one could provide most things with the new environment variables. The math libraries could be a bit more complicated though, as e.g. cuFFTmp is a separate package not included in CUDA. However, we only have one environment variable to pass the path 🙈
(ignoring that cuFFTMp doesn't support CUDA 13 yet, which would also cause issues if we e.g. want to include that in the CUDA installation).
There was a problem hiding this comment.
I would suggest to not fail if those paths (i.e. the generic nccl, mpi and nvshmem ones) do not exist to begin with...
That makes sense, several removes already were under a glob expansion to ensure existence of those paths. I have extended that to all removes so that none can fail. See 195e18a
| ] | ||
| for nvhpc_opt in disabled_nvhpc_options: | ||
| if self.cfg[nvhpc_opt]: | ||
| self.log.warning(f"Option '{nvhpc_opt}' forced to disabled in {self.name}-{self.version}") |
There was a problem hiding this comment.
I'd prefer that this immediately aborts the build instead of "silently" changing the option.
This only shows up in the logs, so people might not be aware of the changes (like I just did).
I'm also confused why we remove NCCL and the math libraries, but keep NVSHMEM?
I'd prefer to allow all three of them, but disable them by default.
There was a problem hiding this comment.
Good point, changed to an error in a78668e
Regarding NVSHMEM that is just an oversight, nvidia-compilers does not provide anything else than the compilers. If any users needs extra stuff, NVHPC should be used.
There was a problem hiding this comment.
If any users needs extra stuff,
NVHPCshould be used.
Well, this is not compatible with the approach chosen in framework, where toolchains now build on top of nvidia-compilers. So one would need to maintain a custom set of toolchains to keep the options one has right now.
Until we're certain that we are able to provide all the libraries NVHPC ships internally, I'm reluctant to simply completely remove all the libraries with giving users only the option to use none (nvidia-compilers) or all (NVHPC) of the libraries (including HPCX), with nothing sensible in between.
Just to give an overview what is missing in CUDA (just the math libraries alone):
- cuBLASMp -- {math}[gompi/2025b] cuBLASMp v0.7.0 w/ CUDA 12.9.1 easybuild-easyconfigs#24772
- cuFFTMp -- available as EC, but not for
nvidia-compilers - cuSOLVERMp -- {math}[GCCcore/14.3.0] cuSOLVERMp v0.7.2 w/ CUDA 12.9.1 easybuild-easyconfigs#24773
- cuTENSOR & cuTENSORMg -- available as EC
There are applications which rely on these for certain functionality. Take e.g. GROMACS, using cuFFTMp as an optional feature.
There was a problem hiding this comment.
Keeping any of the math_libs also requires keeping nccl and nvshmem, since those are directly linked. Alternatively, one would need to provide the libraries via dependencies, which will be complicated if we want to build NVSHMEM (which needs the compiler we're just installing).
| # also include the location where libm & co live on Debian-based systems | ||
| # cfr. https://github.com/easybuilders/easybuild-easyblocks/pull/919 | ||
| append LDLIBARGS=-L/usr/lib/x86_64-linux-gnu; |
There was a problem hiding this comment.
Should we be careful with this, in the case we ever want to include NVHPC (or at least modules built on top of it) in EESSI in some way?
There was a problem hiding this comment.
this is actually very old code and one of the few things I have not changed in this PR, so I suggest to leave this potential issue for another PR
There was a problem hiding this comment.
Indeed, let's deal with this in a follow-up PR.
It should boils down to making this easyblock aware of the sysroot EasyBuild configuration setting.
May be of interest to @adammccartney
| for filename in ["libnuma.so", "libnuma.so.1"]: | ||
| path = os.path.join(compilers_subdir, "lib", filename) | ||
| if os.path.islink(path): | ||
| os.remove(path) |
There was a problem hiding this comment.
Is there a reason why we use os.remove here, and remove below?
There was a problem hiding this comment.
Because this is old code from current nvhpc.py and at the time our remove() didn't support symlinks. Fixed in d9a3adb
Based on [easybuilders#23125](easybuilders#23125), this tries out one possible variation of an easyconfig for the nvidia-compilers. This is to make the module compatible with the defaults in EESSI/2023.06. It swaps out the SYSTEM toolchain for GCCcore-13.2.0 and uses that same gcccore version in the other deps. Also default to CUDA-12.4. Requires: + easybuilders/easybuild-easyblocks#3788 + easybuilders/easybuild-framework#4927
Based on [easybuilders#23125](easybuilders#23125), this tries out one possible variation of an easyconfig for the nvidia-compilers. This is to make the module compatible with the defaults in EESSI/2023.06. It swaps out the SYSTEM toolchain for GCCcore-13.2.0 and uses that same gcccore version in the other deps. Also default to CUDA-12.4. Requires: + easybuilders/easybuild-easyblocks#3788 + easybuilders/easybuild-framework#4927
|
Test report by @lexming Overview of tested easyconfigs (in order)
Build succeeded for 3 out of 3 (total: 1 hour 59 mins 32 secs) (3 easyconfigs in total) |
|
Test report by @lexming Overview of tested easyconfigs from easybuilders/easybuild-easyconfigs#23125 (in order)
Build succeeded for 9 out of 9 (total: 46 mins 29 secs) (8 easyconfigs in total) |
|
Test report by @lexming Overview of tested easyconfigs from easybuilders/easybuild-easyconfigs#23125 (in order)
Build succeeded for 8 out of 8 (total: 32 mins 55 secs) (8 easyconfigs in total) |
|
To correctly set the default CUDA version, we need to change the command in By default, the installer runs: which determines e.g. the paths for the symlinks created by NVHPC. Without a GPU present, this will likely result in the maximum CUDA version. We should replace |
|
If we already loaded CUDA, we could do something like this: DESIREDCUDA=$(nvcc --version | sed -n 's/^.*release \([0-9]\+\.[0-9]\+\).*$/\1/p')(c.f. https://stackoverflow.com/questions/9727688/how-to-get-the-cuda-version) |
|
We can do this in a follow-up PR. CUDA 13 is not covered here. |
|
A diff like this could work (here for our JSC repo): diff --git a/Custom_EasyBlocks/generic/nvidiabase.py b/Custom_EasyBlocks/generic/nvidiabase.py
index 855286e5e..ced354e1d 100644
--- a/Custom_EasyBlocks/generic/nvidiabase.py
+++ b/Custom_EasyBlocks/generic/nvidiabase.py
@@ -48,7 +48,7 @@ from easybuild.framework.easyconfig import CUSTOM
from easybuild.tools import LooseVersion
from easybuild.tools.build_log import EasyBuildError, print_warning
from easybuild.tools.config import build_option
-from easybuild.tools.filetools import adjust_permissions, remove, symlink, write_file
+from easybuild.tools.filetools import adjust_permissions, remove, symlink, write_file, apply_regex_substitutions
from easybuild.tools.modules import MODULE_LOAD_ENV_HEADERS, get_software_root, get_software_version
from easybuild.tools.run import run_shell_cmd
from easybuild.tools.systemtools import AARCH64, X86_64, get_cpu_architecture, get_shared_lib_ext
@@ -443,6 +443,15 @@ class NvidiaBase(PackedBinary):
'NVHPC_STDPAR_CUDACC': self.default_compute_capability[0].replace('.', ''),
})
+ # Before installing, make sure that NVHPC chooses the CUDA version we desire
+ # By default, NVHPC calls 'nvc -printcudaversion', which completely ignores our set
+ # version, and only cares about the supported GPUs and found CUDA driver.
+ # On a system without GPUs, this may return an incompatible CUDA version to the one
+ # we define in active_cuda_version.
+ desired_cuda_version = self.cfg['default_cuda_version'] or self.active_cuda_version
+ desired_cuda_var_regex = [(r'DESIREDCUDA=\$(.*)', f'DESIREDCUDA={str(desired_cuda_version)}')]
+ apply_regex_substitutions('./install_components/install', desired_cuda_var_regex, on_missing_match='error')
+
cmd_env = ' '.join([f'{name}={value}' for name, value in sorted(nvhpc_env_vars.items())])
run_shell_cmd(f"{cmd_env} ./install")Depending on if we want to allow users to use CUDA as a dependency and set |
Given that this already breaks NVHPC-25.9-CUDA-12.9.1.eb in easybuilders/easybuild-easyconfigs#23989, we should fix this. The same issue applies to all other multi-CUDA NVHPC versions, where one tries to use the older CUDA. See e.g. this old forum thread: |
|
I'm fine with postponing fixes for NVHPC 25.9+ / CUDA 13.0 to not hold up the PR being included in an EasyBuild release. However, we should fix these things as soon as possible after merging. There are other things we should take a look at as well, e.g. providing the |
Co-authored-by: Jan André Reuter <jan.andre.reuter@hotmail.de>
…NvidiaBase._get_active_cuda
…en building with NVHPC, see also open-mpi/ompi#12470
NvidiaBase easyblock, and custom easyblocks for nvidia-compilers + NVHPC
|
Test report by @boegel Overview of tested easyconfigs (in order)
Build succeeded for 2 out of 2 (total: 56 mins 14 secs) (2 easyconfigs in total) |
Fixes easybuilders/easybuild-framework#4853
Depends on:
$PATHenvironment variable withmodule_load_environmentin init ofBinaryeasyblock #3787The existing NVHPC easyblock becomes
NvidiaBase, similar to what we do on Intel side withIntelBase. Two new easyblocks are added on top ofNvidiaBase:nvidia-compilers: provides support for (only) the compilers in NVHPC, analogue tointel-compilersNVHPC: sits on top ofnvidia-compilersand adds other packages bundled in NVHPC to make a full toolchain. NVHPCX for MPI and NVBLAS for the math libraries.I also added two extra features to NVHPC:
cuda_compute_capabilitiesis now optional. If done, the generated module will set the environment variable $EBNVHPCCUDACC, which will then be used to setcuda_compute_capabilitiesfor every installation in that toolchain automatically.nvidia-compilersandNVHPCand also, to generate the CUDA templates (e.g.%(cudaver)s)Notes: