Skip to content

Conversation

@boegel
Copy link
Member

@boegel boegel commented Mar 18, 2023

(created using eb --new-pr)

WIP since we're using release candidates here, not final releases.

I had to strip out the CUDA-related patches we are using for OpenMPI 4.1.5 to get the build working, we'll need to figure out how to move forward there (cc @Micket, @bartoldeman)

requires:

@boegel boegel added the update label Mar 18, 2023
@boegel boegel marked this pull request as draft March 18, 2023 11:51
@boegel boegel added this to the release after 4.7.1 milestone Mar 18, 2023
@Micket
Copy link
Contributor

Micket commented Mar 18, 2023

I don't think there is really anything new to do with regards to CUDA. Just continue to patch in support for internal header.

@shahzebsiddiqui
Copy link
Contributor

is this PR going to be merged soon? I would be interested in using this version of OpenMPI.

@boegel boegel changed the title {mpi}[GCC/12.2.0] OpenMPI v5.0.0rc10, PMIx v5.0.0rc1 {mpi}[GCC/13.2.0] OpenMPI v5.0.1, PMIx v5.0.1 Jan 22, 2024
@boegel boegel marked this pull request as ready for review January 22, 2024 07:46
@SebastianAchilles
Copy link
Member

My remaining question here is, whether we want to add the CUDA-related patches first, or merge this PR as is and add the CUDA-related patches in a follow-up PR?

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3446

Test results coming soon (I hope)...

- notification for comment with ID 1904058315 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/3d70d3547c216b3078c551c4d30c96b1 for a full test report.

@bartoldeman
Copy link
Contributor

I can have a look this week to see how hard it is to port over the internal CUDA patches...

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3447

Test results coming soon (I hope)...

- notification for comment with ID 1904168663 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/46572ea7ce6477a7eeb12017f74d3963 for a full test report.

This patch has changed since libcuda is no longer dlopen()'ed by Open
MPI. Instead we can generate a stub library, and at runtime the
CUDA-dependent DSO's (but not the main libmpi.so library) load
libcuda.so. This is then consistent with
https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html
(but --enable-mca-dso=<comma-delimited-list-of-cuda-components> is
done by default already)
@bedroge
Copy link
Contributor

bedroge commented May 23, 2024

Test report by @bedroge
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bob-Latitude-5300 - Linux Ubuntu 22.04, x86_64, Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, Python 3.10.12
See https://gist.github.com/bedroge/39c00f2060f7955ddff352f8e8a37954 for a full test report.

@bedroge
Copy link
Contributor

bedroge commented May 23, 2024

Test report by @bedroge
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
interactive1 - Linux Rocky Linux 8.9, x86_64, AMD EPYC-Milan Processor (zen2), Python 3.6.8
See https://gist.github.com/bedroge/9c8395b2fdf5fcdc9c1518e21d6bd667 for a full test report.

@bartoldeman
Copy link
Contributor

Perhaps we should set PSM3_DEVICES=self instead of self,shm, to also disable shm on Generoso? If nic is already disabled, there isn't any point using PSM3 any more in any case, there are other shm mechanisms that do not need to go via PSM3, or ofi for that matter.

#18925

@bartoldeman
Copy link
Contributor

We should probably also pass
--with-show-load-errors=no
or
--with-show-load-errors=^accelerator/cuda,rcache/gpusm,rcache/rgpusm,btl/smcuda
to configure to avoid this happening if libcuda.so doesn't exist:

[cns1:148604] mca_base_component_repository_open: unable to open mca_accelerator_cuda: libcuda.so: cannot open shared object file: No such file or directory (ignored)
[cns1:148604] mca_base_component_repository_open: unable to open mca_rcache_gpusm: libcuda.so: cannot open shared object file: No such file or directory (ignored)
[cns1:148604] mca_base_component_repository_open: unable to open mca_rcache_rgpusm: libcuda.so: cannot open shared object file: No such file or directory (ignored)
[cns1:148604] mca_base_component_repository_open: unable to open mca_btl_smcuda: libcuda.so: cannot open shared object file: No such file or directory (ignored)

See:
https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/configure-cli-options/installation.html#installation-options
https://docs.open-mpi.org/en/v5.0.x/mca.html#label-mca-common-parameters

@SebastianAchilles
Copy link
Member

SebastianAchilles commented May 23, 2024

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
cnx4 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/SebastianAchilles/4f752a3df0852c8ddf49c60482299303 for a full test report.

Note: Using unset PSM3_DEVICES on generoso

Comment on lines 40 to 41
# to enable SLURM integration (site-specific)
# configopts += '--with-slurm --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obsolete now, as Open MPI 5 only supports PMIx, no more PMI-1 or PMI-2.
The --with-slurm option also applies to PMIx instead (selected by default on all OSes that Slurm supports, no need to set it manually)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in boegel#97

SebastianAchilles and others added 3 commits May 28, 2024 19:00
…penMPI503

remove outdated comment about Slurm support and add --with-show-load-errors=no in OpenMPI-5.0.3-GCC-13.3.0.eb
@bedroge
Copy link
Contributor

bedroge commented May 28, 2024

@boegelbot please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4238

Test results coming soon (I hope)...

- notification for comment with ID 2136006743 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0207u28a.bear.cluster - Linux RHEL 8.6, x86_64, AMD EPYC 9554 64-Core Processor (zen4), Python 3.6.8
See https://gist.github.com/branfosj/107a058b7d8da9f501e38c3a77f21683 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0211u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/branfosj/884ea50dd9596d2e859ebdf3eef8b895 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0105u03a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/dc0290462c345dc2c2abeccc32cd8964 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/35bdb344134637cc855edfad42f181af for a full test report.

@bedroge
Copy link
Contributor

bedroge commented May 28, 2024

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on login1

PR test command 'EB_PR=17561 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_17561 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13587

Test results coming soon (I hope)...

- notification for comment with ID 2136035919 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0207u20a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8480CL (sapphirerapids), Python 3.6.8
See https://gist.github.com/branfosj/94daae94012c90839d382719c80244c8 for a full test report.

@boegelbot
Copy link
Collaborator

boegelbot commented May 28, 2024

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/b3737058e26d688e6ab1f08965066a09 for a full test report.

edit: ah, I see that's expected on generoso.

@bedroge
Copy link
Contributor

bedroge commented May 28, 2024

Test report by @bedroge
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
bob-Latitude-5300 - Linux Ubuntu 22.04, x86_64, Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, Python 3.10.12
See https://gist.github.com/bedroge/ad99581790c2e2728d746716127b6c14 for a full test report.

@bedroge
Copy link
Contributor

bedroge commented May 28, 2024

Test report by @bedroge
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
interactive2 - Linux Rocky Linux 8.9, x86_64, AMD EPYC-Milan Processor (zen2), Python 3.6.8
See https://gist.github.com/bedroge/1bacd9c3494f37223271400589b30946 for a full test report.

@bedroge
Copy link
Contributor

bedroge commented May 28, 2024

Test report by @bedroge
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
starfive - Linux Debian GNU/Linux n/a, RISC-V-64, UNKNOWN, Python 3.10.9
See https://gist.github.com/bedroge/0ed05cd54c6a438ec57ad003f6f662f6 for a full test report.

Copy link
Contributor

@bartoldeman bartoldeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@bartoldeman bartoldeman merged commit c7b7d1b into easybuilders:develop May 29, 2024
@bedroge
Copy link
Contributor

bedroge commented May 29, 2024

Test report by @bedroge
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bob-Latitude-5300 - Linux Ubuntu 22.04, x86_64, Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, Python 3.10.12
See https://gist.github.com/bedroge/0fbeb0c510b8901deb97ed9fa635c1b6 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2024a issues & PRs related to 2024a common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants