Skip to content

Upgrade mpi4py to >4 #184

Merged
FlorianDeconinck merged 4 commits into
NOAA-GFDL:developfrom
FlorianDeconinck:update/mpi4py_4x
Jul 22, 2025
Merged

Upgrade mpi4py to >4 #184
FlorianDeconinck merged 4 commits into
NOAA-GFDL:developfrom
FlorianDeconinck:update/mpi4py_4x

Conversation

@FlorianDeconinck
Copy link
Copy Markdown
Collaborator

The current work to run at scale on Grace Hopper unified silicon ARM64 hardware as showcased the need to move mpi4py above 4x.

The original restriction to 3.1.5 was done out of an abundance of caution in a moment where halo exchange were questioned.

Alas, the new allocator that ships with cuda 12.5 for GH boxes especially trips the old mpi4py. We need to move up.

The risk is limited as 4.1 showcase a very large amount of fixes to main version 4.

fmalatino
fmalatino previously approved these changes Jul 21, 2025
Copy link
Copy Markdown
Collaborator

@romanc romanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. NDSL now logs a warning

   <frozen importlib._bootstrap>:241: RuntimeWarning: suspicious MPI execution environment
  Your environment has OMPI_COMM_WORLD_SIZE=6 set, but mpi4py was built with MPICH.
  You may be using `mpiexec` or `mpirun` from a different MPI implementation.

NDSL is not running mpi test anymore (all of them are skipped). Might be related to issue #178.

pace tests complain about the wrong number of ranks. Something seems off ... Might be in the CI configuration, but suspicious nevertheless

Also, pyFV3 is not running "Orchestrated dace-cpu Accoustics" anymore (and thus really, really fast)

@romanc
Copy link
Copy Markdown
Collaborator

romanc commented Jul 21, 2025

It sounds like mpiexec is bound to an OpenMPI implementation of MPI while mpi4py is bound to an MPICH implementation of MPI. We'd need to dig around in the image to know for sure, but this tracks with the error message and if mpi4py is bound to the MPICH version and -np6 doesn't applies to the OpenMPI version, then - if we do import MPI from mpi4py, we get a rank size of 0, which would explain why all parallel tests are skipped (due to how we skip tests, as explained in #178).

mpi4py now ships wheel packages for mpi, as explained on pypi. We could thus do something like

pip install mpich

somewhere in the pipeline and start from a default ubuntu runner. As mentioned on pypi, these wheels are designed for ease of use rather than maximal speed, but we are talking about GitHub Actions here, not performance runs on the cluster. Imo worth a shot as a potential simplification step.

@fmalatino
Copy link
Copy Markdown
Contributor

It sounds like mpiexec is bound to an OpenMPI implementation of MPI while mpi4py is bound to an MPICH implementation of MPI. We'd need to dig around in the image to know for sure, but this tracks with the error message and if mpi4py is bound to the MPICH version and -np6 doesn't applies to the OpenMPI version, then - if we do import MPI from mpi4py, we get a rank size of 0, which would explain why all parallel tests are skipped (due to how we skip tests, as explained in #178).

mpi4py now ships wheel packages for mpi, as explained on pypi. We could thus do something like

pip install mpich

somewhere in the pipeline and start from a default ubuntu runner. As mentioned on pypi, these wheels are designed for ease of use rather than maximal speed, but we are talking about GitHub Actions here, not performance runs on the cluster. Imo worth a shot as a potential simplification step.

@FlorianDeconinck and @romanc , I too think we should try to move away from the container for now and just use an Ubuntu runner. If we agree this is a solution I will make a quick PR in pace to change the workflow environment.

@FlorianDeconinck
Copy link
Copy Markdown
Collaborator Author

Go for it

@romanc
Copy link
Copy Markdown
Collaborator

romanc commented Jul 21, 2025

A quick test with the mpi tests in NDSL (see here) shows that this seems to be valuable approach. I choose to install the MPICH flavor (as was in the image before). Interestingly, there's apparently no oversubscribe flag for that flavor of mpiexec (or that changed between 3 and 4, but I didn't see anything in the changelog), so that's weird ...

@FlorianDeconinck FlorianDeconinck merged commit 8bf072b into NOAA-GFDL:develop Jul 22, 2025
6 of 8 checks passed
@FlorianDeconinck FlorianDeconinck deleted the update/mpi4py_4x branch July 22, 2025 16:05
romanc added a commit to romanc/PyFV3 that referenced this pull request Jul 22, 2025
romanc added a commit to NOAA-GFDL/pyFV3 that referenced this pull request Jul 22, 2025
* ci: remove container, install mpi from python wheel

This is a follow-up from NOAA-GFDL/NDSL#184
similar to NOAA-GFDL/pace#131.

* Delete weird/unrecongnized mpiexec argument
jjuyeonkim pushed a commit to jjuyeonkim/NDSL that referenced this pull request Sep 8, 2025
* Upgrade `mpi4py` to >4 to allow new unified allocator on GH box to not die

* Restrict further to bug fixed 4.1+

* A tad too strict

* Use bare metal Ubuntu for NDSL unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants