Skip to content

Parallelize surface increment regridding#1109

Merged
BrianCurtis-NOAA merged 8 commits into
ufs-community:developfrom
DavidNew-NOAA:feature/multi_regrid
Nov 4, 2025
Merged

Parallelize surface increment regridding#1109
BrianCurtis-NOAA merged 8 commits into
ufs-community:developfrom
DavidNew-NOAA:feature/multi_regrid

Conversation

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor

@DavidNew-NOAA DavidNew-NOAA commented Oct 14, 2025

DESCRIPTION OF CHANGES:

This PR, a companion to NOAA-EMC/global-workflow#4193, parallelizes the surface increment regridding code over multiple ensemble members. Now, one can specify the number of ensemble members and specify multiple input directories (one for each ensemble member) in the namelist.

TESTS CONDUCTED:

If there are changes to the build or source code, the tests below must be conducted. Contact a repository manager if you need assistance.

  • Compile branch on all Tier 1 machines using Intel
    • Orion
    • Jet
    • Ursa
    • Hercules
    • WCOSS2 (908bbaa)
  • Compile branch on Ursa using GNU.
  • Compile branch in 'Debug' mode on WCOSS2.
  • Compile with Doxygen on any machine with no errors.
  • Run unit tests locally on any Tier 1 machine.
  • Run relevant consistency tests locally on all Tier 1 machines.
    • Orion
    • Jet
    • Ursa
    • Hercules
    • WCOSS2 (908bbaa)

Optional test.

  • Run full set of chgres_cube consistency tests on Ursa.

Describe any additional tests performed.

DEPENDENCIES:

  • None

DOCUMENTATION:

All new and updated source code must be documented with Doxygen.

  • Doxygen is updated.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@DavidNew-NOAA could you also make corresponding changes in https://github.com/DavidNew-NOAA/UFS_UTILS/blob/feature/multi_regrid/tests/regrid_sfc/data/regrid.nml to match ?

Just posting FYI. Here is the unit test failure:

57/57 Testing: regrid_sfc-ftst_read_namelist
57/57 Test: regrid_sfc-ftst_read_namelist
Command: "/scratch4/NCEPDEV/nems/Brian.Curtis/git/DavidNew-NOAA/UFS_UTILS/build/tests/regrid_sfc/ftst_read_namelist"
Directory: /scratch4/NCEPDEV/nems/Brian.Curtis/git/DavidNew-NOAA/UFS_UTILS/build/tests/regrid_sfc
"regrid_sfc-ftst_read_namelist" start time: Oct 30 15:23 UTC
Output:
----------------------------------------------------------
 Starting test of readin_setup.
 Read input namelist.
 - FATAL ERROR: unknown namel in readin_setup
 - IOSTAT IS:            1
Attempting to use an MPI routine (PMPI_Abort) before initializing or after finalizing MPICH

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

The following is from running the driver script for the regrid_sfc reg-tests:

+ srun /scratch4/NCEPDEV/nems/Brian.Curtis/git/DavidNew-NOAA/UFS_UTILS/reg_tests/regrid_sfc/../../exec/regridStates.x
 ** pets: local, total:            1           6
 ** pets: local, total:            2           6
 ** pets: local, total:            5           6
 ** pets: local, total:            0           6
 ** pets: local, total:            3           6
 ** pets: local, total:            4           6
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
 - FATAL ERROR:
Abort(999) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 1
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 2
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 3
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 4
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 0
 - FATAL ERROR:

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

Thanks @BrianCurtis-NOAA . I will work on getting that unit test fixed

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

Hi @BrianCurtis-NOAA I think I made the necessary change to the unit test namelist

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Hi @BrianCurtis-NOAA I think I made the necessary change to the unit test namelist

Still not working. Neither are my regression tests.

+ srun /scratch4/NCEPDEV/nems/Brian.Curtis/git/DavidNew-NOAA/UFS_UTILS/reg_tests/regrid_sfc/../../exec/regridStates.x
 ** pets: local, total:            1           6
 ** pets: local, total:            2           6
 ** pets: local, total:            5           6
 ** pets: local, total:            0           6
 ** pets: local, total:            3           6
 ** pets: local, total:            4           6
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
 - FATAL ERROR:
Abort(999) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 1
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 2
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 3
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 4
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 0
 - FATAL ERROR:
 number of processor divided by number of tiles must equal number of ensemble me
 mbers
 - IOSTAT IS:            1
Abort(999) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 5

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

Thanks @BrianCurtis-NOAA . Are these regression tests easy to run? Can you provide instructions to get me started? I really need to play around with this myself to get a handle on what's going wrong.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Thanks @BrianCurtis-NOAA . Are these regression tests easy to run? Can you provide instructions to get me started? I really need to play around with this myself to get a handle on what's going wrong.

cd reg-tests/regrid_sfc
edit the driver.sh script for the system you're on and your dir setup (mainly edit the WORK_DIR and PROJECT_CODE, leave HOMEreg alone)
make sure you run the link_fixdirs.sh in fix dir and built using build_all.sh
run driver.sh
look for consistency.log01

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

@BrianCurtis-NOAA Awesome, thanks. I'll take a look this afternoon

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

@BrianCurtis-NOAA I'm running into the following error upon running driver.sh. Is there an obvious solution to this?:

/scratch3/NCEPDEV/da/David.New/global-workflow-parallel/sorc/ufs_utils.fd/reg_tests/regrid_sfc/../../exec/regridStates.x: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /scratch3/NCEPDEV/da/David.New/global-workflow-parallel/sorc/ufs_utils.fd/reg_tests/regrid_sfc/../../exec/regridStates.x)

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

BrianCurtis-NOAA commented Nov 3, 2025

@BrianCurtis-NOAA I'm running into the following error upon running driver.sh. Is there an obvious solution to this?:

/scratch3/NCEPDEV/da/David.New/global-workflow-parallel/sorc/ufs_utils.fd/reg_tests/regrid_sfc/../../exec/regridStates.x: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /scratch3/NCEPDEV/da/David.New/global-workflow-parallel/sorc/ufs_utils.fd/reg_tests/regrid_sfc/../../exec/regridStates.x)

It's got to do with the build/linking for that executable. There could be many ways it's broken.

I would try building again with intelllvm using the build_all.sh script, maybe try removing the build dir for a fresh build from src.

If it still persists, try checking what you have loaded in lua on a fresh terminal session to see if it's getting in the way of what's shown in modulefiles/build.<machine>.<compiler>.lua

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

@BrianCurtis-NOAA Dumb mistake on my part. I built Global Workflow on Hera and was trying to run reg tests on Ursa because driver.sh didn't have settings for Hera. I rebuilt on Ursa and it works now

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA Dumb mistake on my part. I built Global Workflow on Hera and was trying to run reg tests on Ursa because driver.sh didn't have settings for Hera. I rebuilt on Ursa and it works now

Been there, done that. 👍

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

@BrianCurtis-NOAA That last commit did the trick

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA That last commit did the trick

Reg tests ran successfully! One last thing, in tests/regrid_sfc/ftst_read_namelist.F90 can you edit it (for both grid_setup_in and grid_setup_out) to match your changes to the nml.

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

@BrianCurtis-NOAA It looks like tests/regrid_sfc/ftst_read_namelist.F90 is only checking values for the &input and &output blocks for the namelist, and the only change I made to the namelist structure is in the &config block (nmem_ens). The other changes to parm/regrid_sfc/regrid.nml_tmpl are to the templating and not the namelist structure itself

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

BrianCurtis-NOAA commented Nov 3, 2025

@BrianCurtis-NOAA It looks like tests/regrid_sfc/ftst_read_namelist.F90 is only checking values for the &input and &output blocks for the namelist, and the only change I made to the namelist structure is in the &config block (nmem_ens). The other changes to parm/regrid_sfc/regrid.nml_tmpl are to the templating and not the namelist structure itself

OK, I see. Maybe it's with the readin_setup.F90 in sorc/regrid_sfc.fd, the unit test errors out with "unknown namel in readin_setup" (line 69) . There's something in there that got messed up with the nml changes, possibly.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA It looks like tests/regrid_sfc/ftst_read_namelist.F90 is only checking values for the &input and &output blocks for the namelist, and the only change I made to the namelist structure is in the &config block (nmem_ens). The other changes to parm/regrid_sfc/regrid.nml_tmpl are to the templating and not the namelist structure itself

OK, I see. Maybe it's with the readin_setup.F90 in sorc/regrid_sfc.fd, the unit test errors out with "unknown namel in readin_setup" (line 69) . There's something in there that got messed up with the nml changes, possibly.

I think i see it. In the ftest, the function call for readin_setup has added variables which are not present in the test.

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

Ok, will take a look shortly

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

@BrianCurtis-NOAA Is the unit test different from the regression test? How do you run the unit test?

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA Is the unit test different from the regression test? How do you run the unit test?

it is, yes.

(edit CMakeLists.txt and uncomment ctest (near the end))
rm -rf build
export BUILD_TESTING=ON && ./build_all.sh

This should run all unit test, they go quick and all but the last one should pass.

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

@BrianCurtis-NOAA CTest fixed

@CatherineThomas-NOAA
Copy link
Copy Markdown
Collaborator

Thanks for your help @BrianCurtis-NOAA. Do you have any other concerns with this PR?

FYI we are going to include this PR in the GFSv17 retrospectives, which will be starting as soon as possible.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Thanks for your help @BrianCurtis-NOAA. Do you have any other concerns with this PR?

FYI we are going to include this PR in the GFSv17 retrospectives, which will be starting as soon as possible.

No other concerns. I'll get to merge testing ASAP (in between meetings today, probably).

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

Thanks @BrianCurtis-NOAA

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Generating docs for file /scratch4/NCEPDEV/nems/Brian.Curtis/git/DavidNew-NOAA/UFS_UTILS/sorc/regrid_sfc.fd/grids_IO.F90...
/scratch4/NCEPDEV/nems/Brian.Curtis/git/DavidNew-NOAA/UFS_UTILS/sorc/regrid_sfc.fd/grids_IO.F90:366: error: The following parameter of grids_io::create_grid_fv3(integer, intent(in) res_atm, character(*), intent(in) dir_fix, integer, intent(in) npets, integer, intent(in) localpet, integer, intent(in) imem_ens, type(esmf_grid), intent(out) fv3_grid) is not documented:
  parameter 'imem_ens' (warning treated as error, aborting now)
Exiting...

Comment thread sorc/regrid_sfc.fd/grids_IO.F90
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA BrianCurtis-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found all the rest of the places where you'll need to document the new inputs to functions.

Comment thread sorc/regrid_sfc.fd/grids_IO.F90
Comment thread sorc/regrid_sfc.fd/grids_IO.F90
Comment thread sorc/regrid_sfc.fd/grids_IO.F90
Comment thread sorc/regrid_sfc.fd/readin_setup.F90
@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

still missing edit to line #254.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

and sorc/regrid_sfc.fd/readin_setup.F90 line 12

@BrianCurtis-NOAA BrianCurtis-NOAA merged commit 154a367 into ufs-community:develop Nov 4, 2025
1 of 2 checks passed
@CatherineThomas-NOAA
Copy link
Copy Markdown
Collaborator

Thanks @BrianCurtis-NOAA and @DavidNew-NOAA!

@DavidNew-NOAA
Copy link
Copy Markdown
Contributor Author

Thanks @BrianCurtis-NOAA ! It's been a useful education on how to do PRs for ufs_utils (reg tests, unit tests, etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants