update scrip_remap_conservative to use MPI#1268
Conversation
|
@ErickRogers pinging you as you were suggested as a possible reviewer. |
ErickRogers
left a comment
There was a problem hiding this comment.
This looks safe to me, since it's put behind a switch. I'm not proficient with MPI programming, so I can't comment on that, but I expect that it is good, since John says it works. Let me know if you want an MPI-fluent reviewer.
NB: We have some rudimentary MPI in WMGHGH. However, it splits things up like "one process per remapping", e.g., if you have 2 grid-pairs, it only splits into 2 processes. So this "within grid" parallelism should be significantly faster.
|
Thanks for submitting @jcwarner-usgs. I'll start reviewing. |
Thanks for reviewing @ErickRogers. |
|
I think the MPI could be re-written to be more WW3 centric. And my switches in there are not active. |
Hi @jcwarner-usgs, thanks for this follow up. I had two thoughts on the switches you mentioned. Currently they are Regarding re-writing to be more WW3 centric. Being as WW3 centric as possible is obviously best for the code base, though I can say in my initial review nothing jumped out as being problematic in that regard. It seemed clean and well commented which is great. We do have a Code Style Standards page though that may help out here. I would currently just make the suggestion of renaming some of these to something more descriptive if possible. |
|
@jcwarner-usgs I'm not sure I understand what you mean re: WW3 centric. What was your motivation for adding this feature? For you to use in your WW3 applications? Or some other applications? |
|
oh. I just meant like I used: but in other parts of the WW3 the mpi calls look like: should i go thru and make the variables all capitals? |
|
@jcwarner-usgs OK, I see. Matt and Jessica may disagree, but from my point of view, it is fine to use the conventions of the original file, rather than the conventions of WW3. I think of the stuff in the /SCRIP/ subdirectory as not strictly part of the WW3 code proper, but rather as a semi-external library that is adapted (with as light a touch as possible) to work with WW3, to be compiled into WW3. We maintain it in the same version control with WW3 because maintaining it separately would just result in extra work. (2 cents) |
|
All that aside, renaming the My* variables wouldn't hurt, for clarity purposes. |
|
I agree that keeping styles matching to what is in the routine is okay. It's more of the using your initials or "my" variable names versus just making them generic/descriptive to what they are. That's really great to hear from @aliabdolali that it sped things up, we'll have to see if that can help us in some set-ups here. I don't see any regression tests that are using these updates. @jcwarner-usgs Can you tell me a little more about your tests that you've run on your end so we can help advise how to best put in a test for the regression tests so we ensure this feature continues to work? Also, did you obtain identical results with and without this update? |
@jcwarner-usgs - I realized I wasn't clear that this question was to you so I updated my comment and am pinging you on this. |
MatthewMasarik-NOAA
left a comment
There was a problem hiding this comment.
@jcwarner-usgs I wanted to touch base following the discussion on code style. I only have one request here, that is to rename the switch W3_SCRIP_JCW to something without your initials (maybe W3_SCRIP_MPI or other).
Also, I wanted to mention that a couple Github CI jobs failed. I'm re-running these manually now to see if they will resolve themselves.
|
Thanks for checking back.
|
SCRIPMPI is fine with me. |
|
Hi @jcwarner-usgs, just a note to say we are taking a short pause on our PR processing do to some temporary changes for our group. I just posted our group on the Commit Queue: Temporary PR Policy. Look forward to picking back up after our short break. |
|
Getting back onto this pull request now.
Can you try this merge again? |
|
Hi @jcwarner-usgs, thank you for addressing the previous comments with these updates, and including the description of the changes. Much appreciated. A few weeks ago we posted a temporary PR Policy. It is in response to temporary resource reductions, where we don't have the manpower to review PRs, and so are limiting what we take on to just what is needed for the next GFS/GEFS releases being worked on. This is just till mid-November. I really apologize because I know you submitted this prior to that. We will pick this back up from our end just as soon as our temporary status ends sometime in November. We'll look forward to completing the review then. Thanks for understanding. |
|
All good! no worries. |
Awesome. thanks so much for understanding. having some more time to ensure it's working is a side benefit! |
|
@jcwarner-usgs just to let you know we are back from our temporary PR policy, and I've started working on the review for this. I'll be back in touch again next week. |
|
Ok. So i made as much progress as i can. This includes two updates in scrip_remap_conservative.F in my branch:
These 2 fixes correct the time out errors in: However there is still an error for this one: thanks |
|
hi @jcwarner-usgs, thanks for the work you've been doing on this. yes, I can try testing that on our end. I have a priortized issue I'm working on right now, but give me a few days and I'll report back. |
|
@jcwarner-usgs I have been able to reproduce the same results you reported: that is matrix04(timeout) and matrix14(seg fault) are successful, but the matrix03(timeout) still hangs. This was for a Release build where the |
|
@jcwarner-usgs I finished testing your updates last week for the following: 1) Debug runs w/o SCRIPMPI to confirm no errors. 2) Runs w/ SCRIPMPI to confirm functionality after the updates. 3) Run of the matrix03 regtest with current develop to confirm the hanging. From the last test it's clear that the hanging was not introduced in this PR, and has likely been slipping past our testing checks when it fails silently at some point, then moves on to the next section. I plan to post an issue for separate follow up and will just note this is already present in the PR review. The only thing I had remaining is reporting the time speed up. I was able to do a subset of regtest runs for timing the PR branch against develop. I need to look into this a little more, but this is the last step I and believe it can be wrapped up this week. |
|
Sweet! |
run_cmake_test: sed SCRIPMPI switch
|
Hi @jcwarner-usgs I have a status update. Some follows on the grid conversation we had on the recent sub-pr. I haven’t been able to find a heavier multi-grid setup than ww3_ufs1.3 to use for this performance testing, and I’m not seeing the speedup with that test that you're seeing in your multi-grid testcase. |
|
Because SCRIP is only used at initialization, a test for the whole model duration is not necessary. But it is hard to have the regtest only time the init part. I was able to run the ww3_ufs1.3 test case here, but only looked at the creation of the rmp_src_dest_conserv* files. All these tests had SCRIP and SCRIPNC. For our realistic case, we had 4 grids, Unstructured, nested, that have # grid points: 301056, 426496, 407624, and 420172. So these are larger, more girds, and since they are UNST (or even CURV) it might take SCRIP longer to find gridded areas for each point. Once you have created the rmp* files, the code will use them next time (this is from SCRIPNC) and so it is no longer an issue, but that first instance is very time consuming. I am happy to provide more info as needed. |
|
@jcwarner-usgs that's extremely helpful! Thanks for providing the timing results and description. It helps me understand better, and I agree it's hard to have a regtest capture just the scrip init portion. I plan to have a regtest to add by Thursday, then it should be ready for final review on our end. |
|
@jcwarner-usgs I have a branch ready to go that adds a regtest to exercise SCRIPMPI. The testing in nearly finished. I've been trying to open a pull request to your PR branch, but I haven't been able to find your fork in the drop-down menu to compare against. It's kind of strange because it's worked before. The only thing I can think is you may need to add me as a collaborator. You can always add me temporarily. Could you see if you can add me as a collaborator? Here's the branch I'll make the PR from: https://github.com/MatthewMasarik-NOAA/WW3/tree/regtest/scrip-mpi. |
|
I am not quite sure i did that correctly, but i accepted you to be a collaborator?? can u check? |
|
hi @jcwarner-usgs, sure thing. I just tried to find your fork to submit the PR but was still unable to, that's OK we can resolve it. I suspect it might be because I sent a Collaborator request from my side while looking into this, but I afterwards remembered the request needs to come from your end. I can tell you the steps to send the request. start by clicking your user icon in the upper right corner. This will drop down a list, then select the gear icon 'Settings'. This pulls up a new page, and in the left-hand side navigation bar there is a section 'Code, planning, and automation'. Click the first item under that, 'Repositories'. For WW3 click on the 'Collaborators' link, then use the search box to 'Find a collaborator'. Enter my username there, then it will allow you to click a button to send me an invite. After I get the request then I should be able to find your WW3 fork when trying to make a PR. Let me know how that goes |
|
Excellent, that did the trick! I'm going to fill the PR header and will post it momentarily. |
MatthewMasarik-NOAA
left a comment
There was a problem hiding this comment.
Code review
Pass
Testing
Pass
Regtest matrix output reproduced here (from jcwarner-usgs/pull/3) for completeness. The only changes were the known non-b4b differences.
develop vs. pr
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (16 files differ)
mww3_test_03/./work_PR1_MPI_d2 (8 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (13 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (16 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (9 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (15 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (7 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ufs1.3/./work_a (3 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
dev.matrixCompSummary.txt
dev.matrixCompFull.txt
dev.matrixDiff.txt
pr vs. pr
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (16 files differ)
mww3_test_03/./work_PR1_MPI_d2 (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (13 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (14 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2 (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (15 files differ)
mww3_test_09/./work_MPI_ASCII (0 files differ)
ww3_tp2.10/./work_MPI_OMPH (6 files differ)
ww3_tp2.16/./work_MPI_OMPH (4 files differ)
ww3_tp2.6/./work_ST4_ASCII (0 files differ)
ww3_ufs1.3/./work_a (2 files differ)
**********************************************************************
************************ identical cases *****************************
**********************************************************************
pr.matrixCompSummary.txt
pr.matrixCompFull.txt
pr.matrixDiff.txt
Performance
Users can expect speed improvements to increase as number of grids and number of points per grid increases. See 1268#issuecomment for reference.
Approved.
|
@jcwarner-usgs thank you for making this performance enhancement for SCRIP, and your input throughout the process! |
|
[like] Warner, John C reacted to your message:
…________________________________
From: Matthew Masarik ***@***.***>
Sent: Wednesday, March 26, 2025 8:37:28 PM
To: NOAA-EMC/WW3 ***@***.***>
Cc: Warner, John C ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [NOAA-EMC/WW3] update scrip_remap_conservative to use MPI (PR #1268)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
@jcwarner-usgs<https://github.com/jcwarner-usgs> thank you for making this performance enhancement for SCRIP, and your input throughout the process!
—
Reply to this email directly, view it on GitHub<#1268 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACZ6MOE25J4ODDFQ7M2NCSD2WMFYRAVCNFSM6AAAAABKU5Z2R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJVGY4TENRWGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
[MatthewMasarik-NOAA]MatthewMasarik-NOAA left a comment (NOAA-EMC/WW3#1268)<#1268 (comment)>
@jcwarner-usgs<https://github.com/jcwarner-usgs> thank you for making this performance enhancement for SCRIP, and your input throughout the process!
—
Reply to this email directly, view it on GitHub<#1268 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACZ6MOE25J4ODDFQ7M2NCSD2WMFYRAVCNFSM6AAAAABKU5Z2R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJVGY4TENRWGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Pull Request Summary
When using nesting, the SCRIP routines are called to compute remapping weights. But each processor computes all the weights. Here we updated SCRIP/scrip_remap_conservative.F to use all the MPI processors to compute separate portions of the remap weights, and then distribute a full set of weights to all the nodes.
Description
The results should be identical with or without this update.
Reviewer could be: Erick Rogers, NRL
Issue(s) addressed
Related to Discussion #1252
Commit Message
Update scrip_remap_conservative to be more efficient on first use.
Check list
Testing