Features/jet partitions#253
Conversation
This had to be updated to run on Jet in the baseline config, too. The change that was required to get the default test to run from develop was this: + LAYOUT_X="20" + LAYOUT_Y="11" + BLOCKSIZE="10"
Needed for new processor decomposition.
JeffBeck-NOAA
left a comment
There was a problem hiding this comment.
Changes look good to me.
mkavulich
left a comment
There was a problem hiding this comment.
These changes look good; @christinaholtNOAA ping me when you have resolved the merge conflicts and I will do a quick re-test and merge (presuming no problems).
Side note: Ran a 6-hour forecast of the GSD_HRRR25km domain on Cheyenne; the model forecast step showed a very significant runtime improvement from 880 seconds to 165 seconds...it's even faster than the make_ics task now!
|
Very nice result! To be clear, I didn't change any submit directives at all for Cheyenne. The only thing that should be affecting that speedup is the layout of the grid! I should get to the conflicts and re-tests tomorrow. |
|
Need some feedback on #245 before I can resolve conflicts. @JeffBeck-NOAA @gsketefian |
|
@christinaholtNOAA This looks good to me. An indirectly related question: it checks whether rem is 0. I think Dom mentioned during yesterday's deliverables meeting that the code is faster if this test is satisfied (but it won't crash if it's not, like it used to). Is that your understanding as well? |
|
@gsketefian I cannot speak to the stability of those types of changes. I think it could help to leave it and provide a bit more guidance on how to choose these numbers for efficiency. Is there some need to remove the check? |
|
Well we originally put it in that check because if BLOCKSIZE wasn't set in that specific way, the forecast would stop but the executable would still be going and would waste all your core-hours. Then Dom fixed things so that the code can run without BLOCKSIZE satisfying that condition, so the test was no longer relevant. But I think he mentioned on Wednesday that the code runs a lot faster if the condition on BLOCKSIZE is satisfied. I won't change anything for now, but somewhere in the docs we should mention the condition you mentioned in this PR [i.e. |
|
@mkavulich I have resolved my conflicts. What testing should be done now? |
|
@christinaholtNOAA I'm running a test on Cheyenne now, I'll merge once that test is complete |
mkavulich
left a comment
There was a problem hiding this comment.
The suite of Hera tests mostly completed (including 3 and 4 which previously failed!), and Cheyenne test was also successful.
|
This is wrong: The inequality for the x direction is backwards. I don't know where the inequality for y comes from, nor whether it is true. |
|
Yep! Nice catch on the x-direction. Thanks, @SamuelTrahanNOAA. I believe I should have left it as: For the y-direciton, I was referencing our emails (thread: how not to blow up GSL physics): |
|
blocksize <= grid_size_x/layout_x/2 You get warnings if grid_size_x/layout_x/2 is not divisible by the blocksize. I've seen FV3 crash in those cases too. Small blocksizes tend to be extremely slow and very small ones will sometimes crash. |
* Minor fixes for EnKF cycling * Get model specific nlevs for EnKF namelist * More fixes for running EnKF on CONUS_3km domain
DESCRIPTION OF CHANGES:
Add the ability for forecast jobs on Jet and Hera to use resources more efficiently, and to allow for Jet jobs to run on more partitions, which will be helpful during the HFIP Allocation Season. This involved additional Slurm tags in the rocoto xml, and changing pre-defined grids' layouts and blocksize to be consistent with
Other changes:
The concept adopted here is based on tests performed by Sam Trahan on the AVID real-time setup of HRRR. He found that a more efficient way to run the forecast model is by using the following settings:
along with
OMP_NUM_THREADS=4To adopt these settings in a more general way, he also provided a two rules to follow in order to avoid crashes when choosing layouts and blocksize:
blocksize <= grid_size_y / layout_y / 2
blocksize >= grid_size_x / layout_x / 2 \
TESTS CONDUCTED:
I ran the test suite (except for nco tests) on both Jet and Hera. Applying the above techniques, I fixed a couple of the tests that were already failing for develop -- regional_003, regional_004, and new_JPgrid. (After merging with develop, I realize that new_JPgrid may have already been fixed)
CONTRIBUTORS:
@SamuelTrahanNOAA, Dom Heinzeller, and several colleagues in GSL/ATD.