[develop] Fixed issue #649 and tested on cheyenne.#650
Conversation
chan-hoo
left a comment
There was a problem hiding this comment.
@padhrigmccarthy, with this change, 1) the run time of this task aqm_lbcs increased significantly on WCOSS2. (before: <1min, after: >15mins) 2) the task failed on Hera with the following error: [h24c53:300358:0:300358] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300358:0:300358] ib_mlx5_log.c:139 RC QP 0x86f9 wqe[0]: SEND --e [va 0x2b609edef280 len 34 lkey 0x3eff03]
[h24c53:300357:0:300357] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300357:0:300357] ib_mlx5_log.c:139 RC QP 0x86eb wqe[0]: SEND --e [va 0x2b2d3dbed200 len 34 lkey 0x1cba0f]
chan-hoo
left a comment
There was a problem hiding this comment.
@padhrigmccarthy, the run on wcoss2 seems to be stuck. It doesn't go forward either.
|
@ytangnoaa is the original developer of the code and script. @ytangnoaa, do you have any idea to fix this issue? |
|
@chan-hoo Thank you for testing on Hera and WCOSS2! I don't fully understand how this runs without this change on those platforms, though they must use something other than mpirun for RUN_CMD_UTILS. In any case, your findings indicate that the change I've proposed needs to be a local mod when running online-cmaq on cheyenne. I'm open to suggestions on how to accomplish this without having each cheyenne user edit exregional_aqm_lbcs.sh before using RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS. Thank you again! |
|
@padhrigmccarthy, I think you should change the machine files
|
|
@ytangnoaa, I got a question when I wrote the above comment for |
You are right. We should only keep '-n ${NUMTS}' |
|
I am about to suggest changes that clean up a few details in Chan-Hoo's
suggestion. The issue is that mpirun (cheyenne) uses -np, but the other
hosts use processes that use (a single) -n flag.
…On Tue, Mar 7, 2023 at 2:46 PM Youhua Tang ***@***.***> wrote:
@ytangnoaa <https://github.com/ytangnoaa>, I got a question when I wrote
the above comment for wcoss2.yaml. Is the command mpirun -n ${nprocs} -n
${NUMTS} correct?? The -n flag is repeated. What do you think about it?
You are right. We should only keep '-n ${NUMTS}'
—
Reply to this email directly, view it on GitHub
<#650 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNGGSZNROSENB7TKGLLOIDW26GANANCNFSM6AAAAAAVSZOGDM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
mpirun -np ${NUMTS} on cheyenne
mpiexec -n ${NUMTS} on wcoss2
srun --export=ALL -n ${NUMTS} on hera
Broke these out into RUN_CMD_AQMLBC, defined in the machine/machine.yaml files.
|
@chan-hoo When I follow your suggestion, I get the following error when running on cheyenne. It seems that adding RUN_AQMLBC to ush/machine/cheyenne.yaml is not enough to define the variable. Does it also need to be added to config_defaults.yaml?
|
|
@padhrigmccarthy, you should define a new parameter in |
|
@padhrigmccarthy, in addition, can you add |
|
@padhrigmccarthy, I confirm that your change works well on Hera as well as WCOSS2. Once you confirm it works correctly on Cheyenne, I'll approve this PR. |
@chan-hoo The small changes I just pushed run successfully on Cheyenne. Thank you for all of your help! |
MichaelLueken
left a comment
There was a problem hiding this comment.
@padhrigmccarthy The Jenkins tests passed for all machines, with the exception of Orion, which is a known issue. Manual testing of the WE2E tests on Orion shows that all tests successfully pass. Since these changes look good to me, I will now approve of these changes!
DESCRIPTION OF CHANGES:
Please review this one-line change, that I believe resolves issue #649. As far as I can tell, the problem was a typo that places the '-n ${NUMTS}' argument before the gefs2lbc_para executable instead of after. This causes mpirun to fail on cheyenne because it's an invalid mpirun argument.
Type of change
TESTS CONDUCTED:
Now runs on cheyenne with RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS both set to true. I have extensive local configuration changes that allow the overall workflow to run on cheyenne. I have no access to Hera, Orion, WCOSS3, etc.
ISSUE:
#649