Skip to content

[develop] Fixed issue #649 and tested on cheyenne.#650

Merged
MichaelLueken merged 3 commits into
ufs-community:developfrom
padhrigmccarthy:aqm_lbcs_bug_649
Mar 10, 2023
Merged

[develop] Fixed issue #649 and tested on cheyenne.#650
MichaelLueken merged 3 commits into
ufs-community:developfrom
padhrigmccarthy:aqm_lbcs_bug_649

Conversation

@padhrigmccarthy
Copy link
Copy Markdown
Contributor

DESCRIPTION OF CHANGES:

Please review this one-line change, that I believe resolves issue #649. As far as I can tell, the problem was a typo that places the '-n ${NUMTS}' argument before the gefs2lbc_para executable instead of after. This causes mpirun to fail on cheyenne because it's an invalid mpirun argument.

Type of change

  • [ x] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

Now runs on cheyenne with RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS both set to true. I have extensive local configuration changes that allow the overall workflow to run on cheyenne. I have no access to Hera, Orion, WCOSS3, etc.

ISSUE:

#649

Copy link
Copy Markdown
Collaborator

@chan-hoo chan-hoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@padhrigmccarthy, with this change, 1) the run time of this task aqm_lbcs increased significantly on WCOSS2. (before: <1min, after: >15mins) 2) the task failed on Hera with the following error: [h24c53:300358:0:300358] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300358:0:300358] ib_mlx5_log.c:139 RC QP 0x86f9 wqe[0]: SEND --e [va 0x2b609edef280 len 34 lkey 0x3eff03]
[h24c53:300357:0:300357] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[h24c53:300357:0:300357] ib_mlx5_log.c:139 RC QP 0x86eb wqe[0]: SEND --e [va 0x2b2d3dbed200 len 34 lkey 0x1cba0f]

Copy link
Copy Markdown
Collaborator

@chan-hoo chan-hoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@padhrigmccarthy, the run on wcoss2 seems to be stuck. It doesn't go forward either.

@chan-hoo
Copy link
Copy Markdown
Collaborator

chan-hoo commented Mar 7, 2023

@ytangnoaa is the original developer of the code and script. @ytangnoaa, do you have any idea to fix this issue?

@padhrigmccarthy
Copy link
Copy Markdown
Contributor Author

@chan-hoo Thank you for testing on Hera and WCOSS2! I don't fully understand how this runs without this change on those platforms, though they must use something other than mpirun for RUN_CMD_UTILS. In any case, your findings indicate that the change I've proposed needs to be a local mod when running online-cmaq on cheyenne.

I'm open to suggestions on how to accomplish this without having each cheyenne user edit exregional_aqm_lbcs.sh before using RUN_TASK_AQM_LBCS and DO_AQM_GEFS_LBCS. Thank you again!

@chan-hoo
Copy link
Copy Markdown
Collaborator

chan-hoo commented Mar 7, 2023

@padhrigmccarthy, I think you should change the machine files ush/machine/[machine].yaml and the command in the script.

  1. scripts/exregional_aqm_lbcs.sh: ${RUN_CMD_UTILS} -n ${NUMTS} ---> ${RUN_CMD_AQMLBC}
  2. add the following line between Ln 19-20 in ush/machine/hera.yaml: RUN_CMD_AQMLBC: srun --export=ALL -np $nprocs -n ${NUMTS}
  3. add the following line between Ln 19-20 in ush/machine/wcoss2.yaml: RUN_CMD_AQMLBC: mpirun -n ${nprocs} -n ${NUMTS}
  4. add the following line between Ln 19-20 in ush/machine/cheyenne.yaml: RUN_CMD_AQMLBC: put the command here

@chan-hoo
Copy link
Copy Markdown
Collaborator

chan-hoo commented Mar 7, 2023

@ytangnoaa, I got a question when I wrote the above comment for wcoss2.yaml. Is the command mpirun -n ${nprocs} -n ${NUMTS} correct?? The -n flag is repeated. What do you think about it?

@ytangnoaa
Copy link
Copy Markdown

@ytangnoaa, I got a question when I wrote the above comment for wcoss2.yaml. Is the command mpirun -n ${nprocs} -n ${NUMTS} correct?? The -n flag is repeated. What do you think about it?

You are right. We should only keep '-n ${NUMTS}'

@padhrigmccarthy
Copy link
Copy Markdown
Contributor Author

padhrigmccarthy commented Mar 7, 2023 via email

mpirun -np ${NUMTS} on cheyenne
mpiexec -n ${NUMTS} on wcoss2
srun --export=ALL -n ${NUMTS} on hera
Broke these out into RUN_CMD_AQMLBC, defined in the machine/machine.yaml files.
@padhrigmccarthy
Copy link
Copy Markdown
Contributor Author

@chan-hoo When I follow your suggestion, I get the following error when running on cheyenne. It seems that adding RUN_AQMLBC to ush/machine/cheyenne.yaml is not enough to define the variable. Does it also need to be added to config_defaults.yaml?
log/aqm_lbcs_2020081700.log:

  • PREP_STEP
  • :
    /glade/work/paddy/online-cmaq/ufs-srweather-app/scripts/exregional_aqm_lbcs.sh: line 209: RUN_CMD_AQMLBC: unbound variable
    FATAL ERROR:

@chan-hoo
Copy link
Copy Markdown
Collaborator

chan-hoo commented Mar 8, 2023

@padhrigmccarthy, you should define a new parameter in ush/config_defaults.yaml: add it to Ln 201~213 as follows:

  # RUN_CMD_AQM:
  # The run command for some AQM tasks.
  #
  # RUN_CMD_AQMLBC:
  # The run command for AQM_LBCS task.
  #
  #-----------------------------------------------------------------------
  #   
  RUN_CMD_SERIAL: ""
  RUN_CMD_UTILS: ""
  RUN_CMD_FCST: ""
  RUN_CMD_POST: "" 
  RUN_CMD_AQM: "" 
  RUN_CMD_AQMLBC: ""

@chan-hoo
Copy link
Copy Markdown
Collaborator

chan-hoo commented Mar 8, 2023

@padhrigmccarthy, in addition, can you add RUN_CMD_AQMLBC to ush/machine/orion.yaml too? It should be the same as that of Hera:

RUN_CMD_AQMLBC: srun --export=ALL -n ${NUMTS}

Comment thread ush/machine/hera.yaml Outdated
Comment thread ush/machine/cheyenne.yaml
@chan-hoo
Copy link
Copy Markdown
Collaborator

chan-hoo commented Mar 8, 2023

@padhrigmccarthy, I confirm that your change works well on Hera as well as WCOSS2. Once you confirm it works correctly on Cheyenne, I'll approve this PR.

@padhrigmccarthy
Copy link
Copy Markdown
Contributor Author

@padhrigmccarthy, I confirm that your change works well on Hera as well as WCOSS2. Once you confirm it works correctly on Cheyenne, I'll approve this PR.

@chan-hoo The small changes I just pushed run successfully on Cheyenne. Thank you for all of your help!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 8, 2023
Copy link
Copy Markdown
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@padhrigmccarthy The Jenkins tests passed for all machines, with the exception of Orion, which is a known issue. Manual testing of the WE2E tests on Orion shows that all tests successfully pass. Since these changes look good to me, I will now approve of these changes!

@MichaelLueken MichaelLueken merged commit 86c3eb0 into ufs-community:develop Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AQM exregional_aqm_lbcs.sh specifies '-n ${NUMTS}' before gefs2lbc_para executable

4 participants