Skip to content

Fix hang bug by updating MPI bcast to be from 0 not IAPROC-1 + Restart configurable path debugging for global-workflow#1426

Merged
JessicaMeixner-NOAA merged 7 commits into
NOAA-EMC:dev/ufs-weather-modelfrom
JessicaMeixner-NOAA:bug/initsavepoints
May 27, 2025
Merged

Fix hang bug by updating MPI bcast to be from 0 not IAPROC-1 + Restart configurable path debugging for global-workflow#1426
JessicaMeixner-NOAA merged 7 commits into
NOAA-EMC:dev/ufs-weather-modelfrom
JessicaMeixner-NOAA:bug/initsavepoints

Conversation

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA commented May 12, 2025

Pull Request Summary

When trying to use the save points, a hang on wcoss2 was discovered, this was due to a bug.

Description

Updating the broadcast calls to be from 0 instead of IAPROC-1 in w3iopo for when we have saved weight files for the unstructured grid case.

A PR to develop was made - #1430

Closes #1420 as it was added here

Issue(s) addressed

Refs #1350
Refs ufs-community/ufs-weather-model#2709

Commit Message

Fix hang bug by updating MPI bcast to be from 0 not IAPROC-1

Check list

Testing

  • How were these changes tested? These were tested in standalone ww3 regtests and in ufs-weather-model RT tests
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) yes - although the hang was seen on larger tests mostly.
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)? hera intel
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)
    No expected answer changes.
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):
**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2                     (17 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (13 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (17 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)
mww3_test_09/./work_MPI_ASCII                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)
ww3_tp2.14/./work_OASACM5                     (1 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)
ww3_tp2.6/./work_ST4_ASCII                     (0 files differ)
ww3_ufs1.3/./work_a                     (3 files differ)

ww3_tp2.14/./work_OASACM5 diffs in OUTPUT_TOY.txt which is unexpected but isn't meaningful from what I can tell:

7c7,8
<  ===========================================================================
---
>  APPLE partitioning
> ========================================================

matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt

UFS PR: ufs-community/ufs-weather-model#2737

Copy link
Copy Markdown
Collaborator

@sbanihash sbanihash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed this PR and approve the changes. Matrix comparison files are attached.
1426_matdiff

My ww3_tp2.14/./work_OASACM5 test shows no file differences. So I am guessing there was just some strange thing happening in Jessica's test.

matrixDiff.txt
matrixCompSummary.txt
matrixCompFull.txt

@JessicaMeixner-NOAA JessicaMeixner-NOAA changed the title Fix hang bug by updating MPI bcast to be from 0 not IAPROC-1 Fix hang bug by updating MPI bcast to be from 0 not IAPROC-1 + Restart configurable path debugging for global-workflow May 20, 2025
@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator Author

@JessicaMeixner-NOAA JessicaMeixner-NOAA merged commit bc43396 into NOAA-EMC:dev/ufs-weather-model May 27, 2025
3 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants