Skip to content

wmmdatmd.F90: update MTAG parameters for 2022 compilers#825

Merged
JessicaMeixner-NOAA merged 3 commits into
NOAA-EMC:developfrom
MatthewMasarik-NOAA:fix/itags
Nov 1, 2022
Merged

wmmdatmd.F90: update MTAG parameters for 2022 compilers#825
JessicaMeixner-NOAA merged 3 commits into
NOAA-EMC:developfrom
MatthewMasarik-NOAA:fix/itags

Conversation

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor

@MatthewMasarik-NOAA MatthewMasarik-NOAA commented Oct 19, 2022

Pull Request Summary

Modifies the MTAG1/2 parameters which determine itag values available for MPI communication between multiple grids.

Description

In wmmdatmd.F90 the parameters MTAG1 and MTAG2 allow you to adjust the total number of itag values which serve as id's for MPI messages between multiple grids. When tuning MTAG1/2 you are effectively splitting the total number of available id's (itag values) between id's used for (1) grids of equal rank, and (2) grids of non-equal rank.

  • What bug does it fix, or what feature does it add?

    • The itag values must be updated for newer (2022) compiler versions.
  • Is a change of answers expected from this PR?

    • A change in answers to some extent can be expected from a change in compiler.
    • Testing and comparison of answers for this PR will be explained below.

Please also include the following information:

  • Add any suggestions for a reviewer

  • Mention any labels that should be added:

    • bug.
  • Are answer changes expected from this PR? Please describe the changes and the reason why in addition to which of the following labels would apply:

    • out_grd change, out_pnt change, restart file change, Regression test

Issue(s) addressed

Commit Message

wmmdatmd.F90: update MTAG1/2 parameters for 2022 compilers
Co-authors: @JessicaMeixner-NOAA, @aliabdolali

Check list

Testing

  • How were these changes tested?

    • A number of WW3 standalone and UFS weather model coupled regression tests were performed.

    • WW3 standalone

      • Given the new are itag values are intended for new (2022) compilers, we did a few different comparisons to help
        ensure consistency in a way that is apples-to-apples.
      • (1) itags:curr/comp:curr -vs- itags:new/comp:curr
      • (2) itags:new/comp:new -vs- itags:new/comp:new (identical, separate runs compared)
      • (3) itags:curr/comp:new -vs- itags:new/comp:new (multi-grid cases expected to not pass)
    • UFS WM coupled

      • The current UFS RT's were run on: hera, orion, wcoss2 to ensure the current regtests aren't broken by the new
        values. Additionally, to confirm that control_c384gdas_wav can be turned back on, this test was run on hera and
        a baseline was created and successfully matched.
  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?)

    • Yes, the existing regression tests are sufficient for testing these changes.
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)?

    • Yes. hera.intel.
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)

(1) Only known non-identicals (and unstructured mod_defs).

**********************************************************************     
********************* non-identical cases ****************************     
**********************************************************************     
mww3_test_03/./work_PR2_UNO_MPI_d2                     (12 files differ)   
mww3_test_03/./work_PR1_MPI_d2                     (16 files differ)       
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (15 files differ) 
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)  
mww3_test_03/./work_PR3_UNO_MPI_d2                     (15 files differ)   
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)    
mww3_test_03/./work_PR3_UQ_MPI_d2                     (15 files differ)    
ww3_ta1/./work_UPD0F_U                     (0 files differ)                
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)            
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)            
ww3_tp2.17/./work_ma                     (1 files differ)                  
ww3_tp2.17/./work_a                     (1 files differ)                   
ww3_tp2.17/./work_mc1                     (1 files differ)                 
ww3_tp2.17/./work_mb                     (1 files differ)                  
ww3_tp2.17/./work_mc                     (1 files differ)                  
ww3_tp2.17/./work_ma1                     (1 files differ)                 
ww3_tp2.17/./work_c                     (1 files differ)                   
ww3_tp2.17/./work_b                     (1 files differ)                   
ww3_tp2.6/./work_ST0                     (1 files differ)                  
ww3_tp2.6/./work_ST4                     (1 files differ)                  
ww3_tp2.6/./work_pdlib                     (1 files differ)                
ww3_ufs1.3/./work_a                     (1 files differ)                                                  
**********************************************************************     
************************ identical cases *****************************     
**********************************************************************

(2) Only known non-identicals (and unstructured mod_defs).

**********************************************************************     
********************* non-identical cases ****************************     
**********************************************************************     
mww3_test_03/./work_PR1_MPI_e                     (1 files differ)         
mww3_test_03/./work_PR3_UQ_MPI_e_c                     (1 files differ)    
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)     
mww3_test_03/./work_PR2_UNO_MPI_d2                     (15 files differ)   
mww3_test_03/./work_PR1_MPI_d2                     (17 files differ)       
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (15 files differ) 
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (16 files differ)  
mww3_test_03/./work_PR3_UNO_MPI_d2                     (16 files differ)   
mww3_test_03/./work_PR2_UQ_MPI_d2                     (13 files differ)    
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)      
mww3_test_03/./work_PR3_UNO_MPI_e_c                     (1 files differ)   
mww3_test_03/./work_PR3_UQ_MPI_d2                     (16 files differ)    
ww3_ta1/./work_UPD0F_U                     (0 files differ)                
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)            
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)            
ww3_tp2.17/./work_ma                     (1 files differ)                  
ww3_tp2.17/./work_a                     (1 files differ)                   
ww3_tp2.17/./work_mc1                     (1 files differ)                 
ww3_tp2.17/./work_mb                     (1 files differ)                  
ww3_tp2.17/./work_mc                     (1 files differ)                  
ww3_tp2.17/./work_ma1                     (1 files differ)                 
ww3_tp2.17/./work_c                     (1 files differ)                   
ww3_tp2.17/./work_b                     (1 files differ)                   
ww3_tp2.6/./work_ST0                     (1 files differ)                  
ww3_tp2.6/./work_ST4                     (1 files differ)                  
ww3_tp2.6/./work_pdlib                     (1 files differ)                
ww3_ts4/./work_ug_MPI                     (1 files differ)                 
ww3_ufs1.3/./work_a                     (3 files differ)                                                                               
**********************************************************************     
************************ identical cases *****************************     
**********************************************************************

(3) For this set of tests we would not expect multi-grid tests to pass without differences. For the itags:curr/comp:new runs that finished without error, the changes are: the known non-identicals, unstructured mod_defs, and additional multi-grid tests (mww3) as mentioned.

**********************************************************************     
********************* non-identical cases ****************************     
**********************************************************************     
mww3_test_03/./work_PR3_UQ_MPI_a                     (25 files differ)     
mww3_test_03/./work_PR1_MPI_d                     (42 files differ)        
mww3_test_03/./work_PR3_UNO_MPI_b_c                     (32 files differ)  
mww3_test_03/./work_PR3_UQ_MPI_c                     (32 files differ)     
mww3_test_03/./work_PR2_UNO_MPI_d                     (42 files differ)    
mww3_test_03/./work_PR2_UQ_MPI_a                     (25 files differ)     
mww3_test_03/./work_PR3_UQ_MPI_a_c                     (25 files differ)   
mww3_test_03/./work_PR3_UNO_MPI_d_c                     (42 files differ)  
mww3_test_03/./work_PR3_UNO_MPI_e                     (1 files differ)     
mww3_test_03/./work_PR2_UQ_MPI_e                     (1 files differ)      
mww3_test_03/./work_PR3_UNO_MPI_d                     (42 files differ)    
mww3_test_03/./work_PR2_UNO_MPI_e                     (1 files differ)     
mww3_test_03/./work_PR2_UQ_MPI_b                     (32 files differ)     
mww3_test_03/./work_PR3_UNO_MPI_c_c                     (32 files differ)  
mww3_test_03/./work_PR3_UNO_MPI_a_c                     (25 files differ)  
mww3_test_03/./work_PR2_UNO_MPI_d2                     (37 files differ)   
mww3_test_03/./work_PR3_UNO_MPI_b                     (32 files differ)    
mww3_test_03/./work_PR3_UQ_MPI_c_c                     (32 files differ)   
mww3_test_03/./work_PR3_UNO_MPI_a                     (25 files differ)    
mww3_test_03/./work_PR1_MPI_c                     (32 files differ)        
mww3_test_03/./work_PR1_MPI_d2                     (37 files differ)       
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (37 files differ) 
mww3_test_03/./work_PR1_MPI_b                     (32 files differ)        
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (37 files differ)  
mww3_test_03/./work_PR3_UNO_MPI_d2                     (37 files differ)   
mww3_test_03/./work_PR3_UQ_MPI_b_c                     (32 files differ)   
mww3_test_03/./work_PR2_UQ_MPI_d2                     (37 files differ)    
mww3_test_03/./work_PR1_MPI_a                     (25 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d_c                     (42 files differ)   
mww3_test_03/./work_PR3_UQ_MPI_d                     (42 files differ)     
mww3_test_03/./work_PR3_UQ_MPI_e                     (1 files differ)      
mww3_test_03/./work_PR2_UNO_MPI_b                     (32 files differ)    
mww3_test_03/./work_PR3_UNO_MPI_c                     (32 files differ)    
mww3_test_03/./work_PR2_UQ_MPI_d                     (42 files differ)     
mww3_test_03/./work_PR2_UQ_MPI_c                     (32 files differ)     
mww3_test_03/./work_PR2_UNO_MPI_c                     (32 files differ)    
mww3_test_03/./work_PR2_UNO_MPI_a                     (25 files differ)    
mww3_test_03/./work_PR3_UQ_MPI_b                     (32 files differ)     
mww3_test_03/./work_PR3_UQ_MPI_d2                     (37 files differ)    
mww3_test_09/./work_MPI                     (20 files differ)              
ww3_ta1/./work_UPD0F_U                     (0 files differ)                
ww3_tp2.10/./work_MPI_OMPH                     (7 files differ)            
ww3_tp2.16/./work_MPI_OMPH                     (4 files differ)            
ww3_tp2.17/./work_ma                     (1 files differ)                  
ww3_tp2.17/./work_a                     (1 files differ)                   
ww3_tp2.17/./work_mc1                     (1 files differ)                 
ww3_tp2.17/./work_mb                     (1 files differ)                  
ww3_tp2.17/./work_mc                     (1 files differ)                  
ww3_tp2.17/./work_ma1                     (1 files differ)                 
ww3_tp2.17/./work_c                     (1 files differ)                   
ww3_tp2.17/./work_b                     (1 files differ)                   
ww3_tp2.6/./work_ST0                     (1 files differ)                  
ww3_tp2.6/./work_ST4                     (1 files differ)                  
ww3_tp2.6/./work_pdlib                     (1 files differ)                
ww3_ts4/./work_ug_MPI                     (1 files differ)                 
ww3_ufs1.3/./work_a                     (3 files differ)                   
**********************************************************************     
************************ identical cases *****************************     
**********************************************************************
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

(1)

(2)

(3)

UFS RT Logs

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

@benoitp-cmc can you try this out to see if this helps for the case you reported in #711 ?

@sbrus89
Copy link
Copy Markdown
Collaborator

sbrus89 commented Oct 19, 2022

We're not using multi-grid for E3SM, so this won't effect our configurations.

@benoitp-cmc
Copy link
Copy Markdown
Contributor

@JessicaMeixner-NOAA I will.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

We're not using multi-grid for E3SM, so this won't effect our configurations.

Thank you for reporting this, @sbrus89.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

Note: I wanted to clarify this is currently a draft PR only because I am still adding text and log files for the testing performed. The code fix itself (and my testing of it) is complete.

@ukmo-ccbunney
Copy link
Copy Markdown
Collaborator

All code managers are encouraged to test this on their own use cases to help ensure consistency.
We don't run a multigrid setup operationally, but we can test this change on one of @ukmo-jianguo-li 's multigrid SMC configurations.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

All code managers are encouraged to test this on their own use cases to help ensure consistency.
We don't run a multigrid setup operationally, but we can test this change on one of @ukmo-jianguo-li 's multigrid SMC configurations.

That would be great, @ukmo-ccbunney. Thanks!

@benoitp-cmc
Copy link
Copy Markdown
Contributor

It's not working for my real case (560 cpu).

I've revisited the simplified test case I provided in #711 by adding the missing 3rd grid:
https://hpfx.collab.science.gc.ca/~bpo001/WW3/issue_711/same_rank_3_grids.tar.gz

With 10 CPU, this test case works but with 80 CPU it gives me:

Abort(201959684) on node 50 (rank 50 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x12a2b0e0, count=1296, MPI_REAL, dest=76, tag=1048613, MPI_COMM_WORLD, request=0x47751a4) failed
PMPI_Isend(95).: Invalid tag, value is 1048613
Abort(67741956) on node 66 (rank 66 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x125689e0, count=1296, MPI_REAL, dest=74, tag=1048635, MPI_COMM_WORLD, request=0x42ae510) failed
PMPI_Isend(95).: Invalid tag, value is 1048635
Abort(805939460) on node 70 (rank 70 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x119edb00, count=1296, MPI_REAL, dest=78, tag=1048683, MPI_COMM_WORLD, request=0x37341b0) failed
PMPI_Isend(95).: Invalid tag, value is 1048683

Without the patch, I have:

Abort(1007266052) on node 27 (rank 27 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x13159680, count=1296, MPI_REAL, dest=7, tag=1500059, MPI_COMM_WORLD, request=0x48526e0) failed
PMPI_Isend(95).: Invalid tag, value is 1500059
Abort(201959684) on node 17 (rank 17 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x12658b60, count=1296, MPI_REAL, dest=7, tag=1500016, MPI_COMM_WORLD, request=0x3d556e0) failed
PMPI_Isend(95).: Invalid tag, value is 1500016
Abort(67741956) on node 71 (rank 71 in comm 0): Fatal error in PMPI_Isend: Invalid tag, error stack:
PMPI_Isend(162): MPI_Isend(buf=0x120da0a0, count=1296, MPI_REAL, dest=3, tag=1500079, MPI_COMM_WORLD, request=0x387cbc0) failed
PMPI_Isend(95).: Invalid tag, value is 1500079

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

@benoitp-cmc :( We did know that this would not work for every case, but we were certainly hoping it'd work for most. Have you had any success changing the MTAG parameters to get successful runs? Or have you just used the MPI environment variables to succeed with the newest intel?

We're trying to figure out a near-term path forward that will allow us to move to the latest intel for our WW3 regression testing, which this PR would allow, but we don't want to break other people's cases. So I guess I'm asking is does this break things "more" for you? Could this be another interim state until we figure out a longer term solution to this, which might require some new MPI route handles so that we're not using the same ones and going over the itag limits? (That idea was from @ukmo-ccbunney )

@MatthewMasarik-NOAA MatthewMasarik-NOAA marked this pull request as ready for review October 20, 2022 21:20
@benoitp-cmc
Copy link
Copy Markdown
Contributor

@JessicaMeixner-NOAA I have tested this change and it doesn't affect our multigrid configurations.

I don't know how the tag system works so I'll pass my turn on playing with MTAG values. For now, using the MPI environment variables is a suitable workaround.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

@JessicaMeixner-NOAA I have tested this change and it doesn't affect our multigrid configurations.

I don't know how the tag system works so I'll pass my turn on playing with MTAG values. For now, using the MPI environment variables is a suitable workaround.

@benoitp-cmc, thank you for these confirmations.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

JessicaMeixner-NOAA commented Oct 21, 2022

@MatthewMasarik-NOAA I believe for this PR, we're still waiting on @ukmo-ccbunney / @ukmo-jianguo-li to test their SMC multi-grid case, and word from @thesser1 and @mickaelaccensi to confirm there's not any unintended consequences for them. Given that this does not completely address #711, let's update the PR description to remove "fix" from the issue mention as we move forward. Hopefully we can have reviews from others by the end of next week at the latest before moving forward on this PR.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

@MatthewMasarik-NOAA I believe for this PR, we're still waiting on @ukmo-ccbunney / @ukmo-jianguo-li to test their SMC multi-grid case, and word from @thesser1 and @mickaelaccensi to confirm there's not any unintended consequences for them. Given that this does not completely address #711, let's update the PR description to remove "fix" from the issue mention as we move forward. Hopefully we can have reviews from others by the end of next week at the latest before moving forward on this PR.

@JessicaMeixner-NOAA, I copy, thanks for the status update. I have removed 'fixes' #711 from the description (and added ufs-community/ufs-weather-model/issues/1237).

Copy link
Copy Markdown
Collaborator

@JessicaMeixner-NOAA JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**********************************************************************
********************* non-identical cases ****************************
**********************************************************************
mww3_test_03/./work_PR2_UNO_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR1_MPI_d2                     (16 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c                     (12 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c                     (15 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2                     (12 files differ)
mww3_test_03/./work_PR2_UQ_MPI_d2                     (15 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2                     (15 files differ)
ww3_ta1/./work_UPD0F_U                     (0 files differ)
ww3_tp2.10/./work_MPI_OMPH                     (5 files differ)
ww3_tp2.16/./work_MPI_OMPH                     (3 files differ)
ww3_tp2.17/./work_a                     (1 files differ)
ww3_tp2.17/./work_c                     (1 files differ)
ww3_tp2.17/./work_b                     (1 files differ)
ww3_tp2.6/./work_ST0                     (1 files differ)
ww3_tp2.6/./work_ST4                     (1 files differ)
ww3_tp2.6/./work_pdlib                     (1 files differ)
ww3_ufs1.3/./work_a                     (1 files differ)

matrixCompFull.txt
matrixCompSummary.txt
matrixDiff.txt

Just waiting on @ukmo-ccbunney @mickaelaccensi and @thesser1 to see if they have any objections at this point.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

Thanks for the update @JessicaMeixner-NOAA. Glad to see your tests passed as well.

@mickaelaccensi
Copy link
Copy Markdown
Collaborator

all tests passed and differences seems as usual


********************* non-identical cases ****************************


mww3_test_03/./work_PR2_UQ_MPI_d2 (9 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2 (8 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2_c (8 files differ)
mww3_test_03/./work_PR3_UNO_MPI_d2_c (8 files differ)
mww3_test_03/./work_PR3_UNO_e_c (1 files differ)
mww3_test_03/./work_PR1_MPI_e (1 files differ)
mww3_test_03/./work_PR3_UQ_MPI_d2 (8 files differ)
mww3_test_03/./work_PR1_MPI_d2 (8 files differ)
mww3_test_03/./work_PR2_UNO_MPI_d2 (8 files differ)
mww3_test_03/./work_PR2_UNO_MPI_e (1 files differ)
ww3_ta1/./work_UPD0F_U (0 files differ)
ww3_tp2.14/./work_OASOCM (0 files differ)
ww3_tp2.14/./work_OASACM (0 files differ)
ww3_tp2.14/./work_OASACM4 (0 files differ)
ww3_tp2.14/./work_OASICM (0 files differ)
ww3_tp2.14/./work_OASACM5 (0 files differ)
ww3_tp2.14/./work_OASACM6 (0 files differ)
ww3_tp2.14/./work_OASACM2 (0 files differ)
ww3_tp2.17/./work_ma (1 files differ)
ww3_tp2.17/./work_mc1 (1 files differ)
ww3_tp2.17/./work_mc (1 files differ)
ww3_tp2.17/./work_ma1 (1 files differ)
ww3_tp2.17/./work_mb (1 files differ)
ww3_tp2.21/./work_ma (4 files differ)
ww3_tp2.6/./work_ST0 (1 files differ)
ww3_tp2.6/./work_ST4 (1 files differ)
ww3_ts4/./work_ug_MPI (1 files differ)

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

Thank you for providing your results, @mickaelaccensi!

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

Although we have not gotten affirmative answers from everyone we were waiting for, at this point we plan to merge this PR at the end of the workday here unless we hear something in the meantime.

@JessicaMeixner-NOAA JessicaMeixner-NOAA merged commit 43e95b4 into NOAA-EMC:develop Nov 1, 2022
@MatthewMasarik-NOAA MatthewMasarik-NOAA deleted the fix/itags branch November 1, 2022 23:54
@ukmo-ccbunney
Copy link
Copy Markdown
Collaborator

ukmo-ccbunney commented Nov 2, 2022

Sorry I am a bit late on reporting this, but for the record this ran fine on our Cray XC40 and EXZ systems using the Cray compiler.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

Great to hear @ukmo-ccbunney, thanks for the report.

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

orion UFS RT control_c384gdas_wav create/match baseline was successful.
orion-run_001_control_c384gdas_wav.log.txt
orion-rt_001_control_c384gdas_wav.log.txt

@MatthewMasarik-NOAA
Copy link
Copy Markdown
Contributor Author

wcoss2 (cactus) UFS RT control_c384gdas_wav create/match baseline was successful.
cactus-run_001_control_c384gdas_wav.log.txt
cactus-rt_001_control_c384gdas_wav.log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants