Add/fix build capability for Gaea-C5, Gaea-C6, and container#800
Conversation
RussTreadon-NOAA
left a comment
There was a problem hiding this comment.
Looks OK to me.
Two comments:
- I can't test the container.
- Additional changes are needed to run GSI/EnKF ctests on Gaea-C6. @DavidBurrows-NCO , do you plan on adding these changes to this PR or will a new issue and PR be opened to activate GSI/EnKF ctests on Gaea-C6?
| target_link_libraries(gsi_fortran_obj PUBLIC w3emc::w3emc_d) | ||
| target_link_libraries(gsi_fortran_obj PUBLIC sp::sp_d) | ||
| target_link_libraries(gsi_fortran_obj PUBLIC bufr::bufr_d) | ||
| if(DEFINED ENV{USE_BUFR4}) |
There was a problem hiding this comment.
Looks OK to me. Cross checking with @DavidHuber-NOAA . PR #791 upgrades to bufr/12.1.0. Not sure how the bufr logic added here might impact Dave's PR.
|
@RussTreadon-NOAA Thanks for taking a look
|
|
Hello @RussTreadon-NOAA. I've made good progress on GSI reg tests. I'm currently using the same walltime/processor configuration as C5 for C6. This can be adjusted, but here are the current results:: rrfs_3denvar_rdasens_loproc_updat keeps hitting the wall clock even after I increased to 60 mins. It freezes in the same spot. I've attached a text file of the output log. I don't see anything too good in the working directory. Please let me know your thoughts. Thanks! |
|
Thank you @DavidBurrows-NCO for the update. This looks good. We've had problems with the |
|
C6 ctests results @DavidBurrows-NCO , I obtained similar C6 ctest results
C5 ctest results Ctest behavior on C5 is very different. The following tests ran and failed rtma global_enkf Line 134 of global_4denvar Not sure what's going on here. Do we need to (un)set certain environment variables on C5? The remaining tests were all killed by the system after reaching the specified wall clock limit. Interestingly, the low task count The C5 |
|
@RussTreadon-NOAA I vaguely recall something like this previously, like 6-9 months ago, where GSI would run but crash at the very end of execution. Do you recall this, or am I imagining it? |
|
@CoryMartin-NOAA , this sounds vaguely familiar. Let me check GSI issues and PRs for clues. The rrfs failure is a known problem. |
|
Hi @RussTreadon-NOAA. I know it's not the solution you want, but I adjusted the node/processor configuration to match Hera on C6 and rrfs was successful: I'm working on C5 right now. |
|
@DavidBurrows-NCO : Changing the task count is consistent with the regional DA team recommendation. Thank you for looking at the C5 failures. |
|
@TingLei-NOAA informed me that he will be on leave and is unable to update the task count for @DavidBurrows-NCO , please commit the modified |
|
@DavidBurrows-NCO , we need two peer reviews for GSI PRs. My review doesn't count as a peer review. Who would you like to review this PR? |
| ofile=$DATA/subout$$ | ||
| >$ofile | ||
| chmod 777 $ofile | ||
| export FI_VERBS_PREFER_XRC=0 |
There was a problem hiding this comment.
Does this setting resolve what appears to be mpi_finalize problems on C5?
There was a problem hiding this comment.
Does this setting resolve what appears to be
mpi_finalizeproblems on C5?
It appears so. Here is the notice from Seth Underwood with Gaea C5: "After the C5 update, users reported that some jobs failed during the MPI_Finalize call. We have alerted ORNL and HPE. HPE has suggested setting the environment variable FI_VERBS_PREFER_XRC=0 in the run script (setenv FI_VERBS_PREFER_XRC 0, for csh; export FI_VERBS_PREFER_XRC=0). This has resolved the error in our tests. Please add this variable to your run script(s) if you also hit this error. Please note that we do not see any issues preemptively setting this environment variable."
Now that I think the MPI_Finalize issue is resolved, I am going to adjust the resources and test a little more. I'll let you know when I have my final changes in place for you to look over.
|
Excellent! Thank you @DavidBurrows-NCO for working through various issues. |
|
@RussTreadon-NOAA Quick question...if a particular test fails, but I check all the stdout, and they return rc=0...does that typically mean the job took too long to run? I assume there are set run time values for each test? Thanks |
|
Unfortunately, the checks in GSI ctests are not very robust. Some of the timing and memory usage checks can yield false positives. The test Failed but a check of the results does not indicate a problem. Since GSI has no code manager and we are transitioning to JEDI, it's unlikely the GSI ctests will be cleaned up to yield more consistent results. If there's a particular failure you'd like me to look at, give me the path or the rundir and I'll take a look. |
3b98cf4
|
Gaea C5 and C6 ctests Gaea C5 The Here are the Indeed the loproc_updat wall time is considerably greater than the loproc_contrl wall time. Note, however, that the updat and contrl tests use the same Gaea C6 The Here are the wall times from the various tests For both tests the hiproc_updat wall time is notably greater than the hiproc_contrl. This is interesting. The updat and contrl runs use the same executables. he wall time differences reflect differences in system load, i/o speed, or other aspects of the system. This is not a fatal fail. |
RussTreadon-NOAA
left a comment
There was a problem hiding this comment.
Looks good. Please reduce the Gaea C6 wall clock limit for rrfs_3denvar_rdasens from 0:60:00 to 0:15:00.
|
This PR is awaiting the return of WCOSS2 to developers so WCOSS2 ctests can be run. Assuming reduction of the Gaea C6 |
RussTreadon-NOAA
left a comment
There was a problem hiding this comment.
Looks good to me. Approve.
8cf6434
|
@DavidBurrows-NCO , NCO said the Cactus upgrade encountered some issues which they are working through. I'm not sure when the development WCOSS2 machine will come back online. While we wait, I installed this PR on both Dogwood and Cactus. I'm ready to run on when either machine becomes dev. |
|
@RussTreadon-NOAA Thanks for the info, and thanks for your quick back and forth with this PR. Have a good weekend! |
|
WCOSS2 ctests Install |
Resolves #799
Type of change
How Has This Been Tested?
Cloned and built on Gaea-C5, Gaea-C6, and in a container.
Checklist