oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 stack#2847
Conversation
|
@rickgrubin-noaa Is this PR ready for review, or is there further work to do? |
It's been ready since August 1 (initial filing). Branch is synced with |
Fix typo in MODULEPATH
Fix stack compiler type to load
Require libfabric/1.20.1
Fix stack name, force libfabric/1.20.1
Fixes for gaeac6 OS upgrade
Remove module reset for gaeac6
Updates for new OS
|
what is the timeline to merge this? the upgrade from the intel classic to oneapi stack resolves issues for high-resolution tests (@JessicaMeixner-NOAA), among some other issues on c6. |
I was actually able to run with the old version of spack-stack after changing an environment variable. |
|
@rickgrubin-noaa Sorry-that's what I meant. I did try to test on Gaea C6, and the initial test I tried failed, but you were suddenly pushing a bunch of changes. Are you done now? |
Yes; done and successfully tested last week. |
would be great to merge friday or shortly after. it's fine to combine this with another PR if that helps expedite the process and reduce resource usage. thank you! |
|
The default compiler on Gaea C6 is now intel/2025.2, which is supposed to fix the bug causing MOM6 to fail to compile with ifx. Would it be possible to recompile the spack-stack with this compiler so that we can finally start testing the model using both Fortran and C/C++ LLVM based compilers. |
@DusanJovic-NOAA there are some known bugs in |
|
@rickgrubin-noaa When I run the If this is a baseline changing PR, we normally need you to run the full RT suite ( |
@gspetro-NOAA did control_c48 fail in the baseline comparison step or elsewhere? this is really only a compiler/lib change, but baselines could be altered. we can run the full suite without -c and share the logs if that will help. |
|
|
@ulmononian Yes, it failed in the comparison stage, which is expected for a compiler change, as @RatkoVasic-NOAA said. However, there was some confusion because this is listed as a non-baseline changing PR. It doesn't matter the reason the baselines change; if they change for any reason, it's a baseline changing PR. On the CM side, Ratko's right that before merging, we would run with the -c command to regenerate baselines. Then there are a few other steps we take. However, this only occurs after the developer has run the full RT suite (
In short, what we need is for @rickgrubin-noaa to run the full RT suite without |
|
I ran the RTs on Gaea C6, and the tests that fail are expected failures. Failures are either:
Note that |
|
This has been combined into #2882 |
|
This has been uncombined with #2882. |
|
I have been examining the reproducibility issue w/ the DATM+GEFS configuration, seen when tested as part of PR #2882. I have this branch checked out and I've compiled the NG-GODAS app twice, using compile.sh, yielding two different "identical" executables. I have two identical sandboxes (created from the I can verify that:
EDIT: Note the relevant mapping was updated in PR #2733, which was merged July 29. |
|
@rickgrubin-noaa My tests indicate that the issue is real. The change in the WM which seems to trigger the failure was merged at the end of July. If you tested prior to that, then you would not have seen the issue w/ reproducibility. Do you have a case where you've reproduced a baseline for these tests after updating to the latest UWM? There could be an issue w/ ESMF's implementation of patch mapping with this compiler version. I haven't checked which other platforms might use this exact intel compiler. It may be an issue we need to raise w/ the ESMF team. |
I can re-run tests / generate a baseline; it seems that's something other folks have done as well, narrowing down the problem(s) to the small number of tests that fail to demonstrate numerically comparable results. To my understanding, all other platforms -- or at least orion / hercules, hera (for what that's worth), ursa, derecho -- have stacks built with Given that you state
it seems that WM folks are aware of the problem and will address a fix. |
|
From @rickgrubin-noaa : Note that the stack on which this PR was based remains, and remains unchanged. Any and all testing to uncover why one host (gaea-c6) -- with the same versions of compiler and the same versions of stack components -- doesn't play nice can still be done, and branch https://github.com/rickgrubin-noaa/ufs-weather-model/tree/gaeac6-oneapi remains available. A new PR can be created when the model-side/reproducibility issue is figured out. @jkbk2004 Who should we notify to look into the reproducibility issue? Do you want me to open an issue w/ESMF? |
|
Additional information from @rickgrubin-noaa : After a multitude of attempts with a oneAPI stack on gaea-c6, the best result has been two tests failing: That reduces the set of failures stated here; the two failures above are a subset of those in the noted comment. @jkbk2004 has said he will run these cases in DEBUG to see if he can find anything else. Then we will create an issue (once we know more specifically what the issue is). |
|
@gspetro-NOAA @rickgrubin-noaa can you confirm this pr was meant to be closed? if not, can we reopen? |
|
@rickgrubin-noaa can you keep syncing up branch? so I can test a bit. |
|
@jkbk2004 Rick is going to be out most of today, but he has suggested that we test by making the appropriate changes manually: The test env (stack) is: The changes to modulefiles/ufs_gaeac6*.lua and tests/{rt.sh, run_test.sh, compile.sh} as were in the original PR are valid to use once paths for the location of the env are substituted. He ran regression tests for ufs-weather-model@develop against the env noted above, and against All tests pass against ue-intel-2023.2.0 and there are two failures (mentioned above) against test4-ue-oneapi-2024.2.1: |
Commit Queue Requirements:
Description:
This PR updates the gaeac6
Intelmodulefiles forspack-stack oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
Documentation:
No documentation change necessary as documentation does not specifically reference making changes host-specific modulefiles, rather only how to load them, e.g. 3.5.1. Loading the Required Modules
Changes
Regression Test Changes (Please commit test_changes.list):
See attached file
RegressionTests_weekly_gaeac6.loggenerated via./rt.sh -a epic -r -wNote that file
test_changes.listwaslength=0for./rt.sh -a epic -r -cand./rt.sh -a epic -r -w./rt.sh -a epic -r -cfollowed by./rt.sh -a epic -r -mgenerated 100%successful comparisons.RegressionTests_weekly_gaeac6.log
Input data Changes:
Library Changes/Upgrades:
Required
Testing Log: