oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 stack by rickgrubin-noaa · Pull Request #2847 · ufs-community/ufs-weather-model

rickgrubin-noaa · 2025-08-01T20:36:38Z

Commit Queue Requirements:

Fill out all sections of this template.
All sub component pull requests have been reviewed by their code managers.
Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
Commit 'test_changes.list' from previous step

Description:

This PR updates the gaeac6 Intel modulefiles for spack-stack oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0

Commit Message:

* UFSWM - update gaeac6 Intel modulefiles for oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 stack

Priority:

Normal

Git Tracking

UFSWM:

Closes Update gaea-c6 Intel modulefiles to support spack-stack@1.9.2 oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 #2846

Sub component Pull Requests:

None

UFSWM Blocking Dependencies:

None

Documentation:

No documentation update is required for this PR (please explain).

No documentation change necessary as documentation does not specifically reference making changes host-specific modulefiles, rather only how to load them, e.g. 3.5.1. Loading the Required Modules

Changes

Regression Test Changes (Please commit test_changes.list):

No Baseline Changes.

See attached file RegressionTests_weekly_gaeac6.log generated via ./rt.sh -a epic -r -w

Note that file test_changes.list was length=0 for ./rt.sh -a epic -r -c and ./rt.sh -a epic -r -w

./rt.sh -a epic -r -c followed by ./rt.sh -a epic -r -m generated 100%successful comparisons.

RegressionTests_weekly_gaeac6.log

Input data Changes:

None.

Library Changes/Upgrades:

Required
- Git Stack PR: Update gaea-c6 config for new CPE PrgEnv-intel/8.6.0 and oneapi@2024.2.1 compilers #1713

Testing Log:

gspetro-NOAA · 2025-09-18T00:36:24Z

@rickgrubin-noaa Is this PR ready for review, or is there further work to do?

rickgrubin-noaa · 2025-09-26T15:37:36Z

@rickgrubin-noaa Is this PR ready for review, or is there further work to do?

It's been ready since August 1 (initial filing).

Branch is synced with HEAD of develop.

Fix typo in MODULEPATH

Fix stack compiler type to load

Require libfabric/1.20.1

Fix stack name, force libfabric/1.20.1

Fixes for gaeac6 OS upgrade

Remove module reset for gaeac6

Updates for new OS

ulmononian · 2025-10-03T20:09:54Z

what is the timeline to merge this? the upgrade from the intel classic to oneapi stack resolves issues for high-resolution tests (@JessicaMeixner-NOAA), among some other issues on c6.

JessicaMeixner-NOAA · 2025-10-03T20:12:43Z

what is the timeline to merge this? the upgrade from the intel classic to oneapi stack resolves issues for high-resolution tests (@JessicaMeixner-NOAA), among some other issues on c6.

I was actually able to run with the old version of spack-stack after changing an environment variable.

gspetro-NOAA · 2025-10-06T13:50:13Z

@rickgrubin-noaa Sorry-that's what I meant. I did try to test on Gaea C6, and the initial test I tried failed, but you were suddenly pushing a bunch of changes. Are you done now?

rickgrubin-noaa · 2025-10-06T14:28:59Z

@rickgrubin-noaa Sorry-that's what I meant. I did try to test on Gaea C6, and the initial test I tried failed, but you were suddenly pushing a bunch of changes. Are you done now?

Yes; done and successfully tested last week.

ulmononian · 2025-10-06T22:17:54Z

@ulmononian I was going to test on Ursa, but then it looked like @rickgrubin-noaa was added several more commits, and the control_c48 I ran failed, likely because he was in the midst of updating. If the PR is complete, then I will get back to testing it, and we can move it to "Schedule if everything passes. We have four PRs lined up already for this week, so it would go in on Friday at the earliest unless it can be combined with another PR. Let me know your thoughts on that.

would be great to merge friday or shortly after. it's fine to combine this with another PR if that helps expedite the process and reduce resource usage. thank you!

DusanJovic-NOAA · 2025-10-07T14:39:20Z

The default compiler on Gaea C6 is now intel/2025.2, which is supposed to fix the bug causing MOM6 to fail to compile with ifx. Would it be possible to recompile the spack-stack with this compiler so that we can finally start testing the model using both Fortran and C/C++ LLVM based compilers.

rickgrubin-noaa · 2025-10-07T15:43:06Z

The default compiler on Gaea C6 is now intel/2025.2, which is supposed to fix the bug causing MOM6 to fail to compile with ifx. Would it be possible to recompile the spack-stack with this compiler so that we can finally start testing the model using both Fortran and C/C++ LLVM based compilers.

@DusanJovic-NOAA there are some known bugs in ifx@2025.2 that will be fixed in the next release, however creating host-specific stack configurations for oneapi@2025.x is underway.

gspetro-NOAA · 2025-10-07T23:54:06Z

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

ulmononian · 2025-10-08T15:45:01Z

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@gspetro-NOAA did control_c48 fail in the baseline comparison step or elsewhere? this is really only a compiler/lib change, but baselines could be altered. we can run the full suite without -c and share the logs if that will help.

RatkoVasic-NOAA · 2025-10-08T15:47:06Z

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@gspetro-NOAA

Yes, this is baseline changing PR (different compiler - expected different results)
Please use test_changes.list as a whole (replace every baseline)
I ran ./rt.sh -c followed by ./rt.sh -m, and it passed ALL regression tests. I believe assigned CM will do the same.
You can disregard -w option, it is used when you don't want to compare results, -m option did comparison.

gspetro-NOAA · 2025-10-08T16:26:49Z

@ulmononian Yes, it failed in the comparison stage, which is expected for a compiler change, as @RatkoVasic-NOAA said. However, there was some confusion because this is listed as a non-baseline changing PR. It doesn't matter the reason the baselines change; if they change for any reason, it's a baseline changing PR.

On the CM side, Ratko's right that before merging, we would run with the -c command to regenerate baselines. Then there are a few other steps we take. However, this only occurs after the developer has run the full RT suite (./rt.sh -a <account> usually w/-e or -r options, too) on a relevant RDHPCS (usually Ursa, but here, Gaea) and pushed the log and the test_changes.list file.

test_changes.list allows us to have a record of what baselines were changed in the PR, but it will only have the full list of changed tests if the full RT suite is run.
Pushing the failed log let's us see what the results of the developer's testing were (and the commands used). For example, here, we would expect failures in the log, but we would only expect comparison failures, not failures for other reasons, and it is important to verify that.

In short, what we need is for @rickgrubin-noaa to run the full RT suite without -c and push the resulting test_changes.list file and RegressionTest_gaea.log file. Then we can schedule it for the commit queue.

gspetro-NOAA · 2025-10-13T16:54:36Z

I ran the RTs on Gaea C6, and the tests that fail are expected failures. Failures are either:

UNABLE TO COMPLETE COMPARISON, which is expected with compiler/baseline changes
UNABLE TO START TEST, which is expected for restart tests when the control failed due to a comparison error.

Note that cpld_control_gfsv17_iau_intel failed to start, but it is a control test that depends on another control (cpld_control_gfsv17_intel), so this is also expected, even though it is not called a restart test.
Given that the failures are expected, we should be able to proceed with this PR.

grantfirl · 2025-10-14T20:29:37Z

This has been combined into #2882

grantfirl · 2025-10-16T16:41:33Z

This has been uncombined with #2882.

DeniseWorthen · 2025-10-16T18:13:37Z

I have been examining the reproducibility issue w/ the DATM+GEFS configuration, seen when tested as part of PR #2882.

I have this branch checked out and I've compiled the NG-GODAS app twice, using compile.sh, yielding two different "identical" executables. I have two identical sandboxes (created from the datm_cdeps_control_gefs_intel run directory). I've added mediator history files to examine the coupled fields. I run both executables for 4 hours in separate run directories.

I can verify that:

ATM sends identical fields at the first coupling timestep.
ICE receives different fields for Sa_u and Sa_v at a total of 4 points at the first coupling timestep. All points are on the tripole seam. These initial differences then propagate and the two runs fail to reproduce each other.
If I switch the mapping for these two fields from the current patch method to bilinear, the two runs are reproducible.

EDIT: Note the relevant mapping was updated in PR #2733, which was merged July 29.

DeniseWorthen · 2025-10-17T16:01:37Z

@rickgrubin-noaa My tests indicate that the issue is real. The change in the WM which seems to trigger the failure was merged at the end of July. If you tested prior to that, then you would not have seen the issue w/ reproducibility. Do you have a case where you've reproduced a baseline for these tests after updating to the latest UWM?

There could be an issue w/ ESMF's implementation of patch mapping with this compiler version. I haven't checked which other platforms might use this exact intel compiler. It may be an issue we need to raise w/ the ESMF team.

rickgrubin-tomorrow · 2025-10-17T16:28:35Z

@rickgrubin-noaa My tests indicate that the issue is real. The change in the WM which seems to trigger the failure was merged at the end of July. If you tested prior to that, then you would not have seen the issue w/ reproducibility. Do you have a case where you've reproduced a baseline for these tests after updating to the latest UWM?

There could be an issue w/ ESMF's implementation of patch mapping with this compiler version. I haven't checked which other platforms might use this exact intel compiler. It may be an issue we need to raise w/ the ESMF team.

I can re-run tests / generate a baseline; it seems that's something other folks have done as well, narrowing down the problem(s) to the small number of tests that fail to demonstrate numerically comparable results.

To my understanding, all other platforms -- or at least orion / hercules, hera (for what that's worth), ursa, derecho -- have stacks built with oneapi@2024.2.1. This comment seems to indicate just that.

Given that you state

the change in the WM ... seems to trigger the failure

it seems that WM folks are aware of the problem and will address a fix.

gspetro-NOAA · 2025-10-20T19:16:39Z

From @rickgrubin-noaa : Note that the stack on which this PR was based remains, and remains unchanged. Any and all testing to uncover why one host (gaea-c6) -- with the same versions of compiler and the same versions of stack components -- doesn't play nice can still be done, and branch https://github.com/rickgrubin-noaa/ufs-weather-model/tree/gaeac6-oneapi remains available. A new PR can be created when the model-side/reproducibility issue is figured out.

@jkbk2004 Who should we notify to look into the reproducibility issue? Do you want me to open an issue w/ESMF?

gspetro-NOAA · 2025-11-07T17:14:41Z

Additional information from @rickgrubin-noaa :

After a multitude of attempts with a oneAPI stack on gaea-c6, the best result has been two tests failing:

datm_cdeps_control_gefs_intel
datm_cdeps_mx025_gefs_intel

That reduces the set of failures stated here; the two failures above are a subset of those in the noted comment.
The Intel Classic stack (/ncrc/proj/epic/spack-stack/c6/spack-stack-1.9.2/envs/ue-intel-2023.2.0) successfully ran regression tests against ufs-weather-model@develop. Given that, perhaps that's where folks can start digging around -- perhaps there are some clues.

@jkbk2004 has said he will run these cases in DEBUG to see if he can find anything else. Then we will create an issue (once we know more specifically what the issue is).

jkbk2004 · 2025-11-12T15:20:15Z

@gspetro-NOAA @rickgrubin-noaa can you confirm this pr was meant to be closed? if not, can we reopen?

jkbk2004 · 2025-11-12T15:37:47Z

@rickgrubin-noaa can you keep syncing up branch? so I can test a bit.

gspetro-NOAA · 2025-11-12T16:38:13Z

@jkbk2004 Rick is going to be out most of today, but he has suggested that we test by making the appropriate changes manually:

The test env (stack) is:
/ncrc/proj/epic/spack-stack/c6/spack-stack-1.9.2/envs/test4-ue-oneapi-2024.2.1

The changes to modulefiles/ufs_gaeac6*.lua and tests/{rt.sh, run_test.sh, compile.sh} as were in the original PR are valid to use once paths for the location of the env are substituted.

He ran regression tests for ufs-weather-model@develop against the env noted above, and against
the current stack: /ncrc/proj/epic/spack-stack/c6/spack-stack-1.9.2/envs/ue-intel-2023.2.0

All tests pass against ue-intel-2023.2.0 and there are two failures (mentioned above) against test4-ue-oneapi-2024.2.1:

datm_cdeps_control_gefs_intel
datm_cdeps_mx025_gefs_intel

rickgrubin-noaa and others added 2 commits July 31, 2025 16:14

oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 stack

bc90377

Merge branch 'ufs-community:develop' into gaeac6-oneapi

bfce9c7

MichaelLueken mentioned this pull request Aug 4, 2025

ESMF 8.8.0 issues with spack-stack 1.9.2 on all machines when QUILTING = false JCSDA/spack-stack#1730

Closed

Merge branch 'ufs-community:develop' into gaeac6-oneapi

c0247f2

jkbk2004 mentioned this pull request Aug 6, 2025

Add two-way ocean-wave coupling feature to the HAFS applications #2584

Merged

14 tasks

rickgrubin-noaa added 5 commits August 11, 2025 14:50

Merge branch 'ufs-community:develop' into gaeac6-oneapi

c967705

Merge branch 'ufs-community:develop' into gaeac6-oneapi

60b5a30

Merge branch 'ufs-community:develop' into gaeac6-oneapi

e330840

Merge branch 'ufs-community:develop' into gaeac6-oneapi

1a839c4

Merge branch 'ufs-community:develop' into gaeac6-oneapi

28db3e2

gspetro-NOAA added this to PRs to Process Sep 18, 2025

gspetro-NOAA moved this to Evaluating in PRs to Process Sep 18, 2025

gspetro-NOAA added the No Baseline Change No Baseline Change label Sep 18, 2025

Merge branch 'ufs-community:develop' into gaeac6-oneapi

ecba893

gspetro-NOAA moved this from Evaluating to Review in PRs to Process Sep 29, 2025

rickgrubin-noaa added 10 commits September 30, 2025 09:38

Merge branch 'ufs-community:develop' into gaeac6-oneapi

c61b9a0

Update ufs_gaeac6.intel.lua

3c49f43

Fix typo in MODULEPATH

Update ufs_gaeac6.intel.lua

34a5cbb

Fix stack compiler type to load

Update ufs_gaeac6.intel.lua

bde80ab

Require libfabric/1.20.1

Update ufs_gaeac6.intelllvm.lua

06696c9

Fix stack name, force libfabric/1.20.1

Update ufs_gaeac6.intelllvm.lua

7b021b6

Update rt.sh

333c416

Fixes for gaeac6 OS upgrade

Update compile.sh

939807e

Remove module reset for gaeac6

Update run_test.sh

04ca92f

Updates for new OS

Update rt.sh

ac7be13

RatkoVasic-NOAA mentioned this pull request Oct 1, 2025

[INSTALL]: Reinstall spack-stack 1.9.1, 1.9.2 and 1.9.3 on C6 JCSDA/spack-stack#1779

Closed

Merge branch 'develop' into gaeac6-oneapi

da73809

gspetro-NOAA added Baseline Updates Current baselines will be updated. and removed No Baseline Change No Baseline Change labels Oct 8, 2025

add gaea c6 logs and test_changes.list

bb44b58

gspetro-NOAA moved this from Review to Schedule in PRs to Process Oct 13, 2025

gspetro-NOAA mentioned this pull request Oct 14, 2025

Sync from NCAR/main + Thompson params #2882

Merged

14 tasks

gspetro-NOAA moved this from Schedule to Not Ready in PRs to Process Oct 16, 2025

merge ufs-weather-model@develop

21c7a77

rickgrubin-noaa closed this Oct 20, 2025

rickgrubin-noaa deleted the gaeac6-oneapi branch October 20, 2025 18:36

Conversation

rickgrubin-noaa commented Aug 1, 2025 • edited by gspetro-NOAA Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit Queue Requirements:

Description:

Commit Message:

Priority:

Git Tracking

UFSWM:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

Documentation:

Changes

Regression Test Changes (Please commit test_changes.list):

Input data Changes:

Library Changes/Upgrades:

Testing Log:

Uh oh!

gspetro-NOAA commented Sep 18, 2025

Uh oh!

rickgrubin-noaa commented Sep 26, 2025

Uh oh!

ulmononian commented Oct 3, 2025

Uh oh!

JessicaMeixner-NOAA commented Oct 3, 2025

Uh oh!

gspetro-NOAA commented Oct 6, 2025

Uh oh!

rickgrubin-noaa commented Oct 6, 2025

Uh oh!

ulmononian commented Oct 6, 2025

Uh oh!

DusanJovic-NOAA commented Oct 7, 2025

Uh oh!

rickgrubin-noaa commented Oct 7, 2025

Uh oh!

gspetro-NOAA commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ulmononian commented Oct 8, 2025

Uh oh!

RatkoVasic-NOAA commented Oct 8, 2025

Uh oh!

gspetro-NOAA commented Oct 8, 2025

Uh oh!

gspetro-NOAA commented Oct 13, 2025

Uh oh!

grantfirl commented Oct 14, 2025

Uh oh!

grantfirl commented Oct 16, 2025

Uh oh!

DeniseWorthen commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DeniseWorthen commented Oct 17, 2025

Uh oh!

rickgrubin-tomorrow commented Oct 17, 2025

Uh oh!

gspetro-NOAA commented Oct 20, 2025

Uh oh!

gspetro-NOAA commented Nov 7, 2025

Uh oh!

jkbk2004 commented Nov 12, 2025

Uh oh!

jkbk2004 commented Nov 12, 2025

Uh oh!

gspetro-NOAA commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

rickgrubin-noaa commented Aug 1, 2025 •

edited by gspetro-NOAA

Loading

gspetro-NOAA commented Oct 7, 2025 •

edited

Loading

DeniseWorthen commented Oct 16, 2025 •

edited

Loading

gspetro-NOAA commented Nov 12, 2025 •

edited

Loading