Skip to content

oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 stack#2847

Closed
rickgrubin-noaa wants to merge 22 commits into
ufs-community:developfrom
rickgrubin-noaa:gaeac6-oneapi
Closed

oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 stack#2847
rickgrubin-noaa wants to merge 22 commits into
ufs-community:developfrom
rickgrubin-noaa:gaeac6-oneapi

Conversation

@rickgrubin-noaa
Copy link
Copy Markdown
Collaborator

@rickgrubin-noaa rickgrubin-noaa commented Aug 1, 2025

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR updates the gaeac6 Intel modulefiles for spack-stack oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0

Commit Message:

* UFSWM - update gaeac6 Intel modulefiles for oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0 stack

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • None

Documentation:

  • No documentation update is required for this PR (please explain).

No documentation change necessary as documentation does not specifically reference making changes host-specific modulefiles, rather only how to load them, e.g. 3.5.1. Loading the Required Modules


Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

See attached file RegressionTests_weekly_gaeac6.log generated via ./rt.sh -a epic -r -w

Note that file test_changes.list was length=0 for ./rt.sh -a epic -r -c and ./rt.sh -a epic -r -w

./rt.sh -a epic -r -c followed by ./rt.sh -a epic -r -m generated 100%successful comparisons.

RegressionTests_weekly_gaeac6.log

Input data Changes:

  • None.

Library Changes/Upgrades:


Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • GaeaC6
    • Derecho
    • Ursa
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

@rickgrubin-noaa Is this PR ready for review, or is there further work to do?

@gspetro-NOAA gspetro-NOAA moved this to Evaluating in PRs to Process Sep 18, 2025
@gspetro-NOAA gspetro-NOAA added the No Baseline Change No Baseline Change label Sep 18, 2025
@rickgrubin-noaa
Copy link
Copy Markdown
Collaborator Author

@rickgrubin-noaa Is this PR ready for review, or is there further work to do?

It's been ready since August 1 (initial filing).

Branch is synced with HEAD of develop.

@gspetro-NOAA gspetro-NOAA moved this from Evaluating to Review in PRs to Process Sep 29, 2025
Fix typo in MODULEPATH
Fix stack compiler type to load
Require libfabric/1.20.1
Fix stack name, force libfabric/1.20.1
Fixes for gaeac6 OS upgrade
Remove module reset for gaeac6
Updates for new OS
@ulmononian
Copy link
Copy Markdown
Collaborator

what is the timeline to merge this? the upgrade from the intel classic to oneapi stack resolves issues for high-resolution tests (@JessicaMeixner-NOAA), among some other issues on c6.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

what is the timeline to merge this? the upgrade from the intel classic to oneapi stack resolves issues for high-resolution tests (@JessicaMeixner-NOAA), among some other issues on c6.

I was actually able to run with the old version of spack-stack after changing an environment variable.

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

@rickgrubin-noaa Sorry-that's what I meant. I did try to test on Gaea C6, and the initial test I tried failed, but you were suddenly pushing a bunch of changes. Are you done now?

@rickgrubin-noaa
Copy link
Copy Markdown
Collaborator Author

@rickgrubin-noaa Sorry-that's what I meant. I did try to test on Gaea C6, and the initial test I tried failed, but you were suddenly pushing a bunch of changes. Are you done now?

Yes; done and successfully tested last week.

@ulmononian
Copy link
Copy Markdown
Collaborator

@ulmononian I was going to test on Ursa, but then it looked like @rickgrubin-noaa was added several more commits, and the control_c48 I ran failed, likely because he was in the midst of updating. If the PR is complete, then I will get back to testing it, and we can move it to "Schedule if everything passes. We have four PRs lined up already for this week, so it would go in on Friday at the earliest unless it can be combined with another PR. Let me know your thoughts on that.

would be great to merge friday or shortly after. it's fine to combine this with another PR if that helps expedite the process and reduce resource usage. thank you!

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator

The default compiler on Gaea C6 is now intel/2025.2, which is supposed to fix the bug causing MOM6 to fail to compile with ifx. Would it be possible to recompile the spack-stack with this compiler so that we can finally start testing the model using both Fortran and C/C++ LLVM based compilers.

@rickgrubin-noaa
Copy link
Copy Markdown
Collaborator Author

The default compiler on Gaea C6 is now intel/2025.2, which is supposed to fix the bug causing MOM6 to fail to compile with ifx. Would it be possible to recompile the spack-stack with this compiler so that we can finally start testing the model using both Fortran and C/C++ LLVM based compilers.

@DusanJovic-NOAA there are some known bugs in ifx@2025.2 that will be fixed in the next release, however creating host-specific stack configurations for oneapi@2025.x is underway.

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

gspetro-NOAA commented Oct 7, 2025

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@ulmononian
Copy link
Copy Markdown
Collaborator

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@gspetro-NOAA did control_c48 fail in the baseline comparison step or elsewhere? this is really only a compiler/lib change, but baselines could be altered. we can run the full suite without -c and share the logs if that will help.

@RatkoVasic-NOAA
Copy link
Copy Markdown
Collaborator

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@gspetro-NOAA

  • Yes, this is baseline changing PR (different compiler - expected different results)
  • Please use test_changes.list as a whole (replace every baseline)
  • I ran ./rt.sh -c followed by ./rt.sh -m, and it passed ALL regression tests. I believe assigned CM will do the same.
  • You can disregard -w option, it is used when you don't want to compare results, -m option did comparison.

@gspetro-NOAA gspetro-NOAA added Baseline Updates Current baselines will be updated. and removed No Baseline Change No Baseline Change labels Oct 8, 2025
@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

@ulmononian Yes, it failed in the comparison stage, which is expected for a compiler change, as @RatkoVasic-NOAA said. However, there was some confusion because this is listed as a non-baseline changing PR. It doesn't matter the reason the baselines change; if they change for any reason, it's a baseline changing PR.

On the CM side, Ratko's right that before merging, we would run with the -c command to regenerate baselines. Then there are a few other steps we take. However, this only occurs after the developer has run the full RT suite (./rt.sh -a <account> usually w/-e or -r options, too) on a relevant RDHPCS (usually Ursa, but here, Gaea) and pushed the log and the test_changes.list file.

  • test_changes.list allows us to have a record of what baselines were changed in the PR, but it will only have the full list of changed tests if the full RT suite is run.
  • Pushing the failed log let's us see what the results of the developer's testing were (and the commands used). For example, here, we would expect failures in the log, but we would only expect comparison failures, not failures for other reasons, and it is important to verify that.

In short, what we need is for @rickgrubin-noaa to run the full RT suite without -c and push the resulting test_changes.list file and RegressionTest_gaea.log file. Then we can schedule it for the commit queue.

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

I ran the RTs on Gaea C6, and the tests that fail are expected failures. Failures are either:

  • UNABLE TO COMPLETE COMPARISON, which is expected with compiler/baseline changes
  • UNABLE TO START TEST, which is expected for restart tests when the control failed due to a comparison error.

Note that cpld_control_gfsv17_iau_intel failed to start, but it is a control test that depends on another control (cpld_control_gfsv17_intel), so this is also expected, even though it is not called a restart test.
Given that the failures are expected, we should be able to proceed with this PR.

@gspetro-NOAA gspetro-NOAA moved this from Review to Schedule in PRs to Process Oct 13, 2025
@grantfirl
Copy link
Copy Markdown
Collaborator

This has been combined into #2882

@gspetro-NOAA gspetro-NOAA moved this from Schedule to Not Ready in PRs to Process Oct 16, 2025
@grantfirl
Copy link
Copy Markdown
Collaborator

This has been uncombined with #2882.

@DeniseWorthen
Copy link
Copy Markdown
Collaborator

DeniseWorthen commented Oct 16, 2025

I have been examining the reproducibility issue w/ the DATM+GEFS configuration, seen when tested as part of PR #2882.

I have this branch checked out and I've compiled the NG-GODAS app twice, using compile.sh, yielding two different "identical" executables. I have two identical sandboxes (created from the datm_cdeps_control_gefs_intel run directory). I've added mediator history files to examine the coupled fields. I run both executables for 4 hours in separate run directories.

I can verify that:

  1. ATM sends identical fields at the first coupling timestep.
  2. ICE receives different fields for Sa_u and Sa_v at a total of 4 points at the first coupling timestep. All points are on the tripole seam. These initial differences then propagate and the two runs fail to reproduce each other.
  3. If I switch the mapping for these two fields from the current patch method to bilinear, the two runs are reproducible.

EDIT: Note the relevant mapping was updated in PR #2733, which was merged July 29.

@DeniseWorthen
Copy link
Copy Markdown
Collaborator

@rickgrubin-noaa My tests indicate that the issue is real. The change in the WM which seems to trigger the failure was merged at the end of July. If you tested prior to that, then you would not have seen the issue w/ reproducibility. Do you have a case where you've reproduced a baseline for these tests after updating to the latest UWM?

There could be an issue w/ ESMF's implementation of patch mapping with this compiler version. I haven't checked which other platforms might use this exact intel compiler. It may be an issue we need to raise w/ the ESMF team.

@rickgrubin-tomorrow
Copy link
Copy Markdown
Contributor

@rickgrubin-noaa My tests indicate that the issue is real. The change in the WM which seems to trigger the failure was merged at the end of July. If you tested prior to that, then you would not have seen the issue w/ reproducibility. Do you have a case where you've reproduced a baseline for these tests after updating to the latest UWM?

There could be an issue w/ ESMF's implementation of patch mapping with this compiler version. I haven't checked which other platforms might use this exact intel compiler. It may be an issue we need to raise w/ the ESMF team.

I can re-run tests / generate a baseline; it seems that's something other folks have done as well, narrowing down the problem(s) to the small number of tests that fail to demonstrate numerically comparable results.

To my understanding, all other platforms -- or at least orion / hercules, hera (for what that's worth), ursa, derecho -- have stacks built with oneapi@2024.2.1. This comment seems to indicate just that.

Given that you state

the change in the WM ... seems to trigger the failure

it seems that WM folks are aware of the problem and will address a fix.

@rickgrubin-noaa rickgrubin-noaa deleted the gaeac6-oneapi branch October 20, 2025 18:36
@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

From @rickgrubin-noaa : Note that the stack on which this PR was based remains, and remains unchanged. Any and all testing to uncover why one host (gaea-c6) -- with the same versions of compiler and the same versions of stack components -- doesn't play nice can still be done, and branch https://github.com/rickgrubin-noaa/ufs-weather-model/tree/gaeac6-oneapi remains available. A new PR can be created when the model-side/reproducibility issue is figured out.

@jkbk2004 Who should we notify to look into the reproducibility issue? Do you want me to open an issue w/ESMF?

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

Additional information from @rickgrubin-noaa :

After a multitude of attempts with a oneAPI stack on gaea-c6, the best result has been two tests failing:

datm_cdeps_control_gefs_intel
datm_cdeps_mx025_gefs_intel

That reduces the set of failures stated here; the two failures above are a subset of those in the noted comment.
The Intel Classic stack (/ncrc/proj/epic/spack-stack/c6/spack-stack-1.9.2/envs/ue-intel-2023.2.0) successfully ran regression tests against ufs-weather-model@develop. Given that, perhaps that's where folks can start digging around -- perhaps there are some clues.

@jkbk2004 has said he will run these cases in DEBUG to see if he can find anything else. Then we will create an issue (once we know more specifically what the issue is).

@jkbk2004
Copy link
Copy Markdown
Collaborator

@gspetro-NOAA @rickgrubin-noaa can you confirm this pr was meant to be closed? if not, can we reopen?

@jkbk2004
Copy link
Copy Markdown
Collaborator

@rickgrubin-noaa can you keep syncing up branch? so I can test a bit.

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

gspetro-NOAA commented Nov 12, 2025

@jkbk2004 Rick is going to be out most of today, but he has suggested that we test by making the appropriate changes manually:

The test env (stack) is:
/ncrc/proj/epic/spack-stack/c6/spack-stack-1.9.2/envs/test4-ue-oneapi-2024.2.1

The changes to modulefiles/ufs_gaeac6*.lua and tests/{rt.sh, run_test.sh, compile.sh} as were in the original PR are valid to use once paths for the location of the env are substituted.

He ran regression tests for ufs-weather-model@develop against the env noted above, and against
the current stack: /ncrc/proj/epic/spack-stack/c6/spack-stack-1.9.2/envs/ue-intel-2023.2.0

All tests pass against ue-intel-2023.2.0 and there are two failures (mentioned above) against test4-ue-oneapi-2024.2.1:

datm_cdeps_control_gefs_intel
datm_cdeps_mx025_gefs_intel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Baseline Updates Current baselines will be updated.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Update gaea-c6 Intel modulefiles to support spack-stack@1.9.2 oneapi@2024.2.1 / CPE PrgEnv-intel/8.6.0

10 participants