Skip to content

Faster, safer, compilation options targeted for HAFS operational upgrade#1554

Merged
jkbk2004 merged 36 commits into
ufs-community:developfrom
SamuelTrahanNOAA:merge-supafast
Jan 12, 2023
Merged

Faster, safer, compilation options targeted for HAFS operational upgrade#1554
jkbk2004 merged 36 commits into
ufs-community:developfrom
SamuelTrahanNOAA:merge-supafast

Conversation

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Dec 31, 2022

Description

Adds a set of faster, safer, optimization options planned for the HAFS operational upgrade whose code freeze is in January 2023. The FV3 uses the operational HRRR optimization options, without -ip, since that option doesn't speed up the model on WCOSS2, and it does slow down the compilation. To enable this, use -DFASTER=ON when running the ufs-weather-model cmake.

Also, regardless of cmake options, the HYCOM no longer uses -xSSE4.2 option that shouldn't be turned on in any WCOSS2 build.

One test, hafs_regional_storm_following_1nest_atm_ocn_wav, runs with an executable that uses the new compilation options. That means the results from that test will change. The default optimization options, and the actions of other -Dwhatever flags, are not changed at all. Therefore, no other tests should change results.

FV3 Compilation Option Change

The new FV3 compilation options use value-safe optimization via -fp-model precise, rather than -fp-model fast like the default FV3 optimization options. They are far faster because just about every optimization is turned on, other than the ones that are turned off automatically due to not being value safe. Most importantly, -fp-model consistent is turned off. That option is often misunderstood; the only place it serves any real purpose is Jet, a heterogeneous cluster, where it is possible to get the same results across several machines. The cost of that is a massive performance penalty, even with -fp-model fast. Any other floating-point consistency safety already comes in -fp-model precise.

Oh, what was that? You don't see -fp-model fast in the current optimization options?

It is on by default. The Intel compiler has four categories of fp-model options. One of them is the value safety vs. speed trade-off: precise, fast, and fast2. If you don't specify any of those options, you get -fp-model fast. The -fp-model consistent is another category, and really serves no purpose if fp-model precise is enabled. (If you're wondering, the often used -fp-model strict turns on several options, including -fp-model precise.)

These options speed up the HAFS by 12% on WCOSS2.

HYCOM Compilation Option Change

HYCOM uses AMD chips with the Intel compiler, and none of the Intel compiler settings do a great job of optimizing for AMD. The craype module already loads the best platform target settings. Specifying additional ones will slow down the model, or cause it to crash, if they have any effect at all.

This HYCOM compilation option change has only a small effect on the runtime since FV3 is the slow component in our current configuration.

Top of commit queue on: TBD

There are no sub-component PRs.

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • Changes are expected to the following tests: all tests that include "hafs" in their name.

The HYCOM compilation options have changed, so all tests that use HYCOM will have new results. That includes all tests with "hafs" in their name, but no other tests.

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Fixes #1553

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

It would be great if @BinLiu-NOAA @ZhanZhang-NOAA and @BijuThomas-NOAA could review this, as they're the HAFS developers that helped test these options on WCOSS2.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

It looks like you ran RT with new baselines. Since it does not look like baselines are changed can you please run against current baselines ./rt.sh -e > rt.out 2>&1 &. We would only expect the new RT to fail. You can attach that log to a comment instead of needing to push it to your repo if you'd like.

Also with any new RT, we need ORT to be run with ./opnReqTest -e -c thr,mpi,dcp,rst,bit,dbg,fhz -n hafs_regional_storm_following_1nest_atm_ocn_wav_supafast > ort.out 2>&1 & it will reject some of the tests as we don't support all ORT with regional applications. So you can keep removing those until it starts officially. Please put that log into a comment in this PR.

@BrianCurtis-NOAA BrianCurtis-NOAA added the No Baseline Change No Baseline Change label Jan 1, 2023
@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

It looks like you ran RT with new baselines. Since it does not look like baselines are changed can you please run against current baselines ./rt.sh -e > rt.out 2>&1 &. We would only expect the new RT to fail. You can attach that log to a comment instead of needing to push it to your repo if you'd like.

You are incorrect. I ran against old baselines by rsyncing the BL_DATE baseline directory into my directory. Only the new test has a new directory.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

It looks like you ran RT with new baselines. Since it does not look like baselines are changed can you please run against current baselines ./rt.sh -e > rt.out 2>&1 &. We would only expect the new RT to fail. You can attach that log to a comment instead of needing to push it to your repo if you'd like.

You are incorrect. I ran against old baselines by rsyncing the BL_DATE baseline directory into my directory. Only the new test has a new directory.

OK great. It didn't appear that way with your log, which confused me. Thanks.

@BrianCurtis-NOAA BrianCurtis-NOAA added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jan 1, 2023
@BrianCurtis-NOAA BrianCurtis-NOAA removed the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jan 1, 2023
@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@jkbk2004 @DeniseWorthen Please review code before commit queue assignment.

@jkbk2004
Copy link
Copy Markdown
Collaborator

jkbk2004 commented Jan 1, 2023

So, performance impact is mostly expected on wcoss2, right?

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

So, performance impact is mostly expected on wcoss2, right?

No. It'll be faster everywhere.

Comment thread HYCOM-interface/CMakeLists.txt Outdated
Comment thread HYCOM-interface/CMakeLists.txt Outdated
Comment thread cmake/Intel.cmake
Comment thread tests/rt.conf
@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

Wasn't quite awake in my last post. There is no FV3 PR that I forgot to make. All of the build options are set at the ufs-weather-model level; nothing below that overrides them (I checked).

@junwang-noaa
Copy link
Copy Markdown
Collaborator

@SamuelTrahanNOAA Have your run the reproducibility tests with the fast compile options? Do we have run-to-run reproducibility? Will the results change with different threads? If I remember correctly, the hafs test does not reproduce with different MPI tasks or in restart runs yet.

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

Have your run the reproducibility tests with the fast compile options?

I did, and so did other people.

Do we have run-to-run reproducibility?

Yes.

If I remember correctly, the hafs test does not reproduce with different MPI tasks

No, it still doesn't. Several people tried several approaches to get this working, but we've run out of time for the implementation.

or in restart runs yet

I don't even know how to set up a coupled wave-ocean-atmosphere restart test.

Will the results change with different threads?

I don't think this has been tested with any optimization options (or lack thereof) because the main focus was MPI task decomposition. Someone else would have to do this test on Dogwood, since Cactus is unusable at the moment, and Acorn is too small.

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

The automatic github tests are failing due to an internal error:

cat: /home/runner/id_file: No such file or directory

I manually checked, and my repo is up-to-date with develop.

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

Have your run the reproducibility tests with the fast compile options?

@junwang-noaa Are you asking me to run all of the tests in rt.conf this way? I don't even know if they'll work. Some may fail to build, getting the dreaded "internal compiler error," which happens more often at higher optimization levels.

My only target for this was the two operational HAFS configurations, due to lack of time; the model took 50% more resources than it should, but now it runs in its window with a minute or three to spare. If people want to use this for other configurations, they'll have to test it first.

I'll try an "all suites" build with the SUPAFAST=ON and see what happens.

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator

I also ran #1558 branch code on Gaea and cpld_control_c192_p8 test passed against the current baseline.

@jkbk2004
Copy link
Copy Markdown
Collaborator

Yeah, it's running on my side. Hopefully, we can finish today.

@jkbk2004
Copy link
Copy Markdown
Collaborator

@SamuelTrahanNOAA I like to confirm one thing. this pr only changes affect hafs_regional_storm_following_1nest_atm_ocn_wav. no imapct on all other hafs tests, right?

@jkbk2004
Copy link
Copy Markdown
Collaborator

@SamuelTrahanNOAA @BrianCurtis-NOAA All tests are done. the gaea issue was my mistake. All ok now. This pr affects only on hafs_regional_storm_following_1nest_atm_ocn_wav. We are ready to merge the pr.

Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA BrianCurtis-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamuelTrahanNOAA @BrianCurtis-NOAA All tests are done. the gaea issue was my mistake. All ok now. This pr affects only on hafs_regional_storm_following_1nest_atm_ocn_wav. We are ready to merge the pr.

@jkbk2004 We haven't confirmed from @SamuelTrahanNOAA why the hafs tests don't fail comparing to previous baselines. We need that to continue.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Compared on WCOSS2 to previous baselines and hafs_regional_atm_ocn, hafs_regional_atm_ocn_wav, hafs_regional_storm_following_1nest_atm_ocn, and hafs_regional_storm_following_1nest_atm_ocn_wav all do not compare sucessfully. They all required new baselines.

@jkbk2004
Copy link
Copy Markdown
Collaborator

Compared on WCOSS2 to previous baselines and hafs_regional_atm_ocn, hafs_regional_atm_ocn_wav, hafs_regional_storm_following_1nest_atm_ocn, and hafs_regional_storm_following_1nest_atm_ocn_wav all do not compare sucessfully. They all required new baselines.

@BrianCurtis-NOAA cmake/Intel.cmake lines 23-27 same as develop branch. so all cases reproduce except hafs_regional_storm_following_1nest_atm_ocn_wav

@BinLiu-NOAA
Copy link
Copy Markdown
Contributor

@jkbk2004, @BrianCurtis-NOAA, there is a HYCOM build option change taking out "-xSSE4.2". So some other HAFS HYCOM coupled RTs could also get result changing.

@jkbk2004
Copy link
Copy Markdown
Collaborator

@jkbk2004, @BrianCurtis-NOAA, there is a HYCOM build option change taking out "-xSSE4.2". So some other HAFS HYCOM coupled RTs could also get result changing.

Let me confirm again with some hafs tests on hera. Weird thing is all other hafs cases not affacted.

@DeniseWorthen
Copy link
Copy Markdown
Collaborator

DeniseWorthen commented Jan 11, 2023

@BinLiu-NOAA Doesn't the removal of -xSSE4.2 only happen for Gaea? Meaning that testing on Hera won't tell you anything about how the tests will compare on Gaea?

@BinLiu-NOAA
Copy link
Copy Markdown
Contributor

@BinLiu-NOAA Doesn't the removal of -xSSE4.2 only happen for Gaea? Meaning that testing on Hera won't tell you anything about how the tests will compare on Gaea?

@DeniseWorthen, the removal of "-xSSE4.2" is meant to happen to at least on WCOSS2 (maybe also on Hera/Orion/Jet). I think @SamuelTrahanNOAA made Gaea to keep using "-xSSE4.2" (since it failed on Gaea after taking it out).

@DeniseWorthen
Copy link
Copy Markdown
Collaborator

@BinLiu-NOAA Thanks, I guess I had it reversed!

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator Author

What is the status of this PR?

@BinLiu-NOAA
Copy link
Copy Markdown
Contributor

@SamuelTrahanNOAA, I think we are waiting for @jkbk2004 to double check if the HAFS HYCOM coupling RTs also got result changing on Hera (which may or may not be the case). But, otherwise, it looks all the regression tests have gone through.

@jkbk2004
Copy link
Copy Markdown
Collaborator

Confirms hafs_regional_atm_ocn is not changing on hera. So all hafs cases on hera/orion/gaea/jet/cheyenne same results except hafs_regional_storm_following_1nest_atm_ocn_wav. So, we can move one at least if "-xSSE4.2" handling behaves as expected on wcoss2. @BinLiu-NOAA what do you think?

@BinLiu-NOAA
Copy link
Copy Markdown
Contributor

@jkbk2004, @BrianCurtis-NOAA, @SamuelTrahanNOAA, we would like to remove using "-xSSE4.2" option for all platforms (except on Gaea, because it fails). Even if it turns out to change results, we can just accept the fact of result changing for those HAFS-HYCOM coupled RTs. Thanks!

@jkbk2004
Copy link
Copy Markdown
Collaborator

I agree. we can investigate more across machines. I will create an issue.

@jkbk2004
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA let's go ahead to merge.

@BinLiu-NOAA
Copy link
Copy Markdown
Contributor

I agree. we can investigate more across machines. I will create an issue.

@jkbk2004, Thanks for confirming all HAFS HYCOM coupling RTs (except the one except hafs_regional_storm_following_1nest_atm_ocn_wav which modified to use the FASTER build option) can reproduce the previous baseline results on hera/orion/gaea/jet/cheyenne. And the HAFS HYCOM RTs are expected to change results on WCOSS2 (since on WCOSS2 we are not recommended to this -xSSE4.2 option). And that's why we are taking out this -xSSE4.2 option in this PR. So, for me, all the result changing the corresponding HAFS RTs are expected/understandable. Thus, I do not see any issue to be concerned/created. Thanks!

@jkbk2004
Copy link
Copy Markdown
Collaborator

created an issue #1565

@jkbk2004 jkbk2004 self-requested a review January 11, 2023 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HAFS planned operationational optimization options

10 participants