Skip to content

Compile CICE with safe CPU instructions#1563

Merged
jkbk2004 merged 13 commits into
ufs-community:developfrom
DavidHuber-NOAA:remove_xhost_cice
Jan 20, 2023
Merged

Compile CICE with safe CPU instructions#1563
jkbk2004 merged 13 commits into
ufs-community:developfrom
DavidHuber-NOAA:remove_xhost_cice

Conversation

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA commented Jan 10, 2023

Description

The CICE component is currently compiled with Intel using the -xHOST flag, which specifies CPU instructions based on the node performing the compilation. In the case of systems like Jet, this causes an issue as the head nodes are comprised of Skylake architecture capable of interpreting certain instructions like AVX-512, while the xjet partition is comprised of Haswell cores that cannot interpret AVX-512 instructions. Thus, when attempting to run a coupled forecast on xjet after compiling on the head node, the UFS will crash.

Note that this could also be an issue for Hercules/Orion (if Hercules-compiled executables should run on Orion).

To fix this, -xHOST has been removed from the CICE interface CMake file so that the instructions specified in configure_jet.intel.cmake and Intel.cmake can be inherited. This does change regression test results for all coupled Intel tests, even on Hera, as -xHOST builds with, among other things, AVX-512 instructions while the new configuration will send AVX-2 instructions.

Top of commit queue on: TBD

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:

All GNU RTs should not see a change. The following Intel, coupled RTs will change due to new compiler instructions:
cpld_control_p8_mixedmode
cpld_control_gfsv17
cpld_control_p8
cpld_2threads_p8
cpld_esmfthreads_p8
cpld_decomp_p8
cpld_mpi_p8
cpld_control_ciceC_p8
cpld_control_c192_p8
cpld_bmark_p8
cpld_control_noaero_p8
cpld_control_nowave_noaero_p8
cpld_control_noaero_p8_agrid
cpld_control_c48
cpld_warmstart_c48
datm_cdeps_control_cfsr
datm_cdeps_control_gefs
datm_cdeps_iau_gefs
datm_cdeps_stochy_gefs
datm_cdeps_ciceC_cfsr
datm_cdeps_bulk_cfsr
datm_cdeps_bulk_gefs
datm_cdeps_mx025_cfsr
datm_cdeps_mx025_gefs
datm_cdeps_multiple_files_cfsr
datm_cdeps_3072x1536_cfsr
datm_cdeps_gfs

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE (Change to build only)
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Closes #1262

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@DeniseWorthen
Copy link
Copy Markdown
Collaborator

@DavidHuber-NOAA Are any performance impacts expected from this fix?

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator Author

@DeniseWorthen I would expect some minor performance changes, but nothing terribly significant. Having run the full RT suite on xjet, kjet, and Hera, I can say that no tests failed due to time limit.

@DeniseWorthen DeniseWorthen added the Baseline Updates Current baselines will be updated. label Jan 10, 2023
@BrianCurtis-NOAA BrianCurtis-NOAA added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jan 17, 2023
@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

BrianCurtis-NOAA commented Jan 17, 2023

There are no changes in CICE repo, this is strictly building on UFSWM. So this should be ready for commit queue.

@DavidHuber-NOAA or @DeniseWorthen if you can double check on the reproducibility (create baselines then check against those baselines) on one of the changed tests that would be helpful to double check. We will attempt to start running RT's later this afternoon.

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator Author

@BrianCurtis-NOAA Sure, I will test cpld_control_c48.

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator Author

@BrianCurtis-NOAA I rebaselined cpld_control_c48 and then tested against it and results were identical.

@jkbk2004
Copy link
Copy Markdown
Collaborator

I am adding new BL_DATE.

@jkbk2004
Copy link
Copy Markdown
Collaborator

We will start testing from hera.

on-behalf-of @ufs-community <brian.curtis@noaa.gov>
on-behalf-of @ufs-community <brian.curtis@noaa.gov>
@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Automated RT Failure Notification
Machine: jet
Compiler: intel
Job: BL
[BL] Repo location: /lfs4/HFIP/h-nems/emc.nemspara/autort/pr/1191748279/20230118180015/ufs-weather-model
Please make changes and add the following label back: jet-intel-BL

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Automated RT Failure Notification
Machine: gaea
Compiler: intel
Job: BL
[BL] Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/1191748279/20230118180008/ufs-weather-model
Please make changes and add the following label back: gaea-intel-BL

@jkbk2004
Copy link
Copy Markdown
Collaborator

We will skip cheyenne. It's under maintenance whole week. with that, all tests are done. Can we have final approvals to start merging in?

@jkbk2004
Copy link
Copy Markdown
Collaborator

FYI: we will resume to test with jenkins-ci again from next pr.

@DeniseWorthen
Copy link
Copy Markdown
Collaborator

@jkbk2004 CISL just sent notification that all services are returned.

@jkbk2004
Copy link
Copy Markdown
Collaborator

@jkbk2004 CISL just sent notification that all services are returned

let me check.

@jkbk2004 jkbk2004 merged commit c5badea into ufs-community:develop Jan 20, 2023
@DavidHuber-NOAA DavidHuber-NOAA deleted the remove_xhost_cice branch January 20, 2023 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S2SWA app crashes on xjet when initializing CICE

4 participants