Skip to content

Multiple domains quilting restart#1722

Merged
zach1221 merged 27 commits into
ufs-community:developfrom
DusanJovic-NOAA:multiple_domains_quilting_restart
May 19, 2023
Merged

Multiple domains quilting restart#1722
zach1221 merged 27 commits into
ufs-community:developfrom
DusanJovic-NOAA:multiple_domains_quilting_restart

Conversation

@DusanJovic-NOAA

@DusanJovic-NOAA DusanJovic-NOAA commented Apr 24, 2023

Copy link
Copy Markdown
Collaborator

Description

This PR updates fv3atm write grid component to allow writing the restart files for multiple domains (nests). Three new tests were added which compare the RESTART files written by the write grid component with restart files written by the FMS

Fixes: #1628

Top of commit queue on: TBD

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:

hafs_global_1nest_atm
hafs_global_multiple_4nests_atm
hafs_regional_1nest_atm
hafs_regional_storm_following_1nest_atm

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

FV3ATM: #650
GFDL Cubed Sphere: #268

Please link the related issues to be closed with this PR, whether in this repository, or in another repository.
EXAMPLE: Closes NOAA-EMC/fv3atm/issues/<issue_number>
-->

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@BrianCurtis-NOAA

Copy link
Copy Markdown
Collaborator

Please link FV3 and atmos_cubed_sphere PR's. Please run GNU/Intel RT's and note Tests that change.

@DusanJovic-NOAA DusanJovic-NOAA added the Baseline Updates Current baselines will be updated. label May 4, 2023
@github-actions

github-actions Bot commented May 4, 2023

Copy link
Copy Markdown

@DusanJovic-NOAA please bring these up to date with respective authoritative repositories

  • ufs-weather-model NOT up to date

@zach1221 zach1221 self-assigned this May 16, 2023
@jkbk2004

Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA we can start working on this pr. Can you sync up branches?

@DusanJovic-NOAA

Copy link
Copy Markdown
Collaborator Author

@DusanJovic-NOAA we can start working on this pr. Can you sync up branches?

Merged.

@zach1221

Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA do you have GNU/Intel RT logs from runs on either Hera or Cheyenne?

@DusanJovic-NOAA

DusanJovic-NOAA commented May 17, 2023

Copy link
Copy Markdown
Collaborator Author

@DusanJovic-NOAA do you have GNU/Intel RT logs from runs on either Hera or Cheyenne?

RegressionTests_hera.intel.log

This is a hera log from my test I ran yesterday. I do not have gnu log, but since gnu test does not run HAFS I do not expect changes in the baselines.

@zach1221 zach1221 added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label May 17, 2023
@zach1221

Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA I'm going to start testing through this PR next.

@zach1221 zach1221 added the jenkins-ci Jenkins CI: ORT build/test on docker container label May 17, 2023
@zach1221

Copy link
Copy Markdown
Collaborator

Jenkins-ci logs attached. ORTs passed. I will now begin manually creating the new baselines for the below tests.
hafs_global_1nest_atm
hafs_global_multiple_4nests_atm
hafs_regional_1nest_atm
hafs_regional_storm_following_1nest_atm
ufs-weather-model » ort-docker-pipeline » PR-1722 #1 Console [Jenkins].pdf

@zach1221

Copy link
Copy Markdown
Collaborator

@jkbk2004 All four of the new Hafs qr cases are failing on cheyenne.intel.

@zach1221

Copy link
Copy Markdown
Collaborator

@jkbk2004 I can try to run these cases with hdf5 1.14.0

@jkbk2004

Copy link
Copy Markdown
Collaborator

It sounds like new cases are running with -DDEBUG=ON. At least not crashing but very slowly: /glade/scratch/jongkim/pr-1722-intel/jongkim/FV3_RT/rt_60395/hafs_regional_1nest_atm_qr. looks like crash is system/compiler/mpt issue. Not practical to run the new cases on cheyenne. @DusanJovic-NOAA @zach1221 can we agree to turn the new cases off on cheyenne?

@DusanJovic-NOAA

Copy link
Copy Markdown
Collaborator Author

It sounds like new cases are running with -DDEBUG=ON. At least not crashing but very slowly: /glade/scratch/jongkim/pr-1722-intel/jongkim/FV3_RT/rt_60395/hafs_regional_1nest_atm_qr. looks like crash is system/compiler/mpt issue. Not practical to run the new cases on cheyenne. @DusanJovic-NOAA @zach1221 can we agree to turn the new cases off on cheyenne?

I agree, turn off 4 new hafs tests on cheyenne.

@DeniseWorthen

Copy link
Copy Markdown
Collaborator

@zach1221

Copy link
Copy Markdown
Collaborator

I can create a new issue regarding the hafs qr error on cheyenne.intel.

@DusanJovic-NOAA

Copy link
Copy Markdown
Collaborator Author

@DusanJovic-NOAA Do you have any insight into the exact line it is complaining about? https://github.com/DusanJovic-NOAA/fv3atm/blob/0379fd48f24dd67cab6f8b88b2b77fabfa7afc71/io/module_write_restart_netcdf.F90#L452

I don't. That's the function that actually writes the array into a file. The error message is 'NetCDF: HDF error' , so I assume it's something wrong in the netcdf or hdf5 library. @zach1221 Did you try to run one of these 4 tests using hdf5 1.14.0?

@zach1221

Copy link
Copy Markdown
Collaborator

Hi, @DusanJovic-NOAA. Yes, I tried hdf5 1.14.0 with hafs_regional_1nest_atm_qr, but received the same error message posted above. Perhaps it would be worth it to try with netcdf updated to version 4.9.4 as well.

@zach1221

Copy link
Copy Markdown
Collaborator

Ok, I'll continue investigating the failure in the UFS-WM issues queue. In the meantime, @DusanJovic-NOAA, if you want to turn off the 4 new hafs tests on cheyenne then I think we're ready to proceed final review/approvals.

@DusanJovic-NOAA

Copy link
Copy Markdown
Collaborator Author

Ok, I'll continue investigating the failure in the UFS-WM issues queue. In the meantime, @DusanJovic-NOAA, if you want to turn off the 4 new hafs tests on cheyenne then I think we're ready to proceed final review/approvals.

I disabled those 4 tests on cheyenne

DeniseWorthen
DeniseWorthen previously approved these changes May 19, 2023
@zach1221

Copy link
Copy Markdown
Collaborator

Apologies I sent the reviews out before the fv3 submodule pointer was updated and gitmodules were reverted. Can you please provide your approval again? @DeniseWorthen @BrianCurtis-NOAA

@zach1221 zach1221 merged commit 52cee26 into ufs-community:develop May 19, 2023
@DusanJovic-NOAA DusanJovic-NOAA deleted the multiple_domains_quilting_restart branch August 31, 2023 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Baseline Updates Current baselines will be updated. jenkins-ci Jenkins CI: ORT build/test on docker container Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using fv3atm write grid component to write out restart files for multiple domains

6 participants