Skip to content

Use nccmp to compare netcdf files in regression test on NOAA R&D machines. Update WW3#1702

Merged
FernandoAndrade-NOAA merged 19 commits into
ufs-community:developfrom
DusanJovic-NOAA:rt_nccmp
Apr 24, 2023
Merged

Use nccmp to compare netcdf files in regression test on NOAA R&D machines. Update WW3#1702
FernandoAndrade-NOAA merged 19 commits into
ufs-community:developfrom
DusanJovic-NOAA:rt_nccmp

Conversation

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA DusanJovic-NOAA commented Apr 6, 2023

Description

Update rt_util.sh to use nccmp utility to compare netcdf outputs against the baseline instead of using custom python script. Currently only on NOAA R&D machines (hera, orion, gaea and jet).

Fixes #1657

Top of commit queue on: TBD

Input data additions/changes

  • No changes are expected to input data.
  • There will be new input data.
  • Input data will be updated.

Anticipated changes to regression tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Combined with PR's (If Applicable):

Commit Queue Checklist:

  • Link PR's from all sub-components involved
  • Confirm reviews completed in sub-component PR's
  • Add all appropriate labels to this PR.
  • Run full RT suite on either Hera/Cheyenne with both Intel/GNU compilers
  • Add list of any failed regression tests to "Anticipated changes to regression tests" section.

Linked PR's and Issues:

Testing Day Checklist:

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.

Testing Log (for CM's):

  • RDHPCS
    • Intel
      • Hera
      • Orion
      • Jet
      • Gaea
      • Cheyenne
    • GNU
      • Hera
      • Cheyenne
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

@DusanJovic-NOAA DusanJovic-NOAA added the No Baseline Change No Baseline Change label Apr 6, 2023
@DeniseWorthen
Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA On Cheyenne, nccmp/1.8.2.1 is available. Is there any reason not to include that platform also?

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator

S4 also has nccmp/1.8.9.0 and I'm happy to test it there.

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator Author

I do not have access to cheyenne and S4, so I didn't know whether nccmp is available and which module and how should be loaded. I didn't want to break anything that currently works. Please feel free to make the required changes and test it. I'll then include your changes.

@junwang-noaa
Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA do you see any RT test time change with this update?

@jkbk2004
Copy link
Copy Markdown
Collaborator

jkbk2004 commented Apr 7, 2023

@FernandoAndrade-NOAA can you test this pr to confirm nccmp on cheyenne?

@DavidHuber-NOAA
Copy link
Copy Markdown
Collaborator

S4 has nccmp installed under hpc-stack, which would require loading the compilers, etc. I have requested it to be installed in a more easily accessible location, but it may take some time. I will include it here before the PR closes if possible, but otherwise I will add it in at a later time.

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator Author

@DusanJovic-NOAA do you see any RT test time change with this update?

No significant change in RT test time. For example cpld_control_qr_p8 test which does not produce bit-by-bit identical restart files and must use alternative comparison of netcdf files currently runs in about 355 seconds:

0: The total amount of wall time = 355.689583

while my test runs in about 350 seconds:

  0: The total amount of wall time                        = 350.140382
  0: The maximum resident set size (KB)                   = 3197440

Test 005 cpld_control_qr_p8 PASS

on Orion.

@FernandoAndrade-NOAA
Copy link
Copy Markdown
Collaborator

@jkbk2004 adding Cheyenne to the checks seems fine:

0:The total amount of wall time                        = 327.542930
0:The maximum resident set size (KB)                   = 2719748

Test 001 cpld_control_qr_p8 PASS

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator Author

@jkbk2004 adding Cheyenne to the checks seems fine:

0:The total amount of wall time                        = 327.542930
0:The maximum resident set size (KB)                   = 2719748

Test 001 cpld_control_qr_p8 PASS

@FernandoAndrade-NOAA Did you run the test on Cheyenne using nccmp? What changes did you make? Can you open PR to my branch or point me to your branch. Or just post the diffs here in the PR. Or just tell what I need to change in order to use nccmp on Cheyenne. Thanks

@FernandoAndrade-NOAA
Copy link
Copy Markdown
Collaborator

@jkbk2004 adding Cheyenne to the checks seems fine:

0:The total amount of wall time                        = 327.542930
0:The maximum resident set size (KB)                   = 2719748

Test 001 cpld_control_qr_p8 PASS

@FernandoAndrade-NOAA Did you run the test on Cheyenne using nccmp? What changes did you make? Can you open PR to my branch or point me to your branch. Or just post the diffs here in the PR. Or just tell what I need to change in order to use nccmp on Cheyenne. Thanks

@DusanJovic-NOAA yes on Cheyenne and Jet. For Cheyenne, no major changes I just added Cheyenne to the MACHINE_ID if statements in rt_utils.sh (line 361) and run_test.sh (line 109)

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator Author

@jkbk2004 adding Cheyenne to the checks seems fine:

0:The total amount of wall time                        = 327.542930
0:The maximum resident set size (KB)                   = 2719748

Test 001 cpld_control_qr_p8 PASS

@FernandoAndrade-NOAA Did you run the test on Cheyenne using nccmp? What changes did you make? Can you open PR to my branch or point me to your branch. Or just post the diffs here in the PR. Or just tell what I need to change in order to use nccmp on Cheyenne. Thanks

@DusanJovic-NOAA yes on Cheyenne and Jet. For Cheyenne, no major changes I just added Cheyenne to the MACHINE_ID if statements in rt_utils.sh (line 361) and run_test.sh (line 109)

Thanks

@jkbk2004
Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA can we combine in #1717 to this pr? We just need to point to ww3 branch https://github.com/jessicameixner-noaa/WW3/tree/feature/syncWW30419. Once combined, I like to ask to modify the PR title to reflect #1717.

@DusanJovic-NOAA DusanJovic-NOAA changed the title Use nccmp to compare netcdf files in regression test on NOAA R&D machines Use nccmp to compare netcdf files in regression test on NOAA R&D machines. Update WW3 Apr 21, 2023
@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator Author

@DusanJovic-NOAA can we combine in #1717 to this pr? We just need to point to ww3 branch https://github.com/jessicameixner-noaa/WW3/tree/feature/syncWW30419. Once combined, I like to ask to modify the PR title to reflect #1717.

Done.

@FernandoAndrade-NOAA
Copy link
Copy Markdown
Collaborator

@BrianCurtis-NOAA @JessicaMeixner-NOAA @MatthewMasarik-NOAA We're going to begin testing for this PR

@FernandoAndrade-NOAA FernandoAndrade-NOAA added Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. jenkins-ci Jenkins CI: ORT build/test on docker container labels Apr 21, 2023
@jkbk2004
Copy link
Copy Markdown
Collaborator

Gaea C4 is not available these days

The paths in the error message are not GAEA paths. They look like WCOSS2 or Acorn to me:

baseline dir = /lfs/h2/emc/nems/noscrub/emc.nems/RT/NEMSfv3gfs/develop-20230418/INTEL/atmaero_control_p8
working dir = /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_66056/atmaero_control_p8

Yeah, @BrianCurtis-NOAA test error is from wcoss2.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Yes WCOSS2:

brian.curtis@clogin02:/lfs/h2/emc/nems/noscrub/brian.curtis/git/DusanJovic-NOAA/ufs-weather-model/tests> 
module load python/3.8.6
brian.curtis@clogin02:/lfs/h2/emc/nems/noscrub/brian.curtis/git/DusanJovic-NOAA/ufs-weather-model/tests> ./compare_ncfile.py /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_66056/atmaero_control_p8/RESTART/20210323.060000.phy_data.tile5.nc /lfs/h2/emc/nems/noscrub/emc.nems/RT/NEMSfv3gfs/develop-20230418/INTEL/atmaero_control_p8/RESTART/20210323.060000.phy_data.tile5.nc
Traceback (most recent call last):
  File "./compare_ncfile.py", line 6, in <module>
    with Dataset(sys.argv[1]) as nc1, Dataset(sys.argv[2]) as nc2:
  File "src/netCDF4/_netCDF4.pyx", line 2330, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 1948, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -51] NetCDF: Unknown file format: b'/lfs/h2/emc/nems/noscrub/emc.nems/RT/NEMSfv3gfs/develop-20230418/INTEL/atmaero_control_p8/RESTART/20210323.060000.phy_data.tile5.nc'

This is the second straight failure for this test.

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator Author

That file is empty:

clogin01:/u/dusan.jovic> ls -l /lfs/h2/emc/nems/noscrub/emc.nems/RT/NEMSfv3gfs/develop-20230418/INTEL/atmaero_control_p8/RESTART/
total 1427328
-rw-r--r-- 1 emc.nems nems       300 Apr 19 19:23 20210323.060000.coupler.res
-rw-r--r-- 1 emc.nems nems     18784 Apr 19 19:23 20210323.060000.fv_core.res.nc
-rw-r--r-- 1 emc.nems nems  37610116 Apr 19 19:23 20210323.060000.fv_core.res.tile1.nc
-rw-r--r-- 1 emc.nems nems  37610116 Apr 19 19:23 20210323.060000.fv_core.res.tile2.nc
-rw-r--r-- 1 emc.nems nems  37610116 Apr 19 19:23 20210323.060000.fv_core.res.tile3.nc
-rw-r--r-- 1 emc.nems nems  37610116 Apr 19 19:23 20210323.060000.fv_core.res.tile4.nc
-rw-r--r-- 1 emc.nems nems  37610116 Apr 19 19:23 20210323.060000.fv_core.res.tile5.nc
-rw-r--r-- 1 emc.nems nems  37610116 Apr 19 19:23 20210323.060000.fv_core.res.tile6.nc
-rw-r--r-- 1 emc.nems nems     92200 Apr 19 19:23 20210323.060000.fv_srf_wnd.res.tile1.nc
-rw-r--r-- 1 emc.nems nems     92200 Apr 19 19:23 20210323.060000.fv_srf_wnd.res.tile2.nc
-rw-r--r-- 1 emc.nems nems     92200 Apr 19 19:23 20210323.060000.fv_srf_wnd.res.tile3.nc
-rw-r--r-- 1 emc.nems nems     92200 Apr 19 19:23 20210323.060000.fv_srf_wnd.res.tile4.nc
-rw-r--r-- 1 emc.nems nems     92200 Apr 19 19:23 20210323.060000.fv_srf_wnd.res.tile5.nc
-rw-r--r-- 1 emc.nems nems     92200 Apr 19 19:23 20210323.060000.fv_srf_wnd.res.tile6.nc
-rw-r--r-- 1 emc.nems nems 163883040 Apr 19 19:23 20210323.060000.fv_tracer.res.tile1.nc
-rw-r--r-- 1 emc.nems nems 163883040 Apr 19 19:23 20210323.060000.fv_tracer.res.tile2.nc
-rw-r--r-- 1 emc.nems nems 163883040 Apr 19 19:23 20210323.060000.fv_tracer.res.tile3.nc
-rw-r--r-- 1 emc.nems nems 163883040 Apr 19 19:23 20210323.060000.fv_tracer.res.tile4.nc
-rw-r--r-- 1 emc.nems nems 163883040 Apr 19 19:23 20210323.060000.fv_tracer.res.tile5.nc
-rw-r--r-- 1 emc.nems nems 163883040 Apr 19 19:23 20210323.060000.fv_tracer.res.tile6.nc
-rw-r--r-- 1 emc.nems nems  38950016 Apr 19 19:23 20210323.060000.phy_data.tile1.nc
-rw-r--r-- 1 emc.nems nems  38950016 Apr 19 19:23 20210323.060000.phy_data.tile2.nc
-rw-r--r-- 1 emc.nems nems  38950016 Apr 19 19:23 20210323.060000.phy_data.tile3.nc
-rw-r--r-- 1 emc.nems nems  38950016 Apr 19 19:23 20210323.060000.phy_data.tile4.nc
-rw-r--r-- 1 emc.nems nems         0 Apr 19 19:23 20210323.060000.phy_data.tile5.nc
-rw-r--r-- 1 emc.nems nems  38950016 Apr 19 16:10 20210323.060000.phy_data.tile6.nc
-rw-r--r-- 1 emc.nems nems   9538756 Apr 19 16:10 20210323.060000.sfc_data.tile1.nc
-rw-r--r-- 1 emc.nems nems   9538756 Apr 19 16:10 20210323.060000.sfc_data.tile2.nc
-rw-r--r-- 1 emc.nems nems   9538756 Apr 19 16:10 20210323.060000.sfc_data.tile3.nc
-rw-r--r-- 1 emc.nems nems   9538756 Apr 19 16:10 20210323.060000.sfc_data.tile4.nc
-rw-r--r-- 1 emc.nems nems   9538756 Apr 19 16:10 20210323.060000.sfc_data.tile5.nc
-rw-r--r-- 1 emc.nems nems   9538756 Apr 19 16:10 20210323.060000.sfc_data.tile6.nc

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

Why would one tile not write?

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

OK, well. I will redo that baseline again from develop branch. Then run the test again. Thanks for the catch @DusanJovic-NOAA

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator

Why would one tile not write?

Three thoughts:

  1. Was that the baseline you generated, or was it copied from Dogwood? If it is copied, make sure you use rsync, and run rsync twice. The second run will confirm that there is nothing left to transfer. Don't ever use scp; it is not fault tolerant.
  2. Do you compare against the baseline that you generate for a PR? That should have caught a zero-sized baseline file.
  3. Do you have the logs from the baseline generation? It may give us clues as to why the copy failed. (Unless the baseline was copied from another machine.)

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

BrianCurtis-NOAA commented Apr 24, 2023

Why would one tile not write?

Three thoughts:

  1. Was that the baseline you generated, or was it copied from Dogwood? If it is copied, make sure you use rsync, and run rsync twice. The second run will confirm that there is nothing left to transfer. Don't ever use scp; it is not fault tolerant.
  2. Do you compare against the baseline that you generate for a PR? That should have caught a zero-sized baseline file.
  3. Do you have the logs from the baseline generation? It may give us clues as to why the copy failed. (Unless the baseline was copied from another machine.)

With all the switches, probably 1, since 2 is Yes and 3 is Yes. I'm not sure what @RatkoVasic-NOAA does, but it may be worth at least double checking his CRON/script.

@SamuelTrahanNOAA
Copy link
Copy Markdown
Collaborator

With all the switches, probably 1, since 2 is Yes and 3 is Yes. I'm not sure what @RatkoVasic-NOAA does, but it may be worth at least double checking his CRON/script.

You should also avoid "cp" or "mv" within the same machine. A large filesystem operation can be unreliable. Rsync is fault tolerant, and it will confirm the transfer was complete on a second run.

The only exception is a "mv" within the same fileset of the same filesystem in the same cluster. In that special case, "mv" is a unit operation. That makes it especially safe.

@RatkoVasic-NOAA
Copy link
Copy Markdown
Collaborator

If you are talking about syncing data between two WCOSS2 machines, I'm using rsync in cron. There's a log file for each day in /u/emc.nems/ratko.vasic/202*

@RatkoVasic-NOAA
Copy link
Copy Markdown
Collaborator

I see that file with zero size is on both machines.

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@RatkoVasic-NOAA Thanks for checking on Dogwood! The new baseline was OK and passed testing, so hopefully it was a random issue and not a long term one.

DeniseWorthen
DeniseWorthen previously approved these changes Apr 24, 2023
zach1221
zach1221 previously approved these changes Apr 24, 2023
@jkbk2004
Copy link
Copy Markdown
Collaborator

@JessicaMeixner-NOAA all tests are done. Please, go ahead to merge ww3 PR.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Collaborator

WW3 has been merged. New hash is e026bcc

@DusanJovic-NOAA DusanJovic-NOAA dismissed stale reviews from zach1221 and DeniseWorthen via a887cce April 24, 2023 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jenkins-ci Jenkins CI: ORT build/test on docker container No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use nccmp to compare netcdf files in regression testing