Use nccmp to compare netcdf files in regression test on NOAA R&D machines. Update WW3#1702
Conversation
|
@DusanJovic-NOAA On Cheyenne, nccmp/1.8.2.1 is available. Is there any reason not to include that platform also? |
|
S4 also has nccmp/1.8.9.0 and I'm happy to test it there. |
|
I do not have access to cheyenne and S4, so I didn't know whether nccmp is available and which module and how should be loaded. I didn't want to break anything that currently works. Please feel free to make the required changes and test it. I'll then include your changes. |
|
@DusanJovic-NOAA do you see any RT test time change with this update? |
|
@FernandoAndrade-NOAA can you test this pr to confirm nccmp on cheyenne? |
|
S4 has nccmp installed under hpc-stack, which would require loading the compilers, etc. I have requested it to be installed in a more easily accessible location, but it may take some time. I will include it here before the PR closes if possible, but otherwise I will add it in at a later time. |
No significant change in RT test time. For example cpld_control_qr_p8 test which does not produce bit-by-bit identical restart files and must use alternative comparison of netcdf files currently runs in about 355 seconds: while my test runs in about 350 seconds: on Orion. |
|
@jkbk2004 adding Cheyenne to the checks seems fine: |
@FernandoAndrade-NOAA Did you run the test on Cheyenne using nccmp? What changes did you make? Can you open PR to my branch or point me to your branch. Or just post the diffs here in the PR. Or just tell what I need to change in order to use nccmp on Cheyenne. Thanks |
@DusanJovic-NOAA yes on Cheyenne and Jet. For Cheyenne, no major changes I just added Cheyenne to the MACHINE_ID if statements in rt_utils.sh (line 361) and run_test.sh (line 109) |
Thanks |
|
@DusanJovic-NOAA can we combine in #1717 to this pr? We just need to point to ww3 branch https://github.com/jessicameixner-noaa/WW3/tree/feature/syncWW30419. Once combined, I like to ask to modify the PR title to reflect #1717. |
Done. |
|
@BrianCurtis-NOAA @JessicaMeixner-NOAA @MatthewMasarik-NOAA We're going to begin testing for this PR |
Yeah, @BrianCurtis-NOAA test error is from wcoss2. |
|
Yes WCOSS2: This is the second straight failure for this test. |
|
That file is empty: |
|
Why would one tile not write? |
|
OK, well. I will redo that baseline again from develop branch. Then run the test again. Thanks for the catch @DusanJovic-NOAA |
Three thoughts:
|
With all the switches, probably 1, since 2 is Yes and 3 is Yes. I'm not sure what @RatkoVasic-NOAA does, but it may be worth at least double checking his CRON/script. |
You should also avoid "cp" or "mv" within the same machine. A large filesystem operation can be unreliable. Rsync is fault tolerant, and it will confirm the transfer was complete on a second run. The only exception is a "mv" within the same fileset of the same filesystem in the same cluster. In that special case, "mv" is a unit operation. That makes it especially safe. |
|
If you are talking about syncing data between two WCOSS2 machines, I'm using rsync in cron. There's a log file for each day in /u/emc.nems/ratko.vasic/202* |
|
I see that file with zero size is on both machines. |
|
@RatkoVasic-NOAA Thanks for checking on Dogwood! The new baseline was OK and passed testing, so hopefully it was a random issue and not a long term one. |
|
@JessicaMeixner-NOAA all tests are done. Please, go ahead to merge ww3 PR. |
|
WW3 has been merged. New hash is e026bcc |
a887cce
Description
Update
rt_util.shto usenccmputility to compare netcdf outputs against the baseline instead of using custom python script. Currently only on NOAA R&D machines (hera, orion, gaea and jet).Fixes #1657
Top of commit queue on: TBD
Input data additions/changes
Anticipated changes to regression tests:
Subcomponents involved:
Combined with PR's (If Applicable):
Commit Queue Checklist:
Linked PR's and Issues:
Testing Day Checklist:
Testing Log (for CM's):