Skip to content

Rename wave output and refactor some wave scripts to use MPMD, and fix some bugzillas along the way#3517

Merged
DavidHuber-NOAA merged 34 commits into
NOAA-EMC:developfrom
aerorahul:feature/rename_wave_rm
Apr 3, 2025
Merged

Rename wave output and refactor some wave scripts to use MPMD, and fix some bugzillas along the way#3517
DavidHuber-NOAA merged 34 commits into
NOAA-EMC:developfrom
aerorahul:feature/rename_wave_rm

Conversation

@aerorahul
Copy link
Copy Markdown
Contributor

@aerorahul aerorahul commented Mar 25, 2025

Description

[Copied and edited from #3406]
Wave products are renamed to be uniform with other components and follow NCO implementation standards. This means all output is in the format ${RUN}.wave.tCCz.product[.fHHH][.domain].suffix.

Some wave scripts are refactored to make them consistent with other parts of the workflow. All scripts for those jobs now pass shellcheck without warnings. Point job scripts were only updated for the file rename, as those scripts will be redone in the near future anyway.

The biggest changes in the refactor include beyond those necessary for the file rename include:
Using the run_mpmd.sh script
Removed redirection of output to logs that get deleted (#296)
The wave gempak job now only operates on one forecast hour at a time. The corresponding rocoto task has been made into a metatask, using the same grouping as wave gridded post. The gempak job must use the same fhr and max tasks settings to ensure the dependencies line up.
One functional change is raw interpolated grids are no longer copied to COM, only the resulting grib files. [Need to confirm with Jessica this is okay]

Resolves #296
Resolves #3270

This was work was largely done by @WalterKolczynski-NOAA

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? YES/NO
  • Does this change require a documentation update? YES/NO
  • Does this change require an update to any of the following submodules? YES/NO (If YES, please add a link to any PRs that are pending.)
    • EMC verif-global
    • GDAS
    • GFS-utils
    • GSI
    • GSI-monitor
    • GSI-utils
    • UFS-utils
    • UFS-weather-model
    • wxflow

How has this been tested?

  • C48mx500 forecast-only on WCOS2.
    A experiment with develop at 563567d dev1 at: /lfs/h2/emc/ptmp/rahul.mahajan/RUNTESTS/EXPDIR/dev1
    A experiment with this branch wk1 at: /lfs/h2/emc/ptmp/rahul.mahajan/RUNTESTS/EXPDIR/wk1
    Links to respective COM can be found in the experiment directories.
    Below is a list of wave products from wk1
❯❯❯ tree COM/gfs.20210323/12/products/wave/
COM/gfs.20210323/12/products/wave/
├── gempak
│   ├── gfswaves200k_2021032312f000
│   ├── gfswaves200k_2021032312f001
│   ├── ...
│   ├── ...
│   └── gfswaves200k_2021032312f120
├── gridded
│   └── global.2p00
│       ├── gfs.wave.t12z.global.2p00.f000.grib2
│       ├── gfs.wave.t12z.global.2p00.f000.grib2.idx
│       ├── ...
│       ├── ...
│       ├── gfs.wave.t12z.global.2p00.f120.grib2
│       └── gfs.wave.t12z.global.2p00.f120.grib2.idx
└── station
    ├── gfs.wave.t12z.t12z.bull_tar
    ├── gfs.wave.t12z.t12z.cbull_tar
    ├── gfs.wave.t12z.t12z.ibpbull_tar
    ├── gfs.wave.t12z.t12z.ibpcbull_tar
    ├── gfs.wave.t12z.t12z.ibp_tar
    └── gfs.wave.t12z.t12z.spec_tar.gz

4 directories, 369 files

The forecast model produced wave model output as:

❯❯❯ tree COM/gfs.20210323/12/model/wave/
├── history
│   ├── gfs.wave.t12z.log.uglo_100km.2021032312
│   ├── gfs.wave.t12z.points.f000.bin
│   ├── gfs.wave.t12z.points.f001.bin
│   ├── ...
│   ├── gfs.wave.t12z.points.f120.bin
│   ├── gfs.wave.t12z.uglo_100km.f000.bin
│   ├── ...
│   ├── ...
│   ├── ...
│   └── gfs.wave.t12z.uglo_100km.f120.bin
├── prep
│   ├── gfs.wave.t12z.mod_def.glo_200.bin
│   └── gfs.wave.t12z.mod_def.uglo_100km.bin
└── restart

3 directories, 245 files

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

@aerorahul aerorahul changed the title Feature/rename wave rm Rename wave output and refactor some wave scripts to use MPMD, and fix some bugzillas along the way Mar 25, 2025
@JessicaMeixner-NOAA
Copy link
Copy Markdown
Contributor

Please note that #3450 PR is a major update of the point jobs. Apologies I know I talked to Walter about not updating that job until after this major update, I did not properly pass this message on to you @aerorahul. This PR is waiting for the last sync from develop -> dev/ufs-waether-model for WW3 in UFS. This PR would have already been submitted but jobs failed yesterday w/stmp being full on hera. I'll try to get it submitted today, but not sure how long it'll take to get that through.

@aerorahul
Copy link
Copy Markdown
Contributor Author

Please note that #3450 PR is a major update of the point jobs. Apologies I know I talked to Walter about not updating that job until after this major update, I did not properly pass this message on to you @aerorahul. This PR is waiting for the last sync from develop -> dev/ufs-waether-model for WW3 in UFS. This PR would have already been submitted but jobs failed yesterday w/stmp being full on hera. I'll try to get it submitted today, but not sure how long it'll take to get that through.

Thank you for the note @JessicaMeixner-NOAA
I have been following the point jobs PR with interest as well.
I am going to open this PR to review today after my current tests complete.

@aerorahul
Copy link
Copy Markdown
Contributor Author

This is now ready for review

@aerorahul
Copy link
Copy Markdown
Contributor Author

I think this is now done, and I have checked the arch job runs past the failure point (I haven't had it succeed on Gaea because of access to HPSS [ticket submitted])
It is ready for more reviewer feedback and test.

@aerorahul aerorahul added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Gaeac6-Ready **CM use only** PR is ready for CI testing on Gaea C6 and removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed CI-Gaeac6-Failed **Bot use only** CI testing on Gaea C6 for this PR has failed labels Apr 2, 2025
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera labels Apr 2, 2025
@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C96C48_ufs_hybatmDA FAILED on Gaeac6 in Build# 6 with error logs:

/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/enkfgdas_atmensanlobs.log
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/gdas_atmanlvar.log
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/gfs_atmanlvar.log

Follow link here to view the contents of the above file(s): (link)

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C96C48_ufs_hybatmDA FAILED on Gaeac6 in Build# 6 in
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/EXPDIR/C96C48_ufs_hybatmDA_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C48_S2SW FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C48_S2SW_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C48_S2SWA_gefs FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C48_S2SWA_gefs_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C96mx100_S2S FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C96mx100_S2S_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C48_ATM FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C48_ATM_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C48mx500_hybAOWCDA FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C48mx500_hybAOWCDA_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C96C48_ufs_hybatmDA FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C96C48_ufs_hybatmDA_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C96C48_hybatmaerosnowDA FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C96C48_hybatmaerosnowDA_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C48mx500_3DVarAOWCDA_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C96_atm3DVar FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C96_atm3DVar_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

Experiment C96C48_hybatmDA FAILED on Hera in Build# 7 in
/scratch1/NCEPDEV/global/glopara/CI/3517/RUNTESTS/EXPDIR/C96C48_hybatmDA_8d714478

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 2, 2025

CI Failed on Hera in Build# 7
Built and ran in directory /scratch1/NCEPDEV/global/glopara/CI/3517

Experiment C96C48_ufs_hybatmDA_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:21:53 UTC 2025
Experiment C96C48_ufs_hybatmDA_8d714478 Terminated: **
Experiment C48_ATM_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:26:25 UTC 2025
Experiment C48_ATM_8d714478 Terminated: **
Experiment C48mx500_hybAOWCDA_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:26:50 UTC 2025
Experiment C48mx500_hybAOWCDA_8d714478 Terminated: **
Experiment C96mx100_S2S_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:26:51 UTC 2025
Experiment C96mx100_S2S_8d714478 Terminated: **
Experiment C48_S2SW_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:27:29 UTC 2025
Experiment C48_S2SW_8d714478 Terminated: **
Experiment C48_S2SWA_gefs_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:27:32 UTC 2025
Experiment C48_S2SWA_gefs_8d714478 Terminated: **
Experiment C96C48_hybatmDA_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:27:34 UTC 2025
Experiment C96C48_hybatmDA_8d714478 Terminated: **
Experiment C48mx500_3DVarAOWCDA_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:27:37 UTC 2025
Experiment C48mx500_3DVarAOWCDA_8d714478 Terminated: **
Experiment C96C48_hybatmaerosnowDA_8d714478 Terminated with  tasks failed and  dead at Wed Apr  2 23:27:39 UTC 2025
Experiment C96C48_hybatmaerosnowDA_8d714478 Terminated: **

This failure was due to resources limitations and/or anomalies from within shell for the system to able to create enough processes to fork when running rocotostat repeatedly in the rocotostat.py script from within the CI framework. The root cause for the specific failure for repeatably calling rototostat as is done in this script is unknown and is being monitored in attempts to repeat the anomaly. ~T.McG

@emcbot
Copy link
Copy Markdown

emcbot commented Apr 3, 2025

CI Failed on Gaeac6 in Build# 6
Built and ran in directory /gpfs/f6/drsa-precip3/world-shared/global/CI/3517


Experiment C96C48_ufs_hybatmDA_8d714478 Terminated with 0
FAIL
FAIL tasks failed and 3 dead at Wed 02 Apr 2025 06:27:28 PM EDT
Experiment C96C48_ufs_hybatmDA_8d714478 Terminated: *FAIL*
Error logs:
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/enkfgdas_atmensanlobs.log
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/gdas_atmanlvar.log
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/gfs_atmanlvar.log
Experiment C48_ATM_8d714478 Completed 1 Cycles: *SUCCESS* at Wed 02 Apr 2025 06:51:37 PM EDT
Experiment C48mx500_hybAOWCDA_8d714478 Completed 2 Cycles: *SUCCESS* at Wed 02 Apr 2025 07:03:55 PM EDT
Experiment C48_S2SW_8d714478 Completed 1 Cycles: *SUCCESS* at Wed 02 Apr 2025 07:16:03 PM EDT
Experiment C96C48_hybatmDA_8d714478 Completed 3 Cycles: *SUCCESS* at Wed 02 Apr 2025 07:40:41 PM EDT
Experiment C48_S2SWA_gefs_8d714478 Completed 1 Cycles: *SUCCESS* at Wed 02 Apr 2025 07:40:49 PM EDT
Experiment C48mx500_3DVarAOWCDA_8d714478 Completed 2 Cycles: *SUCCESS* at Wed 02 Apr 2025 07:46:42 PM EDT
Experiment C96_atm3DVar_8d714478 Completed 3 Cycles: *SUCCESS* at Wed 02 Apr 2025 07:58:46 PM EDT
Experiment C96C48_hybatmaerosnowDA_8d714478 Completed 3 Cycles: *SUCCESS* at Wed 02 Apr 2025 08:05:16 PM EDT

…ccessfully and the logs are not helpful in diagnosing the error
@aerorahul
Copy link
Copy Markdown
Contributor Author

Experiment C96C48_ufs_hybatmDA FAILED on Gaeac6 in Build# 6 with error logs:

/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/enkfgdas_atmensanlobs.log
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/gdas_atmanlvar.log
/gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/gfs_atmanlvar.log

Follow link here to view the contents of the above file(s): (link)

@RussTreadon-NOAA @DavidNew-NOAA
I have disabled the C96C48_ufs_hybatmDA test on Gaea C6. The error is not very revealing of any missing inputs or ill-configured yamls.
Would appreciate a look /gpfs/f6/drsa-precip3/world-shared/global/CI/3517/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_8d714478/logs/2024022400/gdas_atmanlvar.log

Copy link
Copy Markdown
Contributor

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@aerorahul
Copy link
Copy Markdown
Contributor Author

Changing the GaeaC6 label to passed and noting that the failure was from the failed C96C48_ufs_hybatmda test. See this comment.
All other tests passed.

@TerrenceMcGuinness-NOAA
Copy link
Copy Markdown
Collaborator

TerrenceMcGuinness-NOAA commented Apr 3, 2025

All the Hera tests from build number 7 have been restarted where they left off from the above initial failure in rocotostat.py and are advancing nominally:

18:12:07 <(HEAD> scripts/$ cat ../../../RUNTESTS/ci-run_check.log 
Experiment C48_ATM_8d714478 Completed 1 Cycles: *SUCCESS* at Thu Apr  3 17:30:01 UTC 2025
Experiment C48_S2SW_8d714478 Completed 1 Cycles: *SUCCESS* at Thu Apr  3 17:36:15 UTC 2025
Experiment C96mx100_S2S_8d714478 Completed 1 Cycles: *SUCCESS* at Thu Apr  3 17:42:31 UTC 2025
Experiment C48mx500_hybAOWCDA_8d714478 Completed 2 Cycles: *SUCCESS* at Thu Apr  3 18:07:34 UTC 2025

18:12:09 <(HEAD> scripts/$ jobs
[1]  Stopped                 ./run-check_ci.sh /scratch1/NCEPDEV/global/glopara/CI/3517/ C48_ATM_8d714478 global-workflow
[3]   Running                 ./run-check_ci.sh /scratch1/NCEPDEV/global/glopara/CI/3517/ C48_S2SWA_gefs_8d714478 global-workflow &
[5]   Running                 ./run-check_ci.sh /scratch1/NCEPDEV/global/glopara/CI/3517/ C48mx500_3DVarAOWCDA_8d714478 global-workflow &
[7]   Running                 ./run-check_ci.sh /scratch1/NCEPDEV/global/glopara/CI/3517/ C96C48_hybatmDA_8d714478 global-workflow &
[8]   Running                 ./run-check_ci.sh /scratch1/NCEPDEV/global/glopara/CI/3517/ C96C48_hybatmaerosnowDA_8d714478 global-workflow &
[9]   Running                 ./run-check_ci.sh /scratch1/NCEPDEV/global/glopara/CI/3517/ C96C48_ufs_hybatmDA_8d714478 global-workflow &
[10]  Running                 ./run-check_ci.sh /scratch1/NCEPDEV/global/glopara/CI/3517/ C96_atm3DVar_8d714478 global-workflow &

Note: the run-check_ci.sh script running above uses the rocotostat.py utility so that also is now behaving nomially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rename wave files to standardized form Wave scripts should not redirect standard output away from the log file

6 participants