Fixes to make build and run work in new merged directory structure by mkavulich · Pull Request #343 · ufs-community/ufs-srweather-app

mkavulich · 2022-09-08T16:39:13Z

DESCRIPTION OF CHANGES:

Removes regional_workflow from Externals.cfg
Replaces uses of "HOMErrfs" with "SR_WX_APP_TOP_DIR"
Remove/replace other references to regional_workflow

Type of change

Bug fix

TESTS CONDUCTED:

Tests all completed successfully, except for some expected failures and a few temporary problems on Hera

hera.intel
- Comprehensive
- Several tests failed, most expected failures, investigating the rest to ensure they are also failing previously
  - get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio
    - Temporary data retrieval failure
  - get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019101818
    - Temporary data retrieval failure
  - get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2020022518
    - Temporary data retrieval failure
  - get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021010100
    - Temporary data retrieval failure
  - get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio
    - Temporary data retrieval failure
  - grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
    - Wallclock limit reached; re-running succeeded
  - *grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
  - *subhourly_post
  - *subhourly_post_ensemble_2mems
- *expected failures
orion.intel
- Fundamental
- No unexpected failures
cheyenne.intel
- Fundamental
- No failures
cheyenne.gnu
- Fundamental
- No unexpected failures
jet.intel
- Fundamental
- Two tests failed; both are pre-existing failures
  - pregen_grid_orog_sfc_climo
  - nco_ensemble

DEPENDENCIES:

None

DOCUMENTATION:

In-line documentation has been updated; user documentation will need to be updated in a future PR.

- Replace all instances of "homerrfs" with "SR_WX_APP_TOP_DIR", and update location appropriately - Remove references to "regional_workflow" and update paths accordingly

…fs-community#343) ## DESCRIPTION OF CHANGES: - Removes regional_workflow from Externals.cfg - Replaces uses of "HOMErrfs" with "SR_WX_APP_TOP_DIR" - Remove/replace other references to regional_workflow ### Type of change - [ ] Bug fix ## TESTS CONDUCTED: All build tests passed on Cheyenne (gnu), Hera (intel), Jet (intel), Orion (intel) WE2E tests pending but all run so far have succeeded; see PR on GitHub for full list of tests ## DEPENDENCIES: None ## DOCUMENTATION: In-line documentation has been updated; user documentation will need to be updated in a future PR.

danielabdi-noaa · 2022-09-12T22:46:28Z

@mkavulich Did you figure out why the get_from_HPSS* tasks fail ? If I run them individually they seem to work fine, but not when running the whole comprehensive test. So I have always excluded those tests in the past from the comprehensive list.
Maybe there is a limit on how many processes can access HPSS files simultaneously?

mkavulich · 2022-09-13T17:00:22Z

@danielabdi-noaa When I re-ran the tests they all succeeded, so I had just assumed it was a temporary machine issue. These were the messages I found in the logs, apparently the failure was due to hitting the job wallclock time limit:

INFO: Running command
 htar -xvf /NCEPPROD/hpssprod/runhistory/rh2019/201907/20190701/gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190701_00.gfs_nemsioa.tar ./gfs.20190701/00/gfs.t00z.atmanl.nemsio ./gfs.20190701/00/gfs.t00z.sfcanl.nemsio

[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
slurmstepd: error: *** JOB 35521415 ON hfe12 CANCELLED AT 2022-09-08T17:09:30 DUE TO TIME LIMIT ***
_______________________________________________________________
Start Epilog v20.08.28 on node hfe12 for job 35521415 :: Thu Sep 8 17:09:30 UTC 2022
Job 35521415 (serial) finished for user Michael.Kavulich in partition service with exit code 0:15
_______________________________________________________________
End Epilogue v20.08.28 Thu Sep 8 17:09:30 UTC 2022

However, if what you're saying is correct and this is an ongoing issue then maybe we need to think about a good way to throttle HPSS connections when running the comprehensive set of tests, or just extend the wallclock time for these tasks. Would you mind opening an issue detailing the problems you've seen?

mkavulich added 2 commits September 8, 2022 09:07

Remove regional_workflow from Externals.cfg

0b193db

Modifications to allow experiments to run in new merged structure.

26531b4

- Replace all instances of "homerrfs" with "SR_WX_APP_TOP_DIR", and update location appropriately - Remove references to "regional_workflow" and update paths accordingly

mkavulich merged commit 30adef2 into develop Sep 8, 2022

danielabdi-noaa mentioned this pull request Sep 21, 2022

Enable AWS Parallel Works platform and Add Comprehensive End-To-End Tests #333

Merged

mkavulich deleted the feature/post-merge_fixes branch February 6, 2023 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes to make build and run work in new merged directory structure#343

Fixes to make build and run work in new merged directory structure#343
mkavulich merged 2 commits into
developfrom
feature/post-merge_fixes

mkavulich commented Sep 8, 2022 •

edited

Loading

Uh oh!

danielabdi-noaa commented Sep 12, 2022

Uh oh!

mkavulich commented Sep 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mkavulich commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DESCRIPTION OF CHANGES:

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

Uh oh!

danielabdi-noaa commented Sep 12, 2022

Uh oh!

mkavulich commented Sep 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkavulich commented Sep 8, 2022 •

edited

Loading