Skip to content

Fixes to make build and run work in new merged directory structure#343

Merged
mkavulich merged 2 commits into
developfrom
feature/post-merge_fixes
Sep 8, 2022
Merged

Fixes to make build and run work in new merged directory structure#343
mkavulich merged 2 commits into
developfrom
feature/post-merge_fixes

Conversation

@mkavulich
Copy link
Copy Markdown
Collaborator

@mkavulich mkavulich commented Sep 8, 2022

DESCRIPTION OF CHANGES:

  • Removes regional_workflow from Externals.cfg
  • Replaces uses of "HOMErrfs" with "SR_WX_APP_TOP_DIR"
  • Remove/replace other references to regional_workflow

Type of change

  • Bug fix

TESTS CONDUCTED:

Tests all completed successfully, except for some expected failures and a few temporary problems on Hera

  • hera.intel
    • Comprehensive
    • Several tests failed, most expected failures, investigating the rest to ensure they are also failing previously
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio
        • Temporary data retrieval failure
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019101818
        • Temporary data retrieval failure
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2020022518
        • Temporary data retrieval failure
      • get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021010100
        • Temporary data retrieval failure
      • get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio
        • Temporary data retrieval failure
      • grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
        • Wallclock limit reached; re-running succeeded
      • *grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
      • *subhourly_post
      • *subhourly_post_ensemble_2mems
    • *expected failures
  • orion.intel
    • Fundamental
    • No unexpected failures
  • cheyenne.intel
    • Fundamental
    • No failures
  • cheyenne.gnu
    • Fundamental
    • No unexpected failures
  • jet.intel
    • Fundamental
    • Two tests failed; both are pre-existing failures
      • pregen_grid_orog_sfc_climo
      • nco_ensemble

DEPENDENCIES:

None

DOCUMENTATION:

In-line documentation has been updated; user documentation will need to be updated in a future PR.

 - Replace all instances of "homerrfs" with "SR_WX_APP_TOP_DIR", and update location appropriately
 - Remove references to "regional_workflow" and update paths accordingly
@mkavulich mkavulich merged commit 30adef2 into develop Sep 8, 2022
mkavulich added a commit to mkavulich/ufs-srweather-app that referenced this pull request Sep 9, 2022
…fs-community#343)

## DESCRIPTION OF CHANGES: 
 - Removes regional_workflow from Externals.cfg
 - Replaces uses of "HOMErrfs" with "SR_WX_APP_TOP_DIR"
 - Remove/replace other references to regional_workflow

### Type of change
- [ ] Bug fix

## TESTS CONDUCTED: 
All build tests passed on Cheyenne (gnu), Hera (intel), Jet (intel), Orion (intel)

WE2E tests pending but all run so far have succeeded; see PR on GitHub for full list of tests

## DEPENDENCIES:
None

## DOCUMENTATION:
In-line documentation has been updated; user documentation will need to be updated in a future PR.
@danielabdi-noaa
Copy link
Copy Markdown
Collaborator

@mkavulich Did you figure out why the get_from_HPSS* tasks fail ? If I run them individually they seem to work fine, but not when running the whole comprehensive test. So I have always excluded those tests in the past from the comprehensive list.
Maybe there is a limit on how many processes can access HPSS files simultaneously?

@mkavulich
Copy link
Copy Markdown
Collaborator Author

@danielabdi-noaa When I re-ran the tests they all succeeded, so I had just assumed it was a temporary machine issue. These were the messages I found in the logs, apparently the failure was due to hitting the job wallclock time limit:

INFO: Running command
 htar -xvf /NCEPPROD/hpssprod/runhistory/rh2019/201907/20190701/gpfs_dell1_nco_ops_com_gfs_prod_gfs.20190701_00.gfs_nemsioa.tar ./gfs.20190701/00/gfs.t00z.atmanl.nemsio ./gfs.20190701/00/gfs.t00z.sfcanl.nemsio

[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
slurmstepd: error: *** JOB 35521415 ON hfe12 CANCELLED AT 2022-09-08T17:09:30 DUE TO TIME LIMIT ***
_______________________________________________________________
Start Epilog v20.08.28 on node hfe12 for job 35521415 :: Thu Sep 8 17:09:30 UTC 2022
Job 35521415 (serial) finished for user Michael.Kavulich in partition service with exit code 0:15
_______________________________________________________________
End Epilogue v20.08.28 Thu Sep 8 17:09:30 UTC 2022

However, if what you're saying is correct and this is an ongoing issue then maybe we need to think about a good way to throttle HPSS connections when running the comprehensive set of tests, or just extend the wallclock time for these tasks. Would you mind opening an issue detailing the problems you've seen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants