Skip to content

Changes towards making SRW app NCO compliant#348

Merged
danielabdi-noaa merged 106 commits into
ufs-community:developfrom
danielabdi-noaa:feature/nco_standard
Sep 25, 2022
Merged

Changes towards making SRW app NCO compliant#348
danielabdi-noaa merged 106 commits into
ufs-community:developfrom
danielabdi-noaa:feature/nco_standard

Conversation

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa danielabdi-noaa commented Sep 12, 2022

DESCRIPTION OF CHANGES:

Changes to make SRW app NCO complaint following the standard and review.

Here are some of the changes in this PR:

  • Follow new NCO directory structure and naming convention of output files, log files, temporary directory etc
  • Make any WE2E test case to be run in either community or nco mode
  • Also make verification tasks sort of nco complaint but this is not run in production
  • Use standard production environment variables listed in Table 1 of standard
  • Symlink fix files by default -- optionally can be turned off
  • Build to "exec" directory activated
  • Activate debugging mode by optionally setting set -eux
  • Make j-job and ex-script names consistent
  • Add versions of libraries specifically for wcoss2
  • Add a unique workflow ID for resolving clashes between different files, while keeping relation between different tasks in same workflow run
  • Adds unittest for end-to-end workflow generation using config.community.yaml and config.nco.yaml

TESTS CONDUCTED:

  • Run fundamental tests (9 test cases) on Hera, Jet and Orion manually using both community and nco mode.

  • Run fundamental tests on Gaea through Jenkins

  • Run the comprehensive test (82 test cases) on Hera using NCO mode. All of the test cases are successful.

    Some unexpected failures.

    • get_from_HPSS* tasks that fail when run in tandem with other tests, but not if run individually. Cause: Probably HPSS access issue with multiple processes? After a fix for conflicting ics/lbcs temporay working directories in NCO mode all test cases run to completion in one attempt. I suspect that I was very lucky to get this result since develop branch can sometimes fail all of them.
    • ensemble runs going into second day forecast not being run. Cause: make_grid_complete.txt is stored in first day's directory. Not relevant for operations since make_grid is not run, but necessary to run every test case in NCO mode. Find a better way to communicate grid generation completion than using dummy file?
      Solved by moving grid/orog completion files to a shared space for all cycle dates in EXPTDIR/grid(orog)(sfc_climo)
    • community_ensemble_2mems_stoch fails. Cause: This test case generates new input.nml per ensemble member. Solved by outputting input.nml to whatever is current working directory.

    The test directory and rocot stats files:

    /scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/benchmark-nco
    /scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/stat-nco.txt
    

    The NCO operations directory:

    /scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT
    

    The subdirectories com, output and tmp are the ones with content.

  • Run the comprehensive test (82 test cases) on Hera using COMMUNITY mode. All of the test cases are successful.

    The test directory and rocot stats files:

    /scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/benchmark-com
    /scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/stat-com.txt
    

DEPENDENCIES:

None

DOCUMENTATION:

None but would need documentation updates

ISSUE (optional):

#316

CONTRIBUTORS (optional):

@MatthewPyle-NOAA @christinaholtNOAA

@danielabdi-noaa danielabdi-noaa force-pushed the feature/nco_standard branch 4 times, most recently from 354a29b to bea89e6 Compare September 13, 2022 14:45
@chan-hoo
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa, thank you so much for this work!!! In this PR, you newly define some directories such as HOMErrfs, PARMrrfs, and USHrrfs. This kind of naming only works for RRFS. For example, Online-CMAQ (AQMv7) is an extension of the UFS SRW App. For this implementation, we should name them as HOMEaqm, PARMaqm, and USHaqm. The model name is defined with the "NET" variable in the app. For example, NET=rrfs for RRFS, and NET=aqm for Online-CMAQ. Could you change the way to define these directory names like "HOME{NET}", "PARM{NET}", and "USH{NET}"? Is this possible?

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

@chan-hoo Thanks! We can probably re-name HOMErrfs, USHrrfs etc to be HOME{NET}, USH{NET} in the ex-scripts and j-jobs, since all we have to do is modify what gets written to the var_defns.sh file accordingly. However, for the python workflow generation code itself, we would have to stick with one name. HOMErrfs would be Ok for that I guess. There would be a disparity between the workflow generation code and ex-scripts/j-job names if we decide to go this route. Let me know what you think.

@chan-hoo
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa, I agree with you! This requirement is only for the "J-job" (and its ex-script).

@chan-hoo
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa, what about naming it with "model" such as "HOMEmodel", "USHmodel", and "PARMmodel" in the python script?

@MatthewPyle-NOAA
Copy link
Copy Markdown
Collaborator

There probably isn't a great solution for this issue (with the SRW App intended for multiple operational systems), but NCO will really want a $HOMErrfs for RRFS and a $HOMEaqm for AQM. Maybe a generic placeholder like $HOMEmodel could be modified ahead of final code delivery to what it needs to be for the specific application.

@chan-hoo
Copy link
Copy Markdown
Collaborator

chan-hoo commented Sep 13, 2022

@danielabdi-noaa @MatthewPyle-NOAA, If so, the "rrfs" suffix would be fine for the python script. I'll modify it in my forked repo for Online-CMAQ when AQMv7 is delivered to NCO.

@MatthewPyle-NOAA
Copy link
Copy Markdown
Collaborator

MatthewPyle-NOAA commented Sep 13, 2022

@danielabdi-noaa I noticed a small issue with the build for WCOSS2. The version defining variable for wrf_io is inconsistently named between versions/build.ver.wcoss2 (export wrf_io_ver=1.2.0) and modulefiles/build_wcoss2_intel ( module load wrf_io/$::env(wrfio_ver) ) Could you modify build_wcoss2_intel to pull in wrf_io_ver?

It built for me on WCOSS2 after making this change.

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

@MatthewPyle-NOAA Thanks, just fixed it in the repo.

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

@chan-hoo After starting to work on the change, I am realizing it may not be the best idea.

+        f"HOME{NET}": HOMErrfs,
+        f"USH{NET}": USHrrfs,
+        f"SCRIPTS{NET}": SCRIPTSrrfs,
+        f"JOBS{NET}": JOBSrrfs,
+        f"SORC{NET}": SORCrrfs,
+        f"PARM{NET}": PARMrrfs,
+        f"MODULES{NET}": MODULESrrfs,
+        f"EXEC{NET}": EXECrrfs,
+        f"FIX{NET}": FIXrrfs,

In the ex-scripts/j-job scripts we would have to parametrize the variable name itself. If we had an m4 macro processor run over the scripts, maybe it would have been easier, but for now replacing HOMErrfs etc with the appropriate name before delivery sounds better.

@chan-hoo
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa, I got it. I'll replace them manually for AQM. Thank you!

@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Sep 14, 2022
@venitahagerty venitahagerty removed ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Sep 14, 2022
@MatthewPyle-NOAA
Copy link
Copy Markdown
Collaborator

MatthewPyle-NOAA commented Sep 14, 2022

@danielabdi-noaa One more thing jumped out at me when doing some testing. Would it be possible to make DATAROOT a distinctly specifiable variable, rather than having it specified as OPSROOT/tmp/? While production does run that way, development work on WCOSS generally has DATAROOT and COMROOT on different temporary disks that have different retention periods.

@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

danielabdi-noaa commented Sep 14, 2022

@MatthewPyle-NOAA It should be possible but I've had doubts on how NCO variables are specified. Currently the variables are in the var_defns.sh file under the nco section that the user can edit to his liking.

# [nco]
envir='para'
NET='rrfs'
RUN='nco_ensemble'
model_ver='we2e'
OPSROOT='/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT'
COMIN_BASEDIR='/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT/com/rrfs/we2e'
COMOUT_BASEDIR='/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT/com/rrfs/we2e'
COMROOT='/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT/com'
PACKAGEROOT='/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT/packages'
DATAROOT='/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT/tmp'
DCOMROOT='/scratch1/BMC/zrtrr/Daniel.Abdi/rrfs/OPSROOT/dcom'
DBNROOT=''
SENDECF='FALSE'
SENDDBN='FALSE'
SENDDBN_NTC='FALSE'
SENDCOM='FALSE'
SENDWEB='FALSE'
KEEPDATA='TRUE'
MAILTO=''
MAILCC=''

If some of these variables are specified by an export from the environment, currently it won't take effect because both j-jobs and ex-scripts source the var_defns.sh file again and overwrite the specified values. So how are these variables specified in operations? We may need to completely remove them from var_defns.sh if they are exported from the environment.

@ufs-community ufs-community deleted a comment from venitahagerty Sep 14, 2022
@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Sep 14, 2022
@venitahagerty venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Sep 14, 2022
@ufs-community ufs-community deleted a comment from venitahagerty Sep 14, 2022
@venitahagerty
Copy link
Copy Markdown
Collaborator

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1053825830/20220914192016/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE

@MatthewPyle-NOAA
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa Sorry about that - I saw something in setup.py, not thinking that I could fix it in var_defns.sh.

The highest level items like OPSROOT and DATAROOT (items listed as "job card" in Table 1 of the implementation standards) are specified by the ecflow jobs in production. For rocoto-driven jobs, they would ideally be defined somewhere at the rocoto or rocoto-driving script level. At the J-job and below, the expectation is that they would be available as environmental variables.

@dshawul
Copy link
Copy Markdown
Contributor

dshawul commented Sep 14, 2022

@MatthewPyle-NOAA I forgot to mention that you can actually export OPSROOT, DATAROOT etc before workflow generation and it will take those values, instead of the deffaults in setup.py so you don't really need to edit var_defns.sh. I was thinking about the situation after the workflow is generated, in which case the only way to specifiy it by editing var_defns.sh. If SENDDBN etc is something that is decided before workflow is generated and does not change afterwards, we are all good.

@venitahagerty
Copy link
Copy Markdown
Collaborator

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1053825830/20220914192010/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE

@danielabdi-noaa danielabdi-noaa force-pushed the feature/nco_standard branch 2 times, most recently from f9045a7 to 21f9c96 Compare September 25, 2022 13:33
@danielabdi-noaa
Copy link
Copy Markdown
Collaborator Author

I've tested the fundamental tests on Gaea through Jenkins since last time. I am fairly confident it works as intended so going to merge now.

@danielabdi-noaa danielabdi-noaa merged commit f6f1717 into ufs-community:develop Sep 25, 2022
@christinaholtNOAA
Copy link
Copy Markdown
Collaborator

@danielabdi-noaa I missed that you'd merged already. Thank you for being so quick and thorough on your responses to reviews! Great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Tested on Hera Tested successfully on Hera machine Tested on Jet Successfully tested on Jet machine Tested on Orion Tests ran successfully on MSU Orion machine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants