Add support for Rocoto with generic LINUX platform by christinaholtNOAA · Pull Request #617 · ufs-community/regional_workflow

christinaholtNOAA · 2021-10-15T15:55:30Z

DESCRIPTION OF CHANGES:

This modification allows for a user to run with Rocoto on their own Linux-based platform, which still requires modules to be installed.

TESTS CONDUCTED:

Ran with the corresponding linux_target branch on Hera. Passed the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GSD_SAR.

Tests for the new functionality were conducted on the RRFS Cloud AWS platform and within a Docker container. Currently no additional test was added to the test framework for this capability since we have no intent on supporting any defined linux configuration.

DEPENDENCIES:

Add support for Rocoto with generic LINUX platform ufs-srweather-app#184

DOCUMENTATION:

I will need guidance on where and how to add documentation for this addition, if it is necessary.

ISSUE (optional):

This PR addresses PR #610

CONTRIBUTORS (optional):

@christopherwharrop-noaa

christinaholtNOAA · 2021-10-15T15:57:08Z

+  source "${env_fp}" || print_err_msg_exit "\
+    Sourcing platform-specific environment file (env_fp) for
+  the workflow task failed :
+      env_fp = \"${env_fp}\""


A user will be defining this in their own wflow env file, anyway, so have the supported platforms do the same for consistency. I do not have enough familiarity with the WCOSS machines to make changes there.

That sounds good.

But the same comment I have below for BUILD_ENV_FN in load_modules_run_task.sh goes here for WFLOW_ENV_FN, i.e. WFLOW_ENV_FN should be (re)set as necessary in setup.sh, i.e. in setup.sh, include this:
WFLOW_EN_FN=${WFLOW_ENV_FN:-"wflow_${machine}.env"}
so that it shows the correct value in var_defns.sh. Then it can be used here instead of needing the new variable env_fn (since var_defns.sh is sourced at the top of launch_FV3LAM_wflow.sh).

christinaholtNOAA · 2021-10-15T16:26:39Z

@@ -1,3 +1,4 @@
+#!/bin/bash


Should be set explicitly.

This actually will not have an effect since setup.sh just defines the setup() function; it does not contain a script. So you can't call setup() directly from a bash shell (unless you source setup.sh first); it has to be called from a script that sources setup.sh and which includes the #!/bin/bash at the top. I intentionally omit the #!/bin/bash from functions since it is not needed and as a reminder that the file defines a function, not a script.

christinaholtNOAA · 2021-10-18T19:29:27Z

We would like to use this PR to test some of our CI automation work, too. Could someone with appropriate permissions add the ci-hera-intel-WE label to this PR? Thanks!

JeffBeck-NOAA · 2021-10-18T19:40:20Z

We would like to use this PR to test some of our CI automation work, too. Could someone with appropriate permissions add the ci-hera-intel-WE label to this PR? Thanks!

Done!

christinaholtNOAA · 2021-10-25T15:15:08Z

I saw that the corresponding ufs-srweather-app PR was merged. While I doubt it was a problem in this case, the two were definitely linked and should have been merged together. What is the status of this PR?

christinaholtNOAA · 2021-11-01T19:40:47Z

Will anyone be able to review this PR soon?

JeffBeck-NOAA · 2021-11-01T22:06:35Z

+#
+#-----------------------------------------------------------------------
+#
 # Continue appending variable definitions to the variable definitions 


Can we not just write the RUN_CMD_FCST field to var_defns.sh file after all other (related) fields are set?

I tried running the generate script (after commenting out the reset lines for the RUN_CMD_* fields), and I see the following in the var_defns.sh file:

PE_MEMBER01="12"
RUN_CMD_FCST="mpirun -np ${PE_MEMBER01}"

Is this not what we'd expect?

I don't think we want any variable representations that need to be expanded in the var_defns.sh file, do we?

At least I am currently working under that assumption in another PR given the current state of that config file.

So long as the variable representations can be expanded (i.e., the required definitions are provided earlier in the file), I would think it's OK. @gsketefian, thoughts? I agree that an eventual goal of explicitly defining all fields/variables in the var_defn.sh file would be useful, and less prone to error.

@christinaholtNOAA, minor typo in the comment block (occurance -> occurrence).

The reason I am bringing this up is that assuming that variables should not be expanded when the file is sourced means that we can include templates in this file. Not in this particular PR, but for example, if we wanted to set a template for a path to a file on a platform, we can leave it in the var_defns.sh file like this:

EXTRN_MDL_SYSBASEDIR_ICS='/scratch1/NCEPDEV/rstprod/com/gfs/prod/gfs.${yyyymmdd}/${hh}/atmos/'

And then fill it in when we're ready with run-time variables like this (where data variables are already set):

sysdir=$(eval echo ${EXTRN_MDL_SYSBASEDIR_ICS})

@christinaholtNOAA, yes, I understand how the config_defaults/config.sh scripts work. If we don't want to leave un-expanded variables in the var_defns.sh file, then we can't reference undefined variables in the config_defaults.sh or config.sh scripts.

Currently in config_defaults.sh, we see:

# RUN_CMD_FCST: # The run command for the model forecast step. This will be appended to the end # of the variable definitions file, so it can reference other variables.

Therefore, the original goal was to reference other variables and not expand anything. We should come up with another implementation to avoid having to replace an already defined field. I'll approve for now, but a goal of explicit variable definitions needs to be implemented to avoid a sed replacement call.

I'm curious if it's time to move away from bash configuration files....

Should I add this line below to ensure we don't have un-expanded variable references in the definition?

RUN_CMD_FCST=$(eval echo ${RUN_CMD_FCST})

That will allow the value that shows up in var_defns.sh to be:

NCORES_PER_NODE="40" PE_MEMBER01="12" RUN_CMD_FCST=mpirun -np 12

Yeah, I was wondering how it was being expanded, if at all, so this "eval" command looks like it'll do the trick.

Okay, I'll add that here, too.

JeffBeck-NOAA · 2021-11-01T22:09:15Z

    <command>&LOAD_MODULES_RUN_TASK_FP; "&RUN_FCST_TN;" "&JOBSDIR;/JREGIONAL_RUN_FCST"</command>
-  {%- if machine in ["JET", "HERA"]  %}
+  {%- if machine in ["JET", "HERA", "LINUX"]  %}
    <cores>{{ ncores_run_fcst }}</cores>


Can we assume that all LINUX-based systems will need <cores> and <native> in the workflow as follows?

'ncores_run_fcst': ${PE_MEMBER01}
'native_run_fcst': --cpus-per-task ${OMP_NUM_THREADS_RUN_FCST} --exclusive

I think we can assume that this will work for the Slurm-based linux systems. Let's see if we can get some input from @christopherwharrop-noaa on that conjecture.

OK, I'm wondering if we may want to have a user-defined, Linux-based job scheduler option to handle these templating blocks. If that's too much work, then a comment somewhere saying this is specifically for Slurm-based Linux systems is probably fine.

I believe @christinaholtNOAA is correct. That should work.

@christopherwharrop-noaa Is it a fair assumption that Slurm + linux would account for a majority of the installations on unsupported platforms? Is there any good reason to choose PBS on any CSP? Would this Rocoto flag still work appropriately in that case? Can we only support so much flexibility?

This is tricky to do in a portable and generic way across multiple batch system schedulers. Rocoto offers a tpp (threads per process) option in the <nodes> tag which should translate into this without using a <custom> tag. However, then you have to calculate the number of nodes needed, which in turn would require knowing how many cores per node there are, which is specific to a given machine. This is a hard problem to solve in a clean way given the machinery we have to work with.

@christinaholtNOAA - Yes. That is a pretty safe assumption. Slurm appears dominant in that arena. Moreover, both AWS and Google (not sure about Azure) use Slurm for their HPC clustering solutions in the cloud.

I think that the <cores> and <native> flags are the simplest direction to go for the generic Linux + Slurm + Rocoto setup.

We have the configurable options we need to support these: NCORES_PER_NODE and PE_MEMBER01 and OMP_NUM_THREADS_RUN_FCST.

OK, that works. Can you add a comment somewhere that these settings are a Slurm-specific implementation of Linux with Rocoto? Thanks.

Sure thing. Put it in config_default.sh.

These changes are all consistent with PR ufs-community#539 (merged) and PR ufs-community#617 (under review) for the develop branch of the NOAA-EMC:regional_workflow repository. This should enable the use of a generic linux machine with Rocoto for all jobs that exist in the current develop branch. More work is needed to tackle those tasks that do not yet exit in the develop branch, but exist here.

christinaholtNOAA · 2021-11-09T15:01:14Z

@JeffBeck-NOAA @gsketefian @mkavulich Can we move forward on this PR. It would be very helpful to get this set of changes into develop for a couple of different cloud porting activities.

christopherwharrop-noaa · 2021-11-09T15:36:28Z

@JeffBeck-NOAA @gsketefian @mkavulich - I'd like to emphasize the importance of @christinaholtNOAA 's request to move this along to resolve concerns so this can be merged as soon as possible. The LINUX target work is crucial and is needed for other projects to move forward. This is the long pole in the tent keeping other work from getting done. I am not suggesting we compromise on quality of the review process, but am requesting that this PR get the attention required to bring it to the finish line. We need this now.

JeffBeck-NOAA · 2021-11-09T16:12:07Z

@christinaholtNOAA @christopherwharrop-noaa, my review is more or less complete. I was waiting to hear from @mkavulich regarding the expansion of RUN_CMD_FCST directly within config.sh or when it's sourced to create var_defn.sh.

venitahagerty · 2021-11-09T16:22:50Z

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch2/BMC/zrtrr/rrfs_ci/autoci/pr/759317963/20211109160632/ufs-srweather-app
Build was Successful
Script /scratch2/BMC/zrtrr/rrfs_ci/autoci/pr/759317963/20211109160632/ufs-srweather-app/regional_workflow/tests/WE2E/end_to_end_tests.sh does not exist in repo
Cannot run WE2E tests
If test failed, please make changes and add the following label back:
ci-hera-intel-WE

JeffBeck-NOAA · 2021-11-09T16:28:10Z

+      You are running on an unknown platform! The default value of
+    NCORES_PER_NODE = ${NCORES_PER_NODE} will be used unless you set
+    this variable in your configuration file."
    ;;


@christinaholtNOAA, can you explain why this is necessary? If a default NCORES_PER_NODE is defined for LINUX above, then all supported platforms will have a value, and there will be no division by zero errors since a user can't choose an unsupported platform (they have to choose a value listed in the valid_param_vals.sh file, or the generation of the workflow will fail).

@JeffBeck-NOAA - Almost every project I've ever been involved with entails running on a machine that is not supported. Those of us doing pioneering work need default values such as this to make life easier. When we need to modify the workflow infrastructure to support the functionality we need to do our work, having fail safes and warnings such as this helps a lot. Strictly speaking "LINUX" isn't really a platform so much as a workaround to enable users to run outside the confines of the "supported platforms". Even when a user chooses LINUX a lot of customization may be necessary to all of the "LINUX" cases scattered throughout the code. I have a lot to say about that but will reserve those comments for the appropriate discussion forum.

@christinaholtNOAA, what I'm saying is this code will not be used. If a user defines anything other than a defined platform, the generation of the workflow will fail with:

MACHINE must be set to one of the following:
"WCOSS_CRAY" "WCOSS_DELL_P3" "HERA" "ORION" "JET" "ODIN" "CHEYENNE" "STAMPEDE" "LINUX"

I take your point, but it is usually considered a bug to have a case statement that does not include the default * section in it to handle unexpected conditions. The problem is that it absolutely will be executed (but not on purpose) if someone adds a new platform to the list, and forgets to add the appropriate cases to all the various places in the code. In the absence of the default clause, the behavior of this code relies on assumed behavior elsewhere, which is subject to change. This replaces an undefined, difficult to diagnose, behavior with a defined one.

OK, understand. Last question, I noticed you changed the NCORES_PER_NODE to be definable for each platform. What was the goal in doing that? I can see this would be an option for Jet, since the different partitions have different cores per node, and the LINUX target, but the NOAA HPC platforms are completely fixed. Thanks.

This is an interesting solution, but will only get us the core count for the processor where the configuration script is run.

That is an excellent point! But I think it's the only viable default. We may be splitting hairs here. The current code already prints a warning. If we feel that the warning is easily missed and will cause hard-to-find problems, we should fail.

I know my input wasn't requested here specifically but I feel the need to chime in regardless: having had time to think about it and after reading the discussion here, I believe that NCORES_PER_NODE should be a mandatory user-specified variable for custom environments (i.e. for the generic LINUX target, not including NCORES_PER_NODE would quit with a descriptive error message). We have seen this (even on officially supported platforms!) in the past that if NCORES_PER_NODE is set to some incorrect value for that machine, we will see varying levels of poor performance that can be quite pernicious and go unnoticed for a very long time, costing money and resources for no reason. For this and the other reasons mentioned above I think it is reasonable to require users to know and specify this value for their system.

I think that makes a lot of sense, @mkavulich and I fully support your recommendation. I was employing way too many gymnastics trying to come up with a viable strategy for a default that would work.

I will go with failure here, if the value isn't set! Thanks for the input, y'all!

christinaholtNOAA · 2021-11-10T18:00:57Z

I am working on addressing the NCORES_PER_NODE solution we discussed. Thanks for approving, but it's not ready to merge.

christinaholtNOAA · 2021-11-10T20:31:09Z

Okay, I think I've made all the requested changes. Please correct me if I'm wrong. If all is good, I'm happy with a merge as I've tested that the new changes result in the appropriate experiment directory changes (all mods were limited to configuration settings).

gsketefian · 2021-11-13T00:13:18Z

+
 machine=$(echo_lowercase $MACHINE)
-env_fn="build_${machine}_${COMPILER}.env"
+env_fn=${BUILD_ENV_FN:-"build_${machine}_${COMPILER}.env"}


Sorry again for the late input. Since BUILD_ENV_FN now appears in config_defaults.sh, it will also appear in var_defns.sh. If it is not explicitly specified by the user in config.sh, it will be set to its default value (a null string) in var_defns.sh because currently it does not get reset to "build_${machine}_${COMPILER}.env" anywhere. That should happen in setup.sh, i.e. include this in setup.sh:

BUILD_ENV_FN=${BUILD_ENV_FN:-"build_${machine}_${COMPILER}.env"}

This should be placed somewhere before the line

cp_vrfy $USHDIR/${EXPT_DEFAULT_CONFIG_FN} ${GLOBAL_VAR_DEFNS_FP}

in setup.sh. This will cause var_defns.sh to contain the correct name of the environment file. It will also make things easier in this file since now we don't need the env_fn variable; we can just use BUILD_ENV_FN (which will be available since var_defns.sh is sourced at the top of this file).

@christinaholtNOAA

## DESCRIPTION OF CHANGES: 1. Enhance ability to use template variables in the experiment configuration file (either in the default configuration file `config_defaults.sh` or the user configuration file `config.sh`). 2. Modify WE2E test system to include test of template variable use. 3. Fix bugs. ### Notes on template variables: A template variable (or simply a template) is an experiment variable that contains in its definition a reference to another variable(s). The referenced variable can be another experiment variable (i.e. one that is defined in `var_defns.sh`), or it can be a local variable (i.e. one that is not defined in `var_defns.sh` but in the script or function that sources `var_defns.sh` and uses the template). For example, a template named `TEMPL_VAR` my be defined in `config_defaults.sh` or `config.sh` as `TEMPL_VAR='cd ${some_dir}'` where `some_dir` may be an experiment variable or a local variable. `TEMPL_VAR` can then be evaluated using bash's `eval` built-in command in a script or function that first sources `var_defns.sh` and, if necessary, defines `some_dir`. Note that single quotes must be used on the right-hand side to avoid expansion of `${some_dir}` before run time (i.e. when `eval` is called on `TEMPL_VAR`). For details, see the documentation added in PR #[198](ufs-community/ufs-srweather-app#198). ### Changes to WE2E tests: * Modify the WE2E test configuration file `config.deactivate_tasks.sh` to include template variables. `deactivate_tasks` now serves as a test of both deactivating tasks and of using template variables. * Add `template_vars` as an alternate test name for `deactivate_tasks` (by creating a symlink named `config.template_vars.sh` that points to `config.deactivate_tasks.sh`). ### Bug fixes: * In `get_WE2Etest_names_subdirs_descs.sh`, change the variable `alt_test_subdirs` to `alt_test_names` at a single location. * In `setup.sh`, set `BUILD_ENV_FN` and `WFLOW_ENV_FN` (instead of in `load_modules_run_task.sh` and `launch_FV3LAM_wflow.sh`, respectively). This way, these variables will have the correct values in `var_defns.sh`. * In `get_expts_status.sh`, fix the way `homerrfs` is calculated. ## TESTS CONDUCTED: The WE2E tests `grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2` and `template_vars` were run on Hera. Both completed successfully. ## DOCUMENTATION: Documentation is added to the User's Guide via PR #[198](ufs-community/ufs-srweather-app#198) into the ufs-srweather-app repo. ## Dependencies: PR #[198](ufs-community/ufs-srweather-app#198) for the documentation. ## CONTRIBUTORS: @christinaholtNOAA and @mkavulich brought up the issue of templates as part of PR #[617](https://github.com/NOAA-EMC/regional_workflow/pull/617).

christinaholtNOAA and others added 3 commits October 6, 2021 16:47

Adding all changes. Untested at this point.

fba5bec

Chris's additions for running on AWS.

2897caf

Changes needed for AWS platform.

0c7ddf6

christinaholtNOAA requested review from BenjaminBlake-NOAA, JeffBeck-NOAA, gsketefian and mkavulich as code owners October 15, 2021 15:55

christinaholtNOAA mentioned this pull request Oct 15, 2021

Add support for Rocoto with generic LINUX platform ufs-community/ufs-srweather-app#184

Merged

christinaholtNOAA commented Oct 15, 2021

View reviewed changes

JeffBeck-NOAA added the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Oct 18, 2021

JeffBeck-NOAA reviewed Nov 1, 2021

View reviewed changes

Comment thread ush/launch_FV3LAM_wflow.sh

JeffBeck-NOAA reviewed Nov 1, 2021

View reviewed changes

Comment thread ush/launch_FV3LAM_wflow.sh

Clean up comment to match code.

a3c7cd0

JeffBeck-NOAA reviewed Nov 1, 2021

View reviewed changes

christinaholtNOAA added 2 commits November 1, 2021 22:27

Fix spelling error.

15cfe9a

More docs for choosing linux + rocoto.

429eb20

JeffBeck-NOAA reviewed Nov 9, 2021

View reviewed changes

JeffBeck-NOAA approved these changes Nov 10, 2021

View reviewed changes

Require NCORES_PER_NODE for all machines.

5532ef9

mkavulich mentioned this pull request Nov 10, 2021

Get changes to config_defaults.sh made in Ch. 5 of documentation back into config_defaults.sh in repo (both release and develop branches) #432

Closed

Ensure variables are expanded within RUN_CMD

3fb49fc

JeffBeck-NOAA merged commit 5fcddb4 into ufs-community:develop Nov 11, 2021

gsketefian mentioned this pull request Nov 12, 2021

To simplify the launch_FV3LAM_wflow.sh script, try to capture the output of the rocotorun command using bash's $(rocotorun ...) syntax #631

Closed

gsketefian reviewed Nov 13, 2021

View reviewed changes

gsketefian mentioned this pull request Dec 16, 2021

Enhance ability to use template variables #650

Merged

christinaholtNOAA mentioned this pull request Jan 7, 2022

Refactor supported input path handling. #653

Merged

christinaholtNOAA deleted the linux_target branch May 13, 2022 15:27

mkavulich mentioned this pull request Sep 21, 2022

Add support for Rocoto with generic LINUX platform #610

Closed

mkavulich mentioned this pull request Oct 4, 2022

To simplify the launch_FV3LAM_wflow.sh script, capture the output of the rocotorun command using bash's $(rocotorun ...) syntax ufs-community/ufs-srweather-app#393

Closed

Conversation

christinaholtNOAA commented Oct 15, 2021

DESCRIPTION OF CHANGES:

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE (optional):

CONTRIBUTORS (optional):

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

christinaholtNOAA commented Oct 18, 2021

Uh oh!

JeffBeck-NOAA commented Oct 18, 2021

Uh oh!

christinaholtNOAA commented Oct 25, 2021

Uh oh!

christinaholtNOAA commented Nov 1, 2021

Uh oh!

Uh oh!

Uh oh!

JeffBeck-NOAA Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeffBeck-NOAA Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeffBeck-NOAA Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christopherwharrop-noaa Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christopherwharrop-noaa Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeffBeck-NOAA Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

JeffBeck-NOAA Nov 1, 2021 •

edited

Loading

JeffBeck-NOAA Nov 10, 2021 •

edited

Loading

JeffBeck-NOAA Nov 1, 2021 •

edited

Loading

christopherwharrop-noaa Nov 1, 2021 •

edited

Loading

christopherwharrop-noaa Nov 1, 2021 •

edited

Loading

JeffBeck-NOAA Nov 1, 2021 •

edited

Loading

JeffBeck-NOAA Nov 9, 2021 •

edited

Loading

JeffBeck-NOAA Nov 9, 2021 •

edited

Loading