Add support for Rocoto with generic LINUX platform#617
Conversation
| source "${env_fp}" || print_err_msg_exit "\ | ||
| Sourcing platform-specific environment file (env_fp) for | ||
| the workflow task failed : | ||
| env_fp = \"${env_fp}\"" |
There was a problem hiding this comment.
A user will be defining this in their own wflow env file, anyway, so have the supported platforms do the same for consistency. I do not have enough familiarity with the WCOSS machines to make changes there.
There was a problem hiding this comment.
That sounds good.
But the same comment I have below for BUILD_ENV_FN in load_modules_run_task.sh goes here for WFLOW_ENV_FN, i.e. WFLOW_ENV_FN should be (re)set as necessary in setup.sh, i.e. in setup.sh, include this:
WFLOW_EN_FN=${WFLOW_ENV_FN:-"wflow_${machine}.env"}
so that it shows the correct value in var_defns.sh. Then it can be used here instead of needing the new variable env_fn (since var_defns.sh is sourced at the top of launch_FV3LAM_wflow.sh).
| @@ -1,3 +1,4 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Should be set explicitly.
There was a problem hiding this comment.
This actually will not have an effect since setup.sh just defines the setup() function; it does not contain a script. So you can't call setup() directly from a bash shell (unless you source setup.sh first); it has to be called from a script that sources setup.sh and which includes the #!/bin/bash at the top. I intentionally omit the #!/bin/bash from functions since it is not needed and as a reminder that the file defines a function, not a script.
|
We would like to use this PR to test some of our CI automation work, too. Could someone with appropriate permissions add the |
Done! |
|
I saw that the corresponding ufs-srweather-app PR was merged. While I doubt it was a problem in this case, the two were definitely linked and should have been merged together. What is the status of this PR? |
|
Will anyone be able to review this PR soon? |
| # | ||
| #----------------------------------------------------------------------- | ||
| # | ||
| # Continue appending variable definitions to the variable definitions |
There was a problem hiding this comment.
Can we not just write the RUN_CMD_FCST field to var_defns.sh file after all other (related) fields are set?
I tried running the generate script (after commenting out the reset lines for the RUN_CMD_* fields), and I see the following in the var_defns.sh file:
PE_MEMBER01="12"
RUN_CMD_FCST="mpirun -np ${PE_MEMBER01}"
Is this not what we'd expect?
There was a problem hiding this comment.
I don't think we want any variable representations that need to be expanded in the var_defns.sh file, do we?
At least I am currently working under that assumption in another PR given the current state of that config file.
There was a problem hiding this comment.
So long as the variable representations can be expanded (i.e., the required definitions are provided earlier in the file), I would think it's OK. @gsketefian, thoughts? I agree that an eventual goal of explicitly defining all fields/variables in the var_defn.sh file would be useful, and less prone to error.
There was a problem hiding this comment.
@christinaholtNOAA, minor typo in the comment block (occurance -> occurrence).
There was a problem hiding this comment.
The reason I am bringing this up is that assuming that variables should not be expanded when the file is sourced means that we can include templates in this file. Not in this particular PR, but for example, if we wanted to set a template for a path to a file on a platform, we can leave it in the var_defns.sh file like this:
EXTRN_MDL_SYSBASEDIR_ICS='/scratch1/NCEPDEV/rstprod/com/gfs/prod/gfs.${yyyymmdd}/${hh}/atmos/'
And then fill it in when we're ready with run-time variables like this (where data variables are already set):
sysdir=$(eval echo ${EXTRN_MDL_SYSBASEDIR_ICS})
There was a problem hiding this comment.
@christinaholtNOAA, yes, I understand how the config_defaults/config.sh scripts work. If we don't want to leave un-expanded variables in the var_defns.sh file, then we can't reference undefined variables in the config_defaults.sh or config.sh scripts.
Currently in config_defaults.sh, we see:
# RUN_CMD_FCST:
# The run command for the model forecast step. This will be appended to the end
# of the variable definitions file, so it can reference other variables.
Therefore, the original goal was to reference other variables and not expand anything. We should come up with another implementation to avoid having to replace an already defined field. I'll approve for now, but a goal of explicit variable definitions needs to be implemented to avoid a sed replacement call.
There was a problem hiding this comment.
I'm curious if it's time to move away from bash configuration files....
There was a problem hiding this comment.
Should I add this line below to ensure we don't have un-expanded variable references in the definition?
RUN_CMD_FCST=$(eval echo ${RUN_CMD_FCST})
That will allow the value that shows up in var_defns.sh to be:
NCORES_PER_NODE="40"
PE_MEMBER01="12"
RUN_CMD_FCST=mpirun -np 12
There was a problem hiding this comment.
Yeah, I was wondering how it was being expanded, if at all, so this "eval" command looks like it'll do the trick.
There was a problem hiding this comment.
Okay, I'll add that here, too.
| <command>&LOAD_MODULES_RUN_TASK_FP; "&RUN_FCST_TN;" "&JOBSDIR;/JREGIONAL_RUN_FCST"</command> | ||
| {%- if machine in ["JET", "HERA"] %} | ||
| {%- if machine in ["JET", "HERA", "LINUX"] %} | ||
| <cores>{{ ncores_run_fcst }}</cores> |
There was a problem hiding this comment.
Can we assume that all LINUX-based systems will need <cores> and <native> in the workflow as follows?
'ncores_run_fcst': ${PE_MEMBER01}
'native_run_fcst': --cpus-per-task ${OMP_NUM_THREADS_RUN_FCST} --exclusive
There was a problem hiding this comment.
I think we can assume that this will work for the Slurm-based linux systems. Let's see if we can get some input from @christopherwharrop-noaa on that conjecture.
There was a problem hiding this comment.
OK, I'm wondering if we may want to have a user-defined, Linux-based job scheduler option to handle these templating blocks. If that's too much work, then a comment somewhere saying this is specifically for Slurm-based Linux systems is probably fine.
There was a problem hiding this comment.
I believe @christinaholtNOAA is correct. That should work.
There was a problem hiding this comment.
@christopherwharrop-noaa Is it a fair assumption that Slurm + linux would account for a majority of the installations on unsupported platforms? Is there any good reason to choose PBS on any CSP? Would this Rocoto flag still work appropriately in that case? Can we only support so much flexibility?
There was a problem hiding this comment.
This is tricky to do in a portable and generic way across multiple batch system schedulers. Rocoto offers a tpp (threads per process) option in the <nodes> tag which should translate into this without using a <custom> tag. However, then you have to calculate the number of nodes needed, which in turn would require knowing how many cores per node there are, which is specific to a given machine. This is a hard problem to solve in a clean way given the machinery we have to work with.
There was a problem hiding this comment.
@christinaholtNOAA - Yes. That is a pretty safe assumption. Slurm appears dominant in that arena. Moreover, both AWS and Google (not sure about Azure) use Slurm for their HPC clustering solutions in the cloud.
There was a problem hiding this comment.
I think that the <cores> and <native> flags are the simplest direction to go for the generic Linux + Slurm + Rocoto setup.
We have the configurable options we need to support these: NCORES_PER_NODE and PE_MEMBER01 and OMP_NUM_THREADS_RUN_FCST.
There was a problem hiding this comment.
OK, that works. Can you add a comment somewhere that these settings are a Slurm-specific implementation of Linux with Rocoto? Thanks.
There was a problem hiding this comment.
Sure thing. Put it in config_default.sh.
These changes are all consistent with PR ufs-community#539 (merged) and PR ufs-community#617 (under review) for the develop branch of the NOAA-EMC:regional_workflow repository. This should enable the use of a generic linux machine with Rocoto for all jobs that exist in the current develop branch. More work is needed to tackle those tasks that do not yet exit in the develop branch, but exist here.
|
@JeffBeck-NOAA @gsketefian @mkavulich Can we move forward on this PR. It would be very helpful to get this set of changes into develop for a couple of different cloud porting activities. |
|
@JeffBeck-NOAA @gsketefian @mkavulich - I'd like to emphasize the importance of @christinaholtNOAA 's request to move this along to resolve concerns so this can be merged as soon as possible. The LINUX target work is crucial and is needed for other projects to move forward. This is the long pole in the tent keeping other work from getting done. I am not suggesting we compromise on quality of the review process, but am requesting that this PR get the attention required to bring it to the finish line. We need this now. |
|
@christinaholtNOAA @christopherwharrop-noaa, my review is more or less complete. I was waiting to hear from @mkavulich regarding the expansion of RUN_CMD_FCST directly within config.sh or when it's sourced to create var_defn.sh. |
|
Machine: hera |
| You are running on an unknown platform! The default value of | ||
| NCORES_PER_NODE = ${NCORES_PER_NODE} will be used unless you set | ||
| this variable in your configuration file." | ||
| ;; |
There was a problem hiding this comment.
@christinaholtNOAA, can you explain why this is necessary? If a default NCORES_PER_NODE is defined for LINUX above, then all supported platforms will have a value, and there will be no division by zero errors since a user can't choose an unsupported platform (they have to choose a value listed in the valid_param_vals.sh file, or the generation of the workflow will fail).
There was a problem hiding this comment.
@JeffBeck-NOAA - Almost every project I've ever been involved with entails running on a machine that is not supported. Those of us doing pioneering work need default values such as this to make life easier. When we need to modify the workflow infrastructure to support the functionality we need to do our work, having fail safes and warnings such as this helps a lot. Strictly speaking "LINUX" isn't really a platform so much as a workaround to enable users to run outside the confines of the "supported platforms". Even when a user chooses LINUX a lot of customization may be necessary to all of the "LINUX" cases scattered throughout the code. I have a lot to say about that but will reserve those comments for the appropriate discussion forum.
There was a problem hiding this comment.
@christinaholtNOAA, what I'm saying is this code will not be used. If a user defines anything other than a defined platform, the generation of the workflow will fail with:
MACHINE must be set to one of the following:
"WCOSS_CRAY" "WCOSS_DELL_P3" "HERA" "ORION" "JET" "ODIN" "CHEYENNE" "STAMPEDE" "LINUX"
There was a problem hiding this comment.
I take your point, but it is usually considered a bug to have a case statement that does not include the default * section in it to handle unexpected conditions. The problem is that it absolutely will be executed (but not on purpose) if someone adds a new platform to the list, and forgets to add the appropriate cases to all the various places in the code. In the absence of the default clause, the behavior of this code relies on assumed behavior elsewhere, which is subject to change. This replaces an undefined, difficult to diagnose, behavior with a defined one.
There was a problem hiding this comment.
OK, understand. Last question, I noticed you changed the NCORES_PER_NODE to be definable for each platform. What was the goal in doing that? I can see this would be an option for Jet, since the different partitions have different cores per node, and the LINUX target, but the NOAA HPC platforms are completely fixed. Thanks.
There was a problem hiding this comment.
This is an interesting solution, but will only get us the core count for the processor where the configuration script is run.
There was a problem hiding this comment.
That is an excellent point! But I think it's the only viable default. We may be splitting hairs here. The current code already prints a warning. If we feel that the warning is easily missed and will cause hard-to-find problems, we should fail.
There was a problem hiding this comment.
I know my input wasn't requested here specifically but I feel the need to chime in regardless: having had time to think about it and after reading the discussion here, I believe that NCORES_PER_NODE should be a mandatory user-specified variable for custom environments (i.e. for the generic LINUX target, not including NCORES_PER_NODE would quit with a descriptive error message). We have seen this (even on officially supported platforms!) in the past that if NCORES_PER_NODE is set to some incorrect value for that machine, we will see varying levels of poor performance that can be quite pernicious and go unnoticed for a very long time, costing money and resources for no reason. For this and the other reasons mentioned above I think it is reasonable to require users to know and specify this value for their system.
There was a problem hiding this comment.
I think that makes a lot of sense, @mkavulich and I fully support your recommendation. I was employing way too many gymnastics trying to come up with a viable strategy for a default that would work.
There was a problem hiding this comment.
I will go with failure here, if the value isn't set! Thanks for the input, y'all!
|
I am working on addressing the |
|
Okay, I think I've made all the requested changes. Please correct me if I'm wrong. If all is good, I'm happy with a merge as I've tested that the new changes result in the appropriate experiment directory changes (all mods were limited to configuration settings). |
|
|
||
| machine=$(echo_lowercase $MACHINE) | ||
| env_fn="build_${machine}_${COMPILER}.env" | ||
| env_fn=${BUILD_ENV_FN:-"build_${machine}_${COMPILER}.env"} |
There was a problem hiding this comment.
Sorry again for the late input. Since BUILD_ENV_FN now appears in config_defaults.sh, it will also appear in var_defns.sh. If it is not explicitly specified by the user in config.sh, it will be set to its default value (a null string) in var_defns.sh because currently it does not get reset to "build_${machine}_${COMPILER}.env" anywhere. That should happen in setup.sh, i.e. include this in setup.sh:
BUILD_ENV_FN=${BUILD_ENV_FN:-"build_${machine}_${COMPILER}.env"}
This should be placed somewhere before the line
cp_vrfy
in setup.sh. This will cause var_defns.sh to contain the correct name of the environment file. It will also make things easier in this file since now we don't need the env_fn variable; we can just use BUILD_ENV_FN (which will be available since var_defns.sh is sourced at the top of this file).
## DESCRIPTION OF CHANGES:
1. Enhance ability to use template variables in the experiment configuration file (either in the default configuration file `config_defaults.sh` or the user configuration file `config.sh`).
2. Modify WE2E test system to include test of template variable use.
3. Fix bugs.
### Notes on template variables:
A template variable (or simply a template) is an experiment variable that contains in its definition a reference to another variable(s). The referenced variable can be another experiment variable (i.e. one that is defined in `var_defns.sh`), or it can be a local variable (i.e. one that is not defined in `var_defns.sh` but in the script or function that sources `var_defns.sh` and uses the template). For example, a template named `TEMPL_VAR` my be defined in `config_defaults.sh` or `config.sh` as
`TEMPL_VAR='cd ${some_dir}'`
where `some_dir` may be an experiment variable or a local variable. `TEMPL_VAR` can then be evaluated using bash's `eval` built-in command in a script or function that first sources `var_defns.sh` and, if necessary, defines `some_dir`. Note that single quotes must be used on the right-hand side to avoid expansion of `${some_dir}` before run time (i.e. when `eval` is called on `TEMPL_VAR`). For details, see the documentation added in PR #[198](ufs-community/ufs-srweather-app#198).
### Changes to WE2E tests:
* Modify the WE2E test configuration file `config.deactivate_tasks.sh` to include template variables. `deactivate_tasks` now serves as a test of both deactivating tasks and of using template variables.
* Add `template_vars` as an alternate test name for `deactivate_tasks` (by creating a symlink named `config.template_vars.sh` that points to `config.deactivate_tasks.sh`).
### Bug fixes:
* In `get_WE2Etest_names_subdirs_descs.sh`, change the variable `alt_test_subdirs` to `alt_test_names` at a single location.
* In `setup.sh`, set `BUILD_ENV_FN` and `WFLOW_ENV_FN` (instead of in `load_modules_run_task.sh` and `launch_FV3LAM_wflow.sh`, respectively). This way, these variables will have the correct values in `var_defns.sh`.
* In `get_expts_status.sh`, fix the way `homerrfs` is calculated.
## TESTS CONDUCTED:
The WE2E tests `grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2` and `template_vars` were run on Hera. Both completed successfully.
## DOCUMENTATION:
Documentation is added to the User's Guide via PR #[198](ufs-community/ufs-srweather-app#198) into the ufs-srweather-app repo.
## Dependencies:
PR #[198](ufs-community/ufs-srweather-app#198) for the documentation.
## CONTRIBUTORS:
@christinaholtNOAA and @mkavulich brought up the issue of templates as part of PR #[617](https://github.com/NOAA-EMC/regional_workflow/pull/617).
DESCRIPTION OF CHANGES:
This modification allows for a user to run with Rocoto on their own Linux-based platform, which still requires modules to be installed.
TESTS CONDUCTED:
Ran with the corresponding linux_target branch on Hera. Passed the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GSD_SAR.
Tests for the new functionality were conducted on the RRFS Cloud AWS platform and within a Docker container. Currently no additional test was added to the test framework for this capability since we have no intent on supporting any defined linux configuration.
DEPENDENCIES:
DOCUMENTATION:
I will need guidance on where and how to add documentation for this addition, if it is necessary.
ISSUE (optional):
This PR addresses PR #610
CONTRIBUTORS (optional):
@christopherwharrop-noaa