Skip to content

Generalize single cycle script to also work for hofx#44

Merged
CoryMartin-NOAA merged 8 commits into
developfrom
feature/hofx
Apr 11, 2022
Merged

Generalize single cycle script to also work for hofx#44
CoryMartin-NOAA merged 8 commits into
developfrom
feature/hofx

Conversation

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor

Closes #43

This PR generalizes the work done in #42 to allow for a user to run just H(x) instead of the entire variational application.

It also simplifies the SLURM batch script generation to use multi-line strings where appropriate.

The sample YAML changes slightly from #42, see:

working directory: /work2/noaa/stmp/cmartin/gdas_single_test_hofx
GDASApp home: /work2/noaa/da/cmartin/GDASApp/work/GDASApp
GDASApp mode: hofx
executable options:
  obs_yaml_dir: /work2/noaa/da/cmartin/GDASApp/work/GDASApp/parm/atm/obs/config
  yaml_template: /work2/noaa/da/cmartin/GDASApp/work/GDASApp/parm/atm/hofx/hofx_nomodel.yaml
  exe_path: /work2/noaa/da/cmartin/GDASApp/work/GDASApp/build/bin/fv3jedi_hofx_nomodel.x
  obs_list: /work2/noaa/da/cmartin/GDASApp/work/GDASApp/parm/atm/obs/lists/gdas_prototype.yaml
  gdas_fix_root: /work2/noaa/da/cmartin/GDASApp/fix
  atm: true
  layout_x: 1
  layout_y: 1
  atm_window_length: PT6H
  valid_time: 2021-12-21T06:00:00Z
  dump: gdas
  case: C96
  levs: 128
job options:
  machine: orion
  account: da-cpu
  queue: debug
  partition: debug
  walltime: '10:00'
  ntasks: 6
  cpus-per-task: 1
  modulepath: /work2/noaa/da/cmartin/GDASApp/work/GDASApp/modulefiles

Note that some entries are optional for hofx compared to var, and there is now a GDASApp mode, it is now executable options instead of analysis options, and yaml_template instead of var_yaml, and exe_path is new to point to the exe to use at runtime.

@CoryMartin-NOAA CoryMartin-NOAA self-assigned this Apr 8, 2022
@CoryMartin-NOAA CoryMartin-NOAA added the orion-RT Queue for automated testing on Orion label Apr 8, 2022
@emcbot emcbot added orion-RT-Running Automated testing running on Orion and removed orion-RT Queue for automated testing on Orion labels Apr 8, 2022
@emcbot
Copy link
Copy Markdown

emcbot commented Apr 8, 2022

Automated Pull Request Testing Results:
Machine: orion

Start: Fri Apr  8 12:30:19 CDT 2022 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Fri Apr  8 13:13:51 CDT 2022
---------------------------------------------------
Tests:                                 *SUCCESS*
Tests: Completed at Fri Apr  8 13:14:51 CDT 2022
Tests: 100% tests passed, 0 tests failed out of 11

@emcbot emcbot added orion-RT-Passed Automated testing successful on Orion and removed orion-RT-Running Automated testing running on Orion labels Apr 8, 2022
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

General comment: When I see exe_path in sample.yaml, I think of the path to the executable I want to run. exe_path actually include both the path and the executable. Can we rename exe_path to reflect this? Doing so improves readability.

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

General comment: When I see exe_path in sample.yaml, I think of the path to the executable I want to run. exe_path actually include both the path and the executable. Can we rename exe_path to reflect this? Doing so improves readability.

@RussTreadon-NOAA do you have a suggestion on an alternative name? Perhaps just executable or fv3jedi_exe?

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA , executable or fv3jedi_exe work for me. I like fv3jedi_exe but it's possible (likely) GDASApp will execute more than just executables prefixed with fv3jedi_. I see ioda*.x, mom6.x, soca_*.x, etc. Given this, executable is the more generic variable. Another option is to leave exe_path as the path ... but then we need to add another variable to hold the executable name. Let's not go there.

Comment thread ush/run_fv3jedi_exe.py Outdated
Comment thread ush/run_fv3jedi_exe.py Outdated
Comment thread ush/ufsda/misc_utils.py
@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

The variational tests runs close to the specified 10 minute wall clock limit. Since this is a technical, not science, test we can reduce ninner in 3dvar_dripcg.yaml from 50 and 100 to 5 and 10 or something small like this.

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

@RussTreadon-NOAA which is a better 'medium-term' solution:

  • make ninner a variable configurable through YAML?
  • have 2 YAML files 3dvar_dripcg.yaml and 3dvar_dripcg_regtest.yaml or something like that?

I would say the former is 'better' but I worry about users having too many things to configure by default.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA , will issue #45 will add 3dvar_dripcg_regtest.yaml? If it does, then we can reduce ninner in 3dvar_dripcg.yaml. I, too, am leary of requiring users to set too many things in sample.yaml.

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

@RussTreadon-NOAA I was thinking the opposite, actually.
3dvar_dripcg.yaml would be the full 'experiment' and 3dvar_dripcg_regtest.yaml to be added in #45 would only contain 1/10 of the iterations just to verify the results are identical (or within a tolerance). Basically those two would be identical save for the iteration counts and perhaps other features we want to include or exclude. But I'm open to other solutions.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA. Got it. This make sense. #45 is a ctest. It should be fast. Therefore, fewer iterations. We can punt and leave ninner as 50 and 100 or reduce each by a factor of two, 25 and 50. My concern at present is that the variational job runs close to the specified 10 minute wall clock limit.

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

@RussTreadon-NOAA ok that works for me if it works for you, and I'll make sure to choose an appropriate wall clock time when I add the YAML file that will be used by the regression test. I will say that I am imagining something a bit more akin to the GSI ctest and not the JEDI ctest, in which this test in #45 will probably take on the order of minutes to execute, and not seconds.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Sounds good, @CoryMartin-NOAA . Here are run times I'm currently seeing for the variational job

OOPS_STATS Run end                                  - Runtime:    513.06 sec
OOPS_STATS Run end                                  - Runtime:    491.65 sec
OOPS_STATS Run end                                  - Runtime:    503.22 sec
OOPS_STATS Run end                                  - Runtime:    507.40 sec

On a different note, I still see

CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

at the top of GDASApp.o*. If I manually execute sbatch submit_job.sh, the resulting GDASApp.o* file does not contain the above text.

Something odd happens when
subprocess.Popen(f"sbatch {job_script}", cwd=working_dir, shell=True)
submits the job in my environment. This is probably a user (me) problem.

Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Installed feature/hofx on orion. Exercised branch in both hofx and variational modes. Both configurations ran to completion.

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

@RussTreadon-NOAA I'm actually thinking we hold off on merging until I can investigate the subprocess vs shell submission differences (I looked at the beginning of my log files and I see them too, so it's not just an issue on your end). I will do some digging on Monday and see what I can find. Have a nice weekend!

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Some observations regarding the message

CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

in the GDASApp.o log file.

  1. Add "set -x" to submit_job.sh when built by misc_utils.py. Subsequent GDASApp.o log file shows above message is generated when module purge is executed.

  2. Comment out module purge in misc_utils.py and rerun run_jedi_exe.py. Resulting GDASApp.o contains

Due to MODULEPATH changes, the following have been reloaded:
  1) GDAS/orion


CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

The above to be generated following
module use /work/noaa/da/rtreadon/git/GDASApp/hofx/modulefiles

  1. Replace subprocess.Popen(f"sbatch {job_script}", cwd=working_dir, shell=True) with os.system(f"sbatch {job_script}") with module purge present in submit_job.sh. Once again see
CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

at the top of GDASApp.o.

In all cases submit_job.sh successfully runs to completion. The warning message at the top of GDASApp.o, however, is disturbing.

Do we need to examine orion.lua? Might the modules which are loaded or the order in which modules are loaded have something to do with the message in GDASApp.o? Would adding setenv or other commands in orion.lua help? Do users need to add something to their shell rc file? More questions than answers.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA , some additional information.

If I manually sbatch submit_job.sh from a PuTTY session which has not loaded modulefile GDAS/orion, no error messages are written to the top of the resulting GDASApp.o log file. I only see the error messages when submitting from a PuTTY session in which modulefile GDAS/orion has been loaded.

Does this indicate an issue with either the conda gdasapp environment or miniconda3/4.6.14?

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

@RussTreadon-NOAA in fact, I know the issue is coming from the conda env, but not immediately sure why.
If you run module purge manually after running module load GDAS/orion I get the same message.

So this is a problem with the conda env modulefile and not with the script added in this PR. Let me do some digging to see why this message appears.

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

@RussTreadon-NOAA I know what the problem is.

module gdasapp depends on miniconda3/4.6.14.

When module purge happens. It seems like some command in miniconda is now missing.
If you do:

module unload gdasapp
module purge

then there is no error.

I'm not sure if we can modify the modulefile to account for this behavior or not. In the meantime, we could add a module unload gdasapp to the script before module purge. Or we can not worry about the message for now. What is your preference?

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor Author

@RussTreadon-NOAA I have a solution that (perhaps) is sufficient.

I've modified the modulefile on Orion:
/work2/noaa/da/python/opt/modulefiles/python/miniconda3/4.6.14/gdasapp/1.0.0.lua
and changed:

  if (mode() == "unload") then
    local unload_cmd = "conda deactivate"
    execute{cmd=unload_cmd, modeA={"unload"}}
  end

to

  if (mode() == "unload") then
    if (isloaded("miniconda3")) then
      local unload_cmd = "conda deactivate"
      execute{cmd=unload_cmd, modeA={"unload"}}
    end
  end

The warning goes away because it will only run conda deactivate if the module containing that command is still loaded. As far as I can tell, the module purge successfully removes this python env from the user's path.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@CoryMartin-NOAA , your fix works for me. I executed ./run_jedi_exe.py --config sample_hofx.yaml with GDAS/orion loaded (gdasapp env). The GDASApp.o log file for the resulting job no longer contains any conda messages. This is great!

@CoryMartin-NOAA CoryMartin-NOAA merged commit 30e95a4 into develop Apr 11, 2022
@CoryMartin-NOAA CoryMartin-NOAA deleted the feature/hofx branch April 11, 2022 13:48
DavidNew-NOAA added a commit that referenced this pull request Jan 6, 2026
Does the same thing as jcb-algorithms PR
[#8](NOAA-EMC/jcb-algorithms#8)
RussTreadon-NOAA pushed a commit that referenced this pull request Jan 16, 2026
Does the same thing as jcb-algorithms PR
[#8](NOAA-EMC/jcb-algorithms#8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

orion-RT-Passed Automated testing successful on Orion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stand alone hofx capability

3 participants