[develop] Enable workflow runs on single node linux/mac machine using rocoto.#508
Conversation
da73e20 to
7289964
Compare
7289964 to
51668b1
Compare
51668b1 to
3a9f951
Compare
7414d60 to
5e1fe4d
Compare
|
@danielabdi-noaa - Here is my feedback:
|
|
@natalie-perlin Note that although you can run any workflow a from WE2E test case on linux/mac now, it doesn't mean they will run successfully on a single workstation for several reasons.
About your points
|
|
@danielabdi-noaa From my perspective, any increase in unnecessary complications only "darkens" the SRW App, moving it from a "gray box" towards more "black box". Is would seem a bad idea to anybody involved in community model development, and most definitely a step away from making it more accessible to the community (not only for developers). Yes, rocoto is simple to install, I agree on that. However, If abandoning stand-alone wrapper scripts is the only way to proceed with the SRW development and "rocoto" needs to be made an additional pre-requisite, using the "fake slurm" scripts could be a workaround to make a workflow functional on all systems. |
christopherwharrop-noaa
left a comment
There was a problem hiding this comment.
This is an interesting approach. I have two questions. First, I'm not seeing how the exit status is being propagated to the file for later retrieval. And second, how are entries in the .job_database being cleaned up?
| de=\$(date --utc -d '$SECS sec' +%Y-%m-%d:%H:%M:%S); \ | ||
| echo $JOBNAME pid \$$ started \$ds ends \$de >>.job_database; \ | ||
| \ | ||
| ${CTIM} ${CMD} &>$LOG; \ |
There was a problem hiding this comment.
Don't you need to have an echo $? in here so that the status of the timeout is actually written to $LOG? I'm not seeing how the exit status is being written to $LOG for later retrieval.
There was a problem hiding this comment.
The exit code is retrieved from the log file (SRW specific solution) in the line below. I did try $? at first but it was reporting 0 (success) for failed jobs -- didn't investigate further. I will try again since that is a generic solution.
There was a problem hiding this comment.
I have now made it use $? directly. I think in my previous test I forgot to use an escape \$? without which it will always report exit code = 0.
MichaelLueken
left a comment
There was a problem hiding this comment.
@danielabdi-noaa Similar to @mark-a-potts request for ush/machine/linux.yaml, should ulimit -s unlimited be added to ush/machine/macos.yaml as well? This seems to be done for the rest of the machine files.
|
@danielabdi-noaa I have added the DO_NOT_MERGE label until
Please update your feature/fake_slurm branch to the latest develop, address the conflicts, and update the linux and macos machine files, then this work will be ready to be merged. |
|
One other issue I forgot to mention. Line 34 of etc/lmod-setup.sh should be this-- It currently is /usr/share/share/lmod/init/bash, which is wrong. Not sure where that bug got introduced. |
|
@mark-a-potts Thanks for testing! It is good to get confirmation that it works on a system other than mine. @MichaelLueken I will update the branch and make the changes that you and Mark requested later in the afternoon - I am currently at AMS. Thanks! |
0f16c90 to
998697f
Compare
|
@danielabdi-noaa Thank you very much for addressing @mark-a-potts and my final concerns! I resubmitted the Jenkins tests yesterday evening and they successfully passed. I have removed the DO_NOT_MERGE label and will now merge this work. |
DESCRIPTION OF CHANGES:
This PR mainly addresses issue #473 and #507. Fake slurm commands are added on linux/mac setup to make rocoto workflow runs possible. However, this is not the best solution. If there is a light-weight slurm that can be installed on linux/mac to manage resources, or if the idea behind the fake slurm batch commands is incorporated back to rocoto, they are not needed anymore.
Edit: It looks like someone needed this kind of capability in rocoto and implemented the "NoBatchSystem" option in this issue
I am looking into that now. Specifying "no/none" as the batch system did not seem to do the trick. Edit2: Support through rocoto changes for NoBatchSystem is on hold as it needs work. For now the fake slurm commands can be used but once rocoto has a NoBatchSystem that supports SRW, this solution can be removed.
Detailed set of changes
sacct,sbatch,scancel,squeue,sinfo,srunfor use on single node linux/mac. The first four are used in the same way rocoto uses them here. The last two commands are not used by rocoto so they are provided just for completeness..job_databaseis created under theEXPTDIRto keep track of experiment tasks and their state, whether they are submitted, completed, their exit code, job submission/start/completion times etc., i.e. whatever is needed to makesqueueandsinfowork. Here is an example.job_databasefor thedeactivate_taskstest case.ush/wrappersare removed because they are outdated and can't really do what the rocoto xml file does.miniconda3initialization logic added to linux/mac wflow modulefiles. I was not able to use the hpc-stack miniconda3 installation since that would require loadinghpc/1.2.0first in thewflow_linux.luafileI have run
deactivate_taskstest from WE2E after turning offUSE_USER_STAGED_EXTRN_FILES: false.which generated the workflow and run the tasks using cron + rocoto successfully.
Here is what it looks like to execute rocoto and fake slurm commands
Type of change
TESTS CONDUCTED:
DEPENDENCIES:
None
DOCUMENTATION:
Needs update on how to install rocoto and miniconda3 on linux/mac.
ISSUE:
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR:
CONTRIBUTORS (optional):
@natalie-perlin