Add master scrontab runner to serialize CI experiment rocoto calls#58
Closed
Copilot wants to merge 5 commits into
Closed
Add master scrontab runner to serialize CI experiment rocoto calls#58Copilot wants to merge 5 commits into
Copilot wants to merge 5 commits into
Conversation
Closed
4 tasks
When generate_workflows.sh is used on SLURM-managed scron systems (e.g.
Gaea), running all rocoto instances simultaneously can exhaust head-node
memory and cause OOM errors.
Changes:
- Collect individual .scron.sh paths in an array during experiment creation
instead of immediately writing per-experiment entries to tests.cron
- After all experiments are created, generate a single master runner script
(rocoto_master_run.sh) in ${RUNTESTS}/EXPDIR/ that calls each experiment's
.scron.sh sequentially
- Place a single scrontab entry pointing to the master script, with:
- partition/account pulled from the first experiment's .crontab
- wall-time computed as 10 min × number of experiments
- --dependency=singleton to prevent overlapping runs
- Email notification behavior is unchanged: each individual .scron.sh
continues to send failure emails via its own monitoring logic
Agent-Logs-Url: https://github.com/DavidHuber-NOAA/global-workflow/sessions/051bf98c-4777-4976-8c4e-a69d634ab5a1
Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
… per experiment - Add explicit guard that _yaml_list is non-empty before accessing _yaml_list[0] - Add error check that #SCRON --partition= and #SCRON --account= are present in the first experiment's crontab before writing master scrontab entry - Make minutes-per-experiment configurable via SCRON_MINUTES_PER_EXPERIMENT environment variable (default: 10) instead of hardcoding the value Agent-Logs-Url: https://github.com/DavidHuber-NOAA/global-workflow/sessions/051bf98c-4777-4976-8c4e-a69d634ab5a1 Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Modify scrontab scripts to cycle through CI tests
Add master scrontab runner to serialize CI experiment rocoto calls
Apr 27, 2026
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Per reviewer request, replace the dynamic per-experiment wall-time calculation with a fixed 00:10:00 — 10 minutes total is enough time to run rocotorun sequentially across all experiments. Agent-Logs-Url: https://github.com/DavidHuber-NOAA/global-workflow/sessions/d5d8659b-071a-44bc-ad67-5c6464622085 Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
On SLURM-managed scron systems (e.g. Gaea),
generate_workflows.shwas writing one independent scrontab entry per experiment. All fired every 5 minutes simultaneously, causing multiple concurrentrocotorunprocesses that exhausted head-node memory (OOM).Changes
dev/workflow/generate_workflows.sh.scron.shpath into_scron_sh_files[]${RUNTESTS}/EXPDIR/rocoto_master_run.sh— a sequential runner that calls each.scron.shone at a time:--dependency=singleton) pointing to the master script —--partitionand--accountsourced from the first experiment's.crontab00:10:00#SCRON --partition=/#SCRON --account=directives are missing from the crontab.scron.shstill runs its ownrocotostat-based monitoring and sends alerts independentlyResolves NOAA-EMC#2619
Type of change
Change characteristics
How has this been tested?
*/5timing line in scrontab, correct shebang, executable permissions, all experiment scripts referenced in master, static 10-minute wall time in generated scrontab entryChecklist