Skip to content

Add master scrontab runner to serialize CI experiment rocoto calls#58

Closed
Copilot wants to merge 5 commits into
developfrom
copilot/modify-scrontab-scripts
Closed

Add master scrontab runner to serialize CI experiment rocoto calls#58
Copilot wants to merge 5 commits into
developfrom
copilot/modify-scrontab-scripts

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 27, 2026

Description

On SLURM-managed scron systems (e.g. Gaea), generate_workflows.sh was writing one independent scrontab entry per experiment. All fired every 5 minutes simultaneously, causing multiple concurrent rocotorun processes that exhausted head-node memory (OOM).

Changes

dev/workflow/generate_workflows.sh

  • Instead of writing a per-experiment scrontab entry in the loop, collect each experiment's .scron.sh path into _scron_sh_files[]
  • After all experiments are created, generate ${RUNTESTS}/EXPDIR/rocoto_master_run.sh — a sequential runner that calls each .scron.sh one at a time:
    if [[ -x "/path/to/exp1.scron.sh" ]]; then "/path/to/exp1.scron.sh"; fi
    if [[ -x "/path/to/exp2.scron.sh" ]]; then "/path/to/exp2.scron.sh"; fi
    ...
  • Write a single scrontab entry (with --dependency=singleton) pointing to the master script — --partition and --account sourced from the first experiment's .crontab
  • Wall-time for the master scrontab entry is fixed at 00:10:00
  • Error out clearly if required #SCRON --partition= / #SCRON --account= directives are missing from the crontab
  • Email failure notifications unchanged — each individual .scron.sh still runs its own rocotostat-based monitoring and sends alerts independently

Resolves NOAA-EMC#2619

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this change expected to change outputs (e.g. value changes to existing outputs, new files stored in COM, files removed from COM, filename changes, additions/subtractions to archives)? NO
    • GFS
    • GEFS
    • SFS
    • GCAFS
  • Is this a breaking change (a change in existing functionality)? NO — regular crontab (non-scron) path is unchanged; scron behavior changes from N parallel entries to 1 master entry
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO
    • EMC verif-global
    • GDAS
    • GFS-utils
    • GSI
    • GSI-monitor
    • GSI-utils
    • UFS-utils
    • UFS-weather-model
    • wxflow

How has this been tested?

  • End-to-end shell simulation verified: single */5 timing line in scrontab, correct shebang, executable permissions, all experiment scripts referenced in master, static 10-minute wall time in generated scrontab entry

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

Copilot AI and others added 2 commits April 27, 2026 13:44
When generate_workflows.sh is used on SLURM-managed scron systems (e.g.
Gaea), running all rocoto instances simultaneously can exhaust head-node
memory and cause OOM errors.

Changes:
- Collect individual .scron.sh paths in an array during experiment creation
  instead of immediately writing per-experiment entries to tests.cron
- After all experiments are created, generate a single master runner script
  (rocoto_master_run.sh) in ${RUNTESTS}/EXPDIR/ that calls each experiment's
  .scron.sh sequentially
- Place a single scrontab entry pointing to the master script, with:
  - partition/account pulled from the first experiment's .crontab
  - wall-time computed as 10 min × number of experiments
  - --dependency=singleton to prevent overlapping runs
- Email notification behavior is unchanged: each individual .scron.sh
  continues to send failure emails via its own monitoring logic

Agent-Logs-Url: https://github.com/DavidHuber-NOAA/global-workflow/sessions/051bf98c-4777-4976-8c4e-a69d634ab5a1

Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
… per experiment

- Add explicit guard that _yaml_list is non-empty before accessing _yaml_list[0]
- Add error check that #SCRON --partition= and #SCRON --account= are present
  in the first experiment's crontab before writing master scrontab entry
- Make minutes-per-experiment configurable via SCRON_MINUTES_PER_EXPERIMENT
  environment variable (default: 10) instead of hardcoding the value

Agent-Logs-Url: https://github.com/DavidHuber-NOAA/global-workflow/sessions/051bf98c-4777-4976-8c4e-a69d634ab5a1

Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
Copilot AI changed the title [WIP] Modify scrontab scripts to cycle through CI tests Add master scrontab runner to serialize CI experiment rocoto calls Apr 27, 2026
Copilot AI requested a review from DavidHuber-NOAA April 27, 2026 13:48
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Comment thread dev/workflow/generate_workflows.sh Outdated
Per reviewer request, replace the dynamic per-experiment wall-time
calculation with a fixed 00:10:00 — 10 minutes total is enough time
to run rocotorun sequentially across all experiments.

Agent-Logs-Url: https://github.com/DavidHuber-NOAA/global-workflow/sessions/d5d8659b-071a-44bc-ad67-5c6464622085

Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Modify the scrontab scripts to cycle through all launched CI tests

2 participants