Skip to content

Avoid parallel restart I/O on WCOSS2#3615

Merged
DavidHuber-NOAA merged 6 commits into
NOAA-EMC:developfrom
DavidHuber-NOAA:fix/ufs_build
Apr 28, 2025
Merged

Avoid parallel restart I/O on WCOSS2#3615
DavidHuber-NOAA merged 6 commits into
NOAA-EMC:developfrom
DavidHuber-NOAA:fix/ufs_build

Conversation

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Apr 25, 2025

Description

This disables parallel restart reads on WCOSS2 by building the UFS via build.sh instead of tests/compile.sh. This is a temporary workaround to allow C384 forecasts to run to completion.

Resolves #3589
Refs ufs-community/ufs-weather-model#2716

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? YES (Parallel restart I/O will be disabled on WCOSS2, resulting in slower runtimes for forecasts)
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO

How has this been tested?

Simple C384 forecast on WCOSS2.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

FYI @DusanJovic-NOAA

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

Opened #3616 to reinstate parallel restart file reads once ufs-community/ufs-weather-model#2716 is addressed.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Contributor

We are in the middle of scaling/timing tests for GFS & GEFS ---- so I'm pointing this PR out to @GeorgeVandenberghe-NOAA @DusanJovic-NOAA @bingfu-NOAA @NeilBarton-NOAA @CatherineThomas-NOAA @RuiyuSun - so that they make sure they have parallel restarts enabled for any timing tests. I don't think George or Dusan are using g-w builds, but others might be.

@GeorgeVandenberghe-NOAA
Copy link
Copy Markdown

@CatherineThomas-NOAA
Copy link
Copy Markdown
Contributor

Thanks for the PR @DavidHuber-NOAA. I did a quick clone/build to make sure this addressed my previous issue #3589. The model built successfully, but after there was an error and the build script failed:

 [100%] Built target ufs_model
make[1]: Leaving directory '/lfs/h2/emc/da/noscrub/catherine.thomas/git/global-workflow/dave_fix/sorc/ufs_model.fd/build_fv3_1'
/apps/spack/cmake/3.20.2/intel/19.1.3.304/utnbptm3hrf7gppztidueu4jogfgemut/bin/cmake -E cmake_progress_start /lfs/h2/emc/da/noscrub/catherine.thomas/git/global-workflow/dave_fix/sorc/ufs_model.fd/build_fv3_1/CMakeFiles 0
+ mv /lfs/h2/emc/da/noscrub/catherine.thomas/git/global-workflow/dave_fix/sorc/ufs_model.fd/build_fv3_1/ufs_model tests/fv3_1.exe
+ [[ NO == \Y\E\S ]]
+ mv ./tests/fv3_1.exe ./tests/gfs_model.x
+ [[ ! -f ./tests/modules.ufs_model.lua ]]
+ mv ./tests/modules.fv3_1.lua ./tests/modules.ufs_model.lua
mv: cannot stat './tests/modules.fv3_1.lua': No such file or directory

@CatherineThomas-NOAA
Copy link
Copy Markdown
Contributor

Also, I can confirm that even though the build script failed, the executable was created beforehand and running with that executable allowed my previously failed test (C384_atm3DVar) to succeed.

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

@CatherineThomas-NOAA Thanks for the test and letting me know about my slop. I put in a fix just now and am testing the build presently.

@GeorgeVandenberghe-NOAA @JessicaMeixner-NOAA I will add in an option to compile using compile.sh (i.e. enabling parallel restart reads).

@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

My fix did not work... Continuing to diagnose.

@DavidHuber-NOAA DavidHuber-NOAA changed the title Avoid parallel restart reads on WCOSS2 Avoid parallel restart I/O on WCOSS2 Apr 25, 2025
Comment thread sorc/build_ufs.sh Fixed
@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

@GeorgeVandenberghe-NOAA @JessicaMeixner-NOAA I've added an option to build the UFS with parallel restart I/O enabled. This option will only work with the build_all.sh and build_ufs.sh scripts (not build_compute.sh, which has fixed build parameters). To use it, you will just need to pass -p to the script.

@CatherineThomas-NOAA I have fixed the build issues now and my test build was successful.

I will go ahead and start a full suite of CI tests, including the weekly tests, on WCOSS2.

@DavidHuber-NOAA DavidHuber-NOAA added CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. CI-Wcoss2-Building CI testing is cloning/building on WCOSS2 CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully CI-Wcoss2-Ready PR is ready for CI testing on WCOSS2. CI-Wcoss2-Building CI testing is cloning/building on WCOSS2 labels Apr 25, 2025
@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

All tests passed except the weekly C384_S2SWA test due to missing initial conditions. This test has not been run in quite some time, so this is not a big surprise.

@KateFriedman-NOAA is it possible to have the C384_S2SWA ICs uploaded to WCOSS2?

@CatherineThomas-NOAA should this test be run to consider this PR a success?

@DavidHuber-NOAA DavidHuber-NOAA added CI-Wcoss2-Failed CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running CI testing on WCOSS for this PR is in-progress labels Apr 26, 2025
@KateFriedman-NOAA
Copy link
Copy Markdown
Contributor

is it possible to have the C384_S2SWA ICs uploaded to WCOSS2?

@DavidHuber-NOAA I only have the ice IC file for this case. I don't have the wave ICs (link in older timestamp is to empty folder) and I don't have any atmos ICs for this case. If someone from CPC or elsewhere can provide the ICs I can add them to the staged sets.

@DavidHuber-NOAA DavidHuber-NOAA added CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully and removed CI-Wcoss2-Failed CI testing on WCOSS for this PR has failed labels Apr 28, 2025
@DavidHuber-NOAA DavidHuber-NOAA mentioned this pull request Apr 28, 2025
4 tasks
@DavidHuber-NOAA
Copy link
Copy Markdown
Contributor Author

@KateFriedman-NOAA Thanks for checking! I opened up #3618 to track the addition of those ICs. Feel free to ping appropriate folks in that issue.

Setting the WCOSS2-CI-Passed label as all other tests passed.

I believe this PR is ready for final reviews.

@CatherineThomas-NOAA
Copy link
Copy Markdown
Contributor

@CatherineThomas-NOAA should this test be run to consider this PR a success?

@DavidHuber-NOAA - The test you mentioned is for the GEFS. Since that test hasn't been run in some time and this isn't a new error, I think it's fine to move on with this PR from my perspective so long as the original error is fixed and no regularly run tests are failing. Thanks for opening the issue to get that test back up and running!

@CatherineThomas-NOAA
Copy link
Copy Markdown
Contributor

Confirming that I no longer get any errors from the build script with your latest fix @DavidHuber-NOAA. Thank you for adding the -p option so that our scaling work can continue and for opening the companion issue to reenable this option once the ufs issue is addressed.

Copy link
Copy Markdown
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@DavidHuber-NOAA DavidHuber-NOAA merged commit 7e16242 into NOAA-EMC:develop Apr 28, 2025
5 checks passed
tsga added a commit to tsga/global-workflow that referenced this pull request May 1, 2025
* develop:
  Update GSI hash and GSI fix version to resolve bugs (NOAA-EMC#3626)
  Add missing marine DA files to archiving  (NOAA-EMC#3596)
  Add a low resolution test to mimic GFSv17 cycling as much as possible (NOAA-EMC#3617)
  Add the setting to use the reject list for station t/q observations in GSI based soil DA (NOAA-EMC#3599)
  GitLab CI Framework for schedule PR cases and ctests on multi hosts (NOAA-EMC#3603)
  Avoid parallel restart I/O on WCOSS2 (NOAA-EMC#3615)
  Enables user toggling of GDASApp g-w ctests (NOAA-EMC#3587)
  COM variable updates for prep and some external downstream jobs (NOAA-EMC#3608)
  Remove MOS from system (NOAA-EMC#3612)
  Updates to enable soil DA  (NOAA-EMC#3452)
  Unexport SHELLOPTS when running htar (NOAA-EMC#3601)
  Fix check for netcdf wave restart (NOAA-EMC#3594)
  Call err_chk/err_exit for fatal errors in post JJobs/ex-scripts (NOAA-EMC#3571)
  Remove support for Jet and S4 (NOAA-EMC#3572)
  Hotfix in GitLab pipline for Nightly (env MACHINE breaks build on head node) (NOAA-EMC#3578)
  [hotfix] Missed a path during merging develop (NOAA-EMC#3577)
  Prepare for ops readiness - part 1 (NOAA-EMC#3557)
  Update UFS weather-model to 20250328 hash (NOAA-EMC#3528)
  Fix SFS fcst config (NOAA-EMC#3574)
  Use err_chk in GDAS j-jobs (NOAA-EMC#3570)
  Perform compute builds on Gaea head nodes (NOAA-EMC#3560)
  Add initial capability to produce JEDI-based observation space summary stat files (NOAA-EMC#3471)
  Spread epos over more nodes on Hera to increase allocated memory (NOAA-EMC#3567)
  Create separate gists when multiple files are published on GitHub (NOAA-EMC#3551)
  Use err_chk in GSI J-Jobs and scripts (NOAA-EMC#3549)
  Add unified jinja obs list to marine DA (NOAA-EMC#3530)
  Save snow and aerosol analysis increments (and logs and YAMLs) every cycle (NOAA-EMC#3537)
  Add Dependencies to SFS Cleanup Job (NOAA-EMC#3559)
  Updates archiving to reflect current naming of marine anl output (NOAA-EMC#3541)
  Temporarily disable compute builds on C6 (NOAA-EMC#3558)
  Update gdas.cd hash to resolve msu prod_util failure (NOAA-EMC#3556)
  COMIN/COMOUT updates for enkf chgres and downstream product jobs (NOAA-EMC#3518)
  Call err_chk in forecast scripts for fatal errors (NOAA-EMC#3515)
  Add Rocoto Jobs for the Missing Products of GEFS (NOAA-EMC#3466)
  Download subset fix data with python script (NOAA-EMC#3400)
  Check that partition should be set (NOAA-EMC#3543)
  Rename wave output and refactor some wave scripts to use MPMD, and fix some bugzillas along the way (NOAA-EMC#3517)
  Add support for dual batch partitions on AWS NOAA-EMC#3483
  Update CI build and run directories for GitLab Nightlies on C6 and added GitLab support on Hera (NOAA-EMC#3536)
  Hotfix path for CI in Jenkins on Gaea C6 to it's world-share path (NOAA-EMC#3532)
  Create single ocean grib2 product file (NOAA-EMC#3529)
  Scheduled Nightly CI/CD Pipeline Script in GitLab on Gaea C6 (NOAA-EMC#3493)
  make sure cold starts are handled correctly when DOIAU=YES (issue NOAA-EMC#3516) (NOAA-EMC#3520)
  Add check for DO_AERO_FCST before copying fv_tracer files (NOAA-EMC#3485)
  Use jinja templates instead of `@VARNAME@` in config files (NOAA-EMC#3411)
  Replace "status" (or comparable) with "err" in preparation for moving to err_chk/err_exit (NOAA-EMC#3507)
  Error in Java launch script for CI (NOAA-EMC#3465)
  Delete DATAROOT when running generate_workflows.sh (NOAA-EMC#3504)
  Fix 3244 garbled change (NOAA-EMC#3492)
  Enable ensemble archiving via Globus (NOAA-EMC#3479)
  Update MSU FIX_DIR paths (NOAA-EMC#3488)
  Updates for AOWCDA and hybatmaerosnowDA cases on Gaea C6 (NOAA-EMC#3487)
  Update GOCART path for GDAS/GFS/GCAFS implementations  (NOAA-EMC#3455)
  Make RUN Variables Explicit in `config.resources` (NOAA-EMC#3478)
  Remove unused key from enkfgdas_earc_vrfy (NOAA-EMC#3473)
  Bug fix to the failing early cycle marine DA ensemble re-centering (NOAA-EMC#3454)
  Make marine LETKF optional (NOAA-EMC#3462)
  When sourcing for RUN=enkf*, use CASE_ENS (NOAA-EMC#3475)
  Updates for Gaea: verif-global tag, tracker tag, Fit2Obs tag, and C768 analysis resources (NOAA-EMC#3463)
  Update gefswave glo_025 mesh file with new mask (NOAA-EMC#3457)
  Update MSU glopara paths to new role-global space (NOAA-EMC#3443)
  Enable CI testing on AWS (NOAA-EMC#3459)
  Enable Gaea C5 Jenkins CI (NOAA-EMC#3447)
  Job reference removal from WMO product names (NOAA-EMC#3460)
  Turn off aerosol prognostic radiative feedback for GDAS NOAA-EMC#2926 (NOAA-EMC#3445)
  Add DO_GEMPAK check to postsnd subtask (NOAA-EMC#3451)
  Add a force option to setup_xml to ignore unwritable directories (NOAA-EMC#3448)
  Remove the eomg job (NOAA-EMC#3331)
  Migration to role account for Jenkins on Orion (NOAA-EMC#3440)
  Eliminate `_gfs`, `_gdas`, etc, variables and add necessary if blocks (NOAA-EMC#3420)
  Update workflow staging for sfcanl tiles and waveinit (NOAA-EMC#3429)
  Improve messaging to display clear warning when missing snogrb file (NOAA-EMC#3317)
  JEDI-based ensemble recentering and analysis calculation (NOAA-EMC#3312)
  Enable HPSS archiving on C5/6 (NOAA-EMC#3437)
  Check if HOMEDIR STMP and PTMP are writable (NOAA-EMC#3430)
  Update UFS_Utils and GFS-utils hashes to update Gaea support and ocean/ice post products (NOAA-EMC#3433)
  Enable C1152 forecasts on gaea C6 (NOAA-EMC#3438)
  Migration to role account for Jenkins on Hercules (NOAA-EMC#3423)
  Remove Direct Linking to COM from DATA for `extractvars` Job (NOAA-EMC#3379)
  Enable HPSS via Globus on Hercules and Orion
  Remove job name from product files & update GEMPAK module. (NOAA-EMC#3415)
  `link` instead of `copy` in staging jobs (NOAA-EMC#3410)
  Migrate CI Jenkins to role account on Hera (NOAA-EMC#3414)
  Add rocotorc documentation when using scrontab (NOAA-EMC#3417)
  Update jgdas atmos verfozn and verfrad with COMIN/COMOUT prefix instead of COM (NOAA-EMC#3342)
  Add configuration for empirically-corrected ozone parameters (NOAA-EMC#3386)
  Enable global-workflow to run C768C384 GSI on Gaea-C6 (NOAA-EMC#3412)
  Move logical checks into if blocks (NOAA-EMC#3339)
  Adding Jenkins CI to GaeaC6 using role account (NOAA-EMC#3389)
  Enable GDASApp g-w CI cases to run on wcoss2 (NOAA-EMC#3399)
  CI/CD Test on Gaea C5- And update config.gaea under ci/platform (NOAA-EMC#3280)
  Enable cycling support for Gaea C6 (NOAA-EMC#3323)
  Update enkf archive jobs to use COMIN/COMOUT (NOAA-EMC#3393)
  Copy marine ensemble output observation diags and spread (NOAA-EMC#3407)
  Ci testing on aws 2 (NOAA-EMC#3408)
  Disable METplus jobs on Hera (NOAA-EMC#3403)
  Add the mean EnKF soil increment to the deterministic member (NOAA-EMC#3295)
  Add mpich/8.1.19 to the WCOSS2 LD_LIBRARY_PATH for GDASApp jobs (NOAA-EMC#3396)
  Change order of RUNs (NOAA-EMC#3335)
  CI testing on aws (NOAA-EMC#3391)
  Rename Gulf of Mexico in bufr station list in GFSv17 (NOAA-EMC#3384)
  Enabling AWS CI/testing (NOAA-EMC#3383)
  Update issue templates to use new issue type field (NOAA-EMC#3369)
  Replace WAVECUR_DID variable with "rtofs" (NOAA-EMC#3337)
  Allow for C1152 ATM-Aero cycled DA to run on WCOSS2 (NOAA-EMC#3309)
  Remove Direct Linking to COM from DATA for `wavepostsbs` Job (NOAA-EMC#3303)
  Update jgdas enkf update job with COMIN or COMOUT prefix instead of COM (NOAA-EMC#3333)
  Add capability to run diff resolutions for marine anl and background (NOAA-EMC#3238)
  Update high resolution tests and fix minor wave issues  (NOAA-EMC#3289)
  Add sfs as valid system (NOAA-EMC#3243)
  Add missing arch_tars dependencies (NOAA-EMC#3319)
  Fix the empty aerosol DA aerostat tar file issue (NOAA-EMC#3332)
  Add missing file safeguard for IMS prep in snow analysis tasks (NOAA-EMC#3329)
  Fix memory unsetting on Gaea (NOAA-EMC#3325)
  Fix error log parsing in compute build CI (NOAA-EMC#3301)
  Remove marineanlvrfy task from global-workflow (NOAA-EMC#3314)
  Add `gfs_wavepostpnt` dependencies to gfs_cleanup (NOAA-EMC#3313)
  Increase the GDASApp build wallclock (NOAA-EMC#3298)
  Capture build fail in Jenkins pipeline when no error logs are produced (NOAA-EMC#3297)
  Add/update config files for Gaea and check existence before sourcing config files in generate_workflows.sh (NOAA-EMC#3286)
  Fix ocean restarts when cold starting with DOIAU=YES (NOAA-EMC#3278)
  Splitting up the archive task (NOAA-EMC#3242)
  CTests extended validation for C48_ATM and staged C48_S2SW for gfs_fcst and gfs_atmos (NOAA-EMC#3256)
  Add esnowanl to enkfgfs cycle (NOAA-EMC#3283)
  Add gfs cycles to C48mx500_3DVarAOWCDA (NOAA-EMC#3249)
  Add fetch job and update stage_ic to work with fetched ICs (NOAA-EMC#3141)
  Remove WAFS files and references from `develop` (NOAA-EMC#3263)
  fix intel stack version number on c5 (NOAA-EMC#3258)
  Update gsi_monitor and ufs_utils hashes to recent hashes for C5/C6 build and run (NOAA-EMC#3252)
  Enable DA cycling on gaea C5/C6 (NOAA-EMC#3255)
  Copy post-processed sea ice increment for diagnostics (NOAA-EMC#3235)
@DavidHuber-NOAA DavidHuber-NOAA deleted the fix/ufs_build branch May 21, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully

Projects

None yet

Development

Successfully merging this pull request may close these issues.

C384_atm3DVar CI case hangs during cold start read

7 participants