Skip to content

CI self-test with KEEPDATA=YES#2734

Closed
TerrenceMcGuinness-NOAA wants to merge 8 commits into
NOAA-EMC:developfrom
TerrenceMcGuinness-NOAA:ci_keepdata
Closed

CI self-test with KEEPDATA=YES#2734
TerrenceMcGuinness-NOAA wants to merge 8 commits into
NOAA-EMC:developfrom
TerrenceMcGuinness-NOAA:ci_keepdata

Conversation

@TerrenceMcGuinness-NOAA
Copy link
Copy Markdown
Collaborator

Description

This is a CI self-test with KEEPDATA=YES for save off of RUNDIRS to capture disk costs of running CI tests.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI/CD Issue related to CI/CD labels Jun 27, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Jun 28, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Jun 28, 2024

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/logs/2021122018/gdassfcanl.log

Follow link here to view the contents of the above file(s): (link)

@emcbot emcbot added CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed and removed CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jun 28, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Jun 28, 2024

Experiment C96_atmaerosnowDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atmaerosnowDA_7e868a54

@emcbot
Copy link
Copy Markdown

emcbot commented Jun 28, 2024

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48mx500_3DVarAOWCDA_7e868a54

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

C48mx500_3DVarAOWCDA failure

The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts

21:  (abort_ice)ABORTED:
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

As @guillaumevernieres notes, the log file contains Tsfcn NaN for n = 1 through 5.

This PR uses an older gdas.cd hash. PR #2700 uses a newer gdas.cd hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run under role.jedipara and Russ.Treadon.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

C96_atmaerosnowDA failure

The C96_atmaerosnowDA failure in this PR differs from PR #2700 and #2729. The 20211220 18Z gdassfcanl fails in this PR with the error message

2:  FATAL ERROR: OPENING FILE: ./fnbgsi.003: NetCDF: Unknown file format
2:  STOP.
2: Abort(999) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 2

Local file fnbgsi.003 is a copy of 20211220.180000.sfc_data.tile3.nc from /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow. The source file is zero length

 /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow:
  total used in directory 109600 available 71308833872
  drwxrwsr-x 2 Terry.McGuinness global     4096 Jun 28 01:33 .
  drwxr-sr-x 5 Terry.McGuinness global     4096 Jun 28 01:33 ..
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile1.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile2.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile3.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile4.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile5.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile6.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
  -rw-r--r-- 1 Terry.McGuinness global        0 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc

According to gdassnowanl.log file 20211220.180000.sfc_data.tile3.nc originates as

./gdassnowanl.log:2024-06-28 01:33:46,896 - INFO     - file_utils  : Copied /scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl/20211220.180000.sfc_data.tile3.nc to /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18//analysis/snow/20211220.180000.sfc_data.tile3.nc

File 20211220.180000.sfc_data.tile3.nc is a non-zero length file.

Hera(hfe04):/scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl$ ls -l 20211220.180000.sfc_data*
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc

I am not familiar with snow DA. Tagging @jiaruidong2017 . Jiarui, what are your thoughts on this failure?

@jiaruidong2017
Copy link
Copy Markdown
Contributor

Thanks @RussTreadon-NOAA for digging this. I actually don't have any idea why this happened, and I didn't meet such an issue from my previous tests. A rerun to this CI test may help to find the reason. @CoryMartin-NOAA do you have any thoughts on this?

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Thank you @jiaruidong2017 for your reply. Do you routinely run C96_atmaerosnowDA as part of your development? If not, how do / how frequently do you test JEDI snow DA in g-w?

@jiaruidong2017
Copy link
Copy Markdown
Contributor

jiaruidong2017 commented Jun 28, 2024

@RussTreadon-NOAA I actually didn't run the C96_atmaerosnowDA CI test for my development work, but instead I run my own JEDI snow DA test. Recently, I have run my tests four times over the past two weeks.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

@jiaruidong2017 , to help with debugging, when did you make these runs, on which machine, and do you still have the log files online?

@jiaruidong2017
Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA You can find the following log files for my three tests as:

/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory04/logs/ (Today)
/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory03/logs/ (June 26)
/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory02/logs/ (June 15)

@guillaumevernieres
Copy link
Copy Markdown
Contributor

C48mx500_3DVarAOWCDA failure

The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts

21:  (abort_ice)ABORTED:
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

As @guillaumevernieres notes, the log file contains Tsfcn NaN for n = 1 through 5.

This PR uses an older gdas.cd hash. PR #2700 uses a newer gdas.cd hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run under role.jedipara and Russ.Treadon.

@JessicaMeixner-NOAA just checked, the ocean and seaice increments are all nans.

@JessicaMeixner-NOAA
Copy link
Copy Markdown
Contributor

PR #2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that #2719 is also possibly causing issues for tests not related to WCDA based on some other threads.

@aerorahul
Copy link
Copy Markdown
Contributor

PR #2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that #2719 is also possibly causing issues for tests not related to WCDA based on some other threads.

@JessicaMeixner-NOAA
Thanks. I think it is the aggressive clean-up from #2719 that is likely the root cause.
I have left a comment for it in #2719 and #2700 to test that.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera and removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed labels Jul 8, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jul 8, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C48_ATM FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48_ATM_692341ed

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C96_atm3DVar FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atm3DVar_692341ed

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_692341ed/logs/2021032412/gdasfcst.log

Follow link here to view the contents of the above file(s): (link)

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C96_atmaerosnowDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atmaerosnowDA_692341ed

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C48_S2SW FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48_S2SW_692341ed

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48mx500_3DVarAOWCDA_692341ed

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C96C48_hybatmDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96C48_hybatmDA_692341ed

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 8, 2024

Experiment C48_S2SWA_gefs FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48_S2SWA_gefs_692341ed

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera and removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed labels Jul 8, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jul 8, 2024
@emcbot
Copy link
Copy Markdown

emcbot commented Jul 9, 2024

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2734


Experiment C48_ATM_0a498766 Completed 1 Cycles: *SUCCESS* at Mon Jul  8 22:11:41 UTC 2024
Experiment C48mx500_3DVarAOWCDA_0a498766 Completed 2 Cycles: *SUCCESS* at Mon Jul  8 22:23:49 UTC 2024
Experiment C96_atm3DVar_0a498766 Completed 3 Cycles: *SUCCESS* at Mon Jul  8 23:24:35 UTC 2024
Experiment C96C48_hybatmDA_0a498766 Completed 3 Cycles: *SUCCESS* at Mon Jul  8 23:24:39 UTC 2024
Experiment C48_S2SWA_gefs_0a498766 Completed 1 Cycles: *SUCCESS* at Mon Jul  8 23:36:50 UTC 2024
Experiment C48_S2SW_0a498766 Completed 1 Cycles: *SUCCESS* at Mon Jul  8 23:56:12 UTC 2024
Experiment C96_atmaerosnowDA_0a498766 Completed 3 Cycles: *SUCCESS* at Tue Jul  9 00:19:15 UTC 2024

@emcbot
Copy link
Copy Markdown

emcbot commented Jul 9, 2024

Disk requirements for RUNDIRS with KEEPDATA=YES: 437 G

Terry.McGuinness (hfe10) CI $ du -h --max-depth=1 "/scratch2/NCEPDEV/stmp/${USER}/global/CI/STMP/RUNDIRS"
75G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C96_atm3DVar_0a498766
121G /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C96C48_hybatmDA_0a498766
34G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48_S2SW_0a498766
26G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48mx500_3DVarAOWCDA_0a498766
83G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_0a498766
73G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48_S2SWA_gefs_0a498766
28G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48_ATM_0a498766
437G /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIR

And the requirements in all the EXPDIRs (in a typical CI run): 432 G

Terry.McGuinness (hfe05) CI $ du -h --max-depth=1 2581/RUNTESTS
4.0M	2581/RUNTESTS/EXPDIR
432G	2581/RUNTESTS/COMROOT
432G	2581/RUNTESTS

@aerorahul
Copy link
Copy Markdown
Contributor

@TerrenceMcGuinness-NOAA
Thanks for running this test can collecting the information.

bbakernoaa pushed a commit to bbakernoaa/global-workflow that referenced this pull request Mar 19, 2026
…NOAA-EMC#2734; Add option in GFS PBL to use liquid/ice potential temperature in local mixing NOAA-EMC#2763; Update MOM6 to its main repo. 20250527 commit NOAA-EMC#2757; Mom6 025 warmstart option NOAA-EMC#2761 (NOAA-EMC#2734)

* UFSWM - change to MOM6 namelist file
  * FV3 - 
    * ccpp-physics - Bring in the scale-aware 3DTKE related changes for GFS TKE EDMF PBL scheme
    * ccpp-physics - Add a namelist option in GFS TKE EDMF PBL scheme to apply an approximate way to represent local mixing/diffusion process in T-equation using liquid/ice water potential temperature. 
    * atmos_cubed_sphere - Enable the scale-aware 3DTKE capability for the GFS TKE EDMF PBL scheme
  * MOM6 - update MOM6 to its main repo. 20250527 commit (originally GFDL 20250423 candidate)

---------

Co-authored-by: jiandewang <jiande.wang@noaa.gov>
Co-authored-by: NeilBarton-NOAA <neil.barton@noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Issue related to CI/CD CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants