Skip to content

Safeguard to prevent NaNs from CRTM affecting minimisation#925

Merged
RussTreadon-NOAA merged 5 commits into
NOAA-EMC:developfrom
ADCollard:develop
Sep 4, 2025
Merged

Safeguard to prevent NaNs from CRTM affecting minimisation#925
RussTreadon-NOAA merged 5 commits into
NOAA-EMC:developfrom
ADCollard:develop

Conversation

@ADCollard
Copy link
Copy Markdown
Collaborator

@ADCollard ADCollard commented Sep 2, 2025

Description

As noted in Issue #916 and also in PR #924 (the equivalent merge into the develop-v16 release branch), the CRTM can occasionally produce NaNs in the output brightness temperature and Jacobians. The existing QC tests do not always get triggered by this and so an explicit test for NaNs is required.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Standalone runs to ensure that cases with NaNs in the CRTM brightness temperature are screened out.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@ADCollard ADCollard requested a review from emilyhcliu September 2, 2025 19:09
@ADCollard ADCollard self-assigned this Sep 2, 2025
@ADCollard ADCollard added the bug Something isn't working label Sep 2, 2025
Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. See comment in PR #294. Same comment applied to this PR.

@ADCollard
Copy link
Copy Markdown
Collaborator Author

Looks good. See comment in PR #294. Same comment applied to this PR.

Good idea! Change pushed.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

WCOSS2 ctests

Install ADCollard:develop at 7421e09 as updat and NOAA-EMC:develop at 054e4ee as contrl on Cactus. Run ctests with the following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr925/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  735.23 sec
2/6 Test #6: global_enkf ......................   Passed  857.32 sec
3/6 Test #2: rtma .............................   Passed  1094.82 sec
4/6 Test #5: hafs_3denvar_hybens ..............***Failed  1462.83 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1523.67 sec
6/6 Test #1: global_4denvar ...................***Failed  2044.98 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 2045.00 sec

The four failed tests are due to non-reproducible results between the updat and contrl. For each case the initial total penalities are identical between updat and contrl. Differences arise in the minimization.

rrfs_3denvar_rdasens
First step size calculation differs in the 16th digit. updat is the first line below. contrl is the second.

cost,grad,step,b,step? =   1   0  1.601329800829752348E+05  1.284530366199247965E+03  2.180618270029247796E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   0  1.601329800829752348E+05  1.284530366199247965E+03  2.180618270029243799E+00  0.000000000000000000E+00  good

hafs_3denvar_hybens
First step size calculation differs in the 16th digit. updat is the first line below. contrl is the second.

cost,grad,step,b,step? =   1   0  1.522570550456362253E+05  5.089882891394057879E+03  3.159083965720935194E-01  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   0  1.522570550456362253E+05  5.089882891394057879E+03  3.159083965720939635E-01  0.000000000000000000E+00  good

hafs_4denvar_glbens_
First step size calculation differs in the 16th digit. updat is the first line below. contrl is the second.

cost,grad,step,b,step? =   1   0  1.640321269846514042E+05  3.663464403379373380E+03  1.078684720834984345E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   0  1.640321269846514042E+05  3.663464403379373380E+03  1.078684720834985233E+00  0.000000000000000000E+00  good

global_4denvar_
Initial gradient norm differs in the 17th digit. updat is the first line below. contrl is the second.

Initial gradient norm =  1.627364417040545504E+03
Initial gradient norm =  1.627364417040545732E+03

Interestingly, Cactus ctests all Passed for PR #924.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

Ursa ctests

Install ADCollard:develop at 3613a4d and NOAA-EMC:develop at 054e4ee on Ursa. Run ctests with following results

Test project /scratch3/NCEPDEV/da/Russ.Treadon/git/gsi/ursa/pr925/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  486.74 sec
2/6 Test #6: global_enkf ......................   Passed  490.31 sec
3/6 Test #2: rtma .............................   Passed  727.61 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  792.77 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  851.73 sec
6/6 Test #1: global_4denvar ...................   Passed  1082.43 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1082.45 sec

This result differs from WCOSS2 (Cactus).

The Cactus and Ursa builds use different versions of the Intel fortran compiler.

  • Cactus: -- The Fortran compiler identification is Intel 19.1.3.20200925
  • Ursa: -- The Fortran compiler identification is Intel 2021.1.0.20240703

Also, Cactus uses hpc-stack for libraries and modules. Ursa uses spack-stack/1.9.2.

@ADCollard
Copy link
Copy Markdown
Collaborator Author

@CatherineThomas-NOAA After discussion with @RussTreadon-NOAA , we are leaning towards allowing this change to proceed despite the small minimization differences. Do you concur?

Copy link
Copy Markdown
Collaborator

@CatherineThomas-NOAA CatherineThomas-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment from #924 apply here as well

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

**Gaea C6 ctests`

Install ADCollard:develop at 3613a4d and NOAA-EMC:develop at 054e4ee on Gaea C6. Run ctests with following results

Test project /gpfs/f6/ira-sti/scratch/Russ.Treadon/git/gsi/pr925/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  482.82 sec
2/6 Test #6: global_enkf ......................   Passed  484.54 sec
3/6 Test #2: rtma .............................   Passed  724.29 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  843.98 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  973.37 sec
6/6 Test #1: global_4denvar ...................   Passed  1201.77 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1201.79 sec

All tests _Passed.

The Gaea C6 build uses -- The Fortran compiler identification is Intel 2021.10.0.20230609.

@RussTreadon-NOAA RussTreadon-NOAA self-requested a review September 3, 2025 19:22
Copy link
Copy Markdown
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve.

@RussTreadon-NOAA
Copy link
Copy Markdown
Contributor

WCSOSS2 tests

Test 1

Run NOAA-EMC:develop and ADCollard:develop gsi.x for 2025090218 gfs case at operational resolution. Minimization printout differs in the 16th digit of the step size on the second iteration

NOAA-EMC:develop

Initial cost function =  1.261923950715688057E+06
Initial gradient norm =  2.091827177666849821E+04
cost,grad,step,b,step? =   1   0  1.261923950715688057E+06  2.091827177666849821E+04  1.233890950926028962E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  1.233412194524980849E+06  2.627439059878627086E+04  1.200440917279365749E+00  6.353503217882496834E-01  good

ADCollard:develop

Initial cost function =  1.261923950715688057E+06
Initial gradient norm =  2.091827177666849821E+04
cost,grad,step,b,step? =   1   0  1.261923950715688057E+06  2.091827177666849821E+04  1.233890950926028962E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  1.233412194524980849E+06  2.627439059878627086E+04  1.200440917279366415E+00  6.353503217882495724E-01  good

Below is a comparison of the minimum and maximum differences in the siginc.nc analysis increment file. 1 is NOAA-EMC:develop. 2 is ADCollard:develop

u_inc min/max 1=-20.922054,23.294716 min/max 2=-20.922052,23.294716 max abs diff=0.0346442461
v_inc min/max 1=-18.751726,17.711506 min/max 2=-18.75081,17.711506 max abs diff=0.0374132395
delp_inc min/max 1=-11.960331,9.681469 min/max 2=-11.960377,9.681469 max abs diff=0.0010174811
delz_inc min/max 1=-126.771255,126.01985 min/max 2=-126.771355,126.01985 max abs diff=0.2762374878
T_inc min/max 1=-10.683932,9.500078 min/max 2=-10.683899,9.500078 max abs diff=0.0315157771
sphum_inc min/max 1=-0.0050469316,0.005694798 min/max 2=-0.0050469316,0.005694798 max abs diff=0.0000475537
liq_wat_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
o3mr_inc min/max 1=-1.6638726e-06,1.8373532e-06 min/max 2=-1.6638942e-06,1.8373532e-06 max abs diff=0.0000000012
icmr_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
rwmr_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
snmr_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
grle_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000

The maximum absolute difference between the analysis increments from the two executables is very small.

Test 2

ADCollard:develop was built in Debug mode and used to run the 2025090218 gfs case. The following changes were made to the GSI namelist in order to finish the debug run in a reasonable amount of time

  • all dmesh sizes in OBS_INPUT were increased by an order of magnitude
  • the time_window_max was decreased from 3 to 0.5 hours in OBS_INPUT
  • the number of first and second outer loop iterations was decreased to niter(1)=1,niter(2)=1 in SETUP

The ADCollard:develop debug gsi.x successfully ran the 2025090218 gfs case without any error in 7361.614930 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants