Safeguard to prevent NaNs from CRTM affecting minimisation by ADCollard · Pull Request #925 · NOAA-EMC/GSI

ADCollard · 2025-09-02T19:08:01Z

Description

As noted in Issue #916 and also in PR #924 (the equivalent merge into the develop-v16 release branch), the CRTM can occasionally produce NaNs in the output brightness temperature and Jacobians. The existing QC tests do not always get triggered by this and so an explicit test for NaNs is required.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Standalone runs to ensure that cases with NaNs in the CRTM brightness temperature are screened out.

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
New and existing tests pass with my changes
Any dependent changes have been merged and published

RussTreadon-NOAA

Looks good. See comment in PR #294. Same comment applied to this PR.

ADCollard · 2025-09-03T14:33:05Z

Looks good. See comment in PR #294. Same comment applied to this PR.

Good idea! Change pushed.

RussTreadon-NOAA · 2025-09-03T16:52:46Z

WCOSS2 ctests

Install ADCollard:develop at 7421e09 as updat and NOAA-EMC:develop at 054e4ee as contrl on Cactus. Run ctests with the following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr925/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  735.23 sec
2/6 Test #6: global_enkf ......................   Passed  857.32 sec
3/6 Test #2: rtma .............................   Passed  1094.82 sec
4/6 Test #5: hafs_3denvar_hybens ..............***Failed  1462.83 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1523.67 sec
6/6 Test #1: global_4denvar ...................***Failed  2044.98 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 2045.00 sec

The four failed tests are due to non-reproducible results between the updat and contrl. For each case the initial total penalities are identical between updat and contrl. Differences arise in the minimization.

rrfs_3denvar_rdasens
First step size calculation differs in the 16th digit. updat is the first line below. contrl is the second.

cost,grad,step,b,step? =   1   0  1.601329800829752348E+05  1.284530366199247965E+03  2.180618270029247796E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   0  1.601329800829752348E+05  1.284530366199247965E+03  2.180618270029243799E+00  0.000000000000000000E+00  good

hafs_3denvar_hybens
First step size calculation differs in the 16th digit. updat is the first line below. contrl is the second.

cost,grad,step,b,step? =   1   0  1.522570550456362253E+05  5.089882891394057879E+03  3.159083965720935194E-01  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   0  1.522570550456362253E+05  5.089882891394057879E+03  3.159083965720939635E-01  0.000000000000000000E+00  good

hafs_4denvar_glbens_
First step size calculation differs in the 16th digit. updat is the first line below. contrl is the second.

cost,grad,step,b,step? =   1   0  1.640321269846514042E+05  3.663464403379373380E+03  1.078684720834984345E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   0  1.640321269846514042E+05  3.663464403379373380E+03  1.078684720834985233E+00  0.000000000000000000E+00  good

global_4denvar_
Initial gradient norm differs in the 17th digit. updat is the first line below. contrl is the second.

Initial gradient norm =  1.627364417040545504E+03
Initial gradient norm =  1.627364417040545732E+03

Interestingly, Cactus ctests all Passed for PR #924.

RussTreadon-NOAA · 2025-09-03T16:57:34Z

Ursa ctests

Install ADCollard:develop at 3613a4d and NOAA-EMC:develop at 054e4ee on Ursa. Run ctests with following results

Test project /scratch3/NCEPDEV/da/Russ.Treadon/git/gsi/ursa/pr925/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  486.74 sec
2/6 Test #6: global_enkf ......................   Passed  490.31 sec
3/6 Test #2: rtma .............................   Passed  727.61 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  792.77 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  851.73 sec
6/6 Test #1: global_4denvar ...................   Passed  1082.43 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1082.45 sec

This result differs from WCOSS2 (Cactus).

The Cactus and Ursa builds use different versions of the Intel fortran compiler.

Cactus: -- The Fortran compiler identification is Intel 19.1.3.20200925
Ursa: -- The Fortran compiler identification is Intel 2021.1.0.20240703

Also, Cactus uses hpc-stack for libraries and modules. Ursa uses spack-stack/1.9.2.

ADCollard · 2025-09-03T17:21:56Z

@CatherineThomas-NOAA After discussion with @RussTreadon-NOAA , we are leaning towards allowing this change to proceed despite the small minimization differences. Do you concur?

CatherineThomas-NOAA

Same comment from #924 apply here as well

RussTreadon-NOAA · 2025-09-03T19:22:51Z

**Gaea C6 ctests`

Install ADCollard:develop at 3613a4d and NOAA-EMC:develop at 054e4ee on Gaea C6. Run ctests with following results

Test project /gpfs/f6/ira-sti/scratch/Russ.Treadon/git/gsi/pr925/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  482.82 sec
2/6 Test #6: global_enkf ......................   Passed  484.54 sec
3/6 Test #2: rtma .............................   Passed  724.29 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  843.98 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  973.37 sec
6/6 Test #1: global_4denvar ...................   Passed  1201.77 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1201.79 sec

All tests _Passed.

The Gaea C6 build uses -- The Fortran compiler identification is Intel 2021.10.0.20230609.

RussTreadon-NOAA

Approve.

RussTreadon-NOAA · 2025-09-04T11:28:23Z

WCSOSS2 tests

Test 1

Run NOAA-EMC:develop and ADCollard:develop gsi.x for 2025090218 gfs case at operational resolution. Minimization printout differs in the 16th digit of the step size on the second iteration

NOAA-EMC:develop

Initial cost function =  1.261923950715688057E+06
Initial gradient norm =  2.091827177666849821E+04
cost,grad,step,b,step? =   1   0  1.261923950715688057E+06  2.091827177666849821E+04  1.233890950926028962E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  1.233412194524980849E+06  2.627439059878627086E+04  1.200440917279365749E+00  6.353503217882496834E-01  good

ADCollard:develop

Initial cost function =  1.261923950715688057E+06
Initial gradient norm =  2.091827177666849821E+04
cost,grad,step,b,step? =   1   0  1.261923950715688057E+06  2.091827177666849821E+04  1.233890950926028962E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  1.233412194524980849E+06  2.627439059878627086E+04  1.200440917279366415E+00  6.353503217882495724E-01  good

Below is a comparison of the minimum and maximum differences in the siginc.nc analysis increment file. 1 is NOAA-EMC:develop. 2 is ADCollard:develop

u_inc min/max 1=-20.922054,23.294716 min/max 2=-20.922052,23.294716 max abs diff=0.0346442461
v_inc min/max 1=-18.751726,17.711506 min/max 2=-18.75081,17.711506 max abs diff=0.0374132395
delp_inc min/max 1=-11.960331,9.681469 min/max 2=-11.960377,9.681469 max abs diff=0.0010174811
delz_inc min/max 1=-126.771255,126.01985 min/max 2=-126.771355,126.01985 max abs diff=0.2762374878
T_inc min/max 1=-10.683932,9.500078 min/max 2=-10.683899,9.500078 max abs diff=0.0315157771
sphum_inc min/max 1=-0.0050469316,0.005694798 min/max 2=-0.0050469316,0.005694798 max abs diff=0.0000475537
liq_wat_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
o3mr_inc min/max 1=-1.6638726e-06,1.8373532e-06 min/max 2=-1.6638942e-06,1.8373532e-06 max abs diff=0.0000000012
icmr_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
rwmr_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
snmr_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000
grle_inc min/max 1=0.0,0.0 min/max 2=0.0,0.0 max abs diff=0.0000000000

The maximum absolute difference between the analysis increments from the two executables is very small.

Test 2

ADCollard:develop was built in Debug mode and used to run the 2025090218 gfs case. The following changes were made to the GSI namelist in order to finish the debug run in a reasonable amount of time

all dmesh sizes in OBS_INPUT were increased by an order of magnitude
the time_window_max was decreased from 3 to 0.5 hours in OBS_INPUT
the number of first and second outer loop iterations was decreased to niter(1)=1,niter(2)=1 in SETUP

The ADCollard:develop debug gsi.x successfully ran the 2025090218 gfs case without any error in 7361.614930 seconds.

ADCollard and others added 4 commits August 27, 2025 18:23

Bugfix for saildrone/windborne obs errors

f34da3e

Comment out a line causing too many messages

99e6880

Merge branch 'NOAA-EMC:develop' into develop

f5ad258

Trap for NaNs from CRTM

7421e09

ADCollard requested a review from emilyhcliu September 2, 2025 19:09

ADCollard self-assigned this Sep 2, 2025

ADCollard added the bug Something isn't working label Sep 2, 2025

emilyhcliu approved these changes Sep 2, 2025

View reviewed changes

RussTreadon-NOAA mentioned this pull request Sep 3, 2025

Safeguard to prevent NaNs from CRTM affecting minimisation #924

Merged

6 tasks

RussTreadon-NOAA reviewed Sep 3, 2025

View reviewed changes

Add satid to error message

3613a4d

ADCollard requested a review from CatherineThomas-NOAA September 3, 2025 17:22

CatherineThomas-NOAA approved these changes Sep 3, 2025

View reviewed changes

RussTreadon-NOAA self-requested a review September 3, 2025 19:22

RussTreadon-NOAA approved these changes Sep 3, 2025

View reviewed changes

RussTreadon-NOAA merged commit 9ae4063 into NOAA-EMC:develop Sep 4, 2025

RussTreadon-NOAA mentioned this pull request Sep 4, 2025

NaNs produced in v16.3.26 - related to GMI #916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safeguard to prevent NaNs from CRTM affecting minimisation#925

Safeguard to prevent NaNs from CRTM affecting minimisation#925
RussTreadon-NOAA merged 5 commits into
NOAA-EMC:developfrom
ADCollard:develop

ADCollard commented Sep 2, 2025 •

edited

Loading

Uh oh!

RussTreadon-NOAA left a comment

Uh oh!

ADCollard commented Sep 3, 2025

Uh oh!

RussTreadon-NOAA commented Sep 3, 2025

Uh oh!

RussTreadon-NOAA commented Sep 3, 2025

Uh oh!

ADCollard commented Sep 3, 2025

Uh oh!

CatherineThomas-NOAA left a comment

Uh oh!

RussTreadon-NOAA commented Sep 3, 2025

Uh oh!

RussTreadon-NOAA left a comment

Uh oh!

RussTreadon-NOAA commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ADCollard commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

Uh oh!

ADCollard commented Sep 3, 2025

Uh oh!

RussTreadon-NOAA commented Sep 3, 2025

Uh oh!

RussTreadon-NOAA commented Sep 3, 2025

Uh oh!

ADCollard commented Sep 3, 2025

Uh oh!

CatherineThomas-NOAA left a comment

Choose a reason for hiding this comment

Uh oh!

RussTreadon-NOAA commented Sep 3, 2025

Uh oh!

RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

Uh oh!

RussTreadon-NOAA commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ADCollard commented Sep 2, 2025 •

edited

Loading