Skip to content

Add IMS stationIdentification to output#1654

Closed
yuanxue2870 wants to merge 10 commits into
JCSDA-internal:developfrom
yuanxue2870:Feature/add_IMSid
Closed

Add IMS stationIdentification to output#1654
yuanxue2870 wants to merge 10 commits into
JCSDA-internal:developfrom
yuanxue2870:Feature/add_IMSid

Conversation

@yuanxue2870
Copy link
Copy Markdown
Contributor

@yuanxue2870 yuanxue2870 commented May 12, 2025

Description

  1. Revise errsd from 40 to 80 to be consistent with offline workflow. Confirmed IMS errsd value via email exchanges on 05/12/2025.
  2. Based on Youlong's recent PR in Add station IDs with txxxxyyyy for processed IMS snow DA NOAA-EMC/land-SCF_proc#1, now we need to output IMS station IDs (i.e., stationIdentification as a nine-digit string) to the ioda formatted output to be used for follow-up QC.

Issue(s) addressed

Resolves #1642

Dependencies

List the other PRs that this PR is dependent on:

Impact

Expected impact on downstream repositories: No significant impacts are expected. More IMS QCs can be done based on this added IMS ID feature. The input and output to the ctest of "test_iodaconv_imsfv3grid_scf" needs to be changed, otherwise, the ctest will fail.

Checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have run the unit tests before creating the PR

@yuanxue2870
Copy link
Copy Markdown
Contributor Author

Please review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you!

@yuanxue2870 yuanxue2870 marked this pull request as ready for review May 12, 2025 15:19
@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor

@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know

@YoulongXia-NOAA
Copy link
Copy Markdown
Contributor

@yuanxue2870, test files include both input and output files. please keep consistency with ioda converters input and output filenames. Otherwise, you also need to modify .txt file to do the ioda-converter test.

@yuanxue2870
Copy link
Copy Markdown
Contributor Author

@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know

Thanks Youlong and Cory for your comments. I have already created the new test input and output files on Hera, also included their paths on this PR. The administrator can grab these new files and replace with the old ones directly (i.e., file names are consistent). Unless there is a public place that I can drop off these two files?

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor

The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files

@yuanxue2870
Copy link
Copy Markdown
Contributor Author

The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files

Oh, I see. That is why it is different from other repos... Thanks for your clarification! I will do that. Convert to draft while uploading new test files...

@yuanxue2870 yuanxue2870 marked this pull request as draft May 12, 2025 15:56
@yuanxue2870 yuanxue2870 marked this pull request as ready for review May 12, 2025 16:06
@yuanxue2870
Copy link
Copy Markdown
Contributor Author

Please re-review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you!


lons = lons.astype('float32')
lats = lats.astype('float32')
stid = stid.astype('int64')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stid = stid.astype('int32')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanxue2870 I recalled that only datetime use the double precision. For station ID, single precision is sufficient.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Youlong, thanks for your comments. For now, I do not think int32 or int64 would make a difference here as the "stid" is a nine-digit integer, and not being output to the IODA formatted as well. However, int32 will not be enough if we goes to higher resolution (e.g., C1152), where we may need a 11 to 12-digit integer, which will exceed int32 storage capacity. Hence, I would like to keep int64 here. - Thanks,Yuan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, keep it as suggested, @yuanxue2870. I am fine with it but this is the first time I found that people uses int64 for station ID, a bit weird.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an optimal suggestion, @srherbener, thank you for your deep thoughts. sounds good with me.

Copy link
Copy Markdown
Contributor

@YoulongXia-NOAA YoulongXia-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved this PR.

Copy link
Copy Markdown
Collaborator

@srherbener srherbener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this update!

self.varAttrs[('dateTime', metaDataName)]['_FillValue'] = long_missing_value
self.outdata[('latitude', metaDataName)] = lats
self.outdata[('longitude', metaDataName)] = lons
self.outdata[('stationIdentification', metaDataName)] = strstid
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that stid itself, as an integer, is not written to the output file. Rather it is converted to a string and the string value is written out in the stationIdentification value? If so, I think it's okay for stid to be a 64-bit integer.

Pardon my ignorance of this obs type, but I am curious why ~4 billion values is not enough unique station id values. If this obs type is going to have very large numbers of locations, we might want to consider using a numeric station id, or be very careful to use a character array instead of a variable length string, since the variable length string storage and access in the file is going to be very slow. Not for this PR but something to think about for the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srherbener, as @ClaraDraper-NOAA would like to filter out some processed snow depth values at some IMS grids, and she asked me to add a staion ID as txxxxyyy, t is tile number for FV3 grid, 1-6, xxxx is the number for the longitude, and yyyy is number of latitude, e.g., C1152, 4608x2406 grids, xxxx from 0001 to 4608, yyyy 0001 to 2406, therefore, it is a 9-digit integer. In general, stid use string for ioda-converter. Are you thinking to use an integer instead?

Copy link
Copy Markdown
Contributor Author

@yuanxue2870 yuanxue2870 May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srherbener Yes, you are correct. Only strid is written out.
Thanks for your suggestion. IMS is gridded obs, which is different from conventional "in-situ" (i.e., point-scale) obs. We want to give each grid an ID because we want to do some further QCs afterwards. The way we set up each ID is by concatenating its tile number, x-coordinate number, and y-coordinate number -- so we have super big values for stid. For now, int32 is sufficient. By using int64, I am trying to leave some wiggle room for higher resolution (in the long run) which is likely with an increased concatenated ID based on the current code logic.
I am happy to adjust if defining 'stid' as an int64 takes too much memory. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srherbener, In future. you suggest to use an integer. but for now, If it is 9 digit string, do you know how slow it will be for a vector 6x4608x2406?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YoulongXia-NOAA the vector you mention is ~66 million entries. If we are reading that many locations into a single ObsSpace, then a variable length string data type on a variable (eg stationIdentification) in the file could be problematic. However, if it's the case where we need the 66 million values to cover all the possible grid cells, but we only read a subset of those grids (say 10's or 100's of thousands of locations) into a single ObsSpace then we should be fine.

Copy link
Copy Markdown
Contributor Author

@yuanxue2870 yuanxue2870 May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For C96, we output a total of 12252 location ID strings on December 01. The total number of land points in C96 is 18320, meaning ~67% of the total land points are being output in the IMS-IODA output. For C1152, we have a total number of land points of 2381853, similarly, say when 67% are being output on December 01 (for example), we will output 1595841 ID strings.

Copy link
Copy Markdown
Collaborator

@srherbener srherbener May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is not necessarily memory, rather it is runtime. When an hdf5/netcdf file has a large array (vector) of strings, and those strings are the variable length string datatype (which is what stationIdentification is), it can sometimes take a long time to read those strings into memory. When in memory the strings are stored as a vector in C++ and processing (compare, sort, conversion to DateTime objects, etc) those can also be slow, but not as bad as the file I/O.

I think we are in the gray zone for runtime performance with a string vector that contains a million elements. It may still be okay but if so it's probably getting close to where the runtime starts blowing up. It might make sense to:

  1. Leave stationIdentification as is for now and see how it performs
  2. If the runtime is too slow, then consider using a new integer variable, say "gridIdentification"

If this sounds like a good approach with everyone, you could merge this PR as is (using stationIdentification) and then do some profiling to decide if a numeric id is warranted (and add that in a subsequent PR).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you propose sounds good to me, thank you @srherbener for the suggestions and insight! (as always!)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Thanks for your help!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. Happy to approve now.

@fcvdb fcvdb added the NWS NOAA National Weather Service label May 22, 2025
@yuanxue2870
Copy link
Copy Markdown
Contributor Author

yuanxue2870 commented Jun 26, 2025

Dear admin team, this PR is required for GFSv17 for the correct use of IMS observation errors. It had been approved by four reviewers so far. Please consider merging the PR when you get a chance. Feel free to let me know if there is anything needed on my end. Thank you! cc: @CoryMartin-NOAA @ClaraDraper-NOAA

@BenjaminRuston
Copy link
Copy Markdown
Collaborator

@srherbener you still have commit privilege on this repository as well correct?

I will watch and get this merged pending CI and local tests thanks

@BenjaminRuston BenjaminRuston added the ready for merge PR is reviewed and is ready for merge label Jun 26, 2025
@BenjaminRuston
Copy link
Copy Markdown
Collaborator

@srherbener, @yuanxue2870 , @CoryMartin-NOAA and @ClaraDraper-NOAA

I guess the only remaining issue is this is very bespoke and there is no IODA convention for such a thing as tile and the coordinates

@ClaraDraper-NOAA in particular, in the bigger picture the goal is to IODA-ify the pre-processing and then JEDI would implicitly know these values already. Am I correct in this assumption?

@CoryMartin-NOAA
Copy link
Copy Markdown
Contributor

Bigger picture is that we have a preprocessor that provides the input to the ioda-converter here, but we need to refactor it so that it uses IODA directly to write out a compliant file.

@BenjaminRuston
Copy link
Copy Markdown
Collaborator

the CI also is not going to pass unless this is an internal branch

at this point should I just make a parallel PR for this?

@yuanxue2870
Copy link
Copy Markdown
Contributor Author

the CI also is not going to pass unless this is an internal branch

at this point should I just make a parallel PR for this?

Sure. Please feel free to make a parallel PR. Thank you for your help!

@BenjaminRuston BenjaminRuston added duplicate This issue or pull request already exists and removed ready for merge PR is reviewed and is ready for merge labels Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

duplicate This issue or pull request already exists NWS NOAA National Weather Service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add location IDs to IMS IODA files

8 participants