Add IMS stationIdentification to output#1654
Conversation
|
Please review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you! |
|
@yuanxue2870 please update the test files as part of this PR, if you need help with that, let me know |
|
@yuanxue2870, test files include both input and output files. please keep consistency with ioda converters input and output filenames. Otherwise, you also need to modify .txt file to do the ioda-converter test. |
Thanks Youlong and Cory for your comments. I have already created the new test input and output files on Hera, also included their paths on this PR. The administrator can grab these new files and replace with the old ones directly (i.e., file names are consistent). Unless there is a public place that I can drop off these two files? |
|
The IODA converters repository uses "git-lfs" so they should be committed to this PR and replace the existing files |
Oh, I see. That is why it is different from other repos... Thanks for your clarification! I will do that. Convert to draft while uploading new test files... |
|
Please re-review: @BenjaminRuston, @YoulongXia-NOAA, @CoryMartin-NOAA, Thank you! |
|
|
||
| lons = lons.astype('float32') | ||
| lats = lats.astype('float32') | ||
| stid = stid.astype('int64') |
There was a problem hiding this comment.
stid = stid.astype('int32')
There was a problem hiding this comment.
@yuanxue2870 I recalled that only datetime use the double precision. For station ID, single precision is sufficient.
There was a problem hiding this comment.
Hi Youlong, thanks for your comments. For now, I do not think int32 or int64 would make a difference here as the "stid" is a nine-digit integer, and not being output to the IODA formatted as well. However, int32 will not be enough if we goes to higher resolution (e.g., C1152), where we may need a 11 to 12-digit integer, which will exceed int32 storage capacity. Hence, I would like to keep int64 here. - Thanks,Yuan
There was a problem hiding this comment.
Okay, keep it as suggested, @yuanxue2870. I am fine with it but this is the first time I found that people uses int64 for station ID, a bit weird.
There was a problem hiding this comment.
That is an optimal suggestion, @srherbener, thank you for your deep thoughts. sounds good with me.
YoulongXia-NOAA
left a comment
There was a problem hiding this comment.
I approved this PR.
srherbener
left a comment
There was a problem hiding this comment.
Thanks for this update!
| self.varAttrs[('dateTime', metaDataName)]['_FillValue'] = long_missing_value | ||
| self.outdata[('latitude', metaDataName)] = lats | ||
| self.outdata[('longitude', metaDataName)] = lons | ||
| self.outdata[('stationIdentification', metaDataName)] = strstid |
There was a problem hiding this comment.
Am I understanding correctly that stid itself, as an integer, is not written to the output file. Rather it is converted to a string and the string value is written out in the stationIdentification value? If so, I think it's okay for stid to be a 64-bit integer.
Pardon my ignorance of this obs type, but I am curious why ~4 billion values is not enough unique station id values. If this obs type is going to have very large numbers of locations, we might want to consider using a numeric station id, or be very careful to use a character array instead of a variable length string, since the variable length string storage and access in the file is going to be very slow. Not for this PR but something to think about for the future.
There was a problem hiding this comment.
@srherbener, as @ClaraDraper-NOAA would like to filter out some processed snow depth values at some IMS grids, and she asked me to add a staion ID as txxxxyyy, t is tile number for FV3 grid, 1-6, xxxx is the number for the longitude, and yyyy is number of latitude, e.g., C1152, 4608x2406 grids, xxxx from 0001 to 4608, yyyy 0001 to 2406, therefore, it is a 9-digit integer. In general, stid use string for ioda-converter. Are you thinking to use an integer instead?
There was a problem hiding this comment.
@srherbener Yes, you are correct. Only strid is written out.
Thanks for your suggestion. IMS is gridded obs, which is different from conventional "in-situ" (i.e., point-scale) obs. We want to give each grid an ID because we want to do some further QCs afterwards. The way we set up each ID is by concatenating its tile number, x-coordinate number, and y-coordinate number -- so we have super big values for stid. For now, int32 is sufficient. By using int64, I am trying to leave some wiggle room for higher resolution (in the long run) which is likely with an increased concatenated ID based on the current code logic.
I am happy to adjust if defining 'stid' as an int64 takes too much memory. Thanks!
There was a problem hiding this comment.
@srherbener, In future. you suggest to use an integer. but for now, If it is 9 digit string, do you know how slow it will be for a vector 6x4608x2406?
There was a problem hiding this comment.
@YoulongXia-NOAA the vector you mention is ~66 million entries. If we are reading that many locations into a single ObsSpace, then a variable length string data type on a variable (eg stationIdentification) in the file could be problematic. However, if it's the case where we need the 66 million values to cover all the possible grid cells, but we only read a subset of those grids (say 10's or 100's of thousands of locations) into a single ObsSpace then we should be fine.
There was a problem hiding this comment.
For C96, we output a total of 12252 location ID strings on December 01. The total number of land points in C96 is 18320, meaning ~67% of the total land points are being output in the IMS-IODA output. For C1152, we have a total number of land points of 2381853, similarly, say when 67% are being output on December 01 (for example), we will output 1595841 ID strings.
There was a problem hiding this comment.
The issue is not necessarily memory, rather it is runtime. When an hdf5/netcdf file has a large array (vector) of strings, and those strings are the variable length string datatype (which is what stationIdentification is), it can sometimes take a long time to read those strings into memory. When in memory the strings are stored as a vector in C++ and processing (compare, sort, conversion to DateTime objects, etc) those can also be slow, but not as bad as the file I/O.
I think we are in the gray zone for runtime performance with a string vector that contains a million elements. It may still be okay but if so it's probably getting close to where the runtime starts blowing up. It might make sense to:
- Leave stationIdentification as is for now and see how it performs
- If the runtime is too slow, then consider using a new integer variable, say "gridIdentification"
If this sounds like a good approach with everyone, you could merge this PR as is (using stationIdentification) and then do some profiling to decide if a numeric id is warranted (and add that in a subsequent PR).
There was a problem hiding this comment.
What you propose sounds good to me, thank you @srherbener for the suggestions and insight! (as always!)
There was a problem hiding this comment.
Sounds good to me. Thanks for your help!
There was a problem hiding this comment.
Sounds great. Happy to approve now.
|
Dear admin team, this PR is required for GFSv17 for the correct use of IMS observation errors. It had been approved by four reviewers so far. Please consider merging the PR when you get a chance. Feel free to let me know if there is anything needed on my end. Thank you! cc: @CoryMartin-NOAA @ClaraDraper-NOAA |
|
@srherbener you still have commit privilege on this repository as well correct? I will watch and get this merged pending CI and local tests thanks |
|
@srherbener, @yuanxue2870 , @CoryMartin-NOAA and @ClaraDraper-NOAA I guess the only remaining issue is this is very bespoke and there is no IODA convention for such a thing as @ClaraDraper-NOAA in particular, in the bigger picture the goal is to IODA-ify the pre-processing and then JEDI would implicitly know these values already. Am I correct in this assumption? |
|
Bigger picture is that we have a preprocessor that provides the input to the ioda-converter here, but we need to refactor it so that it uses IODA directly to write out a compliant file. |
|
the CI also is not going to pass unless this is an internal branch at this point should I just make a parallel PR for this? |
Sure. Please feel free to make a parallel PR. Thank you for your help! |
Description
Issue(s) addressed
Resolves #1642
Dependencies
List the other PRs that this PR is dependent on:
Impact
Expected impact on downstream repositories: No significant impacts are expected. More IMS QCs can be done based on this added IMS ID feature. The input and output to the ctest of "test_iodaconv_imsfv3grid_scf" needs to be changed, otherwise, the ctest will fail.
Checklist