Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_grib is not outputting all the pieces dimensions correctly. #150

Open
chiaral opened this issue Apr 14, 2022 · 13 comments
Open

scan_grib is not outputting all the pieces dimensions correctly. #150

chiaral opened this issue Apr 14, 2022 · 13 comments

Comments

@chiaral
Copy link

chiaral commented Apr 14, 2022

I am very new at this, so I might not comprehend the exact goal of scan_grib but here we go (this is a follow up from discourse):

What's the issue: scan_grib doesn't outputs all the "parts" that I expect to see in the json files, hence when i read the file through xarray I am missing parts of it. Something is wrong in the writing of the parsing of the file that comes out of _split_file. I explored a bit, but couldn't find what the issue is.

grib_file to download:
!wget https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2

versions I am using:
Screen Shot 2022-04-14 at 3 13 41 PM

import xarray as xr
import cfgrib

dsx = xr.open_dataset('acpcp_sfc_2000010100_c00.grib2', engine='cfgrib')
dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')

in both cases (which are essentially the same thing) the dimensions are recognized correctly, in particular:

dsx
Screen Shot 2022-04-14 at 2 53 45 PM

dscf

Dataset(
dimensions={'step': 80, 'latitude': 721, 'longitude': 1440}, 
variables={'number': Variable(dimensions=(), data=0), 
'time': Variable(dimensions=(), data=946684800), 
'step': Variable(dimensions=('step',), data=array([  3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.,  30.,  33.,
                    ...condensed..for...clarity    
       201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
       234., 237., 240.])), 
'surface': Variable(dimensions=(), data=0.0),
'latitude': Variable(dimensions=('latitude',), data=array([ 90.  ,  89.75,  89.5 ,  89.25,  89.  ,  88.75,  88.5 ,  88.25,
        88.  ,  87.75,  87.5 ,  87.25,  87.  ,  86.75,  86.5 ,  86.25,
                   ...condensed..for...clarity    
       -88.  , -88.25, -88.5 , -88.75, -89.  , -89.25, -89.5 , -89.75,
       -90.  ])), 
'longitude': Variable(dimensions=('longitude',), data=array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02,
       3.5975e+02])), 'valid_time': Variable(dimensions=('step',), data=array([9.466956e+08, 9.467064e+08, 9.467172e+08, 9.467280e+08,
                  ...condensed..for...clarity    
       9.475164e+08, 9.475272e+08, 9.475380e+08, 9.475488e+08])), 
'acpcp': Variable(dimensions=('step', 'latitude', 'longitude'), 
data=OnDiskArray(index=FileIndex(fieldset=FileStream(path='acpcp_sfc_2000010100_c00.grib2', errors='warn'), index_keys=['centre', 'centreDescription', 'dataDate', 'dataTime', 'dataType', 'directionNumber', 'edition', 'endStep', 'frequencyNumber', 'gridType', 'level:float', 'number', 'numberOfPoints', 'paramId', 'step', 'stepType', 'stepUnits', 'subCentre', 'time', 'typeOfLevel'], filter_by_keys={'paramId': 3063}, computed_keys={}, index_protocol_version='2'), shape=(80, 721, 1440), missing_value=9999))}, attributes={'GRIB_edition': 2, 'GRIB_centre': 'kwbc', 'GRIB_centreDescription': 'US National Weather Service - NCEP', 'GRIB_subCentre': 2, 'Conventions': 'CF-1.7', 'institution': 'US National Weather Service - NCEP', 'history': '2022-04-14T18:50 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {"source": "acpcp_sfc_2000010100_c00.grib2", "filter_by_keys": {}, "encode_cf": ["parameter", "time", "geography", "vertical"]}'}, encoding={'source': 'acpcp_sfc_2000010100_c00.grib2', 'filter_by_keys': {}, 'encode_cf': ('parameter', 'time', 'geography', 'vertical')})

Note: step is a dimension, just like latitude and longitude and time. time is particular, has no dimension and one data, step, just like latitude and longitude, has values in all the fields, again, in particular, i see no difference between:

'step': Variable(dimensions=('step',), data=array([  3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.,  30.,  33.,
        ...condensed..for...clarity
       201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
       234., 237., 240.])), 

and

'latitude': Variable(dimensions=('latitude',), data=array([ 90.  ,  89.75,  89.5 ,  89.25,  89.  ,  88.75,  88.5 ,  88.25,
        88.  ,  87.75,  87.5 ,  87.25,  87.  ,  86.75,  86.5 ,  86.25,
            ...condensed..for...clarity    
       -88.  , -88.25, -88.5 , -88.75, -89.  , -89.25, -89.5 , -89.75,
       -90.  ])), 

however, when I do:

!pip install kerchunk
import warnings
warnings.simplefilter("ignore")  # filter some warning messages
import xarray as xr
import hvplot.xarray
import datetime as dt
import pandas as pd
import dask
import panel as pn
import json
import fsspec
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr

and

so = {"anon": True, "default_cache_type": "readahead"}
outscan = scan_grib('acpcp_sfc_2000010100_c00.grib2',
                common_vars = ['time', 'step', 'latitude', 'longitude', 'valid_time'], storage_options=so )

it takes a second, and outscan looks like:

{'version': 1,
 'refs': {'.zgroup': '{\n    "zarr_format": 2\n}',
  '.zattrs': '{\n    "Conventions": "CF-1.7",\n    "GRIB_centre": "kwbc",\n    "GRIB_centreDescription": "US National Weather Service - NCEP",\n    "GRIB_edition": 2,\n    "GRIB_subCentre": 2,\n    "history": "2022-04-14T19:03 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpq40ghn6agrib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n    "institution": "US National Weather Service - NCEP"\n}',
  'time/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<i8",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'time/0': 'base64:gENtOAAAAAA=',
  'time/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "calendar": "proleptic_gregorian",\n    "long_name": "initial time of forecast",\n    "standard_name": "forecast_reference_time",\n    "units": "seconds since 1970-01-01T00:00:00"\n}',
  'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',
  'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": "latitude",\n    "stored_direction": "decreasing",\n    "units": "degrees_north"\n}',
  'longitude/.zarray': '{\n    "chunks": [\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "longitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        1440\n    ],\n    "zarr_format": 2\n}',
  'longitude/0': ['{{u}}', 0, 325317],
  'longitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "longitude"\n    ],\n    "long_name": "longitude",\n    "standard_name": "longitude",\n    "units": "degrees_east"\n}',
  'valid_time/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'valid_time/0': 'base64:AAAA2LY2zEE=',
  'valid_time/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "calendar": "proleptic_gregorian",\n    "long_name": "time",\n    "standard_name": "time",\n    "units": "seconds since 1970-01-01T00:00:00"\n}',
  'acpcp/.zarray': '{\n    "chunks": [\n        721,\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f4",\n    "fill_value": 9999.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "acpcp"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721,\n        1440\n    ],\n    "zarr_format": 2\n}',
  'acpcp/0.0': ['{{u}}', 29740164, 413329],
  'acpcp/.zattrs': '{\n    "GRIB_NV": 0,\n    "GRIB_Nx": 1440,\n    "GRIB_Ny": 721,\n    "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n    "GRIB_cfVarName": "acpcp",\n    "GRIB_dataType": "cf",\n    "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n    "GRIB_gridType": "regular_ll",\n    "GRIB_iDirectionIncrementInDegrees": 0.25,\n    "GRIB_iScansNegatively": 0,\n    "GRIB_jDirectionIncrementInDegrees": 0.25,\n    "GRIB_jPointsAreConsecutive": 0,\n    "GRIB_jScansPositively": 0,\n    "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n    "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n    "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n    "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n    "GRIB_missingValue": 9999,\n    "GRIB_name": "Convective precipitation (water)",\n    "GRIB_numberOfPoints": 1038240,\n    "GRIB_paramId": 3063,\n    "GRIB_shortName": "acpcp",\n    "GRIB_stepType": "accum",\n    "GRIB_stepUnits": 1,\n    "GRIB_totalNumber": 10,\n    "GRIB_typeOfLevel": "surface",\n    "GRIB_units": "kg m**-2",\n    "_ARRAY_DIMENSIONS": [\n        "latitude",\n        "longitude"\n    ],\n    "coordinates": "number time step surface latitude longitude valid_time",\n    "long_name": "Convective precipitation (water)",\n    "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n    "units": "kg m**-2"\n}'},
 'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}

Note the part regarding step:

 'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',

what you notice is that these are all empty:

  • "chunks": [],
  • "shape": [],
  • "filters": null,\n
  • 'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  • "_ARRAY_DIMENSIONS": []

differently, for latitude or longitude:

 'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": 

you have:

  • "chunks": [\n 721\n ],\n
  • "filters": [\n {\n "id": "grib",\n "var": "latitude"\n }\n ],\n
  • "shape": [\n 721\n ],
  • 'latitude/0': ['{{u}}', 0, 325317],
  • "_ARRAY_DIMENSIONS": [\n "latitude"\n ],\n

Ok - so I went and looked into grib2.py and I isolated the part where the reading of the variables defined as common_vars happens (link):

if common is False:
    # done for first valid message
    logger.debug("Common variables")
    z.attrs.update(ds.attributes)
    for var in common_vars:
        # assume grid, etc is the same across all messages
        attr = ds.variables[var].attributes or {}
        attr['_ARRAY_DIMENSIONS'] = ds.variables[var].dimensions
        _store_array(store, z, ds.variables[var].data, var, inline_threashold, offset, size,
                     attr)

If I simply do as below (for simplicity I limit common_vars to two, step and latitude):

dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')
for var in ['step', 'latitude', ]:
                        # assume grid, etc is the same across all messages
    print(var)
    print(dscf.variables[var].attributes)
    print(dscf.variables[var].dimensions)
    print(dscf.variables[var].data)

I get:

step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
('step',)
[  3.   6.   9.  12.  15.  18.  21.  24.  27.  30.  33.  36.  39.  42.
        ...condensed..for...clarity
 213. 216. 219. 222. 225. 228. 231. 234. 237. 240.]
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]

So the problem is NOT the grib2 file (which often times can be).
If I copy and paste from grib2.py the relevant code to run _split_file

import tempfile
def _split_file(f, skip=0):
    if hasattr(f, "size"):
        size = f.size
    else:
        f.seek(0, 2)
        size = f.tell()
        f.seek(0)
    part = 0

    while f.tell() < size:
        # logger.debug(f"extract part {part + 1}")
        start = f.tell()
        f.seek(12, 1)
        part_size = int.from_bytes(f.read(4), "big")
        f.seek(start)
        data = f.read(part_size)
        assert data[:4] == b"GRIB"
        assert data[-4:] == b"7777"
        fn = tempfile.mktemp(suffix="grib2")
        with open(fn, "wb") as fo:
            fo.write(data)
        yield fn, start, part_size
        part += 1
        if skip and part > skip:
            break

and run:

import cfgrib
url = 'acpcp_sfc_2000010100_c00.grib2'
with fsspec.open(url, "rb") as f:
    for fn, offset, size in _split_file(f, skip=0):
        dscf = cfgrib.open_file(fn)
        for var in ['step', 'latitude', ]:
                        # assume grid, etc is the same across all messages
            print(var)
            print(dscf.variables[var].attributes)
            print(dscf.variables[var].dimensions)
            print(dscf.variables[var].data)

what I get is the same as in outscan above BUT I get one print out for each instance of step:

step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
3.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
6.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -90.  ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
9.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]

however, once again, when I ran scan_grib (let's use only step and latitude as common vars):

so = {"anon": True, "default_cache_type": "readahead"}
outscansl = scan_grib('acpcp_sfc_2000010100_c00.grib2',
                common_vars = [ 'step', 'latitude', ], storage_options=so )

it will take a while (almost as it is indeed going through all the 80 steps of step) but the output has only the first one:

outscansl

{'version': 1,
 'refs': {'.zgroup': '{\n    "zarr_format": 2\n}',
  '.zattrs': '{\n    "Conventions": "CF-1.7",\n    "GRIB_centre": "kwbc",\n    "GRIB_centreDescription": "US National Weather Service - NCEP",\n    "GRIB_edition": 2,\n    "GRIB_subCentre": 2,\n    "history": "2022-04-14T19:23 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpyycaxsp5grib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n    "institution": "US National Weather Service - NCEP"\n}',

  'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',

  'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": "latitude",\n    "stored_direction": "decreasing",\n    "units": "degrees_north"\n}',
  'longitude/.zarray': '{\n    "chunks": [\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "longitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        1440\n    ],\n    "zarr_format": 2\n}',

  'longitude/0': ['{{u}}', 29740164, 413329],
  'longitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "longitude"\n    ],\n    "long_name": "longitude",\n    "standard_name": "longitude",\n    "units": "degrees_east"\n}',
  'acpcp/.zarray': '{\n    "chunks": [\n        721,\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f4",\n    "fill_value": 9999.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "acpcp"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721,\n        1440\n    ],\n    "zarr_format": 2\n}',
  'acpcp/0.0': ['{{u}}', 29740164, 413329],
  'acpcp/.zattrs': '{\n    "GRIB_NV": 0,\n    "GRIB_Nx": 1440,\n    "GRIB_Ny": 721,\n    "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n    "GRIB_cfVarName": "acpcp",\n    "GRIB_dataType": "cf",\n    "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n    "GRIB_gridType": "regular_ll",\n    "GRIB_iDirectionIncrementInDegrees": 0.25,\n    "GRIB_iScansNegatively": 0,\n    "GRIB_jDirectionIncrementInDegrees": 0.25,\n    "GRIB_jPointsAreConsecutive": 0,\n    "GRIB_jScansPositively": 0,\n    "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n    "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n    "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n    "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n    "GRIB_missingValue": 9999,\n    "GRIB_name": "Convective precipitation (water)",\n    "GRIB_numberOfPoints": 1038240,\n    "GRIB_paramId": 3063,\n    "GRIB_shortName": "acpcp",\n    "GRIB_stepType": "accum",\n    "GRIB_stepUnits": 1,\n    "GRIB_totalNumber": 10,\n    "GRIB_typeOfLevel": "surface",\n    "GRIB_units": "kg m**-2",\n    "_ARRAY_DIMENSIONS": [\n        "latitude",\n        "longitude"\n    ],\n    "coordinates": "number time step surface latitude longitude valid_time",\n    "long_name": "Convective precipitation (water)",\n    "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n    "units": "kg m**-2"\n}'},
 'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}

so I thought that the issue was _store_array that was overwriting things, but if I copy and paste the whole grib2.py file and add some print statemetns, it seems like _store_array prints only the first step value (which makes sense since it's what i have in the outscan variable) although it is within the loop of for var in common_vars: 🤷🏼‍♀️

Bottom line, I have no idea where the issue is.
Clearly, the _split_file decides to generate one "message" per step (why? why not latitude? the order of common_vars doesn't matter I checked), but then it looses those parts somehow.

Sorry for the long issue, but I wanted to be as detailed as possible.
Thanks for your time,
Chiara

@martindurant
Copy link
Member

@TomAugspurger , you probably have done the most GRIB scanning, any immediate thoughts on this description? I personally have been meaning to come back to grib, and there's a good chance that the code in kerchunk right now is too specific to the examples I used when writing it.

@TomAugspurger
Copy link
Contributor

@TomAugspurger , you probably have done the most GRIB scanning, any immediate thoughts on this description?

Lucky me! :)

Let me take a look at this with cogrib. When I was prototyping it I wanted to match cfgrib's output. But I too was looking at specific files, so I don't know if it'll handle GEFS.

@chiaral
Copy link
Author

chiaral commented Apr 14, 2022

the big difference in these files (GEFS REFORECAST) compared to any (GEFS as well) REAL TIME forecast is that the grib file will have a dimension for the forecast step (or lead time). In REAL TIME forecast grib files it's very common that you have one grib file for each start date and step (lead) time, hence step has always length=1

What I don't understand is where the message is lost. Because when I run the loop outside of scan_grib i do get the looping through all the steps.
And also what I couldn't make out of that is where is made the decision to split the step in different parts, instead to just read it as a separate dimension, just like latitude and longitude.

thanks guys!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 14, 2022

This kinda works, maybe...

A bunch of hacky code to parse the .index file

!pip install -q git+https://github.com/TomAugspurger/cogrib

import adlfs
import cogrib
import urllib.request
import cfgrib
import requests
import re
import itertools

def pairwise(iterable):
    # pairwise('ABCDEFG') --> AB BC CD DE EF FG
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

# XXX: No idea why this is necessary,
# but the inventory file uses different names than cfgrib / xarray.
PARAM_VARAIBLE_MAPPING = {
    "tcdc": "tcc",
    "tmp": "t",
    "soilm": "ssw",
}
# 8:338750:d=2022022100:TMP:0-0.1 m below ground:anl:
def parse_index(index_text: str, length):
    lines = index_text.split("\n")
    names = ["row", "_offset", "date", "variable", "type_of_level", "thing"]
    records = [
        dict(zip(names, line.split(":")[:-1]))
        for line in lines
        if line
    ]

    range_xpr = re.compile(r"(\d*\.?\d*)-(\d*\.?\d*)")

    for record in records:
        record["_offset"] = int(record["_offset"])
        record["param"] = record.pop("variable").lower()  # is this right?
        # XXX: why does this name differ?
        if record["param"] in PARAM_VARAIBLE_MAPPING:
            record["param"] = PARAM_VARAIBLE_MAPPING[record["param"]]
            
        m = range_xpr.match(record["type_of_level"])
        if m:
            endpoints = list(map(float, m.groups()))
            record["levelist"] = endpoints[0]
        else:
            levelist = None
        start, stop = record["thing"].split(" ")[0].split("-")
        record["step"] = stop
            
    for a, b in pairwise(records):
        a["_length"] = b["_offset"] - a["_offset"]
    records[-1]["_length"] = length - records[-1]["_offset"]
    return records

Create the references

import urllib.request
import requests


url = 'https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2'
filename, message = urllib.request.urlretrieve(url)
r = requests.get(url + ".idx")
indices = parse_index(r.text, int(message["Content-Length"]))

dss = cfgrib.open_datasets(filename, backend_kwargs=dict(indexpath=""))
references = cogrib.make_references(dss[0], indices, url)

Use them with a reference file system

import fsspec
import xarray as xr

ds = xr.open_dataset(
    fsspec.get_mapper("reference://", fo=references),
    engine="zarr",
    chunks={}
)
ds

which outputs

<xarray.Dataset>
Dimensions:     (step: 80, latitude: 721, longitude: 1440)
Coordinates:
  * latitude    (latitude) float64 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * longitude   (longitude) float64 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
    number      int64 ...
  * step        (step) timedelta64[ns] 0 days 03:00:00 ... 10 days 00:00:00
    surface     float64 ...
    time        datetime64[ns] ...
    valid_time  (step) datetime64[ns] dask.array<chunksize=(80,), meta=np.ndarray>
Data variables:
    acpcp       (step, latitude, longitude) float32 dask.array<chunksize=(1, 721, 1440), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7
    GRIB_centre:             kwbc
    GRIB_centreDescription:  US National Weather Service - NCEP
    GRIB_edition:            2
    GRIB_subCentre:          2
    institution:             US National Weather Service - NCEP

Note that this doesn't use kerchunk.grib2.scan_grib. But it should output an fsspec-compatible output.

@martindurant
Copy link
Member

(as noted elsewhere, kechunk.grib is due to be updated based on advances in cogrib, when we can spare the effort)

@chiaral
Copy link
Author

chiaral commented Apr 14, 2022

Thanks you both!

I am not surprised that this grib needs different stuff compared to standard cfgrib usage.
Unfortunately the NOAA generated grib files can be very very different from the ecmwf ones. It's one of the biggest issues with grib.

I will take a look at cogrib! thanks!

@martindurant
Copy link
Member

I think this is solved?

@chiaral
Copy link
Author

chiaral commented Jul 12, 2023

Hi!

I just tried to re-run the code based on this example but modifying to match the directory organization and the file naming in this bucket here

And no, the step dimension is once again lost as above. TLDR in these gefs files, the time dimension is the concat_by_dim and step is fixed and equal to 80 values and it is identical on all files i am trying to combine.

I haven't traced my steps and these other issues, or looked at scan_grib again but this is my working example now:

import os
import fsspec
from datetime import datetime, timedelta
import xarray as xr
import ujson
import kerchunk
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr

fs_local = fsspec.filesystem('')
fs_s3 = fsspec.filesystem('s3', anon = True)

s3_so = {
    'anon': True, 
    'skip_instance_cache': True
}

# i need no filters

try:
    fs_local.mkdir('individualreforecast/')
except FileExistsError:
    print(FileExistsError)
def gen_json(file_url):
    out = scan_grib(file_url, storage_options= s3_so)
    out = out[0]
    name = f"individualreforecast/{file_url.split('/')[-1]}.json"
    with fs_local.open(name, 'w') as f:
        f.write(ujson.dumps(out))
        
try:
    fs_local.mkdir('combinedreforecast/')
except FileExistsError:
    print(FileExistsError)

def combine(json_urls):
    mzz = MultiZarrToZarr(json_urls,concat_dims = ['time']) #note the concat_dim is time
    name = f"combinedreforecast/{'.'.join(json_urls[0].split('/')[-1].split('.')[:-2])}.combined.json"
    print(name)
    with fs_local.open(name, 'w') as f:
        f.write(ujson.dumps(mzz.translate()))

for date in range(2000, 2002):
    for run in ['c00', 'p01','p02', 'p03', 'p04']:
        print(date, run, run)
        files = fs_s3.glob(f"noaa-gefs-retrospective/GEFSv12/reforecast/{str(date)}/*/{run}/Days:1-10/cape_sfc_*")
        files = [f for f in files if f.split('.')[-1] != 'idx']
        files = sorted(['s3://'+f for f in files])
#             files = files[:2] #just doing the first 2 files for now
        for file in files[:5]:
            gen_json(file)

        jsons = fs_local.glob(f"individualreforecast/cape_sfc_{str(date)}*_{run}.grib2.json")
        print(jsons)
        combine(jsons)

combined = fs_local.glob(f"combinedreforecast/cape*")
mzz = MultiZarrToZarr(combined, 
                    remote_protocol='s3',
                    remote_options={'anon':True},
                    concat_dims = ['number'],
                    identical_dims = [ 'longitude', 'latitude','time'])


out = mzz.translate()

fs_ = fsspec.filesystem("reference", fo=out, remote_protocol='s3', remote_options={'anon':True})
m = fs_.get_mapper("")
xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False))

this works, but the output looks like this, with only one value for the step dimension.

Screenshot 2023-07-11 at 10 31 51 PM

whereas when i open one file directly, it has 80 step values (I am ok with loosing valid times, because they are not stackable, they are equal to time+step)
Screenshot 2023-07-11 at 10 32 01 PM

@martindurant
Copy link
Member

Do I understand from this, that the logic in scan_grib, and how it assumes filter/keyword values to be coordinates is not correct for this dataset? Maybe we need to be more flexible. That doesn't surprise me, since (as is usual), the current code was based on a limited set of example data.

It is always possible to drop unused coordinates later and to hand-specify how to derive coordinates from each input, so you can always work around the default behaviour, but of course it would be better not to have to resort to that.

Do you think you will have time to examine the current code? https://github.com/fsspec/kerchunk/blob/main/kerchunk/grib2.py#L170

@chiaral
Copy link
Author

chiaral commented Jul 12, 2023

Thanks @martindurant for your time!

it appears that scan_grib makes a decision to look only at lat/lon and not other dimensions/coordinates.
In this case the data has 3 dimensions >1 , lat/lon/step, whereas scan_grib seems to revert everything to only lat/lon >1, and seems to assume that everything else has len=1.

I will have a look at the code!

@dcherian
Copy link
Contributor

On #363 I get

image

after combining with

scan = scan_grib(url)
combined_refs = MultiZarrToZarr(
    scan,
    concat_dims=["step"],
    identical_dims=['number', 'time', 'surface'],
).translate()

and relatively minor differences with the cfgrib dataset. I don't know whats happening with dtypes, other than cfgrib always chooses float64?

Left and right Dataset objects are not identical

Differing coordinates:
L   surface     int64 ...
R   surface     float64 ...
    long_name: original GRIB coordinate for key: level(surface)
    units: 1
L * latitude    (latitude) float64 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
R * latitude    (latitude) float64 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
    stored_direction: decreasing
Differing data variables:
L   acpcp       (step, latitude, longitude) float64 0.0 0.0 0.0 ... 0.0 0.0 0.0
R   acpcp       (step, latitude, longitude) float32 0.0 0.0 0.0 ... 0.0 0.0 0.0
Attributes only on the right object:
    Conventions: CF-1.7
    history: 2023-09-25T14:23 GRIB to CDM+CF via cfgrib-0.9.10.4/ecCodes-2.31...

@martindurant
Copy link
Member

other than cfgrib always chooses float64?

I believe it does

@martindurant
Copy link
Member

other than cfgrib always chooses float64?

I believe it does

Another reason why it would be really nice to be able to get the actual encoded buffer out rather than a whole message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants