-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan_grib is not outputting all the pieces dimensions correctly. #150
Comments
@TomAugspurger , you probably have done the most GRIB scanning, any immediate thoughts on this description? I personally have been meaning to come back to grib, and there's a good chance that the code in kerchunk right now is too specific to the examples I used when writing it. |
Lucky me! :) Let me take a look at this with cogrib. When I was prototyping it I wanted to match cfgrib's output. But I too was looking at specific files, so I don't know if it'll handle GEFS. |
the big difference in these files (GEFS REFORECAST) compared to any (GEFS as well) REAL TIME forecast is that the grib file will have a dimension for the forecast step (or lead time). In REAL TIME forecast grib files it's very common that you have one grib file for each start date and step (lead) time, hence step has always length=1 What I don't understand is where the message is lost. Because when I run the loop outside of thanks guys! |
This kinda works, maybe... A bunch of hacky code to parse the !pip install -q git+https://github.com/TomAugspurger/cogrib
import adlfs
import cogrib
import urllib.request
import cfgrib
import requests
import re
import itertools
def pairwise(iterable):
# pairwise('ABCDEFG') --> AB BC CD DE EF FG
a, b = itertools.tee(iterable)
next(b, None)
return zip(a, b)
# XXX: No idea why this is necessary,
# but the inventory file uses different names than cfgrib / xarray.
PARAM_VARAIBLE_MAPPING = {
"tcdc": "tcc",
"tmp": "t",
"soilm": "ssw",
}
# 8:338750:d=2022022100:TMP:0-0.1 m below ground:anl:
def parse_index(index_text: str, length):
lines = index_text.split("\n")
names = ["row", "_offset", "date", "variable", "type_of_level", "thing"]
records = [
dict(zip(names, line.split(":")[:-1]))
for line in lines
if line
]
range_xpr = re.compile(r"(\d*\.?\d*)-(\d*\.?\d*)")
for record in records:
record["_offset"] = int(record["_offset"])
record["param"] = record.pop("variable").lower() # is this right?
# XXX: why does this name differ?
if record["param"] in PARAM_VARAIBLE_MAPPING:
record["param"] = PARAM_VARAIBLE_MAPPING[record["param"]]
m = range_xpr.match(record["type_of_level"])
if m:
endpoints = list(map(float, m.groups()))
record["levelist"] = endpoints[0]
else:
levelist = None
start, stop = record["thing"].split(" ")[0].split("-")
record["step"] = stop
for a, b in pairwise(records):
a["_length"] = b["_offset"] - a["_offset"]
records[-1]["_length"] = length - records[-1]["_offset"]
return records Create the references import urllib.request
import requests
url = 'https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2'
filename, message = urllib.request.urlretrieve(url)
r = requests.get(url + ".idx")
indices = parse_index(r.text, int(message["Content-Length"]))
dss = cfgrib.open_datasets(filename, backend_kwargs=dict(indexpath=""))
references = cogrib.make_references(dss[0], indices, url) Use them with a reference file system import fsspec
import xarray as xr
ds = xr.open_dataset(
fsspec.get_mapper("reference://", fo=references),
engine="zarr",
chunks={}
)
ds which outputs
Note that this doesn't use |
(as noted elsewhere, kechunk.grib is due to be updated based on advances in cogrib, when we can spare the effort) |
Thanks you both! I am not surprised that this grib needs different stuff compared to standard cfgrib usage. I will take a look at cogrib! thanks! |
I think this is solved? |
Hi! I just tried to re-run the code based on this example but modifying to match the directory organization and the file naming in this bucket here And no, the I haven't traced my steps and these other issues, or looked at import os
import fsspec
from datetime import datetime, timedelta
import xarray as xr
import ujson
import kerchunk
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr
fs_local = fsspec.filesystem('')
fs_s3 = fsspec.filesystem('s3', anon = True)
s3_so = {
'anon': True,
'skip_instance_cache': True
}
# i need no filters
try:
fs_local.mkdir('individualreforecast/')
except FileExistsError:
print(FileExistsError)
def gen_json(file_url):
out = scan_grib(file_url, storage_options= s3_so)
out = out[0]
name = f"individualreforecast/{file_url.split('/')[-1]}.json"
with fs_local.open(name, 'w') as f:
f.write(ujson.dumps(out))
try:
fs_local.mkdir('combinedreforecast/')
except FileExistsError:
print(FileExistsError)
def combine(json_urls):
mzz = MultiZarrToZarr(json_urls,concat_dims = ['time']) #note the concat_dim is time
name = f"combinedreforecast/{'.'.join(json_urls[0].split('/')[-1].split('.')[:-2])}.combined.json"
print(name)
with fs_local.open(name, 'w') as f:
f.write(ujson.dumps(mzz.translate()))
for date in range(2000, 2002):
for run in ['c00', 'p01','p02', 'p03', 'p04']:
print(date, run, run)
files = fs_s3.glob(f"noaa-gefs-retrospective/GEFSv12/reforecast/{str(date)}/*/{run}/Days:1-10/cape_sfc_*")
files = [f for f in files if f.split('.')[-1] != 'idx']
files = sorted(['s3://'+f for f in files])
# files = files[:2] #just doing the first 2 files for now
for file in files[:5]:
gen_json(file)
jsons = fs_local.glob(f"individualreforecast/cape_sfc_{str(date)}*_{run}.grib2.json")
print(jsons)
combine(jsons)
combined = fs_local.glob(f"combinedreforecast/cape*")
mzz = MultiZarrToZarr(combined,
remote_protocol='s3',
remote_options={'anon':True},
concat_dims = ['number'],
identical_dims = [ 'longitude', 'latitude','time'])
out = mzz.translate()
fs_ = fsspec.filesystem("reference", fo=out, remote_protocol='s3', remote_options={'anon':True})
m = fs_.get_mapper("")
xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False)) this works, but the output looks like this, with only one value for the whereas when i open one file directly, it has 80 step values (I am ok with loosing valid times, because they are not stackable, they are equal to time+step) |
Do I understand from this, that the logic in scan_grib, and how it assumes filter/keyword values to be coordinates is not correct for this dataset? Maybe we need to be more flexible. That doesn't surprise me, since (as is usual), the current code was based on a limited set of example data. It is always possible to drop unused coordinates later and to hand-specify how to derive coordinates from each input, so you can always work around the default behaviour, but of course it would be better not to have to resort to that. Do you think you will have time to examine the current code? https://github.com/fsspec/kerchunk/blob/main/kerchunk/grib2.py#L170 |
Thanks @martindurant for your time! it appears that scan_grib makes a decision to look only at lat/lon and not other dimensions/coordinates. I will have a look at the code! |
On #363 I get after combining with scan = scan_grib(url)
combined_refs = MultiZarrToZarr(
scan,
concat_dims=["step"],
identical_dims=['number', 'time', 'surface'],
).translate() and relatively minor differences with the cfgrib dataset. I don't know whats happening with dtypes, other than cfgrib always chooses float64?
|
I believe it does |
Another reason why it would be really nice to be able to get the actual encoded buffer out rather than a whole message. |
I am very new at this, so I might not comprehend the exact goal of
scan_grib
but here we go (this is a follow up from discourse):What's the issue:
scan_grib
doesn't outputs all the "parts" that I expect to see in the json files, hence when i read the file through xarray I am missing parts of it. Something is wrong in the writing of the parsing of the file that comes out of_split_file
. I explored a bit, but couldn't find what the issue is.grib_file to download:
!wget https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2
versions I am using:
in both cases (which are essentially the same thing) the dimensions are recognized correctly, in particular:
dsx
dscf
Note: step is a dimension, just like latitude and longitude and time. time is particular, has no
dimension
and onedata
, step, just like latitude and longitude, has values in all the fields, again, in particular, i see no difference between:and
however, when I do:
and
it takes a second, and
outscan
looks like:Note the part regarding
step
:what you notice is that these are all empty:
differently, for latitude or longitude:
you have:
Ok - so I went and looked into
grib2.py
and I isolated the part where the reading of the variables defined ascommon_vars
happens (link):If I simply do as below (for simplicity I limit
common_vars
to two, step and latitude):I get:
So the problem is NOT the grib2 file (which often times can be).
If I copy and paste from
grib2.py
the relevant code to run_split_file
and run:
what I get is the same as in
outscan
above BUT I get one print out for each instance ofstep
:however, once again, when I ran
scan_grib
(let's use only step and latitude as common vars):it will take a while (almost as it is indeed going through all the 80 steps of
step
) but the output has only the first one:so I thought that the issue was
_store_array
that was overwriting things, but if I copy and paste the wholegrib2.py
file and add someprint
statemetns, it seems like_store_array
prints only the firststep
value (which makes sense since it's what i have in the outscan variable) although it is within the loop offor var in common_vars:
🤷🏼♀️Bottom line, I have no idea where the issue is.
Clearly, the
_split_file
decides to generate one "message" perstep
(why? why not latitude? the order ofcommon_vars
doesn't matter I checked), but then it looses those parts somehow.Sorry for the long issue, but I wanted to be as detailed as possible.
Thanks for your time,
Chiara
The text was updated successfully, but these errors were encountered: