-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential performance improvements for GRIB files #127
Comments
@TomAugspurger , can you please link here your "cogrib" experiments? It indeed scans files without first downloading them, and we plan to upstream it here. I'm not sure if it answers all your points, @d70-t , perhaps you have gone further. Aside from this, it should also be possible to peek into the binary description of the data and directly find the buffers representing the main array of each message. This is assuming we can understand the encoding, which is a very likely yes. This would allow:
|
That's at https://github.com/TomAugspurger/cogrib.
That It'd be nice to avoid the temporary file for scanning too, but one of my desires was to match the output of cfgrib. |
And yes, the isssue with |
Actually, can you please enlighten me what "compatibility" means here? I thought cfgrib was a pretty thin wrapper around eccodes. |
As far as I understand GRIB (I'm really bad at this), GRIB doesn't know about dimensions and coordinates which are shared between arrays. GRIB files consist of messages (which are chunks + per chunk metadata) and nothing shared by those messages. |
As always with guessing, there are multiple options on how you might want to do this and which kind of conventions are to be followed, so when rolling this guesswork on your own, you might end up with something different. |
That matches what I mean by compatibility too. The output of kerchunking a GRIB file should be a list of datasets equal to what you get from |
From working previously on gribs, I do also want to add, that for some files, you cannot use open_datasets without appropriate filters being supplied, because of coordinates mismatch between messages. |
Do you mean |
Yes, single |
We've been working a bit more on our |
Magician?? :) Do you intend to integrate any of the work into or even replace grib2 in this repo? Do you have any comments on how it compares with @TomAugspurger 's Note that with the latest version of numcodecs, you no longer need to import and register your codec, but can instead include install entrypoints. |
:-) yes, we call the customization points a Magician because that's the part where users have to put their guesswork of how to assemble datasets to "magically" stuff the grib messsage together. That's also the biggest difference to The latest version of numcodecs isn't released yet... We've got the entrypoints set up, but they don't yet work 😬 |
Currently it works for some GRIBs, but is not really stable yet and we need to gain more experience... Thus we thought it might need a little time before we really want it in kerchunk. |
@TomAugspurger , I'm sure your thoughts on gribscan would be appreciated, if you have the time to look. |
The magician looks quite a lot like what happens in MultiZarrToZarr - if each of the messages of a grib were made into independent datasets, and combined with that, then maybe you wouldn't need your own mages. Sorry, sorcerers, ... er magicians. |
Probably it would be possible to try to stuff some of the magicians into something like the |
Initially we've had a design which built one dataset per grib-file and then put all of them into MultiZarrToZarr. We moved away from that design, because we needed something which looks at the collection of all individual messages. But we didn't come up with the idea to make datasets out of each individual message. |
I've been playing around a bit on reading GRIB files, but quickly became hit by the performance impact of the temporary files being created by kerchunk/grib2.py. Thus I tried to find ways around this. As far as I understood up to now,
cfgrib
requires access to entire files and also requires some file-API whileeccodes
is happy with in-memory grib messages as well. So I tried to read in grib files using mostlyeccodes
and circumventingcfgrib
where possible, which is orders of magnitude faster than the current method implemented in kerchunk, but sadly, it doesn't do all the magiccfgrib
does in assembling proper datasets in all cases. This lack of generality is the reason why I'm not proposing a PR (yet?), but rather seek for further ideas on that topic:Here's how I'd implement the "decompression", which I belive is relatively generic (but may still be incompatible with what the current kerchunk-grib does):
this gist shows how it may be possible to scan GRIB files without the need for temporary files
The text was updated successfully, but these errors were encountered: