Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential performance improvements for GRIB files #127

Open
d70-t opened this issue Feb 21, 2022 · 18 comments
Open

potential performance improvements for GRIB files #127

d70-t opened this issue Feb 21, 2022 · 18 comments

Comments

@d70-t
Copy link

d70-t commented Feb 21, 2022

I've been playing around a bit on reading GRIB files, but quickly became hit by the performance impact of the temporary files being created by kerchunk/grib2.py. Thus I tried to find ways around this. As far as I understood up to now, cfgrib requires access to entire files and also requires some file-API while eccodes is happy with in-memory grib messages as well. So I tried to read in grib files using mostly eccodes and circumventing cfgrib where possible, which is orders of magnitude faster than the current method implemented in kerchunk, but sadly, it doesn't do all the magic cfgrib does in assembling proper datasets in all cases. This lack of generality is the reason why I'm not proposing a PR (yet?), but rather seek for further ideas on that topic:

  • Do others work on this as well?
  • Do you have ideas on how to do the dataset assembly more generically?

Here's how I'd implement the "decompression", which I belive is relatively generic (but may still be incompatible with what the current kerchunk-grib does):

import eccodes
import numcodecs
from numcodecs.compat import ndarray_copy, ensure_contiguous_ndarray

class RawGribCodec(numcodecs.abc.Codec):
    codec_id = "rawgrib"

    def encode(self, buf):
        return buf

    def decode(self, buf, out=None):
        mid = eccodes.codes_new_from_message(bytes(buf))
        try:
            data = eccodes.codes_get_array(mid, "values")
        finally:
            eccodes.codes_release(mid)

        if hasattr(data, "build_array"):
            data = data.build_array()


        if out is not None:
            return ndarray_copy(data, out)
        else:
            return data

this gist shows how it may be possible to scan GRIB files without the need for temporary files

@martindurant
Copy link
Member

@TomAugspurger , can you please link here your "cogrib" experiments? It indeed scans files without first downloading them, and we plan to upstream it here. I'm not sure if it answers all your points, @d70-t , perhaps you have gone further.

Aside from this, it should also be possible to peek into the binary description of the data and directly find the buffers representing the main array of each message. This is assuming we can understand the encoding, which is a very likely yes. This would allow:

  • somewhat smaller downloads on read (the main array normally dominates a message's size)
  • no need to call cfgrib (or eccodes) to interpret the array and no need to create the codec. We may need a different codec, depending on how the array is actually encoded.
  • no creation of coordinate arrays for every message read. This is pretty fast, but can cause a big memory spike in eccodes and is wholly redundant

@TomAugspurger
Copy link
Contributor

That's at https://github.com/TomAugspurger/cogrib.

It indeed scans files without first downloading them, and we plan to upstream it here.

That cogrib experiment does need to download the whole file when it's being "kerchunked". User's accessing through fsspec's reference filesystem don't need to download it, and it doesn't need a temporary file.

It'd be nice to avoid the temporary file for scanning too, but one of my desires was to match the output of cfgrib.

@d70-t
Copy link
Author

d70-t commented Feb 21, 2022

cogrib looks very nice 👍

And yes, the isssue with cfgrib-compatibility is what bothers me most as well in my current attempt (I chose to drop compatibility for speed). I'd really hope we'd be able to figure out a way to do both: no temporary files and cfgrib compatibility.

@martindurant
Copy link
Member

Actually, can you please enlighten me what "compatibility" means here? I thought cfgrib was a pretty thin wrapper around eccodes.

@d70-t
Copy link
Author

d70-t commented Feb 21, 2022

As far as I understand GRIB (I'm really bad at this), GRIB doesn't know about dimensions and coordinates which are shared between arrays. GRIB files consist of messages (which are chunks + per chunk metadata) and nothing shared by those messages. cfgrib guesses how to assemble those messages to a Dataset based on what it finds among the per-message metadata.

@d70-t
Copy link
Author

d70-t commented Feb 21, 2022

As always with guessing, there are multiple options on how you might want to do this and which kind of conventions are to be followed, so when rolling this guesswork on your own, you might end up with something different.

@TomAugspurger
Copy link
Contributor

That matches what I mean by compatibility too. The output of kerchunking a GRIB file should be a list of datasets equal to what you get from cfgrib.open_datasets(file). I'll want to stretch that definition a bit to handle concatenating data from many GRIB files along time, but the basic idea is that I don't want to guess how to assemble messages into datasets.

@martindurant
Copy link
Member

martindurant commented Feb 21, 2022

From working previously on gribs, I do also want to add, that for some files, you cannot use open_datasets without appropriate filters being supplied, because of coordinates mismatch between messages.

@TomAugspurger
Copy link
Contributor

Do you mean open_datasets (plural) or open_dataset (singular)? I don't think I've run into files where open_datasets fails, but I haven't tried on too many different types of files.

@martindurant
Copy link
Member

Yes, single

@d70-t
Copy link
Author

d70-t commented May 19, 2022

We've been working a bit more on our gribscan, which is now also available at gribscan/gribscan. It's still very fragile, deliberately doesn't care about being compatible to the output of cfgrib and potentially requires users to implement their own Magician.

@martindurant
Copy link
Member

Magician?? :)

Do you intend to integrate any of the work into or even replace grib2 in this repo? Do you have any comments on how it compares with @TomAugspurger 's cogrib?

Note that with the latest version of numcodecs, you no longer need to import and register your codec, but can instead include install entrypoints.

@d70-t
Copy link
Author

d70-t commented May 19, 2022

:-) yes, we call the customization points a Magician because that's the part where users have to put their guesswork of how to assemble datasets to "magically" stuff the grib messsage together.

That's also the biggest difference to cogrib: We do not try to have a universal tool which makes some dataset out of almost any GRIB. Instead we require customization to make the resulting dataset nicer. That's under the assumption that someone who is involved in creating the inital GRIBs might put some valuable knowledge into it.

The latest version of numcodecs isn't released yet... We've got the entrypoints set up, but they don't yet work 😬

@d70-t
Copy link
Author

d70-t commented May 19, 2022

Currently it works for some GRIBs, but is not really stable yet and we need to gain more experience... Thus we thought it might need a little time before we really want it in kerchunk.

@martindurant
Copy link
Member

@TomAugspurger , I'm sure your thoughts on gribscan would be appreciated, if you have the time to look.

@martindurant
Copy link
Member

The magician looks quite a lot like what happens in MultiZarrToZarr - if each of the messages of a grib were made into independent datasets, and combined with that, then maybe you wouldn't need your own mages. Sorry, sorcerers, ... er magicians.

@d70-t
Copy link
Author

d70-t commented May 19, 2022

Probably it would be possible to try to stuff some of the magicians into something like the coo_map... I've to think more about that.

@d70-t
Copy link
Author

d70-t commented May 19, 2022

Initially we've had a design which built one dataset per grib-file and then put all of them into MultiZarrToZarr. We moved away from that design, because we needed something which looks at the collection of all individual messages. But we didn't come up with the idea to make datasets out of each individual message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants