-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazily reading GRIB from cloud object storage? #52
Comments
I can give two answers here outlining how it currently works and how it could be improved, in hopes this sparks some ideas and you can describe if this doesn't fit your use case. Currently, the core API is that you feed in a byte array, and the rust Message struct scans and gathers the GRIB sections and then uses those sections to do things. So the way i use it with cloud object storage is that I store the offsets and metadata provided by the IDX sidecars, then when I need the data, i pull down only the offset I need for a given message, parse the single message, then extract the data. This is not truly lazy because it expects all of the data from the message and does not implement the reader interface like it should, but it keeps everything straightforward for the current time. I can imagine and API where you stream data and the sections are loaded on demand, or only the sections relevant to the data need to be downloaded. All of that is to say, that in python, this package currently works with xarray and kerchunk and will work fine if only the bytes for a single message in a given GRIB file is provided. If you want to explain your use case and if you think this can be improved that would be useful possibly. Thanks for reaching out! |
Thanks loads for the quick & detailed reply!
That sounds good to me! (for now, at least) My use-caseTo zoom all the way out: For the last 5 years, we've been trying to train large ML models on NWP and satellite data at the non-profit that I co-founded, Open Climate Fix. To train our ML models, we need to feed thousands of ML training examples into the ML model per second. Each example is typically a fixed-size crop of NWP and satellite data. For example, we might take a crop with shape My "dream" is to be able to train ML models directly from NWPs already on cloud object storage. And to do so as efficiently as possible (in terms of memory & CPU & network utilisation), so we have enough CPU cycles left over to do some simple transforms of the data on-the-fly. As a data user, I'd like to be able to lazily open the entire NODD GEFS dataset with Are existing tools (kerchunk + gribberish + xarray) already capable of this? (I must admit that I haven't tried yet!) If not, I'm excited to help make this a reality, and I'd love advice on how best to help (I'm comfortable in Rust & Python). (BTW, here's a draft blog post which goes into more detail. But this blog post will almost certainly change! I'm still learning about recent developments in this field!) |
So here is an example notebook shoring how to use gribberish to create a kerchunked GEFS dataset you can lazy load with xarray and gribberish. I think this hits the basics of what you are looking for. Notably, this does not operate on the IDX files, davids work on those workflows should be motsly compatible with gribberish though. Let me know if you have any questions! |
Awesome, thank you! Do you have a feel for how "near-optimal" this existing solution is? Does this solution achieve a throughput that's close to the hardware's theoretical max throughput? (No worries if not! I will do some experiments myself!) |
in my experience, the worst part of this whole process dealing with xarray not having async controls, but if you have an optimized dask pipeline you can overcome. Zarr 3 will be a big deal for this. Beyond that im not totally sure, I mostly deal with building web services so a lot of my performance knowledge is biased toward building those systems |
Hi!
gribberish
looks great! Please may I ask a naive question: Is it possible to lazily read a GRIB directly from cloud object storage? For example, if I only want to read a single message from a GRIB file that contains many messages? (If that question even makes sense! I'm quite new to the inner workings of GRIB!)The text was updated successfully, but these errors were encountered: