-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialising a File will also serialise the cache which can grow very large #1747
Comments
You are right, AbstractBufferedFile could do with a Note that OpenFile instances are designed exactly to encapsulate this kind of information without caches and other state - these are what should really be passed around. Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated. |
No, not xarray objects themselves. Xarray extracts some metadata for netcdf files with the opened file and then this object is put into the graph (that happens in open_mfdataset fwiw). This causes the cache to be populated. I haven't dug through everything in the API yet, so can't tell you exactly when and where this happens |
These don't appear to fit the open_mfdataset pattern in the OP. |
Would someone like to write a reasonable reduce function for AbstractBufferedFile? |
Yep, I would take a look at this |
Btw, just following up here. If I add a small computation like ds.hurs.mean(dim=["lon", "lat"]).compute() to @phofl's original example, the graph is now much smaller with the changes in #1753. With the latest |
We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:
The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.
This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.
Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?
Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.
The text was updated successfully, but these errors were encountered: