-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion about the use of dask/xarray in this project #25
Comments
It's a good series of questions to which i only have under-articulated answers! The inclusion of And also spot on about arbitrary image sizes meaning that we're passing them through the model one at a time, not in batches. (I don't think they're completely arbitrary, but a small set of sizes - but i'm not certain - this starts to veer into project/dataset specific needs again, if we're trying to take a radical approach to prioritising any technical work by its reuse value) |
Hey y'all, I can attempt to answer some of these questions from my own experience of using xarray/dask. The original idea behind xarray as far as I understand it was to enable metadata-aware calculations of netcdf data, the most basic example being selecting out datapoints based on coordinates such as time/lat/lon instead of relying on manually constructed indices that you would have to do with the original netcdf4 python library. It's expanded a whole heap since then, with support and addons for different data types, processing backends such as dask and convenience functions. My 'journey' with it has been similar, I started off using it as just a more convenient way of processing netcdf data then when the data got too big to fit in memory, started exploring/using its dask capability. You are able to use dask naively, and given xarray invokes it automatically in many contexts this is often what happens. I saw 'dask can help you process larger-than-memory data' and used this off-the-shelf xarray implementation to process larger-than-memory data to something that fit in memory or could be saved to disk. This worked fine for simple workflows but anything complicated and you really need to start thinking carefully about how implement the parallelism dask offers or you end up with a workflow that best case takes way longer to complete than it needs to or grinds to a halt completely. I learnt this the hard way when, among other things, trying to rechunk data for my object storage work (which I eventually found a better solution for). For reasons that are now lost to me, certain rechunking operations (perhaps when a full reshuffle of the data is required, and the way dask constructs the task graph (or' DAG' I beleive though I've always found the term unnecessarily jargony) meant that it tried to read in the whole dataset into memory at once, defeating the point of using it) were a particular fail case of dask, there's many a pangeo discourse forum and github issue thread on this, which I can probably dig out if of interest. I beleive this issue has largely been fixed now, but I'd learnt my lesson not to trust the automatic task-graph that xarray-with-dask constructs if you don't manually step in. I think the TL;DR of all that is that you shouldn't just think 'oh my data is now bigger than memory, I'll invoke the dask backend of xarray and hope for the best'. It might work, but there's a good chance it won't. So, to actually answer your questions:
Yes
Yes
Probably, if you don't mind doing it and are not/won't be making use of any of the other features of xarray
This is a good use case for xarray/dask, I have some small example notebooks somewhere on datalabs which I could share with you if useful
For me, and speaking generally, it's to at minimum 'allow' researchers to access and analyse large datasets easily, and ideally provide a semi-standardised workflow that allows this with some boilerplate code to abstract away some of the complications of accessing and analysing large datasets. That's sortof my aim with my work in FDRI anyway.
Knowing very little about this project I'm not certain, but my hunch would be that it's still possible but would require some thought to chunk/structure the data sensibly
This is ultimately the trade off with S3 right, tolerate the longer latency of having the data stored remotely instead of on local disk to gain the scalability/parallelism and ability to store and analyse gargantuan datasets. As you're hinting at, it's not always the appropriate solution. At the extreme end of analysing data that fits perfectly fine in memory it's definitely overkill. Hope some of that semi-structured ramble was useful!! |
Thank you both! So it comes down to whether we are mainly using this project as a way to develop a more general pipeline that could be dropped in elsewhere with minimal modification, or whether we are focused on reaching the best solution for the specific problem posed by the plankton project? Edit: that was intended to read as a question rather than a statement! (added ?) |
Sounds about right - I'm interested to see where you take this, so keep me posted :) |
Thank you both for the engaged and helpful commentary here! This small experimental project definitely doesn't have "scaling up big" as one of its intrinsic aims Following on from this comment I've now completely removed the xarray - dask - intake components that came along wholesale when we tried reusing work from scivision, in #36 There's less to it and it's more readable now. In my previous gig, the folks from Nvidia/MS were heavily pushing Triton Inference Server. I never dug into it, but it might be worth visiting for a future effort to "BioCLIP all the things" or for some of the LLM track of work that @matthewcoole is on. It might equally be "buy more GPUs from we" marketing material, this isn't a recommendation! Planning to close this once #36 is completed |
I am somewhat aware of the 'headline' uses of
dask
andxarray
but have never used them myself, so forgive me for being a little confused.First, I think I'm correct in saying that at present there is nothing implemented that would suffer at all from using pandas/numpy instead. So my simple mind is thinking why not just use the simpler tools up until the point where they are no longer enough.
I note that the majority of current dask/xarray use originates from using
intake-xarray
to load the data, which in turn comes fromscivision
, so I don't know if that's the only reason it's there or if I'm just not understanding what the near future looks like and how dask/xarray will help.So I have some questions related to xarray...
.tif
files?xarray
so that metadata can be 'carried around' with the image?Dataset
that serves the same purpose?xarray
?and related to dask...
To be clear I would expect dask/xarray to potentially make life much simpler down the line, I just want to get up to speed with others in understanding precisely how.
@mattjbr123 your wisdom would be very gratefully received!
The text was updated successfully, but these errors were encountered: