-
Notifications
You must be signed in to change notification settings - Fork 18
Sparse array approach #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #23 +/- ##
===========================================
+ Coverage 69.40% 93.60% +24.20%
===========================================
Files 5 7 +2
Lines 134 219 +85
===========================================
+ Hits 93 205 +112
+ Misses 41 14 -27
Continue to review full report at Codecov.
|
|
@bjlittle the greetings task of the ci seems to fail and I have no idea why. Have you seen this problem before? |
|
This looks interesting, I'll have to have a play around with sparse to see what it can do here. My first concern is that I think I'll probably want to re-integrate and test against some level of back-end unstructured mesh support before changing too much about the regridding mechanics. Part of my thinking when creating the Secondly, I'm curious what advantages sparse gives us with respect to dask. I have seen some work in Iris to give our regridders more dask support (SciTools/iris#3701) and the iris regridders are using scipy.sparse arrays. Is there anything sparse would give us that would go beyond this? Perhaps it would be worth comparing performance eventually. |
|
Play away 😄 To your first concern, I did have that use case in mind, and I think it shouldn't matter. Now suppose the source data comes in as a 1d vector, but we still want a 2d output. Then the weights we get back from ESMF will have three indices What about 3D timeseries? The point is that all those right-hand sides amount to one To your second point, first a point of terminology: As far as I understand, scipy.sparse only offers matrices, ie exactly two dimensional objects. When speaking of arrays, we generally think of arbitrary dimensional objects. That is precisely what sparse offers us in a, well, sparse way. That's also why it interacts nicely with dask: It's straightforward to have a dask array where the chunks are sparse arrays instead of numpy arrays. This doesn't work so nicely with (sparse) matrices because neither slicing nor stacking are clear when the dimensionality changes. Moreover, the semantics of matrices differ from those of ndarrays (see, e.g. multiplication If we put the weights into a dask sparse array using sparse, we simply call This also seems to be the recommended approach to sparse arrays in dask, see here. |
|
I've done a bit of experimenting with sparse/dask to see how well to see how well tensordot would work with chunked arrays. I tried the following code with the latest version of dask: Which gave the following results: It looks as though there is a bug, probably with I've done some more experimenting and it's worth noting that this bug does not seem to happen when only one of the dimensions of the sparse array is chunked, e.g. It's also worth noting that this bug also goes away if the sparse chunked matrix is constructed as: though I still have concerns about performance. |
|
I've raised an issue with dask about the above problem dask/dask#6907. |
|
Thanks for the testing! The apparent bug does concern me, though my own testing suggests that this has nothing to do with sparse and is a rather strange problem. Notably, replacing the line with (i.e. adding The timings do not concern me. The numbers as presented here suffer from at least four problems:
|
|
I agree that the timings are by no means rigorous, but I still think it's something we should figure out before commiting to either package. With that said, it is somewhat tricky knowing what type of calculations to measure before having a complete design. My concern with sparse is that it may be using a fundamentally slower algorithm for the necessary calculations. The fundamental object in sparse is a COO array, scipy.sparse also has COO matrices, but recomends CSR/CSC matrices for certain calculations because they are inherently faster (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html). The sparse documentation suggests that for some operations it first translates to an appropriate scipy.sparse matrix, though this is not necessarily the case for tensordot (the documentation suggests this calculation had to be reimplemented, though the language is not entirely clear to me - https://sparse.pydata.org/en/stable/#design). My suspicion is that while it may be slightly trickier to work with, there may well be a viable design involving scipy.sparse with some level of dask support (using I think it would be good to have a discussion at some point about how we plan to design sparse/dask integration and how best to compare the performance of those designs. |
|
Interestingly, the same PR in dask (dask/dask#6846) which seems to fix the above bug also looks like it will offer some level of dask integration for scipy.sparse. I've also done a little more experimentation with the new dask code and with data slightly closer to use cases and it seems as though, as suggested, the performance of sparse and scipy.sparse becomes more similar. It also seems as though the chunking of the sparse matrix may be a significant factor in the performance (from what I can tell, ensuring the first dimension is unchunked has a large impact on improving performance). It will still probably be good to nail down some standard array/chunk sizes for proper performance testing, but i'm feeling better about the performance of sparse. |
Move from scipy sparse matrices to sparse arrays using the sparse library. This simplifies the index gymnastics a bit, avoid the special numpy matrix interface (cf the note at the top of here), and should allow us to work with dask a bit more easily.
Draft PR for now since it is built on #22 and should be seen as proof of concept.