Skip to content

Conversation

@zklaus
Copy link

@zklaus zklaus commented Nov 23, 2020

Move from scipy sparse matrices to sparse arrays using the sparse library. This simplifies the index gymnastics a bit, avoid the special numpy matrix interface (cf the note at the top of here), and should allow us to work with dask a bit more easily.

Draft PR for now since it is built on #22 and should be seen as proof of concept.

@codecov
Copy link

codecov bot commented Nov 23, 2020

Codecov Report

Merging #23 (3fd53a5) into main (08d5760) will increase coverage by 24.20%.
The diff coverage is 99.11%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main      #23       +/-   ##
===========================================
+ Coverage   69.40%   93.60%   +24.20%     
===========================================
  Files           5        7        +2     
  Lines         134      219       +85     
===========================================
+ Hits           93      205      +112     
+ Misses         41       14       -27     
Impacted Files Coverage Δ
esmf_regrid/esmf_regridder.py 88.54% <96.77%> (+29.40%) ⬆️
esmf_regrid/tests/__init__.py 94.44% <100.00%> (+3.53%) ⬆️
...tests/integration/esmf_regridder/test_Regridder.py 100.00% <100.00%> (ø)
...regrid/tests/unit/esmf_regridder/test_Regridder.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08d5760...3fd53a5. Read the comment docs.

@zklaus
Copy link
Author

zklaus commented Nov 23, 2020

@bjlittle the greetings task of the ci seems to fail and I have no idea why. Have you seen this problem before?

@stephenworsley
Copy link
Contributor

This looks interesting, I'll have to have a play around with sparse to see what it can do here.

My first concern is that I think I'll probably want to re-integrate and test against some level of back-end unstructured mesh support before changing too much about the regridding mechanics. Part of my thinking when creating the _flatten_array and _unflatten_array method was to have the dimensions of the grid be solely the concern of the GridInfo class (or MeshInfo for unstructured meshes). I would want to be confident that this approach would also work if we tried it with meshes where the data was a 1d array.

Secondly, I'm curious what advantages sparse gives us with respect to dask. I have seen some work in Iris to give our regridders more dask support (SciTools/iris#3701) and the iris regridders are using scipy.sparse arrays. Is there anything sparse would give us that would go beyond this? Perhaps it would be worth comparing performance eventually.

@zklaus
Copy link
Author

zklaus commented Nov 23, 2020

Play away 😄

To your first concern, I did have that use case in mind, and I think it shouldn't matter.
I look at it like this: the source data comes in as a tensor with a bunch of indices, say S_{ij} where i refers to latitudes and j to longitudes. Similarly, we want a target output T_{kl}. From ESMF we get weights w_{klij} and we can write our regridding as T_{kl} = w_{klij} S_{ij} using Einstein sum convention or, more explicitly T_{kl} = \sum_{i,j} w_{klij} S_{ij}.

Now suppose the source data comes in as a 1d vector, but we still want a 2d output. Then the weights we get back from ESMF will have three indices w_{kli}, and the equation becomes T_{kl} = \sum_{i} w_{kli} S_{i}. Or maybe we want to regrid regular data to a mesh? T_{k} = \sum_{ij} w_{kij} S{ij}. Mesh to mesh? T_{k} = \sum_{i} w_{ki} S_{i}.

What about 3D timeseries? T_{mnkl} = \sum_{ij} w_{klij} S_{mnij}. Perhaps with full 3d regridding in the future? T_{mnkl} = \sum_{ijp} w_{nklpij} S_{npij}. I'll stop short of time dependent regridding.

The point is that all those right-hand sides amount to one tensordot(weights, source, axes=axes) call each with appropriate axes arguments, which only depend on the shapes of S and T. The result of that operation always has the right shape.

To your second point, first a point of terminology: As far as I understand, scipy.sparse only offers matrices, ie exactly two dimensional objects. When speaking of arrays, we generally think of arbitrary dimensional objects. That is precisely what sparse offers us in a, well, sparse way. That's also why it interacts nicely with dask: It's straightforward to have a dask array where the chunks are sparse arrays instead of numpy arrays. This doesn't work so nicely with (sparse) matrices because neither slicing nor stacking are clear when the dimensionality changes. Moreover, the semantics of matrices differ from those of ndarrays (see, e.g. multiplication *). How would that carry through to a hypothetical dask sparse matrix?

If we put the weights into a dask sparse array using sparse, we simply call dask.array.tensordot instead of numpy.tensordot and dask will take care of any necessary rechunking to bring the requisite bits of weights to the right input data.

This also seems to be the recommended approach to sparse arrays in dask, see here.

@stephenworsley
Copy link
Contributor

stephenworsley commented Nov 26, 2020

I've done a bit of experimenting with sparse/dask to see how well to see how well tensordot would work with chunked arrays. I tried the following code with the latest version of dask:

import sparse
import scipy.sparse
import numpy as np
import dask.array as da
from datetime import datetime

x = da.random.random((4, 4), chunks=(2, 2))
x[x < 0.65] = 0
chunked_sparse_data = x.map_blocks(sparse.COO)

print("random matrix:")
print(chunked_sparse_data.compute().todense())

lazy_chunked_result = da.dot(chunked_sparse_data, chunked_sparse_data)

t = datetime.now()
chunked_result = lazy_chunked_result.compute()
print(f"Time taken for dask/sparse product: {datetime.now() - t}")

scipy_sparse_data = scipy.sparse.csr_matrix(chunked_sparse_data.compute().todense())

t = datetime.now()
scipy_result = scipy_sparse_data * scipy_sparse_data
print(f"Time taken for scipy.sparse product: {datetime.now() - t}")

t = datetime.now()
numpy_result = np.dot(x.compute(), x.compute())
print(f"Time taken for numpy product: {datetime.now() - t}")

print("numpy result = scipy.sparse result")
print(np.allclose(scipy_result.todense(), numpy_result))

print("dask/sparse result = scipy.sparse result")
print(np.allclose(scipy_result.todense(), chunked_result.todense()))

Which gave the following results:

random matrix:
[[0.         0.         0.95641988 0.        ]
 [0.95865256 0.89930353 0.84667644 0.        ]
 [0.         0.         0.         0.        ]
 [0.82999434 0.         0.         0.        ]]
Time taken for dask/sparse product: 0:00:02.275874
Time taken for scipy.sparse product: 0:00:00.000224
Time taken for numpy product: 0:00:00.010662
numpy result = scipy.sparse result
True
dask/sparse result = scipy.sparse result
False

It looks as though there is a bug, probably with dask.array.tensordot (it looks like this bug happens specifically when the data is chunked). It's also concerning that the sparse calculation is considerably longer than the scipy.sparse calculation. It may well be that much of this is a one time overhead and there is a tipping point at which using dask/sparse becomes more efficient, but at the moment it looks as though it is likely that we might want to keep working with scipy.sparse until dask/sparse improves. I feel like it might still be possible, though perhaps slightly more difficult, to incorporate scipy.sparse into some kind of dask framework so long as chunking is done on dimensions other than latitude/longitude.


I've done some more experimenting and it's worth noting that this bug does not seem to happen when only one of the dimensions of the sparse array is chunked, e.g.

x = da.random.random((4, 4), chunks=(2, 4))

It's also worth noting that this bug also goes away if the sparse chunked matrix is constructed as:

x = np.random.random((4, 4))
sparse_data = sparse.COO(x)
chunked_sparse_data = da.from_array(sparse_data, chunks=(2, 2))

though I still have concerns about performance.

@stephenworsley
Copy link
Contributor

I've raised an issue with dask about the above problem dask/dask#6907.

@zklaus
Copy link
Author

zklaus commented Nov 30, 2020

Thanks for the testing!

The apparent bug does concern me, though my own testing suggests that this has nothing to do with sparse and is a rather strange problem. Notably, replacing the line

chunked_sparse_data = x.map_blocks(sparse.COO)

with

chunked_sparse_data = x.map_blocks(sparse.COO).persist()

(i.e. adding .persist(), thereby calculating the preceding graph, ending up with really only the four constituent blocks) makes all calculations equivalent.
Tracing the calls, it seems dask passes wrong submatrices to sparse. Of course, this must be addressed in dask.

The timings do not concern me. The numbers as presented here suffer from at least four problems:

  • We don't compare like with like. Only the scipy.sparse calculation gets the benefit of localizing all data before performing the calculations. That's just not saying anything about scipy.sparse vs sparse, or indeed a matrix interface with specialized multiplication vs an array interface with explicit (tensor)dot.
  • This is multiplying two sparse matrices. That's not the case we are interested in. Rather, we are looking at the application of one sparse, matrix-like thing to a large, full array.
  • The size and chunking are completely unrepresentative of our use case.
  • Timing is not a trivial task and just doing it with a call to datetime.now before and after has to be taken with a huge grain of salt; doubly so if the times are in the milli- and microsecond range. I suspect that you at least ran the code a couple of times and got similar results, but when we want to look at timing more seriously, we should probably have a look at timeit and perhaps the corresponding ipython/jupyter magics %time and %timeit. Some more information can be found in Jake VanderPlas' Python Data Science Handbook.

@stephenworsley
Copy link
Contributor

stephenworsley commented Nov 30, 2020

I agree that the timings are by no means rigorous, but I still think it's something we should figure out before commiting to either package. With that said, it is somewhat tricky knowing what type of calculations to measure before having a complete design.

My concern with sparse is that it may be using a fundamentally slower algorithm for the necessary calculations. The fundamental object in sparse is a COO array, scipy.sparse also has COO matrices, but recomends CSR/CSC matrices for certain calculations because they are inherently faster (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html). The sparse documentation suggests that for some operations it first translates to an appropriate scipy.sparse matrix, though this is not necessarily the case for tensordot (the documentation suggests this calculation had to be reimplemented, though the language is not entirely clear to me - https://sparse.pydata.org/en/stable/#design). My suspicion is that while it may be slightly trickier to work with, there may well be a viable design involving scipy.sparse with some level of dask support (using map_blocks perhaps) which gives better performance.

I think it would be good to have a discussion at some point about how we plan to design sparse/dask integration and how best to compare the performance of those designs.

@stephenworsley
Copy link
Contributor

stephenworsley commented Dec 2, 2020

Interestingly, the same PR in dask (dask/dask#6846) which seems to fix the above bug also looks like it will offer some level of dask integration for scipy.sparse.

I've also done a little more experimentation with the new dask code and with data slightly closer to use cases and it seems as though, as suggested, the performance of sparse and scipy.sparse becomes more similar. It also seems as though the chunking of the sparse matrix may be a significant factor in the performance (from what I can tell, ensuring the first dimension is unchunked has a large impact on improving performance). It will still probably be good to nail down some standard array/chunk sizes for proper performance testing, but i'm feeling better about the performance of sparse.

@zklaus zklaus closed this Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants