Sparse array approach #23

zklaus · 2020-11-23T12:33:56Z

Move from scipy sparse matrices to sparse arrays using the sparse library. This simplifies the index gymnastics a bit, avoid the special numpy matrix interface (cf the note at the top of here), and should allow us to work with dask a bit more easily.

Draft PR for now since it is built on #22 and should be seen as proof of concept.

codecov · 2020-11-23T12:37:42Z

Codecov Report

Merging #23 (3fd53a5) into main (08d5760) will increase coverage by 24.20%.
The diff coverage is 99.11%.

@@             Coverage Diff             @@
##             main      #23       +/-   ##
===========================================
+ Coverage   69.40%   93.60%   +24.20%     
===========================================
  Files           5        7        +2     
  Lines         134      219       +85     
===========================================
+ Hits           93      205      +112     
+ Misses         41       14       -27

Impacted Files	Coverage Δ
esmf_regrid/esmf_regridder.py	`88.54% <96.77%> (+29.40%)`	⬆️
esmf_regrid/tests/__init__.py	`94.44% <100.00%> (+3.53%)`	⬆️
...tests/integration/esmf_regridder/test_Regridder.py	`100.00% <100.00%> (ø)`
...regrid/tests/unit/esmf_regridder/test_Regridder.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08d5760...3fd53a5. Read the comment docs.

zklaus · 2020-11-23T12:39:27Z

@bjlittle the greetings task of the ci seems to fail and I have no idea why. Have you seen this problem before?

stephenworsley · 2020-11-23T16:40:14Z

This looks interesting, I'll have to have a play around with sparse to see what it can do here.

My first concern is that I think I'll probably want to re-integrate and test against some level of back-end unstructured mesh support before changing too much about the regridding mechanics. Part of my thinking when creating the _flatten_array and _unflatten_array method was to have the dimensions of the grid be solely the concern of the GridInfo class (or MeshInfo for unstructured meshes). I would want to be confident that this approach would also work if we tried it with meshes where the data was a 1d array.

Secondly, I'm curious what advantages sparse gives us with respect to dask. I have seen some work in Iris to give our regridders more dask support (SciTools/iris#3701) and the iris regridders are using scipy.sparse arrays. Is there anything sparse would give us that would go beyond this? Perhaps it would be worth comparing performance eventually.

zklaus · 2020-11-23T18:02:41Z

Play away 😄

To your first concern, I did have that use case in mind, and I think it shouldn't matter.
I look at it like this: the source data comes in as a tensor with a bunch of indices, say S_{ij} where i refers to latitudes and j to longitudes. Similarly, we want a target output T_{kl}. From ESMF we get weights w_{klij} and we can write our regridding as T_{kl} = w_{klij} S_{ij} using Einstein sum convention or, more explicitly T_{kl} = \sum_{i,j} w_{klij} S_{ij}.

Now suppose the source data comes in as a 1d vector, but we still want a 2d output. Then the weights we get back from ESMF will have three indices w_{kli}, and the equation becomes T_{kl} = \sum_{i} w_{kli} S_{i}. Or maybe we want to regrid regular data to a mesh? T_{k} = \sum_{ij} w_{kij} S{ij}. Mesh to mesh? T_{k} = \sum_{i} w_{ki} S_{i}.

What about 3D timeseries? T_{mnkl} = \sum_{ij} w_{klij} S_{mnij}. Perhaps with full 3d regridding in the future? T_{mnkl} = \sum_{ijp} w_{nklpij} S_{npij}. I'll stop short of time dependent regridding.

The point is that all those right-hand sides amount to one tensordot(weights, source, axes=axes) call each with appropriate axes arguments, which only depend on the shapes of S and T. The result of that operation always has the right shape.

To your second point, first a point of terminology: As far as I understand, scipy.sparse only offers matrices, ie exactly two dimensional objects. When speaking of arrays, we generally think of arbitrary dimensional objects. That is precisely what sparse offers us in a, well, sparse way. That's also why it interacts nicely with dask: It's straightforward to have a dask array where the chunks are sparse arrays instead of numpy arrays. This doesn't work so nicely with (sparse) matrices because neither slicing nor stacking are clear when the dimensionality changes. Moreover, the semantics of matrices differ from those of ndarrays (see, e.g. multiplication *). How would that carry through to a hypothetical dask sparse matrix?

If we put the weights into a dask sparse array using sparse, we simply call dask.array.tensordot instead of numpy.tensordot and dask will take care of any necessary rechunking to bring the requisite bits of weights to the right input data.

This also seems to be the recommended approach to sparse arrays in dask, see here.

stephenworsley · 2020-11-26T11:20:00Z

I've done a bit of experimenting with sparse/dask to see how well to see how well tensordot would work with chunked arrays. I tried the following code with the latest version of dask:

import sparse
import scipy.sparse
import numpy as np
import dask.array as da
from datetime import datetime

x = da.random.random((4, 4), chunks=(2, 2))
x[x < 0.65] = 0
chunked_sparse_data = x.map_blocks(sparse.COO)

print("random matrix:")
print(chunked_sparse_data.compute().todense())

lazy_chunked_result = da.dot(chunked_sparse_data, chunked_sparse_data)

t = datetime.now()
chunked_result = lazy_chunked_result.compute()
print(f"Time taken for dask/sparse product: {datetime.now() - t}")

scipy_sparse_data = scipy.sparse.csr_matrix(chunked_sparse_data.compute().todense())

t = datetime.now()
scipy_result = scipy_sparse_data * scipy_sparse_data
print(f"Time taken for scipy.sparse product: {datetime.now() - t}")

t = datetime.now()
numpy_result = np.dot(x.compute(), x.compute())
print(f"Time taken for numpy product: {datetime.now() - t}")

print("numpy result = scipy.sparse result")
print(np.allclose(scipy_result.todense(), numpy_result))

print("dask/sparse result = scipy.sparse result")
print(np.allclose(scipy_result.todense(), chunked_result.todense()))

Which gave the following results:

random matrix:
[[0.         0.         0.95641988 0.        ]
 [0.95865256 0.89930353 0.84667644 0.        ]
 [0.         0.         0.         0.        ]
 [0.82999434 0.         0.         0.        ]]
Time taken for dask/sparse product: 0:00:02.275874
Time taken for scipy.sparse product: 0:00:00.000224
Time taken for numpy product: 0:00:00.010662
numpy result = scipy.sparse result
True
dask/sparse result = scipy.sparse result
False

It looks as though there is a bug, probably with dask.array.tensordot (it looks like this bug happens specifically when the data is chunked). It's also concerning that the sparse calculation is considerably longer than the scipy.sparse calculation. It may well be that much of this is a one time overhead and there is a tipping point at which using dask/sparse becomes more efficient, but at the moment it looks as though it is likely that we might want to keep working with scipy.sparse until dask/sparse improves. I feel like it might still be possible, though perhaps slightly more difficult, to incorporate scipy.sparse into some kind of dask framework so long as chunking is done on dimensions other than latitude/longitude.

I've done some more experimenting and it's worth noting that this bug does not seem to happen when only one of the dimensions of the sparse array is chunked, e.g.

x = da.random.random((4, 4), chunks=(2, 4))

It's also worth noting that this bug also goes away if the sparse chunked matrix is constructed as:

x = np.random.random((4, 4))
sparse_data = sparse.COO(x)
chunked_sparse_data = da.from_array(sparse_data, chunks=(2, 2))

though I still have concerns about performance.

stephenworsley · 2020-11-30T11:51:23Z

I've raised an issue with dask about the above problem dask/dask#6907.

zklaus · 2020-11-30T13:31:30Z

Thanks for the testing!

The apparent bug does concern me, though my own testing suggests that this has nothing to do with sparse and is a rather strange problem. Notably, replacing the line

chunked_sparse_data = x.map_blocks(sparse.COO)

with

chunked_sparse_data = x.map_blocks(sparse.COO).persist()

(i.e. adding .persist(), thereby calculating the preceding graph, ending up with really only the four constituent blocks) makes all calculations equivalent.
Tracing the calls, it seems dask passes wrong submatrices to sparse. Of course, this must be addressed in dask.

The timings do not concern me. The numbers as presented here suffer from at least four problems:

We don't compare like with like. Only the scipy.sparse calculation gets the benefit of localizing all data before performing the calculations. That's just not saying anything about scipy.sparse vs sparse, or indeed a matrix interface with specialized multiplication vs an array interface with explicit (tensor)dot.
This is multiplying two sparse matrices. That's not the case we are interested in. Rather, we are looking at the application of one sparse, matrix-like thing to a large, full array.
The size and chunking are completely unrepresentative of our use case.
Timing is not a trivial task and just doing it with a call to datetime.now before and after has to be taken with a huge grain of salt; doubly so if the times are in the milli- and microsecond range. I suspect that you at least ran the code a couple of times and got similar results, but when we want to look at timing more seriously, we should probably have a look at timeit and perhaps the corresponding ipython/jupyter magics %time and %timeit. Some more information can be found in Jake VanderPlas' Python Data Science Handbook.

stephenworsley · 2020-11-30T15:10:48Z

I agree that the timings are by no means rigorous, but I still think it's something we should figure out before commiting to either package. With that said, it is somewhat tricky knowing what type of calculations to measure before having a complete design.

My concern with sparse is that it may be using a fundamentally slower algorithm for the necessary calculations. The fundamental object in sparse is a COO array, scipy.sparse also has COO matrices, but recomends CSR/CSC matrices for certain calculations because they are inherently faster (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html). The sparse documentation suggests that for some operations it first translates to an appropriate scipy.sparse matrix, though this is not necessarily the case for tensordot (the documentation suggests this calculation had to be reimplemented, though the language is not entirely clear to me - https://sparse.pydata.org/en/stable/#design). My suspicion is that while it may be slightly trickier to work with, there may well be a viable design involving scipy.sparse with some level of dask support (using map_blocks perhaps) which gives better performance.

I think it would be good to have a discussion at some point about how we plan to design sparse/dask integration and how best to compare the performance of those designs.

stephenworsley · 2020-12-02T13:45:11Z

Interestingly, the same PR in dask (dask/dask#6846) which seems to fix the above bug also looks like it will offer some level of dask integration for scipy.sparse.

I've also done a little more experimentation with the new dask code and with data slightly closer to use cases and it seems as though, as suggested, the performance of sparse and scipy.sparse becomes more similar. It also seems as though the chunking of the sparse matrix may be a significant factor in the performance (from what I can tell, ensuring the first dimension is unchunked has a large impact on improving performance). It will still probably be good to nail down some standard array/chunk sizes for proper performance testing, but i'm feeling better about the performance of sparse.

stephenworsley and others added 15 commits November 12, 2020 15:29

add support for regridding masked arrays

c88e873

add tests, fix regrid, add norm_type argument

83754c8

Add tests, slightly modify mdtol behaviour

66ec6f2

style changes

c950553

style changes with updated black

88a772e

add test against ESMF

a2fd5d2

flake8 edits

761ccf7

flake8 edits

688c8cf

flake8 edits

8f2a0b9

add dunder __init__ to test directories

fd059cb

add docstrings

eb98a6f

flake8 edits

7e183bc

change regridding logic to better use matrix operations

d098e51

remove comment

acfd732

Move from scipy sparse matrices to the sparse library for sparse arrays

3fd53a5

zklaus requested a review from stephenworsley November 23, 2020 12:34

stephenworsley added the Status: Work in progress label Nov 26, 2020

zklaus closed this Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse array approach #23

Sparse array approach #23

Uh oh!

zklaus commented Nov 23, 2020

Uh oh!

codecov bot commented Nov 23, 2020 •

edited

Loading

Uh oh!

zklaus commented Nov 23, 2020

Uh oh!

stephenworsley commented Nov 23, 2020

Uh oh!

zklaus commented Nov 23, 2020

Uh oh!

stephenworsley commented Nov 26, 2020 •

edited

Loading

Uh oh!

stephenworsley commented Nov 30, 2020

Uh oh!

zklaus commented Nov 30, 2020

Uh oh!

stephenworsley commented Nov 30, 2020 •

edited

Loading

Uh oh!

stephenworsley commented Dec 2, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sparse array approach #23

Sparse array approach #23

Uh oh!

Conversation

zklaus commented Nov 23, 2020

Uh oh!

codecov bot commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zklaus commented Nov 23, 2020

Uh oh!

stephenworsley commented Nov 23, 2020

Uh oh!

zklaus commented Nov 23, 2020

Uh oh!

stephenworsley commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenworsley commented Nov 30, 2020

Uh oh!

zklaus commented Nov 30, 2020

Uh oh!

stephenworsley commented Nov 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenworsley commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Nov 23, 2020 •

edited

Loading

stephenworsley commented Nov 26, 2020 •

edited

Loading

stephenworsley commented Nov 30, 2020 •

edited

Loading

stephenworsley commented Dec 2, 2020 •

edited

Loading