Generating a large matrix with contiguous memory from a list of matrices efficiently #6650

helange23 · 2021-05-04T23:41:46Z

helange23
May 4, 2021

I am experiencing undesired behavior which leads a significant slow down.

I am not sure If I am on the right track of solving this issue, but I think it has something to do with the memory management of my function.

I have a function that gets called 100s of times. It is very important for my application that this function is evaluated quickly. Essentially what the function computes is a dot-product of a matrix that I need to generate and another matrix that is provided, thus essentially M(r,d)@ L.

M and L have fairly large leading dimensions of N = 250,000 to 1,000,000 (but N is fixed and does not vary within a run).

r and d have dimensions (3, N) which in turn means M has shape (N,35,5) and L has dimensions (N,4,35). The problem that I have run into is that there is no "easy" way to express M in terms of r and d (i.e. as a sequence of matmuls or something). I am currently generating M from a list of submatrices like so:

@jax.jit
def magic_func(d,r):
    
    zeros = jnp.zeros((512**2))
    ones = jnp.ones((512**2))
    
    M = ((zeros, zeros, zeros, zeros, r[2]**4, zeros, zeros, zeros, r[1]*r[2]**3, zeros, zeros, r[1]**2*r[2]**2, zeros, r[1]**3*r[2], r[1]**4, zeros, zeros, zeros, r[0]*r[2]**3, zeros, zeros, r[0]*r[1]*r[2]**2, zeros, r[0]*r[1]**2*r[2], r[0]*r[1]**3, zeros, zeros, r[0]**2*r[2]**2, zeros, r[0]**2*r[1]*r[2], r[0]**2*r[1]**2, zeros, r[0]**3*r[2], r[0]**3*r[1], r[0]**4),
         (zeros, zeros, zeros, r[2]**3, d[2]*r[2]**3, zeros, zeros, r[1]*r[2]**2, d[1]*r[2]**3 + d[2]*r[1]*r[2]**2, zeros, r[1]**2*r[2], d[1]*r[1]*r[2]**2 + d[2]*r[1]**2*r[2], r[1]**3, d[1]*r[1]**2*r[2] + d[2]*r[1]**3, d[1]*r[1]**3, zeros, zeros, r[0]*r[2]**2, d[0]*r[2]**3 + d[2]*r[0]*r[2]**2, zeros, r[0]*r[1]*r[2], d[0]*r[1]*r[2]**2 + d[1]*r[0]*r[2]**2 + d[2]*r[0]*r[1]*r[2], r[0]*r[1]**2, d[0]*r[1]**2*r[2] + d[1]*r[0]*r[1]*r[2] + d[2]*r[0]*r[1]**2, d[0]*r[1]**3 + d[1]*r[0]*r[1]**2, zeros, r[0]**2*r[2], d[0]*r[0]*r[2]**2 + d[2]*r[0]**2*r[2], r[0]**2*r[1], d[0]*r[0]*r[1]*r[2] + d[1]*r[0]**2*r[2] + d[2]*r[0]**2*r[1], d[0]*r[0]*r[1]**2 + d[1]*r[0]**2*r[1], r[0]**3, d[0]*r[0]**2*r[2] + d[2]*r[0]**3, d[0]*r[0]**2*r[1] + d[1]*r[0]**3, d[0]*r[0]**3),
         (zeros, zeros, r[2]**2, d[2]*r[2]**2, d[2]**2*r[2]**2, zeros, r[1]*r[2], d[1]*r[2]**2 + d[2]*r[1]*r[2], d[1]*d[2]*r[2]**2 + d[2]**2*r[1]*r[2], r[1]**2, d[1]*r[1]*r[2] + d[2]*r[1]**2, d[1]**2*r[2]**2 + d[1]*d[2]*r[1]*r[2] + d[2]**2*r[1]**2, d[1]*r[1]**2, d[1]**2*r[1]*r[2] + d[1]*d[2]*r[1]**2, d[1]**2*r[1]**2, zeros, r[0]*r[2], d[0]*r[2]**2 + d[2]*r[0]*r[2], d[0]*d[2]*r[2]**2 + d[2]**2*r[0]*r[2], r[0]*r[1], d[0]*r[1]*r[2] + d[1]*r[0]*r[2] + d[2]*r[0]*r[1], d[0]*d[1]*r[2]**2 + d[0]*d[2]*r[1]*r[2] + d[1]*d[2]*r[0]*r[2] + d[2]**2*r[0]*r[1], d[0]*r[1]**2 + d[1]*r[0]*r[1], d[0]*d[1]*r[1]*r[2] + d[0]*d[2]*r[1]**2 + d[1]**2*r[0]*r[2] + d[1]*d[2]*r[0]*r[1], d[0]*d[1]*r[1]**2 + d[1]**2*r[0]*r[1], r[0]**2, d[0]*r[0]*r[2] + d[2]*r[0]**2, d[0]**2*r[2]**2 + d[0]*d[2]*r[0]*r[2] + d[2]**2*r[0]**2, d[0]*r[0]*r[1] + d[1]*r[0]**2, d[0]**2*r[1]*r[2] + d[0]*d[1]*r[0]*r[2] + d[0]*d[2]*r[0]*r[1] + d[1]*d[2]*r[0]**2, d[0]**2*r[1]**2 + d[0]*d[1]*r[0]*r[1] + d[1]**2*r[0]**2, d[0]*r[0]**2, d[0]**2*r[0]*r[2] + d[0]*d[2]*r[0]**2, d[0]**2*r[0]*r[1] + d[0]*d[1]*r[0]**2, d[0]**2*r[0]**2),
         (zeros, r[2], d[2]*r[2], d[2]**2*r[2], d[2]**3*r[2], r[1], d[1]*r[2] + d[2]*r[1], d[1]*d[2]*r[2] + d[2]**2*r[1], d[1]*d[2]**2*r[2] + d[2]**3*r[1], d[1]*r[1], d[1]**2*r[2] + d[1]*d[2]*r[1], d[1]**2*d[2]*r[2] + d[1]*d[2]**2*r[1], d[1]**2*r[1], d[1]**3*r[2] + d[1]**2*d[2]*r[1], d[1]**3*r[1], r[0], d[0]*r[2] + d[2]*r[0], d[0]*d[2]*r[2] + d[2]**2*r[0], d[0]*d[2]**2*r[2] + d[2]**3*r[0], d[0]*r[1] + d[1]*r[0], d[0]*d[1]*r[2] + d[0]*d[2]*r[1] + d[1]*d[2]*r[0], d[0]*d[1]*d[2]*r[2] + d[0]*d[2]**2*r[1] + d[1]*d[2]**2*r[0], d[0]*d[1]*r[1] + d[1]**2*r[0], d[0]*d[1]**2*r[2] + d[0]*d[1]*d[2]*r[1] + d[1]**2*d[2]*r[0], d[0]*d[1]**2*r[1] + d[1]**3*r[0], d[0]*r[0], d[0]**2*r[2] + d[0]*d[2]*r[0], d[0]**2*d[2]*r[2] + d[0]*d[2]**2*r[0], d[0]**2*r[1] + d[0]*d[1]*r[0], d[0]**2*d[1]*r[2] + d[0]**2*d[2]*r[1] + d[0]*d[1]*d[2]*r[0], d[0]**2*d[1]*r[1] + d[0]*d[1]**2*r[0], d[0]**2*r[0], d[0]**3*r[2] + d[0]**2*d[2]*r[0], d[0]**3*r[1] + d[0]**2*d[1]*r[0], d[0]**3*r[0]),
         (ones, d[2], d[2]**2, d[2]**3, d[2]**4, d[1], d[1]*d[2], d[1]*d[2]**2, d[1]*d[2]**3, d[1]**2, d[1]**2*d[2], d[1]**2*d[2]**2, d[1]**3, d[1]**3*d[2], d[1]**4, d[0], d[0]*d[2], d[0]*d[2]**2, d[0]*d[2]**3, d[0]*d[1], d[0]*d[1]*d[2], d[0]*d[1]*d[2]**2, d[0]*d[1]**2, d[0]*d[1]**2*d[2], d[0]*d[1]**3, d[0]**2, d[0]**2*d[2], d[0]**2*d[2]**2, d[0]**2*d[1], d[0]**2*d[1]*d[2], d[0]**2*d[1]**2, d[0]**3, d[0]**3*d[2], d[0]**3*d[1], d[0]**4))

    return M

d = jnp.ones((3, 512**2))
r = jnp.ones((3, 512**2))

%timeit magic_func(d,r)[0][0].block_until_ready()
>> 1.21 ms ± 9.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This is very fast! 1.2ms kind of what I need but I now have a list of arrays that I unfortunately cannot shove into dot.
As soon as I turn this sequence of arrays into an jnp.array (jnp.asarray does not help) and transpose it so I can use M@L, i.e. if I do return jnp.array(M).transpose((2,0,1)) the function becomes 40x slower.

%timeit magic_func(d,r).block_until_ready()
48.6 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If tried various things, like using einsum to avoid the transpose but all of these strategies have failed. I believe my problem is that when I generate the tuple of tuples M, the respective blocks of memory are fragmented and as soon as I call transpose XLA starts copying memory around to make the array M contiguous.

Am I interpreting this behavior correctly?

Is there a way of directly generating the submatrices at the "correct location in terms of memory" such that it already is contiguous and does not require copying?

helange23 · 2021-05-05T17:03:17Z

helange23
May 5, 2021
Author

I think my intuition of non-contiguous memory was correct, because I found a pretty hacky but surprisingly fast way of dealing with this.
Basically, I keep everything in lists and compute the dot-product with L by hand.

    out = [0,0,0,0,0]
    for i,Mi in enumerate(M):
        for j,Mij in enumerate(Mi):
            #Mij = (512*512)
            out[i] += Mij*L[j]

    return [x.T for x in out]

Luckily, the output of L@M is being fed into a function that is vmap-able over the N dimension and after that vmap everything seems to be contiguous in memory.

Anyone looking at that code will probably wonder wth I am doing there but it works and is fast.

I don't know if there is already an elegant way of circumventing the issues I had and how viable the idea would be, but something like a lambda-allocator would be very nice. Basically, a function that takes a lambda expression as input and populates a contiguous memory tensor whose entry at [i,j,...,k] is equal to the lambda function evaluated at i,j,...,k, i.e. something like:

f = lambda i,j: (i==j)*1.0
I = lambda_allocator(f, bounds=[(0,10), (0, 10)]) #equivalent to jnp.eye(10)

f = lambda i,j: jnp.zeros((100,))
M = lambda_allocator(f, bounds=[(0,13), (0, 10)]) #equivalent to jnp.zeros((13,10,100))

2 replies

helange23 May 6, 2021
Author

I just realized that numpy actually has this functionality (np.fromfunction) and it has not yet been implemented in JAX.

Does anyone know whether or not we will get this functionality at some point?

jakevdp May 6, 2021
Maintainer

I think fromfunction is pretty straightforward to implement in JAX; e.g.:

import operator
import jax.numpy as jnp
import numpy as np

def fromfunction(function, shape, *, dtype=float, **kwargs):
  return function(*jnp.indices(shape, dtype=dtype), **kwargs)

print(np.fromfunction(operator.pow, (3, 4)))
# [[1. 0. 0. 0.]
#  [1. 1. 1. 1.]
#  [1. 2. 4. 8.]]

print(fromfunction(operator.pow, (3, 4)))
# [[1. 0. 0. 0.]
#  [1. 1. 1. 1.]
#  [1. 2. 4. 8.]]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating a large matrix with contiguous memory from a list of matrices efficiently #6650

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Generating a large matrix with contiguous memory from a list of matrices efficiently #6650

helange23 May 4, 2021

Replies: 1 comment · 2 replies

helange23 May 5, 2021 Author

helange23 May 6, 2021 Author

jakevdp May 6, 2021 Maintainer

helange23
May 4, 2021

Replies: 1 comment 2 replies

helange23
May 5, 2021
Author

helange23 May 6, 2021
Author

jakevdp May 6, 2021
Maintainer