How `vmap` allocates memory #14381

tianjuxue · 2023-02-09T15:29:11Z

tianjuxue
Feb 9, 2023

I am implementing the Lattice Boltzmann methods with JAX. The method requires operations similar to convolution (but different), i.e., each lattice node interacts with its neighbors. I am using vmap, but I am confused about how much memory vmap requires to solve such a problem.

In the following example, memory_test1 works with my RTX 8000 GPU with 48G memory, but memory test2 fails due to out of memory issue. The error says I'm trying to allocate an array of Shape: f32[1728000000,1,1,1,19] that takes 122.31GiB. This kind of makes sense (1728000000 = 400x400x400x3x3x3). But my question is: why memory_test1 simply works? It looks like for memory_test1, no such big array is allocated to cause out of memory problem. Why is that?

import jax
import jax.numpy as np

def to_id_xyz(lattice_id):
    # E.g., lattice_id = N**2 + 2N + 3 will be converted to (1, 2, 3)
    id_z = lattice_id % N
    lattice_id = lattice_id // N
    id_y = lattice_id % N
    id_x = lattice_id // N    
    return id_x, id_y, id_z 

def extract_3x3x3_grid(lattice_id, value_tensor):
    id_x, id_y, id_z = to_id_xyz(lattice_id)
    grid_index = np.ix_(np.array([(id_x - 1) % N, id_x, (id_x + 1) % N]), 
                        np.array([(id_y - 1) % N, id_y, (id_y + 1) % N]),
                        np.array([(id_z - 1) % N, id_z, (id_z + 1) % N]))
    grid_values = value_tensor[grid_index]
    return grid_values

def memory_test1(lattice_id, value_tensor):
    value_tensor_local = extract_3x3x3_grid(lattice_id, value_tensor) # (3, 3, 3, dof)
    vel = np.ones((3, 3, 3, dof, 1))
    u_local = value_tensor_local[:, :, :, :, None] * vel # (3, 3, 3, dof, 1)
    return np.sum(u_local)

memory_test1_vmap = jax.jit(jax.vmap(memory_test1, in_axes=(0, None)))

def memory_test2(lattice_id, value_tensor):
    value_tensor_local = extract_3x3x3_grid(lattice_id, value_tensor) # (3, 3, 3, dof)
    vel = np.ones((3, 3, 3, dof, 2))
    u_local = value_tensor_local[:, :, :, :, None] * vel # (3, 3, 3, dof, 2)
    return np.sum(u_local)

memory_test2_vmap = jax.jit(jax.vmap(memory_test2, in_axes=(0, None)))

N = 400
dof = 19

value_tensor = np.ones((N, N, N, dof))

result1 = memory_test1_vmap(np.arange(N*N*N), value_tensor)
print(f"max result1 = {np.max(result1)}")

result2 = memory_test2_vmap(np.arange(N*N*N), value_tensor)

Answered by jakevdp

Feb 9, 2023

There is no general answer to "how much memory vmap needs to solve a problem". vmap doesn't do computation; rather it transforms one abstract computation into another that is applicable to batched inputs.

Consider this simple example:

def f(x):
  return (x[:, None] * x[None, :]).sum()

We can get a sense for what operations this lowers to by printing its jaxpr:

x = jnp.ones(100)
print(jax.make_jaxpr(f)(x))

{ lambda ; a:f32[100]. let
    b:f32[100,1] = broadcast_in_dim[broadcast_dimensions=(0,) shape=(100, 1)] a
    c:f32[1,100] = broadcast_in_dim[broadcast_dimensions=(1,) shape=(1, 100)] a
    d:f32[100,100] = mul b c
    e:f32[] = reduce_sum[axes=(0, 1)] d
  in (e,) }

This function does …

View full answer

jakevdp · 2023-02-09T16:34:38Z

jakevdp
Feb 9, 2023
Maintainer

There is no general answer to "how much memory vmap needs to solve a problem". vmap doesn't do computation; rather it transforms one abstract computation into another that is applicable to batched inputs.

Consider this simple example:

def f(x):
  return (x[:, None] * x[None, :]).sum()

We can get a sense for what operations this lowers to by printing its jaxpr:

x = jnp.ones(100)
print(jax.make_jaxpr(f)(x))

{ lambda ; a:f32[100]. let
    b:f32[100,1] = broadcast_in_dim[broadcast_dimensions=(0,) shape=(100, 1)] a
    c:f32[1,100] = broadcast_in_dim[broadcast_dimensions=(1,) shape=(1, 100)] a
    d:f32[100,100] = mul b c
    e:f32[] = reduce_sum[axes=(0, 1)] d
  in (e,) }

This function does two broadcasts, one multiply, and then sums over the resulting size (100, 100) array.

Now what if we vmap it? Again we can see what happens by printing the jaxpr:

f_vmap = jax.vmap(f)
x_batched = jnp.ones((10, 100))
print(jax.make_jaxpr(f_vmap)(x_batched))

{ lambda ; a:f32[10,100]. let
    b:f32[10,100,1] = broadcast_in_dim[
      broadcast_dimensions=(0, 1)
      shape=(10, 100, 1)
    ] a
    c:f32[10,1,100] = broadcast_in_dim[
      broadcast_dimensions=(0, 2)
      shape=(10, 1, 100)
    ] a
    d:f32[10,100,100] = mul b c
    e:f32[10] = reduce_sum[axes=(1, 2)] d
  in (e,) }

Again it's two broadcasts and a multiply, followed by a sum over the resulting size (10, 100, 100) array. The intermediate array created by the batched operations is a factor of 10 bigger than the unbatched version, because that's the operation implied by batching our function.

The point is that vmap isn't doing anything special with memory – it's automatically creating a version of your function that works on batched inputs and outputs, using higher-dimensional arrays to store the inputs and outputs of those batched operations.

Your function is dealing with arrays whose size is measured in tens of gigabytes, on a machine with 48GB available, and you're finding that when you double the size of your arrays, you run out of memory. This on its face is not entirely surprising. If you're interested in the details of which operations and intermediate values are created by your vmapped functions, you can use make_jaxpr to see that, but keep in mind that the XLA compiler does a pass over these jaxprs before computation, so the actual computation may differ somewhat.

Does that make sense?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How `vmap` allocates memory #14381

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How vmap allocates memory #14381

tianjuxue Feb 9, 2023

Replies: 1 comment

jakevdp Feb 9, 2023 Maintainer

How `vmap` allocates memory #14381

tianjuxue
Feb 9, 2023

jakevdp
Feb 9, 2023
Maintainer