Unexpectedly high grad-of-scan memory usage #3186

jeffgortmaker · 2020-05-22T13:25:05Z

Consider the following function that sums x * (y + z) over all y in ys and then averages over the resulting matrix of sums:

import jax.lax
import jax.numpy as jnp

def f(x, ys):
    z = jnp.ones((3000, 3000))

    def scanned(carry, y):
        return carry + x * (y + z), None

    summed, _ = jax.lax.scan(scanned, jnp.zeros_like(z), ys)
    return summed.mean()

Because I use lax.scan (instead of, e.g., vmap or lax.map followed by a sum over the first axis), memory usage doesn't significantly scale with the number of ys. The following code uses ~203MB regardless of whether n = 5 or n = 10:

import resource

print(f(1.0, jnp.ones(n)))
print(f"{1e-3 * resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}MB")

But the gradient uses 557MB for n = 5 and 908MB for n = 10:

import jax

print(jax.grad(f)(1.0, jnp.ones(n)))
print(f"{1e-3 * resource.getrusage(resource.RUSAGE_SELF).ru_maxrss}MB")

The story is similar when these functions are jitted.

My best guess about what's going on here is that grad is storing every (y + z) in memory. Is this intended? And is there some way to tell grad to be more economical about what it stores in memory to achieve a similar lax.scan memory reduction when computing the gradient?

The text was updated successfully, but these errors were encountered:

skye · 2020-05-22T18:36:33Z

You're right that grad causes every (y + z) to be stored. Since the result of f is computed using x * (y + z), it needs to save the (y + z) values to compute the gradient. You can try using the new jax.remat, which causes values needed by the gradient computation to be recomputed instead of stored, thus saving memory. This probably makes sense for a scan like this, where you're creating a large amount of easy-to-compute values. See #1749 for examples of using remat. I think doing scan(remat(scanned), ...) should work in this case.

cc @mattjj who created remat

jeffgortmaker · 2020-05-22T18:53:24Z

This is perfect, thanks so much! I hadn't seen remat before -- looks like it's tailor-made for this type of problem.

For some reason rematifying scanned directly didn't seem to work; I found that I had to rematify the actual computation within the scan to get the desired memory reduction:

def f(x, ys):
    z = jnp.ones((3000, 3000))

    @jax.remat
    def inner(y):
        return x * (y + z)

    def scanned(carry, y):
        return carry + inner(y), None

    summed, _ = jax.lax.scan(scanned, jnp.zeros_like(z), ys)
    return summed.mean()

mattjj · 2020-05-22T19:06:17Z

By the way, we're working on some other improvements that should make this work well even without remat by never instantiating the large ones((3000, 3000)) array. We'd still need remat in general, but in this case the memory savings can be had by avoiding the large constant.

jeffgortmaker · 2020-05-22T19:08:36Z

Very cool, I'll keep my eyes peeled and keep updating the package. The work you all are doing here is really great.

jwnys · 2022-06-07T14:34:50Z

Not sure if it's entirely relevant, but I'll mention what helped me instead.
If you don't want to give up computational speed to reduce your memory (which is what you get with remat), what worked for me (getting memory requirements down from >150GB to <32GB) was to unroll the scan, using unroll=len(xs). I needed Hessians of a scan function, and this somehow resolved everything for me... I'm still not sure why this worked, so it would be good to get some information on this @mattjj , just to know whether this is actually a good idea.

skye self-assigned this May 22, 2020

jeffgortmaker closed this as completed May 22, 2020

mattjj added the question Questions for the JAX team label May 22, 2020

C-J-Cundy mentioned this issue Nov 10, 2020

Jax saves forward-pass intermediate values under lax.stop_gradient #4853

Closed

jaschau mentioned this issue Aug 23, 2023

batch_size in PointCloud is ineffective when reverse-mode differentiation is used ott-jax/ott#417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpectedly high grad-of-scan memory usage #3186

Unexpectedly high grad-of-scan memory usage #3186

jeffgortmaker commented May 22, 2020 •

edited

Loading

skye commented May 22, 2020

jeffgortmaker commented May 22, 2020

mattjj commented May 22, 2020

jeffgortmaker commented May 22, 2020

jwnys commented Jun 7, 2022

Unexpectedly high grad-of-scan memory usage #3186

Unexpectedly high grad-of-scan memory usage #3186

Comments

jeffgortmaker commented May 22, 2020 • edited Loading

skye commented May 22, 2020

jeffgortmaker commented May 22, 2020

mattjj commented May 22, 2020

jeffgortmaker commented May 22, 2020

jwnys commented Jun 7, 2022

jeffgortmaker commented May 22, 2020 •

edited

Loading