Understanding vjp for `scan` and `Remat` #9730

srush · 2022-03-01T15:05:13Z

srush
Mar 1, 2022

Hi all.

I am trying to understand the different ways to write a large kernel in Jax. The underlying code looks like this: (assume lambd is a huge tensor and adding an extra c dimension would blow up).

@partial(np.vectorize, signature="(c),(),(c)->()")                                                                                                                                                                 
def cauchy_dot(v, omega, lambd):                                                                                                                                                                                   
    return (v / (omega - lambd)).sum()

Running this code runs out of memory on the gradient step, but I can use remat to avoid this issue.

jax.remat(cauchy_dot)

My understanding though is that this is only a band-aid as remat just avoid keeping the memory, but still materializes the matrix. So tried writing a non-materialized version

@partial(np.vectorize, signature="(c),(),(c)->()")                                                                                                                                                                   
def cauchy_dot(v, omega, lambd):                                                                                                                                                                                                                                                                                                                                                                                                               
    def s(carry, x):                                                                                                                                                                                                 
        v2, l = x                                                                                                                                                                                                    
        return carry + (v2 / (omega - l)), None                                                                                                                                                                            
    return jax.lax.scan(s, 0.0, (v, lambd))[0]

This however fails, as even though the forward is non-materialized, my understanding is that the vjp for scan will need to materialize the intermediate steps.

@partial(np.vectorize, signature="(c),(),(c)->()")                                                                                                                                                                   
def cauchy_dot(v, omega, lambd):                                                                                                                                                                                                                                                                                                                                                                                                               
    def s(carry, x):                                                                                                                                                                                                 
        v2, l = x                                                                                                                                                                                                    
        return carry + (v2 / (omega - l)), None                                                                                                                                                                            
    return jax.lax.scan(s, 0.0, (v, lambd))[0]

But I can get around this by rematting the inner part of the loop? This is because the vjp for sum doesn't need to save any state so the scan never instantiates the large array? Is this solution really never instantiate the large array on forward / backward? How can I verify this?

@partial(np.vectorize, signature="(c),(),(c)->()")                                                                                                                                                                   
def cauchy_dot(v, omega, lambd):                                                                                                                                                                                     
    @jax.remat                                                                                                                                                                                                       
    def inner(v2, l):                                                                                                                                                                                                
        return (v2 / (omega - l))                                                                                                                                                                                    
                                                                                                                                                                                                                     
    def s(carry, x):                                                                                                                                                                                                 
        v2, l = x                                                                                                                                                                                                    
        return carry + inner(v2, l), None                                                                                                                                                                            
    return jax.lax.scan(s, 0.0, (v, lambd))[0]

============

Update: Further benchmarking on this seems to now show that jax.remat(cauchy_dot) is faster and just as memory efficient. So now even more confused about that.

Answered by YouJiacheng

Mar 2, 2022

@srush My observation is: jax.jit can avoid OOM

@partial(jnp.vectorize, signature="(c),(),(c)->()")                                                                                                                                                                 
def cauchy_dot(v, omega, lambd):                                                                                                                                                                                   
    return (v / (omega - lambd)).sum()


x = jnp.zeros((1000,))
omega = jnp.ones((10000, 10000))
print(jax.jit(jax.grad(lambda *args: jnp.sum(cauchy_dot(*args))))(x, omega, x))

Inspect the unoptimized IR in HLO form

lowered = …

View full answer

YouJiacheng · 2022-03-02T07:32:12Z

YouJiacheng
Mar 2, 2022

@srush My observation is: jax.jit can avoid OOM

@partial(jnp.vectorize, signature="(c),(),(c)->()")                                                                                                                                                                 
def cauchy_dot(v, omega, lambd):                                                                                                                                                                                   
    return (v / (omega - lambd)).sum()


x = jnp.zeros((1000,))
omega = jnp.ones((10000, 10000))
print(jax.jit(jax.grad(lambda *args: jnp.sum(cauchy_dot(*args))))(x, omega, x))

Inspect the unoptimized IR in HLO form

lowered = jax.jit(jax.grad(lambda *args: jnp.sum(cauchy_dot(*args)))).lower(x, omega, x)
print(lowered.compiler_ir(dialect="mhlo"))

module @jit__lambda_.3 {
  func public @main(%arg0: tensor<1000xf32>, %arg1: tensor<10000x10000xf32>, %arg2: tensor<1000xf32>) -> tensor<1000xf32> {
    %0 = "mhlo.broadcast_in_dim"(%arg2) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<1000xf32>) -> tensor<10000x1000xf32>
    %1 = "mhlo.reshape"(%arg1) : (tensor<10000x10000xf32>) -> tensor<10000x10000x1xf32>
    %2 = "mhlo.broadcast_in_dim"(%0) {broadcast_dimensions = dense<[1, 2]> : tensor<2xi64>} : (tensor<10000x1000xf32>) -> tensor<10000x10000x1000xf32>
    %3 = "mhlo.broadcast_in_dim"(%1) {broadcast_dimensions = dense<[0, 1, 2]> : tensor<3xi64>} : (tensor<10000x10000x1xf32>) -> tensor<10000x10000x1000xf32>
    %4 = mhlo.subtract %3, %2 : tensor<10000x10000x1000xf32>
    %5 = "mhlo.broadcast_in_dim"(%arg0) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<1000xf32>) -> tensor<10000x1000xf32>
    %6 = "mhlo.broadcast_in_dim"(%5) {broadcast_dimensions = dense<[1, 2]> : tensor<2xi64>} : (tensor<10000x1000xf32>) -> tensor<10000x10000x1000xf32>
    %7 = mhlo.divide %6, %4 : tensor<10000x10000x1000xf32>
    %8 = mhlo.constant dense<0.000000e+00> : tensor<f32>
    %9 = mhlo.reduce(%7 init: %8) across dimensions = [2] : (tensor<10000x10000x1000xf32>, tensor<f32>) -> tensor<10000x10000xf32>
     reducer(%arg3: tensor<f32>, %arg4: tensor<f32>)  {
      %20 = mhlo.add %arg3, %arg4 : tensor<f32>
      "mhlo.return"(%20) : (tensor<f32>) -> ()
    }
    %10 = mhlo.constant dense<0.000000e+00> : tensor<f32>
    %11 = mhlo.reduce(%9 init: %10) across dimensions = [0, 1] : (tensor<10000x10000xf32>, tensor<f32>) -> tensor<f32>
     reducer(%arg3: tensor<f32>, %arg4: tensor<f32>)  {
      %20 = mhlo.add %arg3, %arg4 : tensor<f32>
      "mhlo.return"(%20) : (tensor<f32>) -> ()
    }
    %12 = mhlo.constant dense<1.000000e+00> : tensor<f32>
    %13 = "mhlo.broadcast_in_dim"(%12) {broadcast_dimensions = dense<> : tensor<0xi64>} : (tensor<f32>) -> tensor<10000x10000xf32>
    %14 = "mhlo.broadcast_in_dim"(%13) {broadcast_dimensions = dense<[0, 1]> : tensor<2xi64>} : (tensor<10000x10000xf32>) -> tensor<10000x10000x1000xf32>
    %15 = mhlo.divide %14, %4 : tensor<10000x10000x1000xf32>
    %16 = mhlo.constant dense<0.000000e+00> : tensor<f32>
    %17 = mhlo.reduce(%15 init: %16) across dimensions = [0] : (tensor<10000x10000x1000xf32>, tensor<f32>) -> tensor<10000x1000xf32>
     reducer(%arg3: tensor<f32>, %arg4: tensor<f32>)  {
      %20 = mhlo.add %arg3, %arg4 : tensor<f32>
      "mhlo.return"(%20) : (tensor<f32>) -> ()
    }
    %18 = mhlo.constant dense<0.000000e+00> : tensor<f32>
    %19 = mhlo.reduce(%17 init: %18) across dimensions = [0] : (tensor<10000x1000xf32>, tensor<f32>) -> tensor<1000xf32>
     reducer(%arg3: tensor<f32>, %arg4: tensor<f32>)  {
      %20 = mhlo.add %arg3, %arg4 : tensor<f32>
      "mhlo.return"(%20) : (tensor<f32>) -> ()
    }
    return %19 : tensor<1000xf32>
  }
}

Maybe the reduce avoid materializing the huge matrix? I'm not sure.

You maybe misunderstand the effect of jax.checkpoint.

This function decorator produces a new version of fun which follows the rematerialization strategy rather than the default store-everything strategy. That is, it returns a new version of fun which, when differentiated, doesn’t store any of its intermediate linearization points. Instead, these linearization points are recomputed from the function’s saved inputs.

Thus, if you checkpoint every step, saved inputs still cost O(#steps) memory.
You should group some steps in a function and checkpoint it.

7 replies

srush Mar 2, 2022
Author

I see the confusion. omega in my case is a huge array, it has a lot of dimensions that are vmapped by np.vectorize

srush Mar 3, 2022
Author

Thanks! I was definitely getting an OOM error, and remat fixed it. But I wasn't running jit after grad explicitly. I wonder if that causes Jax to fuse the backward.

srush Mar 3, 2022
Author

Oh I realize now that it is hard to jit after grad in flax. Hmm, I'll see if they have an answer.

Is there a way I can force grad for this function to be jitted?

YouJiacheng Mar 3, 2022

@srush
jit after grad in flax is easy:

@jax.jit
def train_step(state: TrainState, x: jnp.ndarray):
    def loss_fn(params):
        return state.apply_fn({'params': params}, x)
    loss, grads = jax.value_and_grad(loss_fn)(state.params)
    return loss, state.apply_gradients(grads=grads)

Another solution:

@jax.jit
@jax.value_and_grad # argnums=0 in default
def loss_fn(params, x):
    return model.apply_fn({'params': params}, x)
def train_step(state: TrainState, x: jnp.ndarray):
    loss, grads = loss_fn(state.params)
    return loss, state.apply_gradients(grads=grads)

srush Mar 3, 2022
Author

Shoot, I am doing that already and still hitting OOM. Okay, well you've helped a ton, I'll try to pare it down and find the issue on my side. It's nice to know that if you get the args right Jax can do this automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding vjp for `scan` and `Remat` #9730

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Understanding vjp for scan and Remat #9730

srush Mar 1, 2022

Replies: 1 comment · 7 replies

YouJiacheng Mar 2, 2022

srush Mar 2, 2022 Author

srush Mar 3, 2022 Author

srush Mar 3, 2022 Author

YouJiacheng Mar 3, 2022

srush Mar 3, 2022 Author

Understanding vjp for `scan` and `Remat` #9730

srush
Mar 1, 2022

Replies: 1 comment 7 replies

YouJiacheng
Mar 2, 2022

srush Mar 2, 2022
Author

srush Mar 3, 2022
Author

srush Mar 3, 2022
Author

srush Mar 3, 2022
Author