Keep getting RESOURCE_EXHAUSTED even when changing input sizes #22848

benijohn · 2024-08-02T19:17:48Z

benijohn
Aug 2, 2024

I am working on porting a project from Numpy to JAX to take make it hopefully real time fast. My current algorithm is running at 100 Hz which is plenty fast for what I need. However, I am trying to add 1 more piece and it is leading to OOM errors no matter what size I change the inputs to. I am at a loss of where to go from here. I have tried setting:

os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"]="false"

The algorithm is generating a bunch of samples and then performing operations on them before returning my final result.

I have this line:
grad_cost = jnp.sum(jnp.sum(jnp.square(grad_1),axis=1) + jnp.sum(jnp.square(grad_2),axis=1),axis=0)

which is operating on arrays of size (61, 11, 2000) with no problem.

I then added these lines:


 g1_samples = random.normal(sks[2], shape=(num_steps, num_sweep_points, num_samples, num_grad_samples))
 g1_samples = grad_1[...,None] + g1_samples*sigmas[...,None]

  grad_1_m = jnp.mean(g1_samples,axis-1)

  g2_samples = random.normal(sks[3], shape=(num_steps, num_sweep_points, num_samples, num_grad_samples))
  g2_samples = grad_2[...,None] + g2_samples*sigmas[...,None]

  grad_2_m = jnp.mean(g2_samples, axis=-1)

where these samples are now size (61, 11, 2000, 100). This too runs fine and quite quick. However, if I try pass these mean values into the original function like:

grad_cost = jnp.sum(jnp.sum(jnp.square(grad_1_m),axis=1) + jnp.sum(jnp.square(grad_2_m),axis=1),axis=0)

Then I always will get the resource exhausted error:

RESOURCE_EXHAUSTED: Out of memory while trying to allocate 2899692453488 bytes.

I don't understand why it thinks it needs 2.9 Tbytes. These arrays should not be that big in memory.

I have also tried to implement a version using lax.fori_loop thinking maybe it was really exhausting the memory, but that too failed immediately with the same error. Any suggestions on where I am going wrong? How I can get this run with the additional samples? I can handling the speed coming down a litttle, but I really don't understand why it won't run at all at this point. Any input is very much appreciated.

Thanks!
Benjamin

jakevdp · 2024-08-02T19:34:03Z

jakevdp
Aug 2, 2024
Maintainer

Hi - it's not clear from your description which arrays have shape (61, 11, 2000, 100).

Given the memory blow-up, I suspect you may be running into unexepcted rank promotion due to the alignment of various array axes in your code. For example, if x has shape (1000, 1), then x - x.mean(1) would have shape (1000, 1000), which is 1000x larger than the input. It may be that something similar is happening in the operations you're executing.

It would be helpful if you could put together a complete minimal example of the behavior you're seeing, so that it's clearer to the reader what array sizes yoy're working with.

0 replies

benijohn · 2024-08-03T01:27:59Z

benijohn
Aug 3, 2024
Author

Hey @jakevdp thank you for your reply. I am trying to put together a minimal example, but of course the minimal example is working so I am not sure exactly how to reproduce this unless I just sent you my notebook. If you are willing to take a look at it, I would be happy to send it. Below is the minimum example of what I am trying to do (and what I think I am doing).

`import os
os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"]="false"

import jax.numpy as jnp
from jax import random
import jax

def getGradients(sks):
grad_1 = random.normal(sks[0], shape = (num_t_points, num_s_points, num_t_samples))
grad_2 = random.normal(sks[1], shape = (num_t_points, num_s_points, num_t_samples))
sigmas = random.normal(sks[2], shape = (num_t_points, num_s_points, num_t_samples))
return grad_1, grad_2, sigmas
jit_getGradients = jax.jit(getGradients, device=gpu_device)

def getCost(grad_1, grad_2, sigmas, sks):
# # Generate the gradient samples
g1_samples = random.normal(sks[3], shape=(num_t_points, num_s_points, num_t_samples, num_g_samples))
g1_samples = grad_1[...,None] + g1_samples*sigmas[...,None]

grad_1_m = jnp.mean(g1_samples,axis=-1)

g2_samples = random.normal(sks[4], shape=(num_t_points, num_s_points, num_t_samples, num_g_samples))
g2_samples = grad_2[...,None] + g2_samples*sigmas[...,None]

grad_2_m = jnp.mean(g2_samples, axis=-1)

grad_cost = jnp.sum(jnp.sum(jnp.square(grad_1_m),axis=1) + jnp.sum(jnp.square(grad_2_m),axis=1),axis=0)

return grad_cost

jit_getCost = jax.jit(getCost, device=gpu_device)

def run(sks):
grad_1, grad_2, sigmas = jax.block_until_ready(jit_getGradients(sks))
cost = jax.block_until_ready(jit_getCost(grad_1, grad_2, sigmas, sks))
return cost

rng_key = random.key(123)

num_t_points = 61
num_s_points = 11
num_t_samples = 2000
num_g_samples = 100

new_key, *sks = random.split(rng_key,6)
rng_key = new_key

cost = run(sks)`

In the actual code, the gradients and sigmas are calculated in another function but they all share the same dimensions (61,11,2000). I am then trying to draw samples of size (61,11,2000,100). It seems this operation is working fine and is doing so also in the minimum example. I can even do operations (like taking the mean) on the samples. However, if I try to use the results of the operations that is when I get the error. Now it is saying:

W external[/xla/xla/service/hlo_rematerialization.cc:3005](http://localhost:8888/xla/xla/service/hlo_rematerialization.cc#line=3004)] Can't reduce memory use below 1.11GiB (1191704679 bytes) by rematerialization; only reduced to 20.74TiB (22801279319620 bytes), down from 20.74TiB (22801279320132 bytes) originally
2024-08-02 21:14:45.046400: E external[/xla/xla/status_macros.cc:56](http://localhost:8888/xla/xla/status_macros.cc#line=55)] INTERNAL: RET_CHECK failure (external[/xla/xla/service/gpu/fusions/fusion_emitter.cc:96](http://localhost:8888/xla/xla/service/gpu/fusions/fusion_emitter.cc#line=95)) device_info.block_dim_limit().x == 0 || launch_dims.block_counts().x < device_info.block_dim_limit().x Kernel 'loop_slice_fusion_3' launch needs more blocks (8664096875) than allowed by hardware (2147483647).

It still seems to be memory related, but if I print the size of the array it shows the correct dimension (61,11,2000) after taking the mean along axis(-1).
Please let me know if you would like me to send the notebook. I will continue trying to come up with a better example in the mean time. Thank you!

Benjamin

1 reply

benijohn Aug 3, 2024
Author

So I have done some more digging on this and it seems that the issue is actually happening upstream of where I added this code. For some reason the program was running fine before, but I guess because I was not using the upstream data yet in my program it was just not actually performing the calculations. I ended up "unwinding" all my functions and running them individually, in doing so I found one of the upstream functions was actually trying to allocate a huge amount of memory. This makes sense, but since I thought it was working before it was not clear immediately where the error was coming from. In Numpy this program runs fine, although slow. Is it possible get similar behavior in JAX? As in, just crunch through the numbers and give me the result in the end and don't allocated the entire array in memory simultaneously? I am going to look at VMAP and looping, but any pointers on how to better manage the memory and work around these issues are very welcomed. Thanks!

Benjamin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep getting RESOURCE_EXHAUSTED even when changing input sizes #22848

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Keep getting RESOURCE_EXHAUSTED even when changing input sizes #22848

benijohn Aug 2, 2024

Replies: 2 comments · 1 reply

jakevdp Aug 2, 2024 Maintainer

benijohn Aug 3, 2024 Author

benijohn Aug 3, 2024 Author

benijohn
Aug 2, 2024

Replies: 2 comments 1 reply

jakevdp
Aug 2, 2024
Maintainer

benijohn
Aug 3, 2024
Author

benijohn Aug 3, 2024
Author