HELP! PRNGKey with shard_map not working #22862

AshishKumar4 · 2024-08-04T15:01:45Z

AshishKumar4
Aug 4, 2024

Basically, I was trying to port my code from using pmap to using shard_map, as suggested by jax docs. When using pmap, considering a single host for now on a TPU-v4-8 (4 JAX devices), I used to replicate states such as train state (flax) and random keys via flax.jax_utils.replicate, and then used to pass these replicated states to the pmapped functions, and it used to work great!

But, using shard_map, I am unable to figure out how to make this work! I have the following code (example) thats explains my issue:

import jax.numpy as jnp

from jax.sharding import Mesh, PartitionSpec as P
from jax.experimental.shard_map import shard_map
from jax.experimental.multihost_utils import process_allgather
import flax

jax.distributed.initialize()

ones = jnp.ones((jax.local_device_count(), 1))

def func(x, rngs):
    print(rngs)
    rngs, subkey = jax.random.split(rngs)
    return x + jax.random.normal(subkey, x.shape), rngs

rngs = jax.random.PRNGKey(0)
rngs_mapped = flax.jax_utils.replicate(rngs)

out1 = jax.pmap(func)(ones, rngs_mapped)

mesh = jax.sharding.Mesh(jax.devices(), 'i')

out2 = jax.jit(shard_map(func, mesh=mesh, in_specs=P('i'), out_specs=P()))(ones, rngs_mapped)

print("Output pmap:", out1, jax.debug.visualize_array_sharding(out1[0]))
print("Output Shard map:", out2, jax.debug.visualize_array_sharding(out2[0]))

Now, If I pass the rngs_mapped to the shard_mapped function, I immediately get an error:

Traced<ShapedArray(uint32[1,2])>with<DynamicJaxprTrace(level=1/1)>
jax.errors.SimplifiedTraceback: For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in func
  File "/home/mrwhite0racle/.local/lib/python3.10/site-packages/jax/_src/random.py", line 276, in split
    typed_key, wrapped = _check_prng_key("split", key)
  File "/home/mrwhite0racle/.local/lib/python3.10/site-packages/jax/_src/random.py", line 101, in _check_prng_key
    raise ValueError(f"{name} accepts a single key, but was given a key array of"
ValueError: split accepts a single key, but was given a key array of shape (1, 2) != (). Use jax.vmap for batching.

Printing the object seems to indicate that it has been assigned axes that it wasn't supposed to have I guess?

Ofcourse, I can fix this using reshapes in this example:

def func(x, rngs):
    print(rngs,)
    rngs, subkey = jax.random.split(rngs.reshape((2,)))
    return x + jax.random.normal(subkey, x.shape), rngs.reshape((1, 2))

But this just seems like a hack. Whats the proper way to do this? I am basically trying to train a diffusion models, and I fear similar issues might also occur with the states and what not.

I am aware of the issue #22860 but it was just marked as 'bug' 8 hours ago and there has been no activity. I Just want to be sure its not a mistake from my end.

Answered by AshishKumar4

Aug 4, 2024

Found something. This seems to work:

out2 = jax.jit(shard_map(func, mesh=mesh, in_specs=(P('i'), P(), P()), out_specs=(P('i'), P(), P())))(data, rngs, state)

Basically, If I don't give the axis name at all, then no need to replicate anything and no reshaping etc required. And I am still able to use collectives inside. This works on multi-host. Just want to confirm from you the caveats of this method. Is this alright?
Once again, thank you soo much for this!

EDIT:
should be this actually:

out2 = jax.jit(shard_map(func, mesh=mesh, in_specs=(P('i'), P()), out_specs=(P('i'), P())))(data, rngs, state)

View full answer

jakevdp · 2024-08-04T18:44:40Z

jakevdp
Aug 4, 2024
Maintainer

A main difference between shard_map and pmap is that shard_map passes batches of data to the underlying function, while pmap like vmap passes single elements, with the batch dimension removed.

So here you have a batch of keys of size one, and split only accepts a single un-batched key. You can fix this by removing the batch dimension from the key, as you mentioned in the reshape solution. In the more general case (where the number of shards is not equal to the size of the data axes), you could wrap your function in both shard_map and vmap to make the behavior approximate that of pmap more automatically (although this approach will fail if your function uses any collectives that aggregate across batches).

Taking this into account (and fixing your sharding specifications to match the input and output shapes) looks like this:

out2 = jax.jit(shard_map(jax.vmap(func), mesh=mesh,
                         in_specs=(P('i'), P('i')),
                         out_specs=(P('i'), P('i'))))(ones, rngs_mapped)

4 replies

AshishKumar4 Aug 4, 2024
Author

Thank you for the reply! Indeed, this seems a more general way, but because I want to convert a train step to this, I would need to call collectives for the gradient. I don't want to sacrifice on the performance, and I believe If I do something like return the gradients from the jitted loss function, then do collectives outside and then apply updates in another jitted function, the performance would be hurt.

AshishKumar4 Aug 4, 2024
Author

Found something. This seems to work:

out2 = jax.jit(shard_map(func, mesh=mesh, in_specs=(P('i'), P(), P()), out_specs=(P('i'), P(), P())))(data, rngs, state)

Basically, If I don't give the axis name at all, then no need to replicate anything and no reshaping etc required. And I am still able to use collectives inside. This works on multi-host. Just want to confirm from you the caveats of this method. Is this alright?
Once again, thank you soo much for this!

EDIT:
should be this actually:

out2 = jax.jit(shard_map(func, mesh=mesh, in_specs=(P('i'), P()), out_specs=(P('i'), P())))(data, rngs, state)

Answer selected by AshishKumar4

jakevdp Aug 4, 2024
Maintainer

Sure, if your input key is not replicated that seems like a good approach, assuming you changed your function to have three inputs and three outputs.

AshishKumar4 Aug 4, 2024
Author

Yes yes, I was actually testing it with a pytree to be sure, just copy pasted the code for the new thing here directly without modifications, my bad.

But yes, having the ability to govern which arguments get what spec applied to them is really neat! Thanks a lot!

DBraun · 2024-09-27T19:48:16Z

DBraun
Sep 27, 2024

There's a nice strategy without requiring vmap here: https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/scaling/JAX/data_parallel_fsdp.html

So here's a full answer to your question:

N_CPUS = 8

import os
os.environ["XLA_FLAGS"] = f'--xla_force_host_platform_device_count={N_CPUS}' # Use 8 CPU devices

import jax
import jax.numpy as jnp

from jax.experimental import mesh_utils
from jax.experimental.shard_map import shard_map

P = jax.sharding.PartitionSpec
devices = mesh_utils.create_device_mesh((N_CPUS,), devices=jax.devices("cpu"))
mesh = jax.sharding.Mesh(devices, ("x"))
sharding = jax.sharding.NamedSharding(mesh, P("x"))

rng = jax.random.PRNGKey(0)

# replicate the rng (no need to split)
rng = jax.device_put(rng, jax.sharding.NamedSharding(mesh, P()))

arr = jnp.arange(N_CPUS)


def fold_rng_over_axis(rng: jax.random.PRNGKey, axis_name: str) -> jax.random.PRNGKey:
    """Folds the random number generator over the given axis.

    This is useful for generating a different random number for each device
    across a certain axis (e.g. the model axis).

    Copied from https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/scaling/JAX/data_parallel_fsdp.html

    Args:
        rng: The random number generator.
        axis_name: The axis name to fold the random number generator over.

    Returns:
        A new random number generator, different for each device index along the axis.
    """
    axis_index = jax.lax.axis_index(axis_name)
    return jax.random.fold_in(rng, axis_index)

@jax.jit
def f(rng, x):
    rng = fold_rng_over_axis(rng, axis_name="x")
    print("x shape", x.shape)
    print("rng shape", rng.shape)
    return jax.random.uniform(rng) + x

f_sh = shard_map(f, mesh=mesh, in_specs=(P(), P("x")), out_specs=P("x"))
print("f_sh output", f_sh(rng, arr))

Output:

x shape (1,)
rng shape (2,)
f_sh output [0.96994793 1.0806243  2.6519222  3.7505689  4.6923037  5.6201153
 6.77514    7.201523  ]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HELP! PRNGKey with shard_map not working #22862

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

HELP! PRNGKey with shard_map not working #22862

AshishKumar4 Aug 4, 2024

Replies: 2 comments · 4 replies

jakevdp Aug 4, 2024 Maintainer

AshishKumar4 Aug 4, 2024 Author

AshishKumar4 Aug 4, 2024 Author

jakevdp Aug 4, 2024 Maintainer

AshishKumar4 Aug 4, 2024 Author

DBraun Sep 27, 2024

AshishKumar4
Aug 4, 2024

Replies: 2 comments 4 replies

jakevdp
Aug 4, 2024
Maintainer

AshishKumar4 Aug 4, 2024
Author

AshishKumar4 Aug 4, 2024
Author

jakevdp Aug 4, 2024
Maintainer

AshishKumar4 Aug 4, 2024
Author

DBraun
Sep 27, 2024