[NVIDIA] Support vmap usage of `jax.nn.dot_product_attention` #22830

kaixih · 2024-08-01T23:35:54Z

To address this request: #22760, this PR support no-batch inputs.

superbobry · 2024-08-02T08:38:33Z

jax/_src/nn/functions.py

@@ -853,9 +853,9 @@ def dot_product_attention(
    query: ArrayLike,
    key: ArrayLike,
    value: ArrayLike,
-    *,


Why this change?

It seems you can only specify the batch dims for the positional args in jax.vmap. For the keyward arguments, vmap will always use leading dim for the batch.

superbobry · 2024-08-02T08:44:13Z

jax/_src/cudnn/fused_attention_stablehlo.py

@@ -618,11 +618,12 @@ def _dot_product_attention_fwd_batcher(
    *_, S, _, _ = key.shape
  B = math.prod(Bs)
  has_bias, _ = variadic_args
+  original_shape = query.shape


How about output_shape since you only use it for output?

superbobry · 2024-08-02T08:44:47Z

jax/_src/nn/functions.py

+    if t is None:
+      return t
+    t = jnp.asarray(t)
+    return t[None, ...] if t.ndim == 3 else t


Does it make sense to assert that t.ndim is 4 if it's not 3?

superbobry · 2024-08-02T08:45:08Z

jax/_src/nn/functions.py

@@ -912,19 +912,25 @@ def dot_product_attention(
  Returns:
    An array of the attention output with the same shape as :code:`query`.
  """
+  original_shape = jnp.asarray(query).shape
+  def _preprocess_array(t):
+    if t is None:


Nit: I would personally move this out, since only bias and mask can be None.

superbobry · 2024-08-02T08:45:22Z

jax/_src/nn/functions.py

@@ -912,19 +912,25 @@ def dot_product_attention(
  Returns:
    An array of the attention output with the same shape as :code:`query`.
  """
+  original_shape = jnp.asarray(query).shape
+  def _preprocess_array(t):


Nit: how about _ensure_4d?

superbobry · 2024-08-02T08:45:55Z

jax/_src/nn/functions.py

          query, key, value, bias, mask, is_causal=is_causal, scale=scale_val,
      )
    case _:
      raise ValueError(f"Unsupported implementation option: {implementation}")
+
+  return jnp.reshape(out, original_shape)


I guess you really just squeeze here, so you can do that instead of reshaping?

I use reshape instead of squeeze, because I want to make sure if users want to do the batch aware call and pass (1, T, N, H) inputs and then they can still get the results of the same shape rather than squeezed shape of (T, N, H).

superbobry · 2024-08-02T08:47:23Z

tests/nn_test.py

-    fn_ans = lambda q, k, v, b, m: sdpa_ans(q, k, v, bias=b, mask=m)
-    _, sdpa_vjp_ans = jax.vjp(fn_ans, Q, K, V, bias, causal_mask)
+    if use_vmap:
+      sdpa_ans = jax.vmap(sdpa_ans, in_axes=(0, 0, 0, None, None), out_axes=0)


I think your current implementation will fail if vmapped more than once, since it requires either a 3D or a 4D array.

Wdyt about handling the N>4 case by collapsing any extra leading dimensions into B? @sbodenstein does this make sense?

Why this doesn't work? I have tried this to mimic the 5D tensor and it works fine. Or do I miss your point?

Q = random.normal(keys[0], (B, B, T, N, H), dtype) K = random.normal(keys[1], (B, B, S, N // G, H), dtype) V = random.normal(keys[2], (B, B, S, N // G, H), dtype) if use_bias: bias = random.normal(keys[3], (1, N, T, S), dtype) else: bias = None is_causal = causal_mode == 'is_causal' causal_mask = _get_causal_mask(T, S) if causal_mode == 'is_mask' else None sdpa_ref = partial(sdpa, is_causal=is_causal, implementation=None) sdpa_ans = partial(sdpa, is_causal=is_causal, implementation=impl) if use_vmap: sdpa_ans = jax.vmap(sdpa_ans, in_axes=(0, 0, 0, None, None), out_axes=0) sdpa_ans = jax.vmap(sdpa_ans, in_axes=(0, 0, 0, None, None), out_axes=0) K_ref = (jnp.repeat(K, G, axis=2) if G != 1 else K).reshape(B*B, S, N, H) V_ref = (jnp.repeat(V, G, axis=2) if G != 1 else V).reshape(B*B, S, N, H) Q_ref = Q.reshape(B*B, T, N, H) out_ref = sdpa_ref(Q_ref, K_ref, V_ref, bias, causal_mask).reshape(B,B,T,N,H) out_ans = sdpa_ans(Q, K, V, bias, causal_mask) self.assertAllClose(out_ref, out_ans, atol=.01, rtol=.01)

Don't we assert that ndim is 3 or 4 now? I would expect this to fail given a 5D input.

Yes, it will fail if we directly pass in the 5D tensor. The behavior now is to support (1) 4D tensor for those who want to use the batch-aware API (2) 3D tensor for those wants to use the API in the context of vmap, meaning if users have 5D tensor, they need to use vmap as shown above.

kaixih · 2024-08-07T17:12:37Z

Gentle ping @superbobry

Init commit

6ff6501

superbobry reviewed Aug 2, 2024

View reviewed changes

Address comments

9f9e3e6

jakevdp assigned superbobry Aug 7, 2024

superbobry approved these changes Aug 7, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Aug 7, 2024

kokoro-team removed the kokoro:force-run label Aug 7, 2024

copybara-service bot merged commit cce7250 into jax-ml:main Aug 7, 2024
8 checks passed

jakevdp mentioned this pull request Aug 7, 2024

CI: fix mypy errors #22920

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Support vmap usage of `jax.nn.dot_product_attention` #22830

[NVIDIA] Support vmap usage of `jax.nn.dot_product_attention` #22830

kaixih commented Aug 1, 2024

superbobry Aug 2, 2024

kaixih Aug 2, 2024

superbobry Aug 2, 2024

kaixih Aug 2, 2024

superbobry Aug 2, 2024

kaixih Aug 2, 2024

superbobry Aug 2, 2024

kaixih Aug 2, 2024

superbobry Aug 2, 2024

kaixih Aug 2, 2024

superbobry Aug 2, 2024

kaixih Aug 2, 2024

superbobry Aug 2, 2024

kaixih Aug 2, 2024

superbobry Aug 4, 2024

kaixih Aug 5, 2024

kaixih commented Aug 7, 2024

[NVIDIA] Support vmap usage of jax.nn.dot_product_attention #22830

[NVIDIA] Support vmap usage of jax.nn.dot_product_attention #22830

Conversation

kaixih commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaixih commented Aug 7, 2024

[NVIDIA] Support vmap usage of `jax.nn.dot_product_attention` #22830

[NVIDIA] Support vmap usage of `jax.nn.dot_product_attention` #22830