Native Dot product attention #22760

AakashKumarNain · 2024-07-30T16:50:31Z

AakashKumarNain
Jul 30, 2024

With the latest release, we now have a native implementation of dot product attention provided using jax.nn.dot_product_attention(...). This is good except for one thing. When building nn in jax using libraries like Equinox, we are used to implement functionalities that work for a single example (not a batch_size of 1), and then we vmap over the model to make it work a batch. With the batch axis included in the attention implementation, I have two options:

Rewrite everything as it works for a batch
Expand qkv, and then squeeze the output of attention

I can do either but it seems to have broken my mental model for using vmap because in second case what does even batch mean anymore?

Cjkkkk · 2024-07-31T18:50:34Z

Cjkkkk
Jul 31, 2024

Looking into this and will make the API accept no batch input so it works better with vmap.

6 replies

ASEM000 Aug 1, 2024

Some operations like general convolution or lstm accepts batch dimension by default. Do you think all batch-accepting operations should be changed too?

AakashKumarNain Aug 1, 2024
Author

Nowhere in my comment I am advocating for any change, that is purely up to the JAX team. All I am asking is to keep the mental model for vmap intact for every use case.

ASEM000 Aug 1, 2024

I agree with your motivation, I usually write a wrapper that uses expand-squeeze pattern around these functions. I was actually thinking about suggesting something like adding a keyword argument like batched=True to avoid breaking changes. WDYT?

AakashKumarNain Aug 1, 2024
Author

I usually write a wrapper that uses expand-squeeze pattern around these functions

Yeah I can do that except it's a wast of two ops followed by a copy

I was actually thinking about suggesting something like adding a keyword argument like batched=True to avoid breaking changes. WDYT?

That sounds good to me. The batched kwarg should default to False in this case, and the end-user should decide when to turn it on

ASEM000 Aug 2, 2024

Yeah I can do that except it's a waste of two ops followed by a copy

AFAIK If you are worried about the cost of copying, then under jit this pattern should not incur copying, I think you can double check with the core devs. For reshaping, it should not be an expensive op for this case.

kaixih · 2024-08-01T23:36:35Z

kaixih
Aug 1, 2024

Can you give a shot of this change: #22830?

1 reply

AakashKumarNain Aug 14, 2024
Author

Thanks. Now that it is merged, I will do some tests on my end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Dot product attention #22760

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Native Dot product attention #22760

AakashKumarNain Jul 30, 2024

Replies: 2 comments · 7 replies

Cjkkkk Jul 31, 2024

ASEM000 Aug 1, 2024

AakashKumarNain Aug 1, 2024 Author

ASEM000 Aug 1, 2024

AakashKumarNain Aug 1, 2024 Author

ASEM000 Aug 2, 2024

kaixih Aug 1, 2024

AakashKumarNain Aug 14, 2024 Author

AakashKumarNain
Jul 30, 2024

Replies: 2 comments 7 replies

Cjkkkk
Jul 31, 2024

AakashKumarNain Aug 1, 2024
Author

AakashKumarNain Aug 1, 2024
Author

kaixih
Aug 1, 2024

AakashKumarNain Aug 14, 2024
Author