Attention, Transposed Convolutions, Embeddings, LayerNorm by patrick-kidger · Pull Request #38 · patrick-kidger/equinox

patrick-kidger · 2022-03-10T01:18:55Z

Merging #34 into master via this branch, so as to make some tweaks.

Updates relative to #34:

Added new layers to the documentation.
Added the mathematical details to the docstrings for MultiheadAttention and LayerNorm
Substantially overhauled the MultiheadAttention implementation. The implementation in ConvTranspose layers, MultiheadAttention, lookup embeddings, LayerNorm #34 was basically mimicking PyTorch's implementation, which has a very bad API. The new API is much more consistent, and much more general. (For reference, I found the Haiku implementation of MultiheadAttention to be the cleanest of all previous implementations.)

Plus some other misc updates in support of these changes:

Dropout now takes a fast path when p=0.
Enabled MathJax in the documentation.
Bumped version number.
Updated custom_types.Array and custom_types.PyTree to follow the convention in Diffrax; namely that they're now subscriptable and the documentation handles this.
Standardised on the spelling "normalisation" (rather than "normalization") just because that's the convention I happen to follow.
Other misc doc tweaks.

CC @andyehrenberg (and @lucidrains ?) for any commentary on these changes prior to merging them in.

…main

ConvTranspose layers, MultiheadAttention, lookup embeddings, LayerNorm

andyehrenberg · 2022-03-10T05:19:01Z

I agree that your overhauled MultiheadAttention has a better API. When I was implementing it, I was trying to make the Haiku implementation fit in with the PyTorch API, but I agree that it's pretty bad (with sort of misleading arguments). I've been putting more thought into ConvTranspose lately and should have some extensions ready soon that take inspiration from how jax-ml/jax#5772 computes output sizes (mainly to deal with more output_padding and padding cases). It looks like I'm getting consistent outputs between this new implementation and Haiku's ConvTranspose but want to test things a bit more.

patrick-kidger · 2022-03-10T11:52:38Z

Excellent. I'm assuming you implemented MultiheadAttention on the basis that you need it for your own work? If you could give it a try and check that it looks like it's working for your use cases -- just to be sure -- then that'd be great.

As for transposed convolutions: staring at the implementation I'm not feeling convinced by it. These things jump out at me:

this implementation only supports padding = 0 or padding = 1 (perhaps this is the issue you're describing above).
self.padding is set to be a sequence-of-ints. But in the (untransposed) Conv it's set to be a sequence-of-(int, int).
Would it be simpler to ignore lax.conv_transpose and work directly with lax.conv_general_dilated? Looking at the implementation of lax.conv_transpose, it's just a thin wrapper around lax.conv_general_dilated and I don't think we really hit any of the meaningful functionality that that wrapper provides.
We can probably compute dimension_numbers for all dimensions, just by working directly with lax.ConvDimensionNumbers instead of the simplified string replications. (It doesn't look that tricky.)

WDYT?

(EDIT: if you haven't seen it before, this is quite a nice visual for transposed convolutions, where you can see how stride > 1 corresponds to "fractional strides", i.e. the lhs_dilation argument to lax.conv_general_dilated.)

Tidied; simplified; generalised ConvTranspose implementation.

patrick-kidger · 2022-03-10T20:07:24Z

With the other branch merged in: let me know once you're satisfied that both MultiheadAttention and ConvTranspose work for your use cases and I'll merge this branch + do a new release.

patrick-kidger · 2022-03-14T15:03:18Z

Thanks both @lucidrains @andyehrenberg. I'll merge + release this PR tomorrow.

andyehrenberg and others added 30 commits February 15, 2022 23:29

conv transpose layers

7a7a9ed

fix output_padding default

4fd42f1

conv transpose tests

d192b3f

attention and embedding layers

c1b1baf

LayerNorm

0f344d5

documentation for attention

5dda0fc

embedding documentation

eeb5e59

fix import error

a8145f3

Merge branch 'patrick-kidger:main' into main

1ce7b42

fix attention

56e29b5

Merge branch 'main' of https://github.com/andyehrenberg/equinox into …

2a599c5

…main

_project convenience fn

6574edf

attention and embedding tests

dc95ebc

normalization test

7c81430

fix tests

f5816ec

tolerance for layernorm test

cf22c67

fix embedding test

40367ea

better docs for attention, kwargs in init, other fixes

1c093a9

fix typos, style for conv

6c407ee

Merge pull request #34 from andyehrenberg/main

1330439

ConvTranspose layers, MultiheadAttention, lookup embeddings, LayerNorm

Switched to custom Array/PyTree type hints

36e01bc

Added MathJAX support to documentation.

ec53629

Added transposed convolutions to documentation

8238b41

Dropout now takes the (cheaper) deterministic path if p == 0

2d196aa

Overhauled attention

2e95b56

Tweaked tests for attention

1783295

Updated LayerNorm

b5e7d71

Doc tweaks

b57b2c9

Added attention, normalisation to documentation

9c8e453

Decreased visual noise in navbar

5b02449

patrick-kidger added 3 commits March 10, 2022 01:10

Added embedding to the documentation

3e84539

Bumped version number

cfeb65a

test fixes

4b5e4df

Minor tidy-ups

c001749

patrick-kidger added 6 commits March 10, 2022 13:10

Minor tidy-up

e12d8d9

Fixed some type annotations

f20efeb

Tidied Conv.__init__

2afd7d0

Tidied; simplified; generalised ConvTranspose implementation.

16e7700

Update conv.py

6e6b176

Merge pull request #40 from patrick-kidger/tweak-convtranspose

f3b8b27

Tidied; simplified; generalised ConvTranspose implementation.

patrick-kidger added 6 commits March 11, 2022 17:19

Merge branch 'main' into attn-convt-layernorm

2852b85

Merge branch 'main' into attn-convt-layernorm

296f9e4

Bump version

f908f4e

Fix doc build failure

129d683

Tweaked MultiheadAttention implementation.

1b8d0f9

Test fix

b710cae

Enabled cross-attention.

97ed6e0

patrick-kidger merged commit 3343e84 into main Mar 15, 2022

patrick-kidger deleted the attn-convt-layernorm branch March 17, 2022 15:47

patrick-kidger mentioned this pull request Apr 4, 2022

Remove default bias in MultiheadAttention #60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Attention, Transposed Convolutions, Embeddings, LayerNorm#38

Attention, Transposed Convolutions, Embeddings, LayerNorm#38
patrick-kidger merged 47 commits intomainfrom
attn-convt-layernorm

patrick-kidger commented Mar 10, 2022 •

edited

Loading

Uh oh!

andyehrenberg commented Mar 10, 2022

Uh oh!

patrick-kidger commented Mar 10, 2022 •

edited

Loading

Uh oh!

patrick-kidger commented Mar 10, 2022

Uh oh!

patrick-kidger commented Mar 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

patrick-kidger commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyehrenberg commented Mar 10, 2022

Uh oh!

patrick-kidger commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrick-kidger commented Mar 10, 2022

Uh oh!

patrick-kidger commented Mar 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

patrick-kidger commented Mar 10, 2022 •

edited

Loading

patrick-kidger commented Mar 10, 2022 •

edited

Loading