Implementation of base ViT model #105

theabhirath · 2022-02-02T14:26:53Z

This is an implementation of the base ViT model as detailed in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. This is quite the finicky model to implement and while I'm fairly sure I've done it correctly I would appreciate someone proofread the code just in case 😅 There's additional deps in the form of

TensorCast (once Implementation of MLPMixer #103 gets merged then it should not be a problem) and
LinearAlgebra. This is because of a utility function I had to write for batched matrix multiplication (I poached it off a StackOverflow answer, I hope that's not plagiarism). The function might not be the fastest tool for the job, though, so there's that additional thing to discuss too.

cc: @DhairyaLGandhi for encouraging me to start working on a ViT model 😄

src/vit-based/utilities.jl

DhairyaLGandhi · 2022-02-02T16:01:08Z

Great go! Really excited to see this come through. I was thinking Transformers.jl might be a nice home for this to keep things organized.

darsnack

This is a great start. Why don't we tackle iterating the transformer layers first before moving onto ViT itself?

src/vit-based/vit.jl

CarloLucibello · 2022-02-03T06:38:30Z

It would be good to have the attention layer and the whole transformer model in flux since they are of general use. Pytorch added them a while ago

theabhirath · 2022-02-04T10:15:20Z

So this code has been cleaned up a lot more - it looks much tighter IMO (I shifted a lot of the reusable layers to the layers.jl file as discussed). I think the main ViT model itself looks perfect - of course, if there's any design changes you think can be made, do let me know 😅. But the main issue now is that of the MH-Attention layer - it's still written in a fashion that is not AD-friendly and might definitely be slower than we want...

theabhirath · 2022-02-04T10:17:37Z

It would be good to have the attention layer and the whole transformer model in flux since they are of general use. PyTorch added them a while ago

I was trying to write this but then a look at some PyTorch code, especially transformer-based ones made me realise that since vanilla attention is not really used as is very often, people end up having to write their own code anyways. There's a lot of fancy implementations of attention with changes in the MLP blocks, in the heads etc but writing a very general version and exposing a lot of options for the user will make it a little cluttered

darsnack

Awesome this is indeed looking quite a bit cleaner. I think the only major design consideration left is MHA as you mentioned. I haven't had the time to think about this, and I likely won't until this weekend has passed since I have a paper deadline on Sunday.

src/layers.jl

src/vit-based/vit.jl

theabhirath · 2022-02-04T16:43:16Z

Awesome this is indeed looking quite a bit cleaner. I think the only major design consideration left is MHA as you mentioned. I haven't had the time to think about this, and I likely won't until this weekend has passed since I have a paper deadline on Sunday.

No issues, I'll look up possible approaches that I can find in the meanwhile, good luck with your paper!

darsnack · 2022-02-07T19:16:59Z

I had a chance to implement a Parallel-based approach, which looks much cleaner in my opinion. It also appears to be faster than the PR MHAttention on my machine. We'd still want to test GPU stuff.

`Parallel` implementation

using NNlib: batched_mul

struct Attention{T}
    qkv::T
end

Attention(in, out) = Attention(Dense(in, out * 3; bias = false))

@functor Attention

function (attn::Attention)(x::AbstractArray{T}) where T
    q, k, v = chunk(attn.qkv(x), 3; dim = 1)
    scale = convert(T, sqrt(size(q, 1)))
    score = softmax(batched_mul(batched_transpose(q), k) / scale)
    attention = batched_mul(v, score)

    return attention
end

struct MultiHead{T, S}
    heads::T
    projection::S
end

function MultiHead(in, out, nheads; dropout = 0.)
    inheads, outheads = chunk(1:in, nheads), chunk(1:out, nheads)
    heads = Parallel(vcat, [Attention(length(i), length(o)) for (i, o) in zip(inheads, outheads)]...)
    projection = Chain(Dense(out, out), Dropout(dropout))

    MultiHead(heads, projection)
end

@functor MultiHead

function (mha::MultiHead)(x)
    xhead = chunk(x, 3; dim = 1)

    return mha.projection(mha.heads(xhead...))
end

Benchmarks

This is the code I used to setup the benchmark

dh = 64
nheads = 3
d, n, b = dh * nheads, 20, 32
x = rand(Float32, d, n, b)

mha = MultiHead(d, d, nheads)
mhapr = MHAttention(d; heads = nheads, headplanes = dh)

Here are the results

julia> @benchmark $(mha)($x)
BenchmarkTools.Trial: 303 samples with 1 evaluation.
 Range (min … max):  14.584 ms … 47.184 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.825 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.490 ms ±  2.918 ms  ┊ GC (mean ± σ):  0.43% ± 1.38%

   ▃▄▆█▄█▁▁   ▁                                                
  ▄████████▇▅██▄▄▄▅▅▅▃▃▃▁▂▃▂▁▃▁▁▂▁▁▁▁▁▁▁▂▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂ ▃
  14.6 ms         Histogram: frequency by time        25.8 ms <

 Memory estimate: 5.26 MiB, allocs estimate: 438.

julia> @benchmark $(mhapr)($x)
BenchmarkTools.Trial: 85 samples with 1 evaluation.
 Range (min … max):  55.089 ms … 62.719 ms  ┊ GC (min … max): 0.00% … 3.66%
 Time  (median):     59.550 ms              ┊ GC (median):    3.79%
 Time  (mean ± σ):   59.256 ms ±  1.670 ms  ┊ GC (mean ± σ):  3.17% ± 1.71%

         ▁              ▁ ▁▃  ▁▁   ▃ ▁▆ ▃ ▃█▁                  
  ▄▁▁▁▄▄▄█▁▁▄▁▇▁▁▄▁▁▇▁▄▁█▄██▇▁██▄▇▇█▇██▇█▇███▁▇▁▁▇▄▇▁▁▁▄▄▁▁▄▄ ▁
  55.1 ms         Histogram: frequency by time        62.7 ms <

 Memory estimate: 50.50 MiB, allocs estimate: 5307.

Some other notes

I think the implementation in the PR is wrong. It seems to be returning all zeros which I think is due to the last @cast operation.

The chunk utility here should be a PR to Flux or maybe renamed.

I am using Flux#master.

Perhaps we can look at speeding up Parallel separately from this PR.

darsnack · 2022-02-07T19:36:02Z

A few more numbers with (dh, nheads, n, b) == (64, 8, 100, 32).

julia> 
       @benchmark $(mha)($x)
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  210.888 ms … 217.349 ms  ┊ GC (min … max): 0.35% … 0.42%
 Time  (median):     213.661 ms               ┊ GC (median):    0.35%
 Time  (mean ± σ):   213.556 ms ±   1.737 ms  ┊ GC (mean ± σ):  0.41% ± 0.13%

                █ █            ▃                                 
  ▇▁▁▇▇▇▁▁▁▁▇▁▁▁█▁█▁▁▁▁▁▁▁▁▇▇▁▇█▁▇▁▁▁▇▁▇▁▁▇▁▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▇ ▁
  211 ms           Histogram: frequency by time          217 ms <

 Memory estimate: 92.31 MiB, allocs estimate: 1164.

julia> 
       @benchmark $(mhapr)($x)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.881 s …  1.898 s  ┊ GC (min … max): 0.51% … 0.60%
 Time  (median):     1.886 s             ┊ GC (median):    0.51%
 Time  (mean ± σ):   1.888 s ± 8.711 ms  ┊ GC (mean ± σ):  0.54% ± 0.05%

  █               █                                      █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.88 s        Histogram: frequency by time         1.9 s <

 Memory estimate: 307.26 MiB, allocs estimate: 89820.

theabhirath · 2022-02-08T07:53:31Z

I think the implementation in the PR is wrong. It seems to be returning all zeros which I think is due to the last @cast operation.

Yeah there were quite some issues because I was trying to implement MHA in a rather more Pythonic way. Parallel seems to be a very nice Julian way to resolve the same problem - I made slight tweaks but overall your suggestion fits very well. It also solves the problem of taking on additional deps

The chunk utility here should be a PR to Flux or maybe renamed.

FluxML/Flux.jl#1841 😅

src/utilities.jl

src/vit-based/vit.jl

src/layers.jl

mcabbott · 2022-02-08T12:24:49Z

src/vit-based/vit.jl

+                  SkipConnection(prenorm(planes, mlpblock(planes, mlpplanes, dropout)), +)) 
+            for _ in 1:depth]
+
+  Chain(layers...)


On Flux master, this might be a good candidate for Chain(layers) to reduce load time.

Nice that layers has a concrete eltype here.

src/layers.jl

src/utilities.jl

darsnack

Okay I think we're at the point of cleaning up the final API. In addition to Michael's comments, I've left my own notes.

src/layers.jl

src/vit-based/vit.jl

src/layers.jl

src/vit-based/vit.jl

src/layers.jl

theabhirath · 2022-02-08T17:50:21Z

I think I've covered all the main suggestions. There's a couple involving adding types to arguments that I'm not sure about - while I'm all for it, the other model APIs don't reflect the same. Likewise with the @assert vs throw ArgumentError cases - I think separate PRs to deal with those issues make more sense?

darsnack

Looks great, mostly docstrings that need updating. (Btw all of these suggestions in Github can be committed through the web interface. Makes it easier to make sure nothing is missed from a review).

src/layers.jl

src/utilities.jl

src/vit-based/vit.jl

Co-authored-by: Kyle Daruwalla <[email protected]>

…o vit

src/layers.jl

Co-authored-by: Michael Abbott <[email protected]>

…o vit

src/utilities.jl

theabhirath · 2022-02-11T02:43:00Z

I ran gradtest locally and it passes - quite faster than for the other models, in fact 😂. This should be good to go now

theabhirath · 2022-02-11T02:54:52Z

Whoops. Both Flux and MLUtils are exporting flatten. That's causing the problem

darsnack · 2022-02-11T23:46:04Z

Thank you for all the hard work and patience @theabhirath!

DhairyaLGandhi reviewed Feb 2, 2022

View reviewed changes

src/vit-based/utilities.jl Outdated Show resolved Hide resolved

avik-pal mentioned this pull request Feb 2, 2022

Distributed Training Examples & Scalability Benchmarks avik-pal/FluxMPI.jl#11

Open

darsnack requested changes Feb 3, 2022

View reviewed changes

src/vit-based/vit.jl Outdated Show resolved Hide resolved

src/vit-based/vit.jl Outdated Show resolved Hide resolved

src/vit-based/vit.jl Outdated Show resolved Hide resolved

theabhirath changed the base branch from master to compathelper/new_version/2022-02-04-03-06-36-596-01515794626 February 4, 2022 10:20

theabhirath changed the base branch from compathelper/new_version/2022-02-04-03-06-36-596-01515794626 to master February 4, 2022 10:21

theabhirath requested a review from darsnack February 4, 2022 10:22

darsnack requested changes Feb 4, 2022

View reviewed changes

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

src/vit-based/vit.jl Outdated Show resolved Hide resolved

darsnack mentioned this pull request Feb 7, 2022

Add model implementations #112

Open

46 tasks

theabhirath added 4 commits February 8, 2022 11:52

Initial commit for base ViT model

bd962b2

Cleaned up ViT code

cfc0dbf

API tweaks and added docstrings for utility layers

5829e4d

Updated MHAttention

0a3b43b

theabhirath force-pushed the vit branch from 0c3d873 to 0a3b43b Compare February 8, 2022 07:48

Updated deps

6e45252

mcabbott reviewed Feb 8, 2022

View reviewed changes

darsnack requested changes Feb 8, 2022

View reviewed changes

mcabbott reviewed Feb 8, 2022

View reviewed changes

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

Cleaned up APIs

f99f819

Minor doc fixes

a8fa057

ToucheSir mentioned this pull request Feb 8, 2022

Multi-head attention? FluxML/NNlib.jl#385

Closed

darsnack requested changes Feb 9, 2022

View reviewed changes

theabhirath and others added 4 commits February 10, 2022 06:12

Formatting tweaks

79d69d0

Co-authored-by: Kyle Daruwalla <[email protected]>

Update src/layers.jl

d2cf797

Co-authored-by: Kyle Daruwalla <[email protected]>

Formatting

fc12748

Merge branch 'vit' of https://github.com/theabhirath/Metalhead.jl int…

2b63775

…o vit

mcabbott reviewed Feb 10, 2022

View reviewed changes

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

src/layers.jl Outdated Show resolved Hide resolved

theabhirath and others added 4 commits February 10, 2022 06:37

Chunk definitions update

65d90fb

Minor tweaks

a1d7a14

Co-authored-by: Michael Abbott <[email protected]>

Changed dropout to be a keyword argument

078ce7d

Merge branch 'vit' of https://github.com/theabhirath/Metalhead.jl int…

f15a01e

…o vit

theabhirath requested a review from darsnack February 10, 2022 01:45

CarloLucibello reviewed Feb 10, 2022

View reviewed changes

src/utilities.jl Outdated Show resolved Hide resolved

This was referenced Feb 10, 2022

Insane precompile times in Metalhead models FluxML/Zygote.jl#1160

Closed

Error in installing because v0.1.1 doesn't contain src/Datasets/datasets.jl JuliaML/MLUtils.jl#48

Closed

Used MLUtils.chunk

52411cf

Use MLUtils.flatten

cf211e1

CarloLucibello approved these changes Feb 11, 2022

View reviewed changes

darsnack approved these changes Feb 11, 2022

View reviewed changes

darsnack merged commit dfc9a64 into FluxML:master Feb 11, 2022

theabhirath deleted the vit branch February 12, 2022 01:02

darsnack mentioned this pull request Mar 18, 2022

Refactor of ViT models #135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of base ViT model #105

Implementation of base ViT model #105

theabhirath commented Feb 2, 2022

DhairyaLGandhi commented Feb 2, 2022

darsnack left a comment

CarloLucibello commented Feb 3, 2022

theabhirath commented Feb 4, 2022

theabhirath commented Feb 4, 2022 •

edited

Loading

darsnack left a comment

theabhirath commented Feb 4, 2022

darsnack commented Feb 7, 2022 •

edited

Loading

darsnack commented Feb 7, 2022 •

edited

Loading

theabhirath commented Feb 8, 2022

mcabbott Feb 8, 2022

ToucheSir Feb 8, 2022

darsnack left a comment

theabhirath commented Feb 8, 2022

darsnack left a comment •

edited

Loading

theabhirath commented Feb 11, 2022

theabhirath commented Feb 11, 2022 •

edited

Loading

darsnack commented Feb 11, 2022

Implementation of base ViT model #105

Implementation of base ViT model #105

Conversation

theabhirath commented Feb 2, 2022

DhairyaLGandhi commented Feb 2, 2022

darsnack left a comment

Choose a reason for hiding this comment

CarloLucibello commented Feb 3, 2022

theabhirath commented Feb 4, 2022

theabhirath commented Feb 4, 2022 • edited Loading

darsnack left a comment

Choose a reason for hiding this comment

theabhirath commented Feb 4, 2022

darsnack commented Feb 7, 2022 • edited Loading

Parallel implementation

Benchmarks

Some other notes

darsnack commented Feb 7, 2022 • edited Loading

theabhirath commented Feb 8, 2022

mcabbott Feb 8, 2022

Choose a reason for hiding this comment

ToucheSir Feb 8, 2022

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

theabhirath commented Feb 8, 2022

darsnack left a comment • edited Loading

Choose a reason for hiding this comment

theabhirath commented Feb 11, 2022

theabhirath commented Feb 11, 2022 • edited Loading

darsnack commented Feb 11, 2022

theabhirath commented Feb 4, 2022 •

edited

Loading

darsnack commented Feb 7, 2022 •

edited

Loading

`Parallel` implementation

darsnack commented Feb 7, 2022 •

edited

Loading

darsnack left a comment •

edited

Loading

theabhirath commented Feb 11, 2022 •

edited

Loading