Use of relative attention #27

florianleopold · 2023-03-01T10:00:03Z

Hey all!

While looking through the VPT code I noticed the use of "relative attention logits" in the Self-Attention layers, as seen here:
https://github.com/openai/Video-Pre-Training/blob/main/lib/xf.py#L342

The hypothesis on my end regarding these now was that this R-stream is used as a learnable, data-dependent bias for attention, as also seen in the attention function and the b_nd matrix.
I was also wondering about the use of nbasis = 10 as the per-head dimensionality for it, and thought of it as a form of bottleneck. But I am not sure how different values for nbasis would affect the network.

I would really appreciate any further insights, corrections and references to other resources regarding this.

The text was updated successfully, but these errors were encountered:

Miffyli · 2023-03-06T15:41:39Z

Hey. Good catches! I unfortunately do not have insights on why they exactly went with this approach. Your best bet is to try to send email to the paper corresponding author(s), or try asking this question on some other forums where someone might know why you would do this

Miffyli added the question Further information is requested label Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of relative attention #27

Use of relative attention #27

florianleopold commented Mar 1, 2023

Miffyli commented Mar 6, 2023

Use of relative attention #27

Use of relative attention #27

Comments

florianleopold commented Mar 1, 2023

Miffyli commented Mar 6, 2023