You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The hypothesis on my end regarding these now was that this R-stream is used as a learnable, data-dependent bias for attention, as also seen in the attention function and the b_nd matrix.
I was also wondering about the use of nbasis = 10 as the per-head dimensionality for it, and thought of it as a form of bottleneck. But I am not sure how different values for nbasis would affect the network.
I would really appreciate any further insights, corrections and references to other resources regarding this.
The text was updated successfully, but these errors were encountered:
Hey. Good catches! I unfortunately do not have insights on why they exactly went with this approach. Your best bet is to try to send email to the paper corresponding author(s), or try asking this question on some other forums where someone might know why you would do this
Hey all!
While looking through the VPT code I noticed the use of "relative attention logits" in the Self-Attention layers, as seen here:
https://github.com/openai/Video-Pre-Training/blob/main/lib/xf.py#L342
The hypothesis on my end regarding these now was that this R-stream is used as a learnable, data-dependent bias for attention, as also seen in the
attention
function and theb_nd
matrix.I was also wondering about the use of
nbasis = 10
as the per-head dimensionality for it, and thought of it as a form of bottleneck. But I am not sure how different values fornbasis
would affect the network.I would really appreciate any further insights, corrections and references to other resources regarding this.
The text was updated successfully, but these errors were encountered: