-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
In some cases masked Attention and QAttention do not work correctly. Next, we will analyze this using the example of Attention (similarly for QAttention).
Description of the problem:
With some combinations of mask_index and unidirectional, we get 10-50% difference in the output tensor for Attention. In this case, the relative difference of individual elements can reach 2000.
Layer initialization
Initially, we define the parameters that set Attention from the example:
num_heads= 13head_size= 37sequence_length= 7input_hidden_size= 147
However, it doesn't matter to reproduce the problem.
Inputs initialization:
-
unidirectional= 1 -
input~$N(0,1)$ -
weights~$N(0,1)$ -
bias~$N(0,1)$ -
mask_index~$F(0, 2)$ , discrete uniform distribution in interval$[0, 2)$
Detailed analysis of problems
In our analysis, we will fix mask_index with the following value:
This means that when calculating Attention, we should not take into account embeddings for the following tokens (indices):
Mathematically, this means that these tokens should not affect the calculation of attention weights, and their corresponding weights should be equal to
The implementation of Attention in ORT uses the following mathematical assumption:
In this case, if we replace by
ORT implementation:
-
The implementation uses
$-10000$ instead of$-\infty$ . However, this may potentially be insufficient in cases where the some elements of the tensor$QK^T$ ~$-10000$ .
This is the first point to pay attention to when improving the implementation.
In practice, we have the following masking tensor:
$$mask= \lbrack \matrix{-10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$ -
Using
unidirectionalparameter means that in order to calculateattention weightsfor each of the tokens, that token "cannot look into the future". In other words, all tokens following it should not affectattention weights, that is, they should be masked.
With this approach (when we apply mask to$QK^T$ tensor), this means that we must replace all elements above the main diagonal with$-\infty$ .
$$mask= \lbrack \matrix{-10000 & -10000 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$
However, in the current implementation, the mask values are not replaced, but summed, which leads to the mask being non-binary:
$$mask= \lbrack \matrix{-10000 & -10000 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$ -
Mathematically, applying of this masking tensor should look like this:
$$(QK^T) \& (mask)$$
And in practice it should look like this:
$$(QK^T) \& (mask)= \lbrack \matrix{-10000 & -10000 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 6.5382 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 1.40028 & 1.95172 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 6.03383 & 4.56666 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 2.95622 & 3.41114 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 2.81855 & 3.35107 & -10000 & -10000 & 0.834866 & -10000 \cr -10000 & 1.93507 & 1.65966 & -10000 & -10000 & 2.37467 & -10000} \rbrack$$
However, in the current implementation we have$(QK^T) + (mask)$ and taking into account point2, we get the following result:
$$(QK^T) + (mask)= \lbrack \matrix{-10006.6 & -9994.52 & -9997.99 & -19997.9 & -20000 & -9996.66 & -20000.2 \cr -9997.57 & 6.5382 & -9997.21 & -19999.1 & -19999.5 & -9999.22 & -20001.1 \cr -10000.6 & 1.40028 & 1.95172 & -20002.2 & -20003.8 & -9998.68 & -20003.3 \cr -10002.6 & 6.03383 & 4.56666 & -10001 & -20002 & -10000.2 & -20001.9 \cr -9999.71 & 2.95622 & 3.41114 & -9999.45 & -10000 & -9998.97 & -20001.4 \cr -10001.8 & 2.81855 & 3.35107 & -10001.4 & -10001.1 & 0.834866 & -20000.7 \cr -9999.03 & 1.93507 & 1.65966 & -10000.7 & -10001 & 2.37467 & -9999.72} \rbrack$$
- And after the magic fix for huggingface implementation, we get the following:
$$(QK^T) + (mask)= \lbrack \matrix{-10006.6 & -10000 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -9997.57 & 6.5382 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000.6 & 1.40028 & 1.95172 & -20000 & -20000 & -10000 & -20000 \cr -10002.6 & 6.03383 & 4.56666 & -10001 & -20000 & -10000 & -20000 \cr -9999.71 & 2.95622 & 3.41114 & -9999.45 & -10000 & -10000 & -20000 \cr -10001.8 & 2.81855 & 3.35107 & -10001.4 & -10001.1 & 0.834866 & -20000 \cr -9999.03 & 1.93507 & 1.65966 & -10000.7 & -10001 & 2.37467 & -9999.72} \rbrack$$
If we do not make the magic fix from point 4, or make it homogeneous (replace -20000 with -10000), then in some cases we get 10-50% difference in the output tensor for Attention. In this case, the relative difference of individual elements can reach 2000.
Problems arising from this implementation:
-
Consider the first line from the
$(QK^T) + (mask)$ tensor:
$$[-10006.6, -10000, -10000, -20000, -20000, -10000, -20000]$$
From a theoretical point of view, this means that all sequence tokens must be masked and have the following "equiprobable" weights:
$$softmax=[0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14]$$
However, in the current implementation, due to the heterogeneity introduced in point2, we have the following probability distribution for tokens:
$$softmax=[0.0005, 0.3332, 0.3332, 0, 0, 0.3332, 0]$$
The probabilities are centered around only three elements, which is very different from theoretical expectations. -
Also, based on points
1and3, we have "bad softmax conditionality" for similar rows of$(QK^T) + (mask)$ tensor:
delta: -0.0
scores: [-10000. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax: [0.14 0.14 0.14 0.14 0.14 0.14 0.14]
delta: -1.0
scores: [-10001. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax: [0.06 0.16 0.16 0.16 0.16 0.16 0.16]
delta: -2.0
scores: [-10002. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax: [0.02 0.16 0.16 0.16 0.16 0.16 0.16]
delta: -3.0
scores: [-10003. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax: [0.01 0.17 0.17 0.17 0.17 0.17 0.17]
delta: -4.0
scores: [-10004. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax: [0. 0.17 0.17 0.17 0.17 0.17 0.17]It can be seen from this computational experiment that a small relative change in one of the elements leads to a complete change in the probability distribution for tokens.
However, if we replace
delta: -0.0
scores: [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax: [0.14 0.14 0.14 0.14 0.14 0.14 0.14]
delta: -1.0
scores: [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax: [0.14 0.14 0.14 0.14 0.14 0.14 0.14]
delta: -2.0
scores: [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax: [0.14 0.14 0.14 0.14 0.14 0.14 0.14]
delta: -3.0
scores: [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax: [0.14 0.14 0.14 0.14 0.14 0.14 0.14]
delta: -4.0
scores: [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax: [0.14 0.14 0.14 0.14 0.14 0.14 0.14]To reproduce
Run Attention with initialization of parameters and inputs from the description.
Urgency
No response
Platform
Linux
OS Version
Ubuntu 20.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
v1.13.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response