Skip to content

[Bug] Attention and QAttention don't work properly in some cases #14363

@KJlaccHoeUM9l

Description

@KJlaccHoeUM9l

In some cases masked Attention and QAttention do not work correctly. Next, we will analyze this using the example of Attention (similarly for QAttention).

Description of the problem:
With some combinations of mask_index and unidirectional, we get 10-50% difference in the output tensor for Attention. In this case, the relative difference of individual elements can reach 2000.

Layer initialization
Initially, we define the parameters that set Attention from the example:

  • num_heads = 13
  • head_size = 37
  • sequence_length = 7
  • input_hidden_size = 147

However, it doesn't matter to reproduce the problem.

Inputs initialization:

  • unidirectional = 1
  • input ~ $N(0,1)$
  • weights ~ $N(0,1)$
  • bias ~ $N(0,1)$
  • mask_index ~ $F(0, 2)$, discrete uniform distribution in interval $[0, 2)$

Detailed analysis of problems
In our analysis, we will fix mask_index with the following value:
$$[0, 1, 1, 0, 0, 1, 0]$$

This means that when calculating Attention, we should not take into account embeddings for the following tokens (indices): $[0, 3, 4, 6]$

Mathematically, this means that these tokens should not affect the calculation of attention weights, and their corresponding weights should be equal to $0$:
$$attention weights = softmax(QK^T)$$
$$softmax_i (z) = \frac{\exp (z_i)}{\sum \exp (z_k)}$$

The implementation of Attention in ORT uses the following mathematical assumption:
$$\exp(-\infty)=0$$

In this case, if we replace by $-\infty$ the elements of $QK^T$ tensor that must be masked, then the final weights will meet the "masking requirement".

ORT implementation:

  1. The implementation uses $-10000$ instead of $-\infty$. However, this may potentially be insufficient in cases where the some elements of the tensor $QK^T$ ~ $-10000$.
    This is the first point to pay attention to when improving the implementation.
    In practice, we have the following masking tensor:
    $$mask= \lbrack \matrix{-10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$

  2. Using unidirectional parameter means that in order to calculate attention weights for each of the tokens, that token "cannot look into the future". In other words, all tokens following it should not affect attention weights, that is, they should be masked.
    With this approach (when we apply mask to $QK^T$ tensor), this means that we must replace all elements above the main diagonal with $-\infty$.
    $$mask= \lbrack \matrix{-10000 & -10000 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$
    However, in the current implementation, the mask values are not replaced, but summed, which leads to the mask being non-binary:
    $$mask= \lbrack \matrix{-10000 & -10000 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$

  3. Mathematically, applying of this masking tensor should look like this:
    $$(QK^T) \& (mask)$$
    And in practice it should look like this:
    $$(QK^T) \& (mask)= \lbrack \matrix{-10000 & -10000 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 6.5382 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 1.40028 & 1.95172 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 6.03383 & 4.56666 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 2.95622 & 3.41114 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 2.81855 & 3.35107 & -10000 & -10000 & 0.834866 & -10000 \cr -10000 & 1.93507 & 1.65966 & -10000 & -10000 & 2.37467 & -10000} \rbrack$$
    However, in the current implementation we have $(QK^T) + (mask)$ and taking into account point 2, we get the following result:
    $$(QK^T) + (mask)= \lbrack \matrix{-10006.6 & -9994.52 & -9997.99 & -19997.9 & -20000 & -9996.66 & -20000.2 \cr -9997.57 & 6.5382 & -9997.21 & -19999.1 & -19999.5 & -9999.22 & -20001.1 \cr -10000.6 & 1.40028 & 1.95172 & -20002.2 & -20003.8 & -9998.68 & -20003.3 \cr -10002.6 & 6.03383 & 4.56666 & -10001 & -20002 & -10000.2 & -20001.9 \cr -9999.71 & 2.95622 & 3.41114 & -9999.45 & -10000 & -9998.97 & -20001.4 \cr -10001.8 & 2.81855 & 3.35107 & -10001.4 & -10001.1 & 0.834866 & -20000.7 \cr -9999.03 & 1.93507 & 1.65966 & -10000.7 & -10001 & 2.37467 & -9999.72} \rbrack$$

  1. And after the magic fix for huggingface implementation, we get the following:
    $$(QK^T) + (mask)= \lbrack \matrix{-10006.6 & -10000 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -9997.57 & 6.5382 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000.6 & 1.40028 & 1.95172 & -20000 & -20000 & -10000 & -20000 \cr -10002.6 & 6.03383 & 4.56666 & -10001 & -20000 & -10000 & -20000 \cr -9999.71 & 2.95622 & 3.41114 & -9999.45 & -10000 & -10000 & -20000 \cr -10001.8 & 2.81855 & 3.35107 & -10001.4 & -10001.1 & 0.834866 & -20000 \cr -9999.03 & 1.93507 & 1.65966 & -10000.7 & -10001 & 2.37467 & -9999.72} \rbrack$$

If we do not make the magic fix from point 4, or make it homogeneous (replace -20000 with -10000), then in some cases we get 10-50% difference in the output tensor for Attention. In this case, the relative difference of individual elements can reach 2000.

Problems arising from this implementation:

  • Consider the first line from the $(QK^T) + (mask)$ tensor:
    $$[-10006.6, -10000, -10000, -20000, -20000, -10000, -20000]$$
    From a theoretical point of view, this means that all sequence tokens must be masked and have the following "equiprobable" weights:
    $$softmax=[0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14]$$
    However, in the current implementation, due to the heterogeneity introduced in point 2, we have the following probability distribution for tokens:
    $$softmax=[0.0005, 0.3332, 0.3332, 0, 0, 0.3332, 0]$$
    The probabilities are centered around only three elements, which is very different from theoretical expectations.

  • Also, based on points 1 and 3, we have "bad softmax conditionality" for similar rows of $(QK^T) + (mask)$ tensor:

delta:  -0.0
scores:  [-10000. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -1.0
scores:  [-10001. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.06 0.16 0.16 0.16 0.16 0.16 0.16]

delta:  -2.0
scores:  [-10002. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.02 0.16 0.16 0.16 0.16 0.16 0.16]

delta:  -3.0
scores:  [-10003. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.01 0.17 0.17 0.17 0.17 0.17 0.17]

delta:  -4.0
scores:  [-10004. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.   0.17 0.17 0.17 0.17 0.17 0.17]

It can be seen from this computational experiment that a small relative change in one of the elements leads to a complete change in the probability distribution for tokens.
However, if we replace $-10^4$ by $-10^8$, then this problem will be solved (for the same small perturbations):

delta:  -0.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -1.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -2.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -3.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -4.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

To reproduce

Run Attention with initialization of parameters and inputs from the description.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

v1.13.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    model:transformerissues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.staleissues that have not been addressed in a while; categorized by a bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions