[Bug] Attention and QAttention don't work properly in some cases

In some cases masked `Attention` and `QAttention` do not work correctly. Next, we will analyze this using the example of `Attention` (similarly for `QAttention`).

**Description of the problem:**
With some combinations of `mask_index` and `unidirectional`, we get **10-50%** difference in the output tensor for `Attention`. In this case, the relative difference of individual elements can reach 2000.

**Layer initialization**
Initially, we define the parameters that set `Attention` from the example:

- `num_heads` = 13
- `head_size` = 37
- `sequence_length` = 7
- `input_hidden_size` = 147

However, it doesn't matter to reproduce the problem.

Inputs initialization:

- `unidirectional` = 1
- `input` ~ $N(0,1)$
- `weights` ~ $N(0,1)$
- `bias` ~ $N(0,1)$
- `mask_index` ~ $F(0, 2)$, discrete uniform distribution in interval $[0, 2)$

**Detailed analysis of problems**
In our analysis, we will fix `mask_index` with the following value:
$$[0, 1, 1, 0, 0, 1, 0]$$

This means that when calculating `Attention`, we should not take into account embeddings for the following tokens (indices): $[0, 3, 4, 6]$

Mathematically, this means that these tokens should not affect the calculation of `attention weights`, and their corresponding weights should be equal to $0$:
$$attention weights = softmax(QK^T)$$
$$softmax_i (z) = \frac{\exp (z_i)}{\sum \exp (z_k)}$$

The implementation of `Attention` in ORT uses the following mathematical assumption:
$$\exp(-\infty)=0$$

In this case, if we replace by $-\infty$ the elements of $QK^T$ tensor that must be masked, then the final weights will meet the "masking requirement".

_ORT implementation:_

1) [The implementation](https://github.com/microsoft/onnxruntime/blob/b353e0b41d588605958b03f9a223d10a2fbeb514/onnxruntime/contrib_ops/cpu/bert/attention_helper.h#L81) uses $-10000$ instead of $-\infty$. However, this may potentially be insufficient in cases where the some elements of the tensor $QK^T$ ~ $-10000$.
This is the first point to pay attention to when improving the implementation.
In practice, we have the following masking tensor:
$$mask= \lbrack \matrix{-10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$

2) Using `unidirectional` parameter means that in order to calculate `attention weights` for each of the tokens, that token "cannot look into the future". In other words, all tokens following it should not affect `attention weights`, that is, they should be masked.
With this approach (when we apply mask to $QK^T$ tensor), this means that we must replace all elements above the main diagonal with $-\infty$.
$$mask= \lbrack \matrix{-10000 & -10000 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$
However, in [the current implementation](https://github.com/microsoft/onnxruntime/blob/b353e0b41d588605958b03f9a223d10a2fbeb514/onnxruntime/contrib_ops/cpu/bert/attention_helper.h#L91), the mask values are not replaced, but summed, which leads to the mask being non-binary:
$$mask= \lbrack \matrix{-10000 & -10000 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -20000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -20000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & -10000 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -20000 \cr -10000 & 0 & 0 & -10000 & -10000 & 0 & -10000} \rbrack$$

3) Mathematically, applying of this masking tensor should look like this:
$$(QK^T) \\& (mask)$$
And in practice it should look like this:
$$(QK^T) \\& (mask)= \lbrack \matrix{-10000 & -10000 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 6.5382 & -10000 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 1.40028 & 1.95172 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 6.03383 & 4.56666 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 2.95622 & 3.41114 & -10000 & -10000 & -10000 & -10000 \cr -10000 & 2.81855 & 3.35107 & -10000 & -10000 & 0.834866 & -10000 \cr -10000 & 1.93507 & 1.65966 & -10000 & -10000 & 2.37467 & -10000} \rbrack$$
However, in [the current implementation](https://github.com/microsoft/onnxruntime/blob/b353e0b41d588605958b03f9a223d10a2fbeb514/onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h#L163) we have $(QK^T) + (mask)$ and taking into account point `2`, we get the following result:
$$(QK^T) + (mask)= \lbrack \matrix{-10006.6 & -9994.52 & -9997.99 & -19997.9 & -20000 & -9996.66 & -20000.2 \cr -9997.57 & 6.5382 & -9997.21 & -19999.1 & -19999.5 & -9999.22 & -20001.1 \cr -10000.6 & 1.40028  & 1.95172 & -20002.2 & -20003.8 & -9998.68 & -20003.3 \cr -10002.6 & 6.03383 & 4.56666 & -10001 & -20002 & -10000.2 & -20001.9 \cr -9999.71 & 2.95622 & 3.41114 & -9999.45 & -10000 & -9998.97 & -20001.4 \cr -10001.8 & 2.81855 & 3.35107 & -10001.4 & -10001.1 & 0.834866 & -20000.7 \cr -9999.03 & 1.93507 & 1.65966 & -10000.7 & -10001 & 2.37467 & -9999.72} \rbrack$$

4. And after [the magic fix](https://github.com/microsoft/onnxruntime/blob/b353e0b41d588605958b03f9a223d10a2fbeb514/onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h#L172) for **huggingface implementation**, we get the following:
$$(QK^T) + (mask)= \lbrack \matrix{-10006.6 & -10000 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -9997.57 & 6.5382 & -10000 & -20000 & -20000 & -10000 & -20000 \cr -10000.6 & 1.40028  & 1.95172 & -20000 & -20000 & -10000 & -20000 \cr -10002.6 & 6.03383 & 4.56666 & -10001 & -20000 & -10000 & -20000 \cr -9999.71 & 2.95622 & 3.41114 & -9999.45 & -10000 & -10000 & -20000 \cr -10001.8 & 2.81855 & 3.35107 & -10001.4 & -10001.1 & 0.834866 & -20000 \cr -9999.03 & 1.93507 & 1.65966 & -10000.7 & -10001 & 2.37467 & -9999.72} \rbrack$$

If we do not make the `magic fix` from point `4`, or make it homogeneous (replace `-20000` with `-10000`), then in some cases we get `10-50%` difference in the output tensor for `Attention`. In this case, the relative difference of individual elements can reach 2000.

_Problems arising from this implementation:_
- Consider the first line from the $(QK^T) + (mask)$ tensor:
$$[-10006.6, -10000, -10000, -20000, -20000, -10000, -20000]$$
From a theoretical point of view, this means that all sequence tokens must be masked and have the following "equiprobable" weights:
$$softmax=[0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14]$$
However, in the current implementation, due to the heterogeneity introduced in point `2`, we have the following probability distribution for tokens:
$$softmax=[0.0005, 0.3332, 0.3332, 0, 0, 0.3332, 0]$$
The probabilities are centered around only three elements, which is very different from theoretical expectations.

- Also, based on points `1` and `3`, we have "bad softmax conditionality" for similar rows of $(QK^T) + (mask)$ tensor:
```Shell
delta:  -0.0
scores:  [-10000. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -1.0
scores:  [-10001. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.06 0.16 0.16 0.16 0.16 0.16 0.16]

delta:  -2.0
scores:  [-10002. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.02 0.16 0.16 0.16 0.16 0.16 0.16]

delta:  -3.0
scores:  [-10003. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.01 0.17 0.17 0.17 0.17 0.17 0.17]

delta:  -4.0
scores:  [-10004. -10000. -10000. -10000. -10000. -10000. -10000.]
softmax:  [0.   0.17 0.17 0.17 0.17 0.17 0.17]
```
It can be seen from this computational experiment that a small relative change in one of the elements leads to a complete change in the probability distribution for tokens.
However, if we replace $-10^4$ by $-10^8$, then this problem will be solved (for the same small perturbations):
```Shell
delta:  -0.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -1.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -2.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -3.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]

delta:  -4.0
scores:  [-1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08 -1.e+08]
softmax:  [0.14 0.14 0.14 0.14 0.14 0.14 0.14]
```

### To reproduce

Run `Attention` with initialization of parameters and inputs from the description.

### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 20.04

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

v1.13.1

### ONNX Runtime API

Python

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Attention and QAttention don't work properly in some cases #14363

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Attention and QAttention don't work properly in some cases #14363

Description

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions