Falcon output with alibi bias is different output_attentions=True

### System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.39.1
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.17
- Huggingface_hub version: 0.22.1
- Safetensors version: 0.4.2
- Accelerate version: 0.28.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.1.2 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:  NO
- Using distributed or parallel set-up in script?: NO

### Who can help?

 @ArthurZucker and @younesbelkada 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Running a sample through a falcon model with `output_attention` turned on vs off returns different values.  I think this implies there is something off with the new implementation with SDPA?

```
from transformers import FalconConfig, FalconForCausalLM

VOCAB_SIZE = 1000
HIDDEN_SIZE = 64
NUM_HIDDEN_LAYERS = 3
NUM_ATTENTION_HEADS = 4
INTERMEDIATE_SIZE = HIDDEN_SIZE * 4
INPUT_IDS = torch.randint(0, 1000, (5, 20))
MAX_POSITION_EMBEDDINGS = 2048

config = FalconConfig(
        vocab_size=VOCAB_SIZE,
        hidden_size=HIDDEN_SIZE,
        num_hidden_layers=NUM_HIDDEN_LAYERS,
        num_attention_heads=NUM_ATTENTION_HEADS,
        new_decoder_architecture=True,
       alibi = True
    )

falcon = FalconForCausalLM(config)

falcon_output = falcon(INPUT_IDS, output_attentions=True)[0]
falcon_output2 = falcon(INPUT_IDS)[0]
print(torch.allclose(falcon_output, falcon_output2, atol=1e-3)) # False

```

```
In [30]: print(falcon_output[0])
    ...: print(falcon_output2[0])
tensor([[ 0.3474,  0.0397, -0.0542,  ...,  0.0310,  0.2182, -0.0476],
        [ 0.2459, -0.0962, -0.1354,  ...,  0.1950, -0.2991,  0.2416],
        [ 0.1041, -0.0469,  0.0851,  ..., -0.1261,  0.0160, -0.0514],
        ...,
        [ 0.1683, -0.0328, -0.0490,  ..., -0.1198,  0.2471,  0.3014],
        [ 0.1236, -0.1986, -0.1901,  ...,  0.0341, -0.0316,  0.1492],
        [-0.0388,  0.0754,  0.0067,  ..., -0.1129,  0.0227,  0.0597]],
       grad_fn=<SelectBackward0>)
tensor([[-0.0034,  0.0485, -0.1621,  ..., -0.2797,  0.1162,  0.0521],
        [ 0.1608, -0.1233, -0.1542,  ...,  0.0887, -0.1677,  0.2669],
        [-0.0077, -0.0758,  0.0148,  ..., -0.2351,  0.1145, -0.0721],
        ...,
        [ 0.1441, -0.0420, -0.0213,  ..., -0.1504,  0.2413,  0.3030],
        [ 0.1076, -0.1867, -0.1643,  ...,  0.0081, -0.0278,  0.1577],
        [-0.0588,  0.0900,  0.0219,  ..., -0.1190,  0.0211,  0.0505]],
       grad_fn=<SelectBackward0>)
```

### Expected behavior

`print(torch.allclose(falcon_output, falcon_output2, atol=1e-3))` should return True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Falcon output with alibi bias is different output_attentions=True #29946

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Falcon output with alibi bias is different output_attentions=True #29946

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions