- 
                Notifications
    You must be signed in to change notification settings 
- Fork 31k
Closed
Description
System Info
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- transformersversion: 4.39.1
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.17
- Huggingface_hub version: 0.22.1
- Safetensors version: 0.4.2
- Accelerate version: 0.28.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.2 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: NO
- Using distributed or parallel set-up in script?: NO
Who can help?
@ArthurZucker and @younesbelkada
Information
- The official example scripts
- My own modified scripts
Tasks
-  An officially supported task in the examplesfolder (such as GLUE/SQuAD, ...)
- My own task or dataset (give details below)
Reproduction
Running a sample through a falcon model with output_attention turned on vs off returns different values.  I think this implies there is something off with the new implementation with SDPA?
from transformers import FalconConfig, FalconForCausalLM
VOCAB_SIZE = 1000
HIDDEN_SIZE = 64
NUM_HIDDEN_LAYERS = 3
NUM_ATTENTION_HEADS = 4
INTERMEDIATE_SIZE = HIDDEN_SIZE * 4
INPUT_IDS = torch.randint(0, 1000, (5, 20))
MAX_POSITION_EMBEDDINGS = 2048
config = FalconConfig(
        vocab_size=VOCAB_SIZE,
        hidden_size=HIDDEN_SIZE,
        num_hidden_layers=NUM_HIDDEN_LAYERS,
        num_attention_heads=NUM_ATTENTION_HEADS,
        new_decoder_architecture=True,
       alibi = True
    )
falcon = FalconForCausalLM(config)
falcon_output = falcon(INPUT_IDS, output_attentions=True)[0]
falcon_output2 = falcon(INPUT_IDS)[0]
print(torch.allclose(falcon_output, falcon_output2, atol=1e-3)) # False
In [30]: print(falcon_output[0])
    ...: print(falcon_output2[0])
tensor([[ 0.3474,  0.0397, -0.0542,  ...,  0.0310,  0.2182, -0.0476],
        [ 0.2459, -0.0962, -0.1354,  ...,  0.1950, -0.2991,  0.2416],
        [ 0.1041, -0.0469,  0.0851,  ..., -0.1261,  0.0160, -0.0514],
        ...,
        [ 0.1683, -0.0328, -0.0490,  ..., -0.1198,  0.2471,  0.3014],
        [ 0.1236, -0.1986, -0.1901,  ...,  0.0341, -0.0316,  0.1492],
        [-0.0388,  0.0754,  0.0067,  ..., -0.1129,  0.0227,  0.0597]],
       grad_fn=<SelectBackward0>)
tensor([[-0.0034,  0.0485, -0.1621,  ..., -0.2797,  0.1162,  0.0521],
        [ 0.1608, -0.1233, -0.1542,  ...,  0.0887, -0.1677,  0.2669],
        [-0.0077, -0.0758,  0.0148,  ..., -0.2351,  0.1145, -0.0721],
        ...,
        [ 0.1441, -0.0420, -0.0213,  ..., -0.1504,  0.2413,  0.3030],
        [ 0.1076, -0.1867, -0.1643,  ...,  0.0081, -0.0278,  0.1577],
        [-0.0588,  0.0900,  0.0219,  ..., -0.1190,  0.0211,  0.0505]],
       grad_fn=<SelectBackward0>)
Expected behavior
print(torch.allclose(falcon_output, falcon_output2, atol=1e-3)) should return True
Metadata
Metadata
Assignees
Labels
No labels