Phi3: fix attn for sliding window #33586

zucchini-nlp · 2024-09-19T10:59:22Z

What does this PR do?

Fixes #32945. The reason is that Phi3 prev prepared 4D attn with sliding window while the new updte_causal_mask didn't take that into account. This PR fixes it and adds a test

src/transformers/models/phi3/modeling_phi3.py

HuggingFaceDocBuilderDev · 2024-09-19T11:31:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

Thank you for opening the PR with the fix 💪 A few questions and comments

src/transformers/models/mistral/modeling_mistral.py

src/transformers/models/phi3/modeling_phi3.py

gante · 2024-09-19T15:05:52Z

src/transformers/models/phi3/modeling_phi3.py

        min_dtype = torch.finfo(dtype).min
        sequence_length = input_tensor.shape[1]
-        if using_static_cache:
+        if using_sliding_window_cache:


we now have a significant code block to determine target_length, but target_length is not used directly in _update_causal_mask

suggestion to avoid this disconnection:

make SlidingWindowCache store the config.sliding_window it receives at init time

move the logic that computes target_length inside _prepare_4d_causal_attention_mask_with_cache_position

_prepare_4d_causal_attention_mask_with_cache_position no longer receives target_length nor config, as they can be retrieved from past_key_values

For the target length, maybe we merge #32421 first where the max length should be easily accessible through cache no matter cache type. We earlier discussed returning max cache shape, and sliding window has a max capacity for cache class

And then yeah, we can move that part to _prepare_4d_causal_attention_mask_with_cache_position but I think we'd better move it in all classes for general consistency

Agreed with the plan -- shall we leave a TODO for us in this PR, linking to your comment?

yes, I'll add a TODO for us so we can make the required change in all models at once, to not mix different updates in one PR :)

gante · 2024-09-19T15:06:13Z

src/transformers/models/phi3/modeling_phi3.py

            dtype = self.lm_head.weight.dtype
            min_dtype = torch.finfo(dtype).min

+            if isinstance(past_key_values, SlidingWindowCache):


(same comment as above here)

src/transformers/models/phi3/modeling_phi3.py

tests/models/phi3/test_modeling_phi3.py

ArthurZucker

awesome test!

src/transformers/models/phi3/modeling_phi3.py

tests/models/phi3/test_modeling_phi3.py

zucchini-nlp · 2024-09-26T16:09:28Z

Done, I think this PR is ready to be reviewed/merged.

Depending on which PR is merged first, this or the linked one, I will rebase and apply necessary changes. Then I'll add the TODO comment for moving target_length logic to _prepare_attention_mask

ArthurZucker

Thanks, let's try to abstract a tad bit, as we generally would avoid differentiating cache classes in the modeling!

ArthurZucker · 2024-10-03T07:22:41Z

tests/models/phi3/test_modeling_phi3.py

this LGTM! thanks for the thorough test

ArthurZucker · 2024-10-03T07:30:44Z

src/transformers/models/mistral/modeling_mistral.py

+            if isinstance(past_key_values, SlidingWindowCache):
+                sliding_window = (
+                    self.config.sliding_window if self.config.sliding_window is not None else sequence_length
+                )
+                target_length = max(sequence_length, sliding_window)
+            else:
+                target_length = past_key_values.get_max_length()


I don't understand why we need to do this: if there is a sliding WindowCache it was init from the config and thus has a correct sliding_window.

Then, SlidingWindowCache. get_max_length should take sequence_length as input to return the max and avoid having these checks here WDYT?

Yes, it should but the PR for getting max_length on slidingWindow cache is on its way and not merged yet.

So, to update you on our discussions with @gante which are currently in different PR comments: some cache classes now do not have a max_length (e.g. SlidingWindow). As commented in the code, sliding window technically has no max length and goes on a rolling basis. But in transformers what we want to check is the "maximum capacity of cache instance", independently of how cache handles new tokens going beyond that capacity.

So, in a different PR I changed naming to get_max_cache_shape which is more straightforward and added get_max_cache_shape for Sliding Window cache. We'll do a simple deprecation cycle, as we did for static cache's "max_batch_size". Until the linked PR is merged, I am copying this piece of code from mistral and using it in phi3. I have it noted and will handle it depending which one gets merged first :)

alright sounds good. I just don't want us to add too much complexity to the code! 🤗

gante

LGTM, but the changes will likely fail because of a recently merged PR (things need to be moved, see comment)

gante · 2024-10-04T10:27:15Z

src/transformers/models/mistral/modeling_mistral.py

 _CONFIG_FOR_DOC = "MistralConfig"


+def _prepare_4d_causal_attention_mask_with_cache_position(


Because of #33677, this function is part of the model class -- I think you will have to move the diff there, otherwise tests may fail on main

(see the diff in that PR for Llama, it should be similar to the changes you need to do here)

zucchini-nlp · 2024-10-08T08:58:02Z

Rebased main and updated accordingly by moving prepare_causal_mask to XXXModel. Also, noticed Phi3Moe was added while PR was in progress and it is same as Phi3, so I propagated changes there too

Will be merging tomorrow if no comment remain :)

Cyrilvallez · 2024-10-09T13:27:22Z

Hey @zucchini-nlp, while working on #33619 I had issues with the 4d masks and just found this PR - however, it is not only an issue for Phi3! For what I could see, the following models have the exact same issue (AttentionMaskConverter._ignore_causal_mask_sdpa() does not check for the sliding_window resulting in wrong masks, and neither do _prepare_4d_causal_attention_mask_with_cache_position()).

Mimi
Mixtral
PhiMoe
Qwen2
Qwen2Moe
Qwen2VL
Starcoder2

Let me know if you can fix it or if you want me to jump on it.

zucchini-nlp · 2024-10-09T13:32:29Z

@Cyrilvallez oh I see, didn't know we had more models that support sliding window. I can propagate changes to other models, sure :)

Cyrilvallez · 2024-10-09T13:38:37Z

Yes! I think I listed them all but you can maybe double-check so that all of them get correctly fixed 🤗

ArthurZucker

ping me when merge this way I can put it in a patch!

zucchini-nlp · 2024-10-09T16:37:18Z

@ArthurZucker done! I had to change tests for Qwen2 models because otherwise we won't get same results for long padded input as for the base input. Applying sliding mask results in minor differences

ArthurZucker

Thanks, good for the test modifications, if run-slow were good let's go! 🔥

* fix phi3 attn fir sliding window * fix tests * address most comment * style * update after rebase * add more models * fix tests

fix phi3 attn fir sliding window

156a85b

zucchini-nlp requested review from ArthurZucker and gante September 19, 2024 10:59

zucchini-nlp commented Sep 19, 2024

View reviewed changes

src/transformers/models/phi3/modeling_phi3.py Outdated Show resolved Hide resolved

fix tests

2e33d0d

gante reviewed Sep 19, 2024

View reviewed changes

ArthurZucker reviewed Sep 20, 2024

View reviewed changes

src/transformers/models/phi3/modeling_phi3.py Outdated Show resolved Hide resolved

vasqu reviewed Sep 20, 2024

View reviewed changes

tests/models/phi3/test_modeling_phi3.py Outdated Show resolved Hide resolved

zucchini-nlp added 3 commits September 26, 2024 17:20

merge main

0faeaf3

address most comment

52fb350

style

2351ab2

zucchini-nlp requested review from ArthurZucker and gante September 26, 2024 16:07

ArthurZucker reviewed Oct 3, 2024

View reviewed changes

ArthurZucker approved these changes Oct 4, 2024

View reviewed changes

gante reviewed Oct 4, 2024

View reviewed changes

zucchini-nlp added 2 commits October 8, 2024 10:22

merge main

5b27f5f

update after rebase

a19c7f5

ArthurZucker reviewed Oct 9, 2024

View reviewed changes

zucchini-nlp added 3 commits October 9, 2024 17:15

add more models

c174f25

merge main

21ede7f

fix tests

25b5852

ArthurZucker approved these changes Oct 10, 2024

View reviewed changes

zucchini-nlp merged commit adea675 into huggingface:main Oct 10, 2024

ArthurZucker mentioned this pull request Oct 29, 2024

Fix onnx non-exportable inplace aten op #34376

Merged

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Phi3: fix attn for sliding window (huggingface#33586)

a192fbc

* fix phi3 attn fir sliding window * fix tests * address most comment * style * update after rebase * add more models * fix tests

		_CONFIG_FOR_DOC = "MistralConfig"


		def _prepare_4d_causal_attention_mask_with_cache_position(

Phi3: fix attn for sliding window #33586

Phi3: fix attn for sliding window #33586

Uh oh!

Conversation

zucchini-nlp commented Sep 19, 2024

What does this PR do?

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 19, 2024

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Sep 26, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Oct 8, 2024

Uh oh!

Cyrilvallez commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Oct 9, 2024

Uh oh!

Cyrilvallez commented Oct 9, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Oct 9, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Cyrilvallez commented Oct 9, 2024 •

edited

Loading