Restore Gemma-4 AudioAttention patch for fp16#577
Merged
Conversation
The audio attention patch is needed for fp16 (Tesla T4). The config value attention_invalid_logits_value = -1e9 overflows fp16 max (65504), causing a RuntimeError at masked_fill. This patch clamps the value to -65000.0 when running in fp16. This was incorrectly removed as part of the FORCE_FLOAT32 cleanup. It is not a FORCE_FLOAT32 workaround -- it applies unconditionally when the hidden_states dtype is float16.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| if needs_clamp: | ||
| self.config.attention_invalid_logits_value = original_value | ||
| return result | ||
| pass |
| return result | ||
| pass | ||
| Gemma4AudioAttention.forward = forward | ||
| pass |
| # On Tesla T4, autocast can downcast attn_weights to fp16, causing masked_fill to fail. | ||
| # ============================================================================ | ||
|
|
||
| def patch_Gemma4AudioAttention(): |
| # Gemma-4 does not need FORCE_FLOAT32 or temporary patches. | ||
| # float16 and bfloat16 both work correctly without intervention. | ||
| import torch | ||
| import os |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
patch_Gemma4AudioAttentionwhich was incorrectly removed in Remove Gemma-4 temporary patches #576Problem
Gemma4AudioAttentionusesconfig.attention_invalid_logits_value = -1e9in amasked_fillcall. On fp16 (Tesla T4), -1e9 overflows the fp16 max of 65504, causing:This crashes at
modeling_gemma4.pyline 304:Fix
The patch wraps
Gemma4AudioAttention.forwardto temporarily clampattention_invalid_logits_valueto -65000.0 whenhidden_states.dtype == torch.float16. The original value is restored after the forward call. This only activates on fp16 -- bf16 supports up to ~3.4e38 and does not need clamping.Why this was removed
This patch was bundled with the FORCE_FLOAT32 patches in #576. Unlike the other patches, this one is not a FORCE_FLOAT32 workaround -- it applies unconditionally based on dtype and is needed for fp16 audio inference on Tesla T4.
Test plan