Allow Memory Efficient Attention Kernel to run when local window size is set #21310
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a slight change to the handling of the Local Window Size parameter in the context of Memory Efficient Attention. Previously, setting the Local Window Size to any value other than -1 would disable Memory Efficient Attention. This update allows the kernel to operate regardless of the Local Window Size setting.
Motivation and Context
Previously, models with local attention were tricky to run on CUDA, because Flash Attention supports local window attention on hardware with sm >= 80, but lesser hardware was unsupported. With this PR, users will be able to run these models on lesser hardware, although the output may not match exactly when compared with a model properly using local attention.
The motivation behind this change stems from the challenges faced when running models with local attention on CUDA. Flash Attention, which supports local window attention, was only operable on hardware with a CUDA capability sm_80 or higher. This limitation made it difficult to utilize these models on hardware with lower sm.
With the implementation of this PR, models with local attention can now be executed on hardware with lower sm values. However, it’s important to note that the output may not precisely match that of a model utilizing local attention as intended due to the disregard of the Local Window Size setting. This update, therefore, enhances the versatility of model execution, albeit with potential variations in output.