Skip to content

Enable fast softmax mode in FusedSDPA#159

Merged
3 commits merged into
HabanaAI:habana-mainfrom
wszczurekhabana:fast_softmax
May 2, 2024
Merged

Enable fast softmax mode in FusedSDPA#159
3 commits merged into
HabanaAI:habana-mainfrom
wszczurekhabana:fast_softmax

Conversation

@wszczurekhabana
Copy link
Copy Markdown

Support for setting fast softmax mode in FusedSDPA operator. This is a tradeoff: performance vs accuracy.

Data on performance:

Ratio Max input tokens Max new tokens Batch size Throughput without fast softmax [tokens/s] Throughput with fast softmax [tokens/s] Improvement %
97% 31744 1042 12 139.08 147.97 6.4%
75% 24576 8192 16 431.09 437.95 1.6%
50% 16384 16384 24 653.39 656.38 0.5%

Data on accuracy (using mlperf test from: https://gerrit.habana-labs.com/plugins/gitiles/mlperf_inference/+/refs/heads/master_next/code/llama/llama_greedy.py
and https://gerrit.habana-labs.com/plugins/gitiles/mlperf_inference/+/refs/heads/master_next/code/llama/evaluation.py):

  rouge1 rouge2 rougeL rougeLsum accuracy
without fast softmax 44.4279 22.0536 28.6362 42.0044 99.99
with fast softmax 44.4065 22.0229 28.6156 41.9858 99.94

@ghost ghost requested a review from MrGeva April 15, 2024 07:27
Comment thread optimum/habana/transformers/models/llama/modeling_llama.py
@dudilester
Copy link
Copy Markdown

This ModuleFusedSDPA forward API change will require changes in the HQT patched module for quantization. which means it will break the nightly testing once merged. Im not sure we support regular-softmax for 8bit, we need to consider the appropriate behavior when user requests both quantization and regular-softmax, should we ignore the quantization or the softmax? or assert on that configuration.

@wszczurekhabana
Copy link
Copy Markdown
Author

Discussed offline. Relevant change for quantization toolkit: https://gerrit.habana-labs.com/#/c/411008/ pushed by @dudilester is in review.

@wszczurekhabana
Copy link
Copy Markdown
Author

Change in https://gerrit.habana-labs.com/#/c/411008/ is merged. @dvarshney-habana @puneeshkhanna @dudilester I think we can merge this PR now.

@dudilester
Copy link
Copy Markdown

Change in https://gerrit.habana-labs.com/#/c/411008/ did not pass promotion yet, we need to wait till it will pass before we merge this PR.

@dudilester
Copy link
Copy Markdown

FYI, commit https://gerrit.habana-labs.com/#/c/411008/ was promoted since my previous comment, and is included in builds since CD 1.16.0-328 release build

@wszczurekhabana
Copy link
Copy Markdown
Author

Thanks, I was not tracking it closely. @dvarshney-habana can we merge it?

@ghost ghost merged commit 8405798 into HabanaAI:habana-main May 2, 2024
astachowiczhabana pushed a commit that referenced this pull request May 6, 2024
* Enable fast softmax mode in FusedSDPA

* Add fast_softmax parameter to _gradient_checkpointing_func
wszczurekhabana added a commit that referenced this pull request May 10, 2024
* Enable fast softmax mode in FusedSDPA

* Add fast_softmax parameter to _gradient_checkpointing_func
@wszczurekhabana
Copy link
Copy Markdown
Author

upstreamed in: huggingface#972

astachowiczhabana pushed a commit that referenced this pull request Feb 14, 2025
xinyu-intel pushed a commit that referenced this pull request Mar 4, 2025
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants