Add --use_multi_head_attention in transformers fusion #14198

tianleiwu · 2023-01-10T00:41:40Z

Description

Add an option --use_multi_head_attention to fuse model with MultiHeadAttention operator instead of Attention operator for testing purpose.

Note that MultiHeadAttention can be used in self-attention and cross-attention, while Attention operator is used for self-attention only. In Attention operator, there is packed Q/K/V weights for input projection, but that MatMul of input projection is excluded from MultiHeadAttention.

Motivation and Context

This reverts commit 012b34d.

### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - huggingface/optimum#872 - huggingface/optimum#920 ### Motivation and Context This PR helps the following issues: - #15100 - #15235 - huggingface/optimum#869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - #15247 - #15339 - #15362 - #15365 - #15427 This PR uses changes from the following merged PRs: - #14198 - #14146 - #14201 - #14928 (this introduced the new multi-head attention spec)

add --use_cross_attention in transformers fusion

64f23c3

tianleiwu marked this pull request as draft January 10, 2023 00:41

tianleiwu added 2 commits January 10, 2023 13:46

Merge branch 'main' into tlwu/cross_attention_fusion

da5a244

change CrossAttention to MultiHeadAttention

72379cf

tianleiwu changed the title ~~Add --use_cross_attention in transformers fusion~~ Add --use_multi_head_attention in transformers fusion Jan 10, 2023

tianleiwu requested a review from wangyems January 11, 2023 03:33

tianleiwu marked this pull request as ready for review January 11, 2023 03:33

tianleiwu marked this pull request as draft January 11, 2023 03:35

add test case

cb27f6e

tianleiwu marked this pull request as ready for review January 11, 2023 06:04

tianleiwu requested a review from yufenglee January 11, 2023 18:40

wangyems approved these changes Jan 11, 2023

View reviewed changes

tianleiwu merged commit 012b34d into main Jan 11, 2023

tianleiwu deleted the tlwu/cross_attention_fusion branch January 11, 2023 21:20

mszhanyi added a commit that referenced this pull request Jan 12, 2023

Revert "Add --use_multi_head_attention in transformers fusion (#14198)"

1ae4f75

This reverts commit 012b34d.

kunal-vaishnavi mentioned this pull request Apr 11, 2023

Whisper Model Optimization #15473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --use_multi_head_attention in transformers fusion #14198

Add --use_multi_head_attention in transformers fusion #14198

Uh oh!

tianleiwu commented Jan 10, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add --use_multi_head_attention in transformers fusion #14198

Add --use_multi_head_attention in transformers fusion #14198

Uh oh!

Conversation

tianleiwu commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Jan 10, 2023 •

edited

Loading