rename CrossAttention to MultiHeadAttention #14201

wangyems · 2023-01-10T02:36:56Z

Description

rename the CrossAttention to MultiheadAttention since this op can also be used as self attention

Motivation and Context

### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - huggingface/optimum#872 - huggingface/optimum#920 ### Motivation and Context This PR helps the following issues: - #15100 - #15235 - huggingface/optimum#869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - #15247 - #15339 - #15362 - #15365 - #15427 This PR uses changes from the following merged PRs: - #14198 - #14146 - #14201 - #14928 (this introduced the new multi-head attention spec)

Ubuntu added 2 commits January 10, 2023 02:36

init

8950212

docs

9b1881c

wangyems marked this pull request as ready for review January 10, 2023 07:16

wangyems requested a review from tianleiwu January 10, 2023 07:16

tianleiwu approved these changes Jan 10, 2023

View reviewed changes

wangyems merged commit a01bf8d into main Jan 10, 2023

wangyems deleted the wangye/rename branch January 10, 2023 18:18

kunal-vaishnavi mentioned this pull request Apr 11, 2023

Whisper Model Optimization #15473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rename CrossAttention to MultiHeadAttention #14201

rename CrossAttention to MultiHeadAttention #14201

Uh oh!

wangyems commented Jan 10, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rename CrossAttention to MultiHeadAttention #14201

rename CrossAttention to MultiHeadAttention #14201

Uh oh!

Conversation

wangyems commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangyems commented Jan 10, 2023 •

edited

Loading