Add CrossAttention operator #14146

tianleiwu · 2023-01-05T19:51:34Z

Description

Move separated Q, K and V (without input projection) from Attention to a new operator CrossAttention.

The Attention operator is hard to maintain when we need support with and without input projection in one class. Add a new operator according to feedback.

Some change might need in the future, but not in this PR:
(1) bias could be optional (We will not proceed that route unless experiments show that fusing Add bias with MatMul instead of this op could improve performance).
(2) support packed KV. There are two ways to support it: when key and value are same Tensor, they are packed; or we can make value as optional, and use packed mode when value is empty and the key has packed K/V.
(3) support cached key and value, and other (like relative position bias), or more attention mask format. They can be added easily without breaking backward compatible.
(4) ROCm/CPU implementation of this op.

Motivation and Context

…nxruntime into tlwu/qkv_to_context

yufenglee

wangyems · 2023-01-06T22:01:50Z

onnxruntime/core/graph/contrib_ops/bert_defs.cc

+               "value",
+               "Value with shape (batch_size, kv_sequence_length, v_hidden_size)",
+               "T")
+        .Input(3,


optional? (if combine kv in future)

Packed kv will be supported later. Will update the interface accordingly at that time.

wangyems · 2023-01-06T22:15:32Z

onnxruntime/core/graph/contrib_ops/shape_inference_functions.cc

-  //    Input 6 (value) has shape (batch_size, kv_sequence_length, v_hidden_size)
-  //
+  // Input 0 has 3D shape (batch_size, sequence_length, input_hidden_size)
+  // INput 1 has 2D shape (input_hidden_size, hidden_size + hidden_size + v_hidden_size)


will fix it in next PR.

We remove key and value inputs in #14146, need update the convert_generation as well.

### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - huggingface/optimum#872 - huggingface/optimum#920 ### Motivation and Context This PR helps the following issues: - #15100 - #15235 - huggingface/optimum#869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - #15247 - #15339 - #15362 - #15365 - #15427 This PR uses changes from the following merged PRs: - #14198 - #14146 - #14201 - #14928 (this introduced the new multi-head attention spec)

tianleiwu added 7 commits January 4, 2023 11:40

Add QkvToContext

77a629e

draft

d5e03cd

Merge branch 'tlwu/qkv_to_context' of https://github.com/Microsoft/on…

1d43f6b

…nxruntime into tlwu/qkv_to_context

add key_padding_mask

b4239f3

update doc

00c1e6e

Merge branch 'main' into tlwu/qkv_to_context

4bb9a45

fix doc

4bf800b

tianleiwu requested review from wangyems and yufenglee January 5, 2023 19:51

tianleiwu marked this pull request as draft January 5, 2023 21:18

tianleiwu added 6 commits January 6, 2023 01:58

optional value in packed kv

e9ab1e2

value/bias required, remove packed kv

740daf3

fix build

7c09f01

fix dummy mask

f4b276c

Merge branch 'main' into tlwu/qkv_to_context

930bae8

update doc

0484d75

tianleiwu marked this pull request as ready for review January 6, 2023 17:56

update shape inference

8df7447

tianleiwu changed the title ~~[WIP] Add CrossAttention operator~~ Add CrossAttention operator Jan 6, 2023

tianleiwu added 2 commits January 6, 2023 10:27

fix build

6c0201e

fix rocm build

ae7f454

yufenglee approved these changes Jan 6, 2023

View reviewed changes

wangyems reviewed Jan 6, 2023

View reviewed changes

wangyems approved these changes Jan 6, 2023

View reviewed changes

tianleiwu merged commit 2cacb24 into main Jan 6, 2023

tianleiwu deleted the tlwu/qkv_to_context branch January 6, 2023 22:27

tianleiwu mentioned this pull request Jan 9, 2023

update convert_generation for Attention op change #14191

Merged

tianleiwu added a commit that referenced this pull request Jan 10, 2023

update convert_generation for Attention op change (#14191)

7e751ac

We remove key and value inputs in #14146, need update the convert_generation as well.

kunal-vaishnavi mentioned this pull request Apr 11, 2023

Whisper Model Optimization #15473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CrossAttention operator #14146

Add CrossAttention operator #14146

Uh oh!

tianleiwu commented Jan 5, 2023 •

edited

Loading

Uh oh!

yufenglee left a comment

Uh oh!

wangyems Jan 6, 2023 •

edited

Loading

Uh oh!

tianleiwu Jan 6, 2023

Uh oh!

wangyems Jan 6, 2023

Uh oh!

tianleiwu Jan 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add CrossAttention operator #14146

Add CrossAttention operator #14146

Uh oh!

Conversation

tianleiwu commented Jan 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

yufenglee left a comment

Choose a reason for hiding this comment

Uh oh!

wangyems Jan 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

wangyems Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jan 6, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented Jan 5, 2023 •

edited

Loading

wangyems Jan 6, 2023 •

edited

Loading