Petermca/beamsearch whisper #15339

petermcaughan · 2023-04-03T18:50:14Z

Description

Adjust various code paths to allow Whisper model to function with BeamSearch op.

Approach: Add a new kModelType enum value in IGenerationParameters as so:

Old: 0 = GPT2, 1 = T5

New: 0 = GPT2, 1 = T5, 2 = Whisper

When the user assigns this attribute value to 2, various shape and type checks are changed to accommodate Whisper inputs.

Motivation and Context

BeamSearch is currently designed to function with BERT-based models with inputs as vocab tokens, and needs changes to function with Whisper inputs (3-D float values processed from audio data).

…petermca/whisper_beamsearch

…petermca/beamsearch_whisper

### Description Adjust various code paths to allow Whisper model to function with BeamSearch op. Approach: Add a new kModelType enum value in IGenerationParameters as so: #### Old: 0 = GPT2, 1 = T5 #### New: 0 = GPT2, 1 = T5, 2 = Whisper When the user assigns this attribute value to 2, various shape and type checks are changed to accommodate Whisper inputs. ### Motivation and Context BeamSearch is currently designed to function with BERT-based models with inputs as vocab tokens, and needs changes to function with Whisper inputs (3-D float values processed from audio data). --------- Co-authored-by: Peter McAughan <[email protected]>

### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - huggingface/optimum#872 - huggingface/optimum#920 ### Motivation and Context This PR helps the following issues: - #15100 - #15235 - huggingface/optimum#869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - #15247 - #15339 - #15362 - #15365 - #15427 This PR uses changes from the following merged PRs: - #14198 - #14146 - #14201 - #14928 (this introduced the new multi-head attention spec)

Peter McAughan added 13 commits March 14, 2023 18:04

initial commit

2a2e6bb

Fix syntax errors & complete first iteration

a04594d

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

091de8b

…petermca/whisper_beamsearch

Remove unrelated files

8cdfb79

Remove artefacts

dbcb84e

Changes to successfully inference

a58a653

Fix GPT2 unit tests

42121e1

Fix op definition for BeamSearch inputs

05be311

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

df22117

…petermca/whisper_beamsearch

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

5671a15

…petermca/whisper_beamsearch

Fix documentation for beamsearch inputs

d2b7d11

Fix documentation typing

a08c134

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

d04a649

…petermca/beamsearch_whisper

yufenglee previously approved these changes Apr 3, 2023

View reviewed changes

Replace docs with auto-generated docs

12ca7ec

petermcaughan dismissed yufenglee’s stale review via 12ca7ec April 4, 2023 01:08

Peter McAughan added 2 commits April 4, 2023 01:09

Whitespace error

e72052e

Fix OperatorKernels doc

60aef22

hanbitmyths approved these changes Apr 4, 2023

View reviewed changes

hanbitmyths merged commit 1251964 into main Apr 4, 2023

hanbitmyths deleted the petermca/beamsearch_whisper branch April 4, 2023 16:09

kunal-vaishnavi mentioned this pull request Apr 11, 2023

Whisper Model Optimization #15473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Petermca/beamsearch whisper #15339

Petermca/beamsearch whisper #15339

Uh oh!

petermcaughan commented Apr 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Petermca/beamsearch whisper #15339

Petermca/beamsearch whisper #15339

Uh oh!

Conversation

petermcaughan commented Apr 3, 2023

Description

Old: 0 = GPT2, 1 = T5

New: 0 = GPT2, 1 = T5, 2 = Whisper

Motivation and Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants