Add FP16 support for Whisper model #15427

stevenlix · 2023-04-07T21:47:39Z

Current ORT can only run inference for Whisper FP32 model. This PR adds FP16 support.

tianleiwu · 2023-04-08T00:16:28Z

Add a test case with hf-internal-testing/tiny-random-WhisperForConditionalGeneration model?

stevenlix · 2023-04-08T05:05:08Z

ContribOperators.md also needs to be updated since FP16 is added in contrib_defs.cc . Is there a specific command to run gen_contrib_doc.py?

tianleiwu · 2023-04-08T05:27:42Z

ContribOperators.md also needs to be updated since FP16 is added in contrib_defs.cc . Is there a specific command to run gen_contrib_doc.py?

You can download ContribOperators.md from kernelDocumentation build pipeline (see the failed pipeline job) published artifacts: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=952726&view=artifacts&pathAsName=false&type=publishedArtifacts

yufenglee · 2023-04-08T09:24:39Z

ContribOperators.md also needs to be updated since FP16 is added in contrib_defs.cc . Is there a specific command to run gen_contrib_doc.py?

Windows GPU CI pipeline generates and unloads them as artifacts. You can download the updated version there:
https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=952726&view=artifacts&pathAsName=false&type=publishedArtifacts

stevenlix · 2023-04-08T20:34:41Z

Add a test case with hf-internal-testing/tiny-random-WhisperForConditionalGeneration model?

Sure. We may merge this PR first so people can run Wisper FP16 model. The HF testing model will be added in a separate PR.

This reverts commit 6d126f8.

Current ORT can only run inference for Whisper FP32 model. This PR adds FP16 support.

### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - huggingface/optimum#872 - huggingface/optimum#920 ### Motivation and Context This PR helps the following issues: - #15100 - #15235 - huggingface/optimum#869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - #15247 - #15339 - #15362 - #15365 - #15427 This PR uses changes from the following merged PRs: - #14198 - #14146 - #14201 - #14928 (this introduced the new multi-head attention spec)

stevenlix added 2 commits April 7, 2023 21:34

FP16 support for whisper

3995491

clean up

db8e4b0

stevenlix requested review from petermcaughan, tianleiwu and yufenglee April 7, 2023 21:47

fix template issue

4410013

update docs

db08e9c

yufenglee approved these changes Apr 9, 2023

View reviewed changes

tianleiwu approved these changes Apr 9, 2023

View reviewed changes

stevenlix merged commit 6d126f8 into main Apr 9, 2023

stevenlix deleted the stevenlix/whisper branch April 9, 2023 04:36

snnn pushed a commit that referenced this pull request Apr 10, 2023

Revert "Add FP16 support for Whisper model (#15427)"

86deba4

This reverts commit 6d126f8.

kunal-vaishnavi mentioned this pull request Apr 11, 2023

Whisper Model Optimization #15473

Merged

smk2007 pushed a commit that referenced this pull request Apr 14, 2023

Add FP16 support for Whisper model (#15427)

2853879

Current ORT can only run inference for Whisper FP32 model. This PR adds FP16 support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP16 support for Whisper model #15427

Add FP16 support for Whisper model #15427

Uh oh!

stevenlix commented Apr 7, 2023

Uh oh!

tianleiwu commented Apr 8, 2023 •

edited

Loading

Uh oh!

stevenlix commented Apr 8, 2023

Uh oh!

tianleiwu commented Apr 8, 2023 •

edited

Loading

Uh oh!

yufenglee commented Apr 8, 2023

Uh oh!

stevenlix commented Apr 8, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add FP16 support for Whisper model #15427

Add FP16 support for Whisper model #15427

Uh oh!

Conversation

stevenlix commented Apr 7, 2023

Uh oh!

tianleiwu commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenlix commented Apr 8, 2023

Uh oh!

tianleiwu commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yufenglee commented Apr 8, 2023

Uh oh!

stevenlix commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented Apr 8, 2023 •

edited

Loading

tianleiwu commented Apr 8, 2023 •

edited

Loading

stevenlix commented Apr 8, 2023 •

edited

Loading