Remove constant output in encoder-decoder ONNX models decoder with past #920

fxmarty · 2023-03-24T14:03:47Z

Remove the constant output encoder_last_hidden_state from decoders.

Breaking change:

encoder_last_hidden_state is not anymore an output of without/with past decoders ONNX.
encoder_hidden_states is not anymore an input of without past decoder ONNX, as it is unused (except for t5).

Fixes #869

HuggingFaceDocBuilderDev · 2023-03-24T14:23:59Z

The documentation is not available anymore as the PR was closed or merged.

michaelbenayoun

Not sure I understand everything here, but LGTM

michaelbenayoun · 2023-03-24T16:59:37Z

optimum/exporters/onnx/base.py

+        if self._behavior is ConfigBehavior.DECODER:
+            reference_model_inputs["input_ids"] = reference_model_inputs.pop("decoder_input_ids")
+
+            if self.use_past_in_inputs is False:
+                # ONNX without past uses encoder_hidden_states even when we don't outputing them
+                reference_model_inputs["encoder_hidden_states"] = reference_model_inputs.pop("encoder_outputs")[0]
+            else:
+                # ONNX with past does not use encoder_hidden_states when we don't output them
+                reference_model_inputs.pop("encoder_outputs")
+
+        return reference_model_inputs


In general I do not like a function to both mutate the input and return it, as it gives the impression that it creates a new output.

What would you prefer?

optimum/exporters/onnx/model_patcher.py

Co-authored-by: Michael Benayoun <[email protected]>

…//github.com/fxmarty/optimum into remove-constant-output-encoder-hidden-states

fxmarty · 2023-03-28T16:43:04Z

Until we support optional inputs, we will need encoder_last_hidden_state dummy input (at least in ONNX export validation) so that the merge case works.

fxmarty · 2023-03-29T08:14:28Z

Merging as the failing test is unrelated (https://huggingface.co/hf-internal-testing/tiny-random-DebertaV2Model was updated)

### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - huggingface/optimum#872 - huggingface/optimum#920 ### Motivation and Context This PR helps the following issues: - #15100 - #15235 - huggingface/optimum#869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - #15247 - #15339 - #15362 - #15365 - #15427 This PR uses changes from the following merged PRs: - #14198 - #14146 - #14201 - #14928 (this introduced the new multi-head attention spec)

fxmarty added 3 commits March 24, 2023 14:41

wip

17a7134

wip 2

abf0cd3

nit

0a8b970

fxmarty added 3 commits March 24, 2023 15:52

fix

4ac3711

remove prints

1e07385

work for all?

ff39e77

fxmarty marked this pull request as ready for review March 24, 2023 15:50

fxmarty requested review from JingyaHuang, echarlaix, mht-sharma, michaelbenayoun and regisss and removed request for michaelbenayoun March 24, 2023 15:50

dead code

8ba54c7

michaelbenayoun approved these changes Mar 24, 2023

View reviewed changes

fxmarty and others added 10 commits March 28, 2023 10:23

Update optimum/exporters/onnx/model_patcher.py

e7b30bf

Co-authored-by: Michael Benayoun <[email protected]>

Merge branch 'master' into remove-constant-output-encoder-hidden-states

5202f4c

Merge branch 'remove-constant-output-encoder-hidden-states' of https:…

d32d641

…//github.com/fxmarty/optimum into remove-constant-output-encoder-hidden-states

fix tests

efcbf79

fix

e9f74bf

last fix, hopefully

1ceb58f

Merge branch 'master' into remove-constant-output-encoder-hidden-states

0baa29f

merge mess

d876679

hoepfully pass

a8369db

fix merge

71b2705

getting more hacky day by day

0f5a578

fxmarty added 2 commits March 28, 2023 18:49

fix broken longt5

c2c1700

fix again longt5

527fc36

fxmarty merged commit b4a83e2 into huggingface:main Mar 29, 2023

fxmarty mentioned this pull request Mar 29, 2023

ORT whisper on CUDAExecutionProvider is slower than PyTorch #869

Open

4 tasks

kunal-vaishnavi mentioned this pull request Apr 11, 2023

Whisper Model Optimization microsoft/onnxruntime#15473

Merged

fxmarty mentioned this pull request Apr 21, 2023

Raise a warning at ONNX export if input -> Identity -> output patterns are detected #1000

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove constant output in encoder-decoder ONNX models decoder with past #920

Remove constant output in encoder-decoder ONNX models decoder with past #920

Uh oh!

fxmarty commented Mar 24, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2023 •

edited

Loading

Uh oh!

michaelbenayoun left a comment

Uh oh!

michaelbenayoun Mar 24, 2023

Uh oh!

fxmarty Mar 28, 2023

Uh oh!

Uh oh!

fxmarty commented Mar 28, 2023

Uh oh!

fxmarty commented Mar 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove constant output in encoder-decoder ONNX models decoder with past #920

Remove constant output in encoder-decoder ONNX models decoder with past #920

Uh oh!

Conversation

fxmarty commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelbenayoun left a comment

Choose a reason for hiding this comment

Uh oh!

michaelbenayoun Mar 24, 2023

Choose a reason for hiding this comment

Uh oh!

fxmarty Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fxmarty commented Mar 28, 2023

Uh oh!

fxmarty commented Mar 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fxmarty commented Mar 24, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 24, 2023 •

edited

Loading