-
Notifications
You must be signed in to change notification settings - Fork 615
Remove constant output in encoder-decoder ONNX models decoder with past #920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove constant output in encoder-decoder ONNX models decoder with past #920
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
michaelbenayoun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand everything here, but LGTM
optimum/exporters/onnx/base.py
Outdated
| if self._behavior is ConfigBehavior.DECODER: | ||
| reference_model_inputs["input_ids"] = reference_model_inputs.pop("decoder_input_ids") | ||
|
|
||
| if self.use_past_in_inputs is False: | ||
| # ONNX without past uses encoder_hidden_states even when we don't outputing them | ||
| reference_model_inputs["encoder_hidden_states"] = reference_model_inputs.pop("encoder_outputs")[0] | ||
| else: | ||
| # ONNX with past does not use encoder_hidden_states when we don't output them | ||
| reference_model_inputs.pop("encoder_outputs") | ||
|
|
||
| return reference_model_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I do not like a function to both mutate the input and return it, as it gives the impression that it creates a new output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you prefer?
Co-authored-by: Michael Benayoun <[email protected]>
…//github.com/fxmarty/optimum into remove-constant-output-encoder-hidden-states
|
Until we support optional inputs, we will need |
|
Merging as the failing test is unrelated (https://huggingface.co/hf-internal-testing/tiny-random-DebertaV2Model was updated) |
### Description This PR contains fusion-level and kernel-level optimizations for [OpenAI's Whisper](https://github.com/openai/whisper). Some of the added optimizations include: - Pruning of duplicate/unnecessary inputs and outputs - Fusion support for Whisper models with or without these inputs/outputs (e.g. with these inputs/outputs if exporting with an older official Optimum version, without these inputs/outputs if exporting with Optimum from source) - Attention fusions - For Whisper's encoder and decoder - Modified symbolic shape inference for present output when no past input exists (for decoder) - Multi-head attention fusions - For Whisper's decoder and decoder with past - Packed MatMul for the 3 MatMuls excluded in multi-head attention fusion - Attention kernel changes - CPU: - Different Q and KV sequence lengths - Parallel memset for large sequence lengths - Convert broadcast add after MatMul of Q and K (add_qk) to element-wise add - Separate present key-value output into present key and present value (for multi-head attention spec) - CUDA: - Use memory efficient attention compute kernel with present state (for decoder) - Multi-head attention kernel changes - CPU: - Introduction of multi-head attention CPU kernel (previously did not exist) - Use AddBiasReshape instead of AddBiasTranspose when sequence length = 1 (for decoder with past) - Different Q, K, V input shapes - Pass past key and past value directly as key and value - CUDA: - Use memory efficient attention compute kernel with past and/or present state (for decoder with past) ### Usage To use the optimizations, run the ORT transformer optimizer script as follows: ``` $ cd onnxruntime/onnxruntime/python/tools/transformers/ $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention ``` Once optimized, here's an example of how to run Whisper with [Hugging Face's Optimum](https://github.com/huggingface/optimum): ``` from transformers.onnx.utils import get_preprocessor from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from optimum.pipelines import pipeline as ort_pipeline import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/ directory = './whisper_opt' # Where the optimized ONNX models are located model_name = 'openai/whisper-tiny' device = 'cpu' # Get pipeline processor = get_preprocessor(model_name) model = ORTModelForSpeechSeq2Seq.from_pretrained( directory, use_io_binding=(device == 'cuda'), provider='CPUExecutionProvider', ).to(device) pipe = ort_pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=(-1 if device == 'cpu' else 0), ) # Load audio file and run pipeline audio = whisper.load_audio('tests/jfk.flac') audio = whisper.pad_or_trim(audio) outputs = pipe([audio]) print(outputs) ``` Note: In order to use these changes with Optimum, it is recommended to use Optimum from source to have the following changes: - huggingface/optimum#872 - huggingface/optimum#920 ### Motivation and Context This PR helps the following issues: - #15100 - #15235 - huggingface/optimum#869 (work in progress) This PR can be used with the other currently merged Whisper PRs: - #15247 - #15339 - #15362 - #15365 - #15427 This PR uses changes from the following merged PRs: - #14198 - #14146 - #14201 - #14928 (this introduced the new multi-head attention spec)
Remove the constant output
encoder_last_hidden_statefrom decoders.Breaking change:
Fixes #869