Skip to content

Conversation

@fxmarty
Copy link
Contributor

@fxmarty fxmarty commented Mar 24, 2023

Remove the constant output encoder_last_hidden_state from decoders.

Breaking change:

  • encoder_last_hidden_state is not anymore an output of without/with past decoders ONNX.
  • encoder_hidden_states is not anymore an input of without past decoder ONNX, as it is unused (except for t5).

Fixes #869

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 24, 2023

The documentation is not available anymore as the PR was closed or merged.

@fxmarty fxmarty marked this pull request as ready for review March 24, 2023 15:50
Copy link
Member

@michaelbenayoun michaelbenayoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand everything here, but LGTM

Comment on lines 745 to 755
if self._behavior is ConfigBehavior.DECODER:
reference_model_inputs["input_ids"] = reference_model_inputs.pop("decoder_input_ids")

if self.use_past_in_inputs is False:
# ONNX without past uses encoder_hidden_states even when we don't outputing them
reference_model_inputs["encoder_hidden_states"] = reference_model_inputs.pop("encoder_outputs")[0]
else:
# ONNX with past does not use encoder_hidden_states when we don't output them
reference_model_inputs.pop("encoder_outputs")

return reference_model_inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I do not like a function to both mutate the input and return it, as it gives the impression that it creates a new output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you prefer?

@fxmarty
Copy link
Contributor Author

fxmarty commented Mar 28, 2023

Until we support optional inputs, we will need encoder_last_hidden_state dummy input (at least in ONNX export validation) so that the merge case works.

@fxmarty
Copy link
Contributor Author

fxmarty commented Mar 29, 2023

Merging as the failing test is unrelated (https://huggingface.co/hf-internal-testing/tiny-random-DebertaV2Model was updated)

@fxmarty fxmarty merged commit b4a83e2 into huggingface:main Mar 29, 2023
hanbitmyths pushed a commit to microsoft/onnxruntime that referenced this pull request Apr 19, 2023
### Description
This PR contains fusion-level and kernel-level optimizations for
[OpenAI's Whisper](https://github.com/openai/whisper).

Some of the added optimizations include:

- Pruning of duplicate/unnecessary inputs and outputs
- Fusion support for Whisper models with or without these inputs/outputs
(e.g. with these inputs/outputs if exporting with an older official
Optimum version, without these inputs/outputs if exporting with Optimum
from source)
- Attention fusions
   - For Whisper's encoder and decoder
- Modified symbolic shape inference for present output when no past
input exists (for decoder)
- Multi-head attention fusions
   - For Whisper's decoder and decoder with past
- Packed MatMul for the 3 MatMuls excluded in multi-head attention
fusion
- Attention kernel changes
   - CPU:
      - Different Q and KV sequence lengths
      - Parallel memset for large sequence lengths
- Convert broadcast add after MatMul of Q and K (add_qk) to element-wise
add
- Separate present key-value output into present key and present value
(for multi-head attention spec)
   - CUDA:
- Use memory efficient attention compute kernel with present state (for
decoder)
- Multi-head attention kernel changes
   - CPU:
- Introduction of multi-head attention CPU kernel (previously did not
exist)
- Use AddBiasReshape instead of AddBiasTranspose when sequence length =
1 (for decoder with past)
      - Different Q, K, V input shapes
      - Pass past key and past value directly as key and value
   - CUDA:
- Use memory efficient attention compute kernel with past and/or present
state (for decoder with past)

### Usage
To use the optimizations, run the ORT transformer optimizer script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type bart --num_heads <number of attention heads, depends on the size of the whisper model used> --hidden_size <attention hidden size, depends on the size of the whisper model used> --use_external_data_format --use_multi_head_attention
```

Once optimized, here's an example of how to run Whisper with [Hugging
Face's Optimum](https://github.com/huggingface/optimum):
```
from transformers.onnx.utils import get_preprocessor
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from optimum.pipelines import pipeline as ort_pipeline

import whisper # Installed from OpenAI's repo - setup instructions at https://github.com/openai/whisper/

directory = './whisper_opt' # Where the optimized ONNX models are located
model_name = 'openai/whisper-tiny'
device = 'cpu'

# Get pipeline
processor = get_preprocessor(model_name)
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    directory,
    use_io_binding=(device == 'cuda'),
    provider='CPUExecutionProvider',
).to(device)
pipe = ort_pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    device=(-1 if device == 'cpu' else 0),
)

# Load audio file and run pipeline
audio = whisper.load_audio('tests/jfk.flac')
audio = whisper.pad_or_trim(audio)
outputs = pipe([audio])
print(outputs)
```

Note: In order to use these changes with Optimum, it is recommended to
use Optimum from source to have the following changes:
- huggingface/optimum#872
- huggingface/optimum#920

### Motivation and Context
This PR helps the following issues:
- #15100
- #15235
- huggingface/optimum#869 (work in progress)

This PR can be used with the other currently merged Whisper PRs:
- #15247
- #15339
- #15362
- #15365
- #15427

This PR uses changes from the following merged PRs:
- #14198
- #14146
- #14201
- #14928 (this introduced
the new multi-head attention spec)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ORT whisper on CUDAExecutionProvider is slower than PyTorch

3 participants