-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Weight-Only Support To Whisper #794
Conversation
@Eddie-Wang1120 Hi Eddie, many thanks. We would take this PR into our internal gitlab. We'll add your name into the co-author list and credit your work on the release notes for whisper int8 support. |
Thanks a lot! I will keep on working on supporting int8 kv cache and smoothquant for whisper |
hello. |
Yeah, Ampere card e.g. A100, A10 should work with int8 very well. I am curious about the perf stats file under results_dir. Would you mind pasting fp16 v.s. int8 RTF and batch_size info here? The above results in the PR is from RTX4060Ti 16 GB. You may also try increase batch_size for int8 since it saves a lot VRAM. |
I used the 1221-135766-0002.wav file in the example and the default 4 for batch_size. |
Also, I suggest to do benchmark with a whole dataset e.g. https://huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy |
@Eddie-Wang1120 This is great work! While we're looking at optimized Whisper performance are there any plans to support distil-whisper? |
Thanks! Currently I'm working on the int8_kv_cache support for Whisper, and will think about other model's support after I finish Whisper. |
Hi @kaiyux , would you mind adding @Eddie-Wang1120‘s name during next release note? I have imported and merged this PR into gitlab. Thanks. |
@Eddie-Wang1120 Thanks very much for your great contribution. The changes will be included in the next main branch update to the GitHub, and we will credit you as co-author. Thanks! |
Thanks a lot! |
@Eddie-Wang1120, thank you for your collaboration. I just tested this quantization on the latest release of TRT, I mean c896530, but I'm not experiencing any improvement in terms of performance or memory usage reduction. In terms of inference speed, it seems to be three times slower. I'm building the model using the following command:
But I see the following results, which I think are not expected. I guess this might be related to the A10 GPU, and it might not perform well in INT8 bit mode, whereas it is faster in float16. On the other hand, I should see a significantly lower memory footprint, right? large-v2:
v2 int8 WOQ:
|
Current weight only quant solution for whisper exists a large speed/throughput regression and we're investigating it now. However, you should see a significantly lower memory footprint with the current solution. What's your VRAM usage in above cases? @robosina |
@yuekaizhang I see thanks for providing feedback. For the normal model with a batch size of 8 and a beam size of 5, the memory usage is approximately 19,862 MiB. For the WOQ8 model, it is around 19,044 MiB. |
Would you mind trying batch_size 4 and beam_size 1? It's weird since I got 16000Mb for fp16, about 7000Mb for Weight only int8 on A10 GPU. @robosina |
@yuekaizhang Yes, it's weird for me; In this config, the memory usage is 8,730 MiB for the normal model and 7,912 MiB for the WOQ8 model. |
Yeah, with this config, the WOQ8 results are same. Your's fp16 memory usage is much lower than mine. I'm using large-v3 model, seems it's the only difference between us. |
@yuekaizhang I see Thanks, I will check this in more detail and get back to you. Thanks. |
@yuekaizhang Any updates on the reason for the quantized models behaving slower than unquantized models? |
Hi @yuekaizhang, |
@aramfaghfouri @Bhuvanesh09 See #992 (comment) please. All issues are now fixed, and the relationship between memory usage and speed is similar to the conclusions in the link provided. Using int8 weight only will result in less memory usage and faster speed. You can wait for our code update, or directly use the PR corresponding to the link above. |
support weight-only to whisper model
use default hf-internal-testing/librispeech_asr_dummy dataset
only use single build command:
python3 build.py --output_dir whisper_large_weight_only --use_gpt_attention_plugin --use_gemm_plugin --use_layernorm_plugin --use_bert_attention_plugin --use_weight_only
results:
obviously,the GPU memory usage reduce over 200%, the inference speed up about 150%, also mantain the accuracy.
looking for good news!
Eddie-Wang