add whisper

wenet-e2e · Feb 28, 2024 · 68e4e8c · 68e4e8c
1 parent 422f114
commit 68e4e8c
Show file tree

Hide file tree

Showing 2 changed files with 14 additions and 0 deletions.
diff --git a/examples/wenetspeech/s0/README.md b/examples/wenetspeech/s0/README.md
@@ -43,6 +43,7 @@
 * Feature info: using fbank feature, with dither 1.0, with cmvn
 * Training info: lr 0.001, batch size dynamic36000, 8 gpus on 3090, acc_grad 4, 130k steps, 4.6 days
 * Decoding info: ctc_weight 0.5, reverse_weight 0.0, average_num 5, blank penalty 0.0, length penalty 0.0
+* PR link: https://github.com/wenet-e2e/wenet/pull/2371
 
 | Decoding mode - Chunk size    | Dev  | Test\_Net | Test\_Meeting |
 |:-----------------------------:|:----:|:---------:|:-------------:|

diff --git a/examples/wenetspeech/whisper/README.md b/examples/wenetspeech/whisper/README.md
@@ -55,6 +55,19 @@ python local/modify_ckpt.py \
 |      attention      | 7.27 % N=328207 C=308016 S=11392 D=8799 I=3672  |  7.90 % N=414097 C=383382 S=18954 D=11761 I=2018    |   13.00 % N=220358 C=194417 S=11788 D=14153 I=2705     |
 | attention_rescoring | 8.95 % N=328207 C=305892 S=16696 D=5619 I=7056  |    10.83 % N=414097 C=371515 S=30229 D=12353 I=2269    |    15.64 % N=220358 C=193717 S=18669 D=7972 I=7812     |
 
+## Whisper-largev3 (conv1d2, full-parameter tuning) Result (text\_fixed, see https://github.com/wenet-e2e/WenetSpeech/discussions/54)
+
+* Feature info: using log_mel_spectrogram feature, no cmvn
+* Training info: bf16, deepspeed stage1, activation checkpointing, batch dynamic12000, acc_grad 8, 8 * 3090 gpu, 48k steps (about 6 days), conf/finetune_whisper_largev3.yaml
+* Decoding info: ctc_weight 0.0, average_num 5
+* PR link: https://github.com/wenet-e2e/wenet/pull/2371
+
+|   decoding_method   |  Dev | Test\_Net | Test\_Meeting |
+|:-------------------:|:----:|:---------:|:-------------:|
+|  ctc_greedy_search  | 7.09 % N=328207 C=308643 S=16976 D=2588 I=3709  | 10.98 % N=414092 C=373301 S=33375 D=7416 I=4697 | 12.84 % N=220358 C=194928 S=18398 D=7032 I=2862 |
+|      attention      | 4.66 % N=328207 C=315591 S=10352 D=2264 I=2692  | 6.54 % N=414092 C=389523 S=19101 D=5468 I=2513 | 8.84 % N=220358 C=202722 S=11296 D=6340 I=1839  |
+| attention_rescoring | 5.99 % N=328207 C=311106 S=14807 D=2294 I=2547  | 9.27 % N=414092 C=378406 S=28993 D=6693 I=2715 | 11.47 % N=220358 C=197013 S=16716 D=6629 I=1923 |
+
 # Frequently Asked Questions
 
 - Q: Why are there so many insertion errors in the decoding results of CTC and attention_rescoring?