Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding speed and accuracy on the transformed onnx model #42

Open
yangyi0818 opened this issue Aug 25, 2022 · 8 comments
Open

Decoding speed and accuracy on the transformed onnx model #42

yangyi0818 opened this issue Aug 25, 2022 · 8 comments

Comments

@yangyi0818
Copy link

Hi, thanks for you share of the espnet_onnx system!

I met two problems when I tried to inference thorough your codes. My acoustic model is trained by myself on our own dataset. The AM architecture is the typical Conformer. I downloaded this code on June.

First, the decoding speed is too slow by it. When using torch to decode, the RTF is around 2.32; however it becomes around 20 when using the transformed onnx.

Second, the CER calculated in the torch version is 7.8% while for the onnx, it becomes 10.6%. I think it is probably wrong.

I'm giving some configs here:

export.py

import sys
sys.path.append('espnet-master')
sys.path.append('espnet-master/espnet_tts_frontend-master')
sys.path.append('espnet_onnx-master/espnet_onnx/export/asr')
import torch

from export_asr import ModelExport
from espnet2.bin.asr_inference import Speech2Text

if __name__ == '__main__':
    m = ModelExport(cache_dir = sys.argv[5])

    # export from trained model
    speech2text=Speech2Text(
            asr_train_config = sys.argv[1],
            asr_model_file=sys.argv[2],
            lm_train_config=sys.argv[3],
            lm_file=sys.argv[4],
            )

    m.export(model = speech2text, tag_name = 'speech2text', quantize=True)

And I get an onnx dir structured like:

asr/onnx/speech2text/
      config.yaml
      feats_stats.npz
      full/
      quantize/

The test wav is a filelist, structured as:

bigfar_001_000001 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000001.wav
bigfar_001_000002 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000002.wav
bigfar_001_000003 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000003.wav
bigfar_001_000004 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000004.wav
bigfar_001_000005 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000005.wav
bigfar_001_000006 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000006.wav
...

The decoding process is:

decode.py

import sys
sys.path.append('espnet_onnx-master/espnet_onnx/asr')

import time
import threading
import librosa
import os
from tqdm import tqdm
from asr_model import Speech2Text

if __name__ == '__main__':
    """ step1: load onnx file """
        speech2text = Speech2Text(tag_name = 'speech2text', model_dir=sys.argv[3],)

        """ step2: ASR """
        f = open(sys.argv[1])
        lines = f.readlines()
        for line in tqdm(lines):
            with open(os.path.join(sys.argv[2], 'hyp_flush_1process.trn'),'a') as fout:
                wav_name = line.split(' ')[0].strip()
                processing_wav = line.split(' ')[1].strip()

                start = time.time()
                y, sr = librosa.load(processing_wav, sr=16000)
                nbest = speech2text(y)
                asr_result = nbest[0][0]
                end = time.time()

                for j in range (len(asr_result)):
                    fout.write(asr_result[j])
                    if j != len(asr_result) - 1:
                        fout.write(' ')
                fout.write('\t')
                fout.write('(')
                fout.write(wav_name)
                fout.write('-')
                fout.write(wav_name)
                fout.write(')')
                fout.write('\n')

                print('processing:  ', processing_wav)
                print('Result:         ', asr_result)
                print('Time:           ', end-start, 's')

Furthermore, I noticed that you have mentioned there may be some problems for Conformer AM considering ASR in latest issue, has it been fixed?

Looking forward for your reply!

@Masao-Someki
Copy link
Collaborator

Hi @yangyi0818, Thank you for reporting the issue!
About the first point, I would like to know the following information:

  • What is your device? CPU or GPU?
  • Am I right that your model was constructed with Conformer encoder and Transformer decoder?
  • Did you use LM for the inference?
  • There are two Conformer blocks in ESPnet, the legacy and the latest versions. Which block did you use?
  • I see quantization is applied to your model. Did you execute your quantized model on GPU?

And about the second point, I would like to know the following information:

  • Did you check the weights for ctc and decoder? It is defined in asr/onnx/speech2text/config.yaml

The latest Conformer-related issue is not yet fixed, and I'm trying to solve it!

@yangyi0818
Copy link
Author

Hi @Masao-Someki ! Thank you for your kind reply!
Here are my answers.

About the first point:

What is your device? CPU or GPU?
CPU

Am I right that your model was constructed with Conformer encoder and Transformer decoder?
Yes.

Did you use LM for the inference?
Yes. It is a transformer structured LM.

There are two Conformer blocks in ESPnet, the legacy and the latest versions. Which block did you use?
Our AM was trained last year, maybe it is a legacy one?

I see quantization is applied to your model. Did you execute your quantized model on GPU?
It is true that I set 'quantize=True' in 'export.py'. But I have only tried the unquantized model on CPU.

About the second point:
Yes , I checked the weights and I also tried different configurations. It seems that it didn't help much. Here are the results:
weights: {ctc: 0.3, decoder: 0.7, length_bonus: 0.0, lm: 0.3} # cer=10.8% (This is the same configuration as inferencing on torch)
weights: {ctc: 0.3, decoder: 0.7, length_bonus: 0.0, lm: 1.0} # cer=10.8%
weights: {ctc: 0.3, decoder: 1.0, length_bonus: 0.0, lm: 0.1} # cer=11.6%
weights: {ctc: 0.5, decoder: 0.5, length_bonus: 0.0, lm: 1.0} # cer=10.7%

@Masao-Someki
Copy link
Collaborator

Thank you!
About the RTF, it may be a problem with the frontend process.
If you are using the default frontend, which contains stft and logmel, is it possible to check the performance difference between the torch frontend and the onnx frontend?
I recently found a little speed down in espnet_onnx's frontend compared to the ESPnet version. Now I'm considering converting this whole process into onnx. If the frontend causes this problem, I think I have to do this quickly..

@rookie0607
Copy link

Hi, thanks for you share of the espnet_onnx system!

I met two problems when I tried to inference thorough your codes. My acoustic model is trained by myself on our own dataset. The AM architecture is the typical Conformer. I downloaded this code on June.

First, the decoding speed is too slow by it. When using torch to decode, the RTF is around 2.32; however it becomes around 20 when using the transformed onnx.

Second, the CER calculated in the torch version is 7.8% while for the onnx, it becomes 10.6%. I think it is probably wrong.

I'm giving some configs here:

export.py

import sys
sys.path.append('espnet-master')
sys.path.append('espnet-master/espnet_tts_frontend-master')
sys.path.append('espnet_onnx-master/espnet_onnx/export/asr')
import torch

from export_asr import ModelExport
from espnet2.bin.asr_inference import Speech2Text

if __name__ == '__main__':
    m = ModelExport(cache_dir = sys.argv[5])

    # export from trained model
    speech2text=Speech2Text(
            asr_train_config = sys.argv[1],
            asr_model_file=sys.argv[2],
            lm_train_config=sys.argv[3],
            lm_file=sys.argv[4],
            )

    m.export(model = speech2text, tag_name = 'speech2text', quantize=True)

And I get an onnx dir structured like:

asr/onnx/speech2text/       config.yaml       feats_stats.npz       full/       quantize/

The test wav is a filelist, structured as:

bigfar_001_000001 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000001.wav
bigfar_001_000002 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000002.wav
bigfar_001_000003 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000003.wav
bigfar_001_000004 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000004.wav
bigfar_001_000005 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000005.wav
bigfar_001_000006 /home/dangfeng/exp_xiandao/for_xiandao/onnx_enh/output_0703/enh/bigfar_001_000006.wav
...

The decoding process is:

decode.py

import sys
sys.path.append('espnet_onnx-master/espnet_onnx/asr')

import time
import threading
import librosa
import os
from tqdm import tqdm
from asr_model import Speech2Text

if __name__ == '__main__':
    """ step1: load onnx file """
        speech2text = Speech2Text(tag_name = 'speech2text', model_dir=sys.argv[3],)

        """ step2: ASR """
        f = open(sys.argv[1])
        lines = f.readlines()
        for line in tqdm(lines):
            with open(os.path.join(sys.argv[2], 'hyp_flush_1process.trn'),'a') as fout:
                wav_name = line.split(' ')[0].strip()
                processing_wav = line.split(' ')[1].strip()

                start = time.time()
                y, sr = librosa.load(processing_wav, sr=16000)
                nbest = speech2text(y)
                asr_result = nbest[0][0]
                end = time.time()

                for j in range (len(asr_result)):
                    fout.write(asr_result[j])
                    if j != len(asr_result) - 1:
                        fout.write(' ')
                fout.write('\t')
                fout.write('(')
                fout.write(wav_name)
                fout.write('-')
                fout.write(wav_name)
                fout.write(')')
                fout.write('\n')

                print('processing:  ', processing_wav)
                print('Result:         ', asr_result)
                print('Time:           ', end-start, 's')

Furthermore, I noticed that you have mentioned there may be some problems for Conformer AM considering ASR in latest issue, has it been fixed?

Looking forward for your reply!

What is your torch version?

@yangyi0818
Copy link
Author

HI @rookie0607
my torch version is 1.7.1 and onnx version is 1.7.0

@joazoa
Copy link

joazoa commented Aug 31, 2022

In relation to the slow speed, can you check how many cores are loaded when you try to inference with onnx as i suspect it could be related?
@Masao-Someki I notice that all cpu cores are in use when i try to do cpu inference. Is there a way to avoid this other than setting tasksel 1 ?
I tried export OMP_NUM_THREADS=1 but no luck.

@Masao-Someki
Copy link
Collaborator

@joazoa
You can limit the number of threads with the following options:

  • inter_op_num_threads = 1
  • intra_op_num_threads = 1

Currently, there is no script to limit the number of threads in espnet_onnx, so you may need to modify inference codes like this:

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.inter_op_num_threads = 1
sess_options.intra_op_num_threads = 1

self.encoder = onnxruntime.InferenceSession(
                self.config.quantized_model_path,
                providers=providers,
                sess_options=sess_options
            )

@joazoa
Copy link

joazoa commented Aug 31, 2022

@Masao-Someki thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants