Skip to content

Commit

Permalink
[speechx] more doc for speechx (#2702)
Browse files Browse the repository at this point in the history
* doc for ds2 websocket
  • Loading branch information
zh794390558 authored Nov 30, 2022
1 parent 1ffe356 commit d71d127
Show file tree
Hide file tree
Showing 9 changed files with 184 additions and 38 deletions.
2 changes: 1 addition & 1 deletion speechx/examples/custom_asr/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# customized Auto Speech Recognition
# Customized ASR

## introduction
These scripts are tutorials to show you how build your own decoding graph.
Expand Down
1 change: 1 addition & 0 deletions speechx/examples/ds2_ol/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@

* `websocket` - Streaming ASR with websocket for deepspeech2_aishell.
* `aishell` - Streaming Decoding under aishell dataset, for local WER test.
* `onnx` - Example to convert deepspeech2 to onnx format.
70 changes: 61 additions & 9 deletions speechx/examples/ds2_ol/aishell/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,57 @@
# Aishell - Deepspeech2 Streaming

## How to run
> We recommend using U2/U2++ model instead of DS2, please see [here](../../u2pp_ol/wenetspeech/).
A C++ deployment example for using the deepspeech2 model to recognize `wav` and compute `CER`. We using AISHELL-1 as test data.

## Source path.sh

```bash
. path.sh
```

SpeechX bins is under `echo $SPEECHX_BUILD`, more info please see `path.sh`.

## Recognize with linear feature

```bash
bash run.sh
```

## Results
`run.sh` has multi stage, for details please see `run.sh`:

1. donwload dataset, model and lm
2. convert cmvn format and compute feature
3. decode w/o lm by feature
4. decode w/ ngram lm by feature
5. decode w/ TLG graph by feature
6. recognize w/ TLG graph by wav input

### Recognize with `.scp` file for wav

This sciprt using `recognizer_main` to recognize wav file.

The input is `scp` file which look like this:
```text
# head data/split1/1/aishell_test.scp
BAC009S0764W0121 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0121.wav
BAC009S0764W0122 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0122.wav
...
BAC009S0764W0125 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0125.wav
```

If you want to recognize one wav, you can make `scp` file like this:
```text
key path/to/wav/file
```

Then specify `--wav_rspecifier=` param for `recognizer_main` bin. For other flags meaning, please see `help`:
```bash
recognizer_main --help
```

For the exmaple to using `recognizer_main` please see `run.sh`.


### CTC Prefix Beam Search w/o LM

Expand All @@ -25,7 +70,7 @@ Mandarin -> 7.86 % N=104768 C=96865 S=7573 D=330 I=327
Other -> 0.00 % N=0 C=0 S=0 D=0 I=0
```

### CTC WFST
### CTC TLG WFST

LM: [aishell train](http://paddlespeech.bj.bcebos.com/speechx/examples/ds2_ol/aishell/aishell_graph.zip)
--acoustic_scale=1.2
Expand All @@ -43,8 +88,11 @@ Mandarin -> 10.93 % N=104762 C=93410 S=9779 D=1573 I=95
Other -> 100.00 % N=3 C=0 S=1 D=2 I=0
```

## fbank
```
## Recognize with fbank feature

This script is same to `run.sh`, but using fbank feature.

```bash
bash run_fbank.sh
```

Expand All @@ -66,7 +114,7 @@ Mandarin -> 5.82 % N=104762 C=99386 S=4941 D=435 I=720
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
```

### CTC WFST
### CTC TLG WFST

LM: [aishell train](https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph2.zip)
```
Expand All @@ -75,7 +123,11 @@ Mandarin -> 9.57 % N=104762 C=94817 S=4325 D=5620 I=84
Other -> 100.00 % N=3 C=0 S=1 D=2 I=0
```

## build TLG graph
```
bash run_build_tlg.sh
## Build TLG WFST graph

The script is for building TLG wfst graph, depending on `srilm`, please make sure it is installed.
For more information please see the script below.

```bash
bash ./local/run_build_tlg.sh
```
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,15 @@ mkdir -p $data

if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
if [ ! -f $data/speech.ngram.zh.tar.gz ];then
# download ngram
pushd $data
wget -c http://paddlespeech.bj.bcebos.com/speechx/examples/ngram/zh/speech.ngram.zh.tar.gz
tar xvzf speech.ngram.zh.tar.gz
popd
fi

if [ ! -f $ckpt_dir/data/mean_std.json ]; then
# download model
mkdir -p $ckpt_dir
pushd $ckpt_dir
wget -c https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/WIP1_asr0_deepspeech2_online_wenetspeech_ckpt_1.0.0a.model.tar.gz
Expand All @@ -43,6 +45,7 @@ if [ ! -f $unit ]; then
fi

if ! which ngram-count; then
# need srilm install
pushd $MAIN_ROOT/tools
make srilm.done
popd
Expand Down Expand Up @@ -71,7 +74,7 @@ lm=data/local/lm
mkdir -p $lm

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# Train lm
# Train ngram lm
cp $text $lm/text
local/aishell_train_lms.sh
echo "build LM done."
Expand All @@ -94,8 +97,8 @@ cmvn=$data/cmvn_fbank.ark
wfst=$data/lang_test

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then

if [ ! -d $data/test ]; then
# download test dataset
pushd $data
wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_test.zip
unzip aishell_test.zip
Expand All @@ -107,7 +110,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
fi

./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj


# convert cmvn format
cmvn-json2kaldi --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn
fi

Expand All @@ -116,7 +120,7 @@ label_file=aishell_result
export GLOG_logtostderr=1

if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# TLG decoder
# recognize w/ TLG graph
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/check_tlg.log \
recognizer_main \
--wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \
Expand Down
24 changes: 14 additions & 10 deletions speechx/examples/ds2_ol/aishell/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ exp=$PWD/exp
aishell_wav_scp=aishell_test.scp
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
if [ ! -d $data/test ]; then
# donwload dataset
pushd $data
wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_test.zip
unzip aishell_test.zip
Expand All @@ -43,6 +44,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
fi

if [ ! -f $ckpt_dir/data/mean_std.json ]; then
# download model
mkdir -p $ckpt_dir
pushd $ckpt_dir
wget -c https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz
Expand All @@ -52,6 +54,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then

lm=$data/zh_giga.no_cna_cmn.prune01244.klm
if [ ! -f $lm ]; then
# download kenlm bin
pushd $data
wget -c https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm
popd
Expand All @@ -68,7 +71,7 @@ export GLOG_logtostderr=1

cmvn=$data/cmvn.ark
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# 3. gen linear feat
# 3. convert cmvn format and compute linear feat
cmvn_json2kaldi_main --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn

./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj
Expand All @@ -82,14 +85,14 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# recognizer
# decode w/o lm
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.wolm.log \
ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
--model_path=$model_dir/avg_1.jit.pdmodel \
--param_path=$model_dir/avg_1.jit.pdiparams \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--nnet_decoder_chunk=8 \
--nnet_decoder_chunk=8 \
--dict_file=$vocb_dir/vocab.txt \
--result_wspecifier=ark,t:$data/split${nj}/JOB/result

Expand All @@ -101,14 +104,14 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# decode with lm
# decode w/ ngram lm with feature input
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.lm.log \
ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
--model_path=$model_dir/avg_1.jit.pdmodel \
--param_path=$model_dir/avg_1.jit.pdiparams \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--nnet_decoder_chunk=8 \
--nnet_decoder_chunk=8 \
--dict_file=$vocb_dir/vocab.txt \
--lm_path=$lm \
--result_wspecifier=ark,t:$data/split${nj}/JOB/result_lm
Expand All @@ -124,6 +127,7 @@ wfst=$data/wfst/
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
mkdir -p $wfst
if [ ! -f $wfst/aishell_graph.zip ]; then
# download TLG graph
pushd $wfst
wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph.zip
unzip aishell_graph.zip
Expand All @@ -133,7 +137,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi

if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# TLG decoder
# decoder w/ TLG graph with feature input
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.wfst.log \
ctc_tlg_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
Expand All @@ -142,7 +146,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
--word_symbol_table=$wfst/words.txt \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--graph_path=$wfst/TLG.fst --max_active=7500 \
--nnet_decoder_chunk=8 \
--nnet_decoder_chunk=8 \
--acoustic_scale=1.2 \
--result_wspecifier=ark,t:$data/split${nj}/JOB/result_tlg

Expand All @@ -154,15 +158,15 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi

if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# TLG decoder
# recognize from wav file w/ TLG graph
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recognizer.log \
recognizer_main \
--wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \
--cmvn_file=$cmvn \
--model_path=$model_dir/avg_1.jit.pdmodel \
--param_path=$model_dir/avg_1.jit.pdiparams \
--word_symbol_table=$wfst/words.txt \
--nnet_decoder_chunk=8 \
--nnet_decoder_chunk=8 \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--graph_path=$wfst/TLG.fst --max_active=7500 \
--acoustic_scale=1.2 \
Expand All @@ -173,4 +177,4 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
echo "recognizer test have finished!!!"
echo "please checkout in ${exp}/${wer}.recognizer"
tail -n 7 $exp/${wer}.recognizer
fi
fi
21 changes: 11 additions & 10 deletions speechx/examples/ds2_ol/aishell/run_fbank.sh
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ export GLOG_logtostderr=1

cmvn=$data/cmvn_fbank.ark
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# 3. gen linear feat
# 3. convert cmvn format and compute fbank feat
cmvn_json2kaldi_main --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn --binary=false

./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj
Expand All @@ -82,15 +82,15 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# recognizer
# decode w/ lm by feature
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.wolm.log \
ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \
--model_path=$model_dir/avg_5.jit.pdmodel \
--param_path=$model_dir/avg_5.jit.pdiparams \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--model_cache_shapes="5-1-2048,5-1-2048" \
--nnet_decoder_chunk=8 \
--nnet_decoder_chunk=8 \
--dict_file=$vocb_dir/vocab.txt \
--result_wspecifier=ark,t:$data/split${nj}/JOB/result_fbank

Expand All @@ -100,15 +100,15 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# decode with lm
# decode with ngram lm by feature
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.lm.log \
ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \
--model_path=$model_dir/avg_5.jit.pdmodel \
--param_path=$model_dir/avg_5.jit.pdiparams \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--model_cache_shapes="5-1-2048,5-1-2048" \
--nnet_decoder_chunk=8 \
--model_cache_shapes="5-1-2048,5-1-2048" \
--nnet_decoder_chunk=8 \
--dict_file=$vocb_dir/vocab.txt \
--lm_path=$lm \
--result_wspecifier=ark,t:$data/split${nj}/JOB/fbank_result_lm
Expand All @@ -131,16 +131,16 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi

if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# TLG decoder
# decode w/ TLG graph by feature
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.wfst.log \
ctc_tlg_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \
--model_path=$model_dir/avg_5.jit.pdmodel \
--param_path=$model_dir/avg_5.jit.pdiparams \
--word_symbol_table=$wfst/words.txt \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--model_cache_shapes="5-1-2048,5-1-2048" \
--nnet_decoder_chunk=8 \
--model_cache_shapes="5-1-2048,5-1-2048" \
--nnet_decoder_chunk=8 \
--graph_path=$wfst/TLG.fst --max_active=7500 \
--acoustic_scale=1.2 \
--result_wspecifier=ark,t:$data/split${nj}/JOB/result_tlg
Expand All @@ -153,6 +153,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi

if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# recgonize w/ TLG graph by wav
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/fbank_recognizer.log \
recognizer_main \
--wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \
Expand All @@ -163,7 +164,7 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
--word_symbol_table=$wfst/words.txt \
--model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
--model_cache_shapes="5-1-2048,5-1-2048" \
--nnet_decoder_chunk=8 \
--nnet_decoder_chunk=8 \
--graph_path=$wfst/TLG.fst --max_active=7500 \
--acoustic_scale=1.2 \
--result_wspecifier=ark,t:$data/split${nj}/JOB/result_fbank_recognizer
Expand Down
Loading

0 comments on commit d71d127

Please sign in to comment.