[ASR] support wav2vec2 command line and demo (#2658)

* wav2vec2_cli * wav2vec2 demo update: support different optimizer and lr_schedular, align mdoel, update input type, test=asr * wav2vec2 demo update: support different optimizer and lr_schedular, align mdoel, update input type, test=asr * wav2vec2 demo update: support different optimizer and lr_schedular, align mdoel, update input type, test=asr * wav2vec2 demo update: support different optimizer and lr_schedular, align mdoel, update input type, test=asr * Update RESULTS.md * Update RESULTS.md * Update base_commands.py * wav2vec2 demo update: support different optimizer and lr_schedular, align mdoel, update input type, test=asr * wav2vec2 demo update: support different optimizer and lr_schedular, align mdoel, update input type, test=asr
PaddlePaddle · Nov 21, 2022 · 94a487b · 94a487b
1 parent 5c1867a
commit 94a487b
Show file tree

Hide file tree

Showing 32 changed files with 1,424 additions and 137 deletions.
diff --git a/README.md b/README.md
@@ -157,12 +157,12 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
   - 🧩  *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
 
 ### Recent Update
+- 🔥 2022.11.18: Add [Wav2vec2 CLI and Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), Support ASR and Feature Extraction.
 - 🎉 2022.11.17: Add [male voice for TTS](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660).
-- 🔥 2022.11.07: Add [U2/U2++ C++ High Performance Streaming ASR Deployment](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech).
-- 👑 2022.11.01: Add [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) for [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
+- 🔥 2022.11.07: Add [U2/U2++ C++ High Performance Streaming ASR Deployment](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech).- 👑 2022.11.01: Add [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) for [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
 - 🔥 2022.10.26: Add [Prosody Prediction](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy) for TTS.
 - 🎉 2022.10.21: Add [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) for TTS Chinese Text Frontend.
-- 👑 2022.10.11: Add [Wav2vec2ASR](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech.
+- 👑 2022.10.11: Add [Wav2vec2ASR-en](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech.
 - 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and [ERNIE-SAT](https://arxiv.org/abs/2211.03545) in [PaddleSpeech Web Demo](./demos/speech_web).
 - ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder.
 - ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example.

diff --git a/README_cn.md b/README_cn.md
@@ -164,12 +164,13 @@
 
 
 ### 近期更新
+- 🔥 2022.11.18: 新增 [Wav2vec2 CLI 和 Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), 支持 ASR 和 特征提取.
 - 🎉 2022.11.17: TTS 新增[高质量男性音色](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660)。
 - 🔥 2022.11.07: 新增 [U2/U2++ 高性能流式 ASR C++ 部署](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech)。
 - 👑 2022.11.01: [中英文混合 TTS](./examples/zh_en_tts/tts3) 新增 [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) 模块。
 - 🔥 2022.10.26: TTS 新增[韵律预测](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy)功能。
 - 🎉 2022.10.21: TTS 中文文本前端新增 [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) 功能。
-- 👑 2022.10.11: 新增 [Wav2vec2ASR](./examples/librispeech/asr3), 在 LibriSpeech 上针对 ASR 任务对 wav2vec2.0 的 finetuning。
+- 👑 2022.10.11: 新增 [Wav2vec2ASR-en](./examples/librispeech/asr3), 在 LibriSpeech 上针对 ASR 任务对 wav2vec2.0 的 finetuning。
 - 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 [ERNIE-SAT](https://arxiv.org/abs/2211.03545) 到 [PaddleSpeech 网页应用](./demos/speech_web)。
 - ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。
 - ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。

diff --git a/demos/speech_ssl/README.md b/demos/speech_ssl/README.md
@@ -0,0 +1,102 @@
+([简体中文](./README_cn.md)|English)
+# Speech SSL (Self-Supervised Learning)
+
+## Introduction
+Speech SSL, or Self-Supervised Learning, refers to a training method on the large-scale unlabeled speech dataset. The model trained in this way can produce a good acoustic representation, and can be applied to other downstream speech tasks by fine-tuning on labeled datasets.
+
+This demo is an implementation to recognize text or produce the acoustic representation from a specific audio file by speech ssl models. It can be done by a single command or a few lines in python using `PaddleSpeech`. 
+
+## Usage
+### 1. Installation
+see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+### 2. Prepare Input File
+The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
+
+Here are sample files for this demo that can be downloaded:
+```bash
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+```
+
+### 3. Usage
+- Command Line(Recommended)
+  ```bash
+  # to recognize text 
+  paddlespeech ssl --task asr --lang en --input ./en.wav
+
+  # to get acoustic representation
+  paddlespeech ssl --task vector --lang en --input ./en.wav
+  ```
+
+  Usage:
+  ```bash
+  paddlespeech ssl --help
+  ```
+  Arguments:
+  - `input`(required): Audio file to recognize.
+  - `model`: Model type of asr task. Default: `wav2vec2ASR_librispeech`.
+  - `task`: Output type. Default: `asr`.
+  - `lang`: Model language. Default: `en`.
+  - `sample_rate`: Sample rate of the model. Default: `16000`.
+  - `config`: Config of asr task. Use pretrained model when it is None. Default: `None`.
+  - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
+  - `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
+  - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
+  - `verbose`: Show the log information.
+
+
+- Python API
+  ```python
+  import paddle
+  from paddlespeech.cli.ssl import SSLExecutor
+
+  ssl_executor = SSLExecutor()
+
+  # to recognize text 
+  text = ssl_executor(
+      model='wav2vec2ASR_librispeech',
+      task='asr',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('ASR Result: \n{}'.format(text))
+
+  # to get acoustic representation
+  feature = ssl_executor(
+      model='wav2vec2',
+      task='vector',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('Representation: \n{}'.format(feature))
+  ```
+
+  Output:
+  ```bash
+  ASR Result:
+  我认为跑步最重要的就是给我带来了身体健康
+
+  Representation:
+  Tensor(shape=[1, 164, 1024], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[[ 0.02351918, -0.12980647,  0.17868176, ...,  0.10118122,
+          -0.04614586,  0.17853957],
+         [ 0.02361383, -0.12978461,  0.17870593, ...,  0.10103855,
+          -0.04638699,  0.17855372],
+         [ 0.02345137, -0.12982975,  0.17883906, ...,  0.10104341,
+          -0.04643029,  0.17856732],
+         ...,
+         [ 0.02313030, -0.12918393,  0.17845058, ...,  0.10073373,
+          -0.04701405,  0.17862988],
+         [ 0.02176583, -0.12929161,  0.17797582, ...,  0.10097728,
+          -0.04687393,  0.17864393],
+         [ 0.05269200,  0.01297141, -0.23336855, ..., -0.11257174,
+          -0.17227529,  0.20338398]]])
+  ```
diff --git a/demos/speech_ssl/README_cn.md b/demos/speech_ssl/README_cn.md
@@ -0,0 +1,103 @@
+(简体中文|[English](./README.md))
+
+# 语音自监督学习
+## 介绍
+语音自监督学习，指的是在大规模无标记的语音数据集上的训练方法。用这种方法训练出来的模型可以产生很好的声学表征。并且可以通过在有标签的数据集上进行微调，应用于其他下游的语音任务。
+
+这个 demo 是通过语音自监督模型将一个特定的音频文件识别成文本或产生声学表征，它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。
+
+## 使用方法
+### 1. 安装
+请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
+
+你可以从 easy，medium，hard 三中方式中选择一种方式安装。
+
+### 2. 准备输入
+这个 demo 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
+
+可以下载此 demo 的示例音频：
+```bash
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+```
+### 3. 使用方法
+- 命令行 (推荐使用)
+  ```bash
+
+  # 识别文本
+  paddlespeech ssl --task asr --lang en --input ./en.wav
+
+  # 产生声学表征
+  paddlespeech ssl --task vector --lang en --input ./en.wav
+  ```
+
+  使用方法：
+  ```bash
+  paddlespeech asr --help
+  ```
+  参数：
+  - `input`(必须输入)：用于识别的音频文件。
+  - `model`：ASR 任务的模型，默认值：`conformer_wenetspeech`。
+  - `task`：输出类别，默认值：`asr`。
+  - `lang`：模型语言，默认值：`zh`。
+  - `sample_rate`：音频采样率，默认值：`16000`。
+  - `config`：ASR 任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
+  - `ckpt_path`：模型参数文件，若不设置则下载预训练模型使用，默认值：`None`。
+  - `yes`；不需要设置额外的参数，一旦设置了该参数，说明你默认同意程序的所有请求，其中包括自动转换输入音频的采样率。默认值：`False`。
+  - `device`：执行预测的设备，默认值：当前系统下 paddlepaddle 的默认 device。
+  - `verbose`: 如果使用，显示 logger 信息。
+
+
+- Python API
+  ```python
+  import paddle
+  from paddlespeech.cli.ssl import SSLExecutor
+
+  ssl_executor = SSLExecutor()
+
+  # 识别文本
+  text = ssl_executor(
+      model='wav2vec2ASR_librispeech',
+      task='asr',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('ASR Result: \n{}'.format(text))
+
+  # 得到声学表征
+  feature = ssl_executor(
+      model='wav2vec2',
+      task='vector',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('Representation: \n{}'.format(feature))
+  ```
+
+
+  输出：
+  ```bash
+  ASR Result:
+  我认为跑步最重要的就是给我带来了身体健康
+
+  Representation:
+  Tensor(shape=[1, 164, 1024], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[[ 0.02351918, -0.12980647,  0.17868176, ...,  0.10118122,
+          -0.04614586,  0.17853957],
+         [ 0.02361383, -0.12978461,  0.17870593, ...,  0.10103855,
+          -0.04638699,  0.17855372],
+         [ 0.02345137, -0.12982975,  0.17883906, ...,  0.10104341,
+          -0.04643029,  0.17856732],
+         ...,
+         [ 0.02313030, -0.12918393,  0.17845058, ...,  0.10073373,
+          -0.04701405,  0.17862988],
+         [ 0.02176583, -0.12929161,  0.17797582, ...,  0.10097728,
+          -0.04687393,  0.17864393],
+         [ 0.05269200,  0.01297141, -0.23336855, ..., -0.11257174,
+          -0.17227529,  0.20338398]]])
+  ```
diff --git a/demos/speech_ssl/run.sh b/demos/speech_ssl/run.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+# audio download
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+
+# to recognize text 
+paddlespeech ssl --task asr --lang en --input ./en.wav
+
+# to get acoustic representation
+paddlespeech ssl --task vector --lang en --input ./en.wav
diff --git a/examples/librispeech/asr3/RESULTS.md b/examples/librispeech/asr3/RESULTS.md
@@ -1,8 +1,8 @@
 # LibriSpeech
 
 ## Wav2VecASR
-train: Epoch 1, 1*V100-32G, batchsize:10
+train: Epoch 1, 1*V100-32G, batchsize: 6
 
 | Model | Params | Config | Augmentation| Test set | Decode method | WER |  
 | --- | --- | --- | --- | --- | --- | --- |
-| wav2vec2ASR | 302.86 M | conf/wav2vec2ASR.yaml | spec_aug | test-clean | greedy search | 0.018887 |  
+| wav2vec2ASR | 302.86 M | conf/wav2vec2ASR.yaml | spec_aug | test-clean | greedy search | 0.018906 |  
diff --git a/examples/librispeech/asr3/conf/preprocess.yaml b/examples/librispeech/asr3/conf/preprocess.yaml
@@ -1,4 +1,3 @@
 process:
     # use raw audio
   - type: wav_process
-    dither: 0.0
diff --git a/examples/librispeech/asr3/conf/wav2vec2ASR.yaml b/examples/librispeech/asr3/conf/wav2vec2ASR.yaml
@@ -4,16 +4,21 @@
 freeze_wav2vec2: True
 normalize_wav: True
 output_norm: True
-dnn_blocks: 2
-dnn_neurons: 1024
-blank_id: 0
-ctc_dropout_rate: 0.0
+init_type: 'kaiming_uniform' # !Warning: need to convergence
+enc:
+  input_shape: 1024
+  dnn_blocks: 2
+  dnn_neurons: 1024
+  activation: True
+ctc:
+  enc_n_units: 1024
+  blank_id: 0
+  dropout_rate: 0.0
 wav2vec2_params_path: "exp/wav2vec2/wav2vec2-large-960h-lv60-self.pdparams"
 
 ############################################
 #               Wav2Vec2.0                 #
 ############################################
-vocab_size: 32
 hidden_size: 1024
 num_hidden_layers: 24
 num_attention_heads: 16
@@ -54,9 +59,6 @@ diversity_loss_weight: 0.1
 ctc_loss_reduction: "sum"
 ctc_zero_infinity: False
 use_weighted_layer_sum: False
-pad_token_id: 0
-bos_token_id: 1
-eos_token_id: 2
 add_adapter: False
 adapter_kernel_size: 3
 adapter_stride: 2
@@ -78,7 +80,7 @@ unit_type: 'char'
 mean_std_filepath: ""
 preprocess_config: conf/preprocess.yaml
 sortagrad: -1 # Feed samples from shortest to longest ; -1: enabled for all epochs 0: disabled other: enabled for 'other' epochs 
-batch_size: 10  # Different batch_size may cause large differences in results
+batch_size: 6  # Different batch_size may cause large differences in results
 maxlen_in: 51200000000  # if input length  > maxlen-in batchsize is automatically reduced
 maxlen_out: 1500000  # if output length > maxlen-out batchsize is automatically reduced
 minibatches: 0 # for debug
@@ -106,17 +108,26 @@ audio_augment:  # for raw audio
 ###########################################
 n_epoch: 1
 accum_grad: 1
-global_grad_clip: 3.0
+global_grad_clip: 5.0
 model_optim: adadelta
 model_optim_conf:
   lr: 0.9
   epsilon: 1.0e-6
   rho: 0.95
-scheduler: constantlr    
-scheduler_conf:
+model_scheduler: constantlr    
+model_scheduler_conf:
+  warmup_steps: 25000
+  lr_decay: 1.0
+wav2vec2_optim: adadelta
+wav2vec2_optim_conf:
+  lr: 0.9
+  epsilon: 1.0e-6
+  rho: 0.95
+wav2vec2_scheduler: constantlr    
+wav2vec2_scheduler_conf:
   warmup_steps: 25000
   lr_decay: 1.0
 log_interval: 1
 checkpoint:
   kbest_n: 50
-  latest_n: 5
+  latest_n: 5
diff --git a/examples/librispeech/asr3/local/train.sh b/examples/librispeech/asr3/local/train.sh
@@ -10,7 +10,8 @@ echo "using $ngpu gpus..."
 
 config_path=$1
 ckpt_name=$2
-ips=$3
+resume=$3
+ips=$4
 
 if [ ! $ips ];then
   ips_config=
@@ -21,7 +22,7 @@ fi
 mkdir -p exp
 
 # seed may break model convergence
-seed=1998
+seed=1988
 if [ ${seed} != 0 ]; then
     export FLAGS_cudnn_deterministic=True
 fi
@@ -34,13 +35,15 @@ python3 -u ${BIN_DIR}/train.py \
 --ngpu ${ngpu} \
 --config ${config_path} \
 --output exp/${ckpt_name} \
---seed ${seed} 
+--seed ${seed} \
+--resume ${resume}
 else
 python3 -m paddle.distributed.launch --gpus=${CUDA_VISIBLE_DEVICES} ${ips_config} ${BIN_DIR}/train.py \
 --ngpu ${ngpu} \
 --config ${config_path} \
 --output exp/${ckpt_name} \
---seed ${seed}
+--seed ${seed} \
+--resume ${resume}
 fi
 
 if [ ${seed} != 0 ]; then