Streaming Speech Synthesis Service

Introduction

This demo is an implementation of starting the streaming speech synthesis service and accessing the service. It can be achieved with a single command using paddlespeech_server and paddlespeech_client or a few lines of code in python.

For service interface definition, please check:

PaddleSpeech Server RESTful API
PaddleSpeech Streaming Server WebSocket API

Usage

1. Installation

see installation.

It is recommended to use paddlepaddle 2.3.1 or above.

You can choose one way from easy, meduim and hard to install paddlespeech.

If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.

2. Prepare config File

The configuration file can be found in conf/tts_online_application.yaml.

protocol indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported.
engine_list indicates the speech engine that will be included in the service to be started, in the format of <speech task>_<engine type>.
- This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to tts.
- the engine type supports two forms: online and online-onnx. online indicates an engine that uses python for dynamic graph inference; online-onnx indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
Streaming TTS engine AM model support: fastspeech2 and fastspeech2_cnndecoder; Voc model support: hifigan and mb_melgan
In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, am_block indicates the number of valid frames in the chunk, and am_pad indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it.
- fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio.
In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where voc_block indicates the number of valid frames in the chunk, and voc_pad indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- Both hifigan and mb_melgan support streaming voc inference.
- When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
- When the voc model is hifigan, when voc_pad=19, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
- Pad calculation method of streaming vocoder in PaddleSpeech: AIStudio tutorial
Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan
Note: If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the host address in the configuration file with the local IP address.

3. Streaming speech synthesis server and client using http protocol

3.1 Server Usage

Command Line (Recommended)

Start the service (the configuration file uses http by default):

paddlespeech_server start --config_file ./conf/tts_online_application.yaml

Usage:

paddlespeech_server start --help

Arguments:

config_file: yaml file of the app, defalut: ./conf/tts_online_application.yaml
log_file: log file. Default: ./log/paddlespeech.log

Output:

[2022-04-24 20:05:27,887] [    INFO] - The first response time of the 0 warm up: 1.0123658180236816 s
[2022-04-24 20:05:28,038] [    INFO] - The first response time of the 1 warm up: 0.15108466148376465 s
[2022-04-24 20:05:28,191] [    INFO] - The first response time of the 2 warm up: 0.15317344665527344 s
[2022-04-24 20:05:28,192] [    INFO] - **********************************************************************
INFO:     Started server process [14638]
[2022-04-24 20:05:28] [INFO] [server.py:75] Started server process [14638]
INFO:     Waiting for application startup.
[2022-04-24 20:05:28] [INFO] [on.py:45] Waiting for application startup.
INFO:     Application startup complete.
[2022-04-24 20:05:28] [INFO] [on.py:59] Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-24 20:05:28] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)

Python API

from paddlespeech.server.bin.paddlespeech_server import ServerExecutor

server_executor = ServerExecutor()
server_executor(
    config_file="./conf/tts_online_application.yaml", 
    log_file="./log/paddlespeech.log")

Output:

[2022-04-24 21:00:16,934] [    INFO] - The first response time of the 0 warm up: 1.268730878829956 s
[2022-04-24 21:00:17,046] [    INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
[2022-04-24 21:00:17,151] [    INFO] - The first response time of the 2 warm up: 0.10413002967834473 s
[2022-04-24 21:00:17,151] [    INFO] - **********************************************************************
INFO:     Started server process [320]
[2022-04-24 21:00:17] [INFO] [server.py:75] Started server process [320]
INFO:     Waiting for application startup.
[2022-04-24 21:00:17] [INFO] [on.py:45] Waiting for application startup.
INFO:     Application startup complete.
[2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)

3.2 Streaming TTS client Usage

Command Line (Recommended)

Access http streaming TTS service:

If 127.0.0.1 is not accessible, you need to use the actual service IP address.
```
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
```
Usage:
```
paddlespeech_client tts_online --help
```
Arguments:
- server_ip: erver ip. Default: 127.0.0.1
- port: server port. Default: 8092
- protocol: Service protocol, choices: [http, websocket], default: http.
- input: (required): Input text to generate.
- spk_id: Speaker id for multi-speaker text to speech. Default: 0
- output: Client output wave filepath. Default: None, which means not to save the audio to the local.
- play: Whether to play audio, play while synthesizing, default value: False, which means not playing. Playing audio needs to rely on the pyaudio library.
- Currently, only the single-speaker model is supported in the code, so spk_id does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume.
Output:
```
[2022-04-24 21:08:18,559] [    INFO] - tts http client start
[2022-04-24 21:08:21,702] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
[2022-04-24 21:08:21,703] [    INFO] - 首包响应：0.18863153457641602 s
[2022-04-24 21:08:21,704] [    INFO] - 尾包响应：3.1427218914031982 s
[2022-04-24 21:08:21,704] [    INFO] - 音频时长：3.825 s
[2022-04-24 21:08:21,704] [    INFO] - RTF: 0.8216266382753459
[2022-04-24 21:08:21,739] [    INFO] - 音频保存至：output.wav
```

Python API

from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
import json

executor = TTSOnlineClientExecutor()
executor(
    input="您好，欢迎使用百度飞桨语音合成服务。",
    server_ip="127.0.0.1",
    port=8092,
    protocol="http",
    spk_id=0,
    output="./output.wav",
    play=False)

Output:

[2022-04-24 21:11:13,798] [    INFO] - tts http client start
[2022-04-24 21:11:16,800] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
[2022-04-24 21:11:16,801] [    INFO] - 首包响应：0.18234872817993164 s
[2022-04-24 21:11:16,801] [    INFO] - 尾包响应：3.0013909339904785 s
[2022-04-24 21:11:16,802] [    INFO] - 音频时长：3.825 s
[2022-04-24 21:11:16,802] [    INFO] - RTF: 0.7846773683635238
[2022-04-24 21:11:16,837] [    INFO] - 音频保存至：./output.wav

4. Streaming speech synthesis server and client using websocket protocol

4.1 Server Usage

Command Line (Recommended) First modify the configuration file conf/tts_online_application.yaml, set protocol to websocket. Start the service:

paddlespeech_server start --config_file ./conf/tts_online_application.yaml

Usage:

paddlespeech_server start --help

Arguments:

config_file: yaml file of the app, defalut: ./conf/tts_online_application.yaml
log_file: log file. Default: ./log/paddlespeech.log

Output:

[2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
[2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
[2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
[2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
INFO:     Started server process [17600]
[2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
INFO:     Waiting for application startup.
[2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
INFO:     Application startup complete.
[2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)

Python API

from paddlespeech.server.bin.paddlespeech_server import ServerExecutor

server_executor = ServerExecutor()
server_executor(
    config_file="./conf/tts_online_application.yaml", 
    log_file="./log/paddlespeech.log")

Output:

[2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
[2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
[2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
[2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
INFO:     Started server process [23466]
[2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
INFO:     Waiting for application startup.
[2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
INFO:     Application startup complete.
[2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)

4.2 Streaming TTS client Usage

Command Line (Recommended)

Access websocket streaming TTS service:

If 127.0.0.1 is not accessible, you need to use the actual service IP address.
```
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
```
Usage:
```
paddlespeech_client tts_online --help
```
Arguments:
- server_ip: erver ip. Default: 127.0.0.1
- port: server port. Default: 8092
- protocol: Service protocol, choices: [http, websocket], default: http.
- input: (required): Input text to generate.
- spk_id: Speaker id for multi-speaker text to speech. Default: 0
- output: Client output wave filepath. Default: None, which means not to save the audio to the local.
- play: Whether to play audio, play while synthesizing, default value: False, which means not playing. Playing audio needs to rely on the pyaudio library.
- Currently, only the single-speaker model is supported in the code, so spk_id does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume.
Output:
```
[2022-04-27 10:21:04,262] [    INFO] - tts websocket client start
[2022-04-27 10:21:04,496] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:21:04,496] [    INFO] - 首包响应：0.2124948501586914 s
[2022-04-27 10:21:07,483] [    INFO] - 尾包响应：3.199106454849243 s
[2022-04-27 10:21:07,484] [    INFO] - 音频时长：3.825 s
[2022-04-27 10:21:07,484] [    INFO] - RTF: 0.8363677006141812
[2022-04-27 10:21:07,516] [    INFO] - 音频保存至：output.wav
```

Python API

from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
import json

executor = TTSOnlineClientExecutor()
executor(
    input="您好，欢迎使用百度飞桨语音合成服务。",
    server_ip="127.0.0.1",
    port=8092,
    protocol="websocket",
    spk_id=0,
    output="./output.wav",
    play=False)

Output:

[2022-04-27 10:22:48,852] [    INFO] - tts websocket client start
[2022-04-27 10:22:49,080] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:22:49,080] [    INFO] - 首包响应：0.21017956733703613 s
[2022-04-27 10:22:52,100] [    INFO] - 尾包响应：3.2304444313049316 s
[2022-04-27 10:22:52,101] [    INFO] - 音频时长：3.825 s
[2022-04-27 10:22:52,101] [    INFO] - RTF: 0.8445606356352762
[2022-04-27 10:22:52,134] [    INFO] - 音频保存至：./output.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Streaming Speech Synthesis Service

Introduction

Usage

1. Installation

2. Prepare config File

3. Streaming speech synthesis server and client using http protocol

3.1 Server Usage

3.2 Streaming TTS client Usage

4. Streaming speech synthesis server and client using websocket protocol

4.1 Server Usage

4.2 Streaming TTS client Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Streaming Speech Synthesis Service

Introduction

Usage

1. Installation

2. Prepare config File

3. Streaming speech synthesis server and client using http protocol

3.1 Server Usage

3.2 Streaming TTS client Usage

4. Streaming speech synthesis server and client using websocket protocol

4.1 Server Usage

4.2 Streaming TTS client Usage