Merge pull request #1609 from qingen/database-search

[vec][search] update to paddlespeech model
PaddlePaddle · Mar 28, 2022 · 500080a · 500080a
2 parents d60856b + 612ba54
commit 500080a
Show file tree

Hide file tree

Showing 7 changed files with 187 additions and 83 deletions.
diff --git a/demos/audio_searching/README.md b/demos/audio_searching/README.md
diff --git a/demos/audio_searching/README_cn.md b/demos/audio_searching/README_cn.md
@@ -4,21 +4,26 @@
 # 音频相似性检索
 ## 介绍
 
-随着互联网不断发展，电子邮件、社交媒体照片、直播视频、客服语音等非结构化数据已经变得越来越普遍。如果想要使用计算机来处理这些数据，需要使用 embedding 技术将这些数据转化为向量 vector，然后进行存储、建索引、并查询
+随着互联网不断发展，电子邮件、社交媒体照片、直播视频、客服语音等非结构化数据已经变得越来越普遍。如果想要使用计算机来处理这些数据，需要使用 embedding 技术将这些数据转化为向量 vector，然后进行存储、建索引、并查询。
 
-但是，当数据量很大，比如上亿条音频要做相似度搜索，就比较困难了。穷举法固然可行，但非常耗时。针对这种场景，该 demo 将介绍如何使用开源向量数据库 Milvus 搭建音频相似度检索系统
+但是，当数据量很大，比如上亿条音频要做相似度搜索，就比较困难了。穷举法固然可行，但非常耗时。针对这种场景，该 demo 将介绍如何使用开源向量数据库 Milvus 搭建音频相似度检索系统。
 
-音频检索（如演讲、音乐、说话人等检索）实现了在海量音频数据中查询并找出相似声音（或相同说话人）片段。音频相似性检索系统可用于识别相似的音效、最大限度减少知识产权侵权等，还可以快速的检索声纹库、帮助企业控制欺诈和身份盗用等。在音频数据的分类和统计分析中，音频检索也发挥着重要作用
+音频检索（如演讲、音乐、说话人等检索）实现了在海量音频数据中查询并找出相似声音（或相同说话人）片段。音频相似性检索系统可用于识别相似的音效、最大限度减少知识产权侵权等，还可以快速的检索声纹库、帮助企业控制欺诈和身份盗用等。在音频数据的分类和统计分析中，音频检索也发挥着重要作用。
 
-在本 demo 中，你将学会如何构建一个音频检索系统，用来检索相似的声音片段。使用基于 PaddleSpeech 预训练模型（音频分类模型，说话人识别模型等）将上传的音频片段转换为向量数据，并存储在 Milvus 中。Milvus 自动为每个向量生成唯一的 ID，然后将 ID 和 相应的音频信息（音频id，音频的说话人id等等）存储在 MySQL，这样就完成建库的工作。用户在检索时，上传测试音频，得到向量，然后在 Milvus 中进行向量相似度搜索，Milvus 返回的检索结果为向量 ID，通过 ID 在 MySQL 内部查询相应的音频信息即可
+在本 demo 中，你将学会如何构建一个音频检索系统，用来检索相似的声音片段。使用基于 PaddleSpeech 预训练模型（音频分类模型，说话人识别模型等）将上传的音频片段转换为向量数据，并存储在 Milvus 中。Milvus 自动为每个向量生成唯一的 ID，然后将 ID 和 相应的音频信息（音频id，音频的说话人id等等）存储在 MySQL，这样就完成建库的工作。用户在检索时，上传测试音频，得到向量，然后在 Milvus 中进行向量相似度搜索，Milvus 返回的检索结果为向量 ID，通过 ID 在 MySQL 内部查询相应的音频信息即可。
 
 ![音频检索流程图](./img/audio_searching.png)
 
-注：该 demo 使用 [CN-Celeb](http://openslr.org/82/) 数据集，包括至少 650000 条音频，3000 个说话人，来建立音频向量库（音频特征，或音频说话人特征），然后通过预设的距离计算方式进行音频（或说话人）检索，这里面数据集也可以使用其他的，根据需要调整，如Librispeech，VoxCeleb，UrbanSound，GloVe，MNIST等
+注：该 demo 使用 [CN-Celeb](http://openslr.org/82/) 数据集，包括至少 650000 条音频，3000 个说话人，来建立音频向量库（音频特征，或音频说话人特征），然后通过预设的距离计算方式进行音频（或说话人）检索，这里面数据集也可以使用其他的，根据需要调整，如Librispeech，VoxCeleb，UrbanSound，GloVe，MNIST等。
 
 ## 使用方法
-### 1. MySQL 和 Milvus 安装
-音频相似度搜索系统需要用到 Milvus, MySQL 服务。 我们可以通过 [docker-compose.yaml](./docker-compose.yaml) 一键启动这些容器，所以请确保在运行之前已经安装了 [Docker Engine](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)。 即
+### 1. PaddleSpeech 安装
+音频向量的提取需要用到基于 PaddleSpeech 训练的模型，所以请确保在运行之前已经安装了 PaddleSpeech，具体安装步骤，详见[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
+
+你可以从 easy，medium，hard 三中方式中选择一种方式安装。
+
+### 2. MySQL 和 Milvus 安装
+音频相似性的检索需要用到 Milvus, MySQL 服务。 我们可以通过 [docker-compose.yaml](./docker-compose.yaml) 一键启动这些容器，所以请确保在运行之前已经安装了 [Docker Engine](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)。 即
 
 ```bash
 docker-compose -f docker-compose.yaml up -d
@@ -47,8 +52,8 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
 
 ```
 
-### 2. 配置并启动 API 服务
-启动系统服务程序，它会提供基于 Http 后端服务
+### 3. 配置并启动 API 服务
+启动系统服务程序，它会提供基于 HTTP 后端服务。
 
 - 安装服务依赖的 python 基础包
 
@@ -77,24 +82,24 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
   启动用 Fastapi 构建的服务
 
   ```bash
-  export PYTHONPATH=$PYTHONPATH:./src
+  export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
   python src/main.py
   ```
 
   然后你会看到应用程序启动:
 
   ```bash
-  INFO:     Started server process [3949]
-  2022-03-07 17:39:14,864 ｜ INFO ｜ server.py ｜ serve ｜ 75 ｜ Started server process [3949]
+  INFO:     Started server process [13352]
+  2022-03-26 22:45:30,838 ｜ INFO ｜ server.py ｜ serve ｜ 75 ｜ Started server process [13352]
   INFO:     Waiting for application startup.
-  2022-03-07 17:39:14,865 ｜ INFO ｜ on.py ｜ startup ｜ 45 ｜ Waiting for application startup.
+  2022-03-26 22:45:30,839 ｜ INFO ｜ on.py ｜ startup ｜ 45 ｜ Waiting for application startup.
   INFO:     Application startup complete.
-  2022-03-07 17:39:14,866 ｜ INFO ｜ on.py ｜ startup ｜ 59 ｜ Application startup complete.
+  2022-03-26 22:45:30,839 ｜ INFO ｜ on.py ｜ startup ｜ 59 ｜ Application startup complete.
   INFO:     Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
-  2022-03-07 17:39:14,867 ｜ INFO ｜ server.py ｜ _log_started_message ｜ 206 ｜ Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
+  2022-03-26 22:45:30,840 ｜ INFO ｜ server.py ｜ _log_started_message ｜ 206 ｜ Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
   ```
 
-### 3. 测试方法
+### 4. 测试方法
 - 准备数据
   ```bash
   wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz 
@@ -110,40 +115,88 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
 
     输出：
     ```bash
-    Checkpoint path: %your model path%
+    Downloading https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz ...
+    ...
+    Unpacking ./example_audio.tar.gz ...
+    [2022-03-26 22:50:54,987] [    INFO] - checking the aduio file format......
+    [2022-03-26 22:50:54,987] [    INFO] - The sample rate is 16000
+    [2022-03-26 22:50:54,987] [    INFO] - The audio file format is right
+    [2022-03-26 22:50:54,988] [    INFO] - device type: cpu
+    [2022-03-26 22:50:54,988] [    INFO] - load the pretrained model: ecapatdnn_voxceleb12-16k
+    [2022-03-26 22:50:54,990] [    INFO] - Downloading sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz from https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz
+    ...
+    [2022-03-26 22:51:17,285] [    INFO] - start to dynamic import the model class
+    [2022-03-26 22:51:17,285] [    INFO] - model name ecapatdnn
+    [2022-03-26 22:51:23,864] [    INFO] - start to set the model parameters to model
+    [2022-03-26 22:54:08,115] [    INFO] - create the model instance success
+    [2022-03-26 22:54:08,116] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_
+    searching/example_audio/knife_hit_iron3.wav
+    [2022-03-26 22:54:08,116] [    INFO] - load the audio sample points, shape is: (11012,)
+    [2022-03-26 22:54:08,150] [    INFO] - extract the audio feat, shape is: (80, 69)
+    [2022-03-26 22:54:08,152] [    INFO] - feats shape: [1, 80, 69]
+    [2022-03-26 22:54:08,154] [    INFO] - audio extract the feat success
+    [2022-03-26 22:54:08,155] [    INFO] - start to do backbone network model forward
+    [2022-03-26 22:54:08,155] [    INFO] - feats shape:[1, 80, 69], lengths shape: [1]
+    [2022-03-26 22:54:08,433] [    INFO] - embedding size: (192,)
     Extracting feature from audio No. 1 , 20 audios in total
+    [2022-03-26 22:54:08,435] [    INFO] - checking the aduio file format......
+    [2022-03-26 22:54:08,435] [    INFO] - The sample rate is 16000
+    [2022-03-26 22:54:08,436] [    INFO] - The audio file format is right
+    [2022-03-26 22:54:08,436] [    INFO] - device type: cpu
+    [2022-03-26 22:54:08,436] [    INFO] - Model has been initialized
+    [2022-03-26 22:54:08,436] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/sword_wielding.wav
+    [2022-03-26 22:54:08,436] [    INFO] - load the audio sample points, shape is: (6391,)
+    [2022-03-26 22:54:08,452] [    INFO] - extract the audio feat, shape is: (80, 40)
+    [2022-03-26 22:54:08,454] [    INFO] - feats shape: [1, 80, 40]
+    [2022-03-26 22:54:08,454] [    INFO] - audio extract the feat success
+    [2022-03-26 22:54:08,454] [    INFO] - start to do backbone network model forward
+    [2022-03-26 22:54:08,455] [    INFO] - feats shape:[1, 80, 40], lengths shape: [1]
+    [2022-03-26 22:54:08,633] [    INFO] - embedding size: (192,)
     Extracting feature from audio No. 2 , 20 audios in total
     ...
-    2022-03-09 17:22:13,870 ｜ INFO ｜ main.py ｜ load_audios ｜ 85 ｜ Successfully loaded data, total count: 20
-    2022-03-09 17:22:13,898 ｜ INFO ｜ main.py ｜ count_audio ｜ 147 ｜ Successfully count the number of data!
-    2022-03-09 17:22:13,918 ｜ INFO ｜ main.py ｜ audio_path ｜ 57 ｜ Successfully load audio: ./example_audio/test.wav
+    2022-03-26 22:54:15,892 ｜ INFO ｜ main.py ｜ load_audios ｜ 85 ｜ Successfully loaded data, total count: 20
+    2022-03-26 22:54:15,908 ｜ INFO ｜ main.py ｜ count_audio ｜ 148 ｜ Successfully count the number of data!
+    [2022-03-26 22:54:15,916] [    INFO] - checking the aduio file format......
+    [2022-03-26 22:54:15,916] [    INFO] - The sample rate is 16000
+    [2022-03-26 22:54:15,916] [    INFO] - The audio file format is right
+    [2022-03-26 22:54:15,916] [    INFO] - device type: cpu
+    [2022-03-26 22:54:15,916] [    INFO] - Model has been initialized
+    [2022-03-26 22:54:15,916] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/test.wav
+    [2022-03-26 22:54:15,917] [    INFO] - load the audio sample points, shape is: (8456,)
+    [2022-03-26 22:54:15,923] [    INFO] - extract the audio feat, shape is: (80, 53)
+    [2022-03-26 22:54:15,924] [    INFO] - feats shape: [1, 80, 53]
+    [2022-03-26 22:54:15,924] [    INFO] - audio extract the feat success
+    [2022-03-26 22:54:15,924] [    INFO] - start to do backbone network model forward
+    [2022-03-26 22:54:15,924] [    INFO] - feats shape:[1, 80, 53], lengths shape: [1]
+    [2022-03-26 22:54:16,051] [    INFO] - embedding size: (192,)
     ...
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/test.wav, distance 0.0
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, distance 0.021805256605148315
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/knife_cut_into_flesh.wav, distance 0.052762262523174286
+    2022-03-26 22:54:16,086 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/test.wav, score 100.0
+    2022-03-26 22:54:16,087 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, score 29.182177782058716
+    2022-03-26 22:54:16,087 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/knife_cut_into_body.wav, score 22.73637056350708
     ...
-    2022-03-09 17:22:32,582 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 135 ｜ Successfully searched similar audio!
-    2022-03-09 17:22:33,658 ｜ INFO ｜ main.py ｜ drop_tables ｜ 159 ｜ Successfully drop tables in Milvus and MySQL!
+    2022-03-26 22:54:16,088 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 136 ｜ Successfully searched similar audio!
+    2022-03-26 22:54:17,164 ｜ INFO ｜ main.py ｜ drop_tables ｜ 160 ｜ Successfully drop tables in Milvus and MySQL!
     ```
+
   - 前端测试（可选）
 
     在浏览器中输入 127.0.0.1:8068 访问前端页面
 
-    注：如果浏览器和服务不在同一台机器上，那么 IP 需要修改成服务所在的机器 IP，并且 docker-compose.yaml 中相应的 API_URL 也要修改，并重新起服务即可
+    注：如果浏览器和服务不在同一台机器上，那么 IP 需要修改成服务所在的机器 IP，并且 docker-compose.yaml 中相应的 API_URL 也要修改，然后重新执行 docker-compose.yaml 文件，使修改生效。
 
     - 上传音频
 
-      下载数据并解压到一文件夹，假设为 /home/speech/data，那么在上传页面地址栏输入 /home/speech/data 进行数据上传
+      在服务端下载数据并解压到一文件夹，假设为 /home/speech/data/，那么在上传页面地址栏输入 /home/speech/data/ 进行数据上传
 
       ![](./img/insert.png)
 
     - 检索相似音频
 
-      选择左上角放大镜，点击 “Default Target Audio File” 按钮，上传测试音频，接着你将看到检索结果
+      选择左上角放大镜，点击 “Default Target Audio File” 按钮，从客户端上传测试音频，接着你将看到检索结果
 
       ![](./img/search.png)
 
-### 4. 结果
+### 5. 结果
 
 机器配置：
 - 操作系统: CentOS release 7.6 
@@ -158,9 +211,9 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
 
   ![](./img/result.png)
 
-基于 Milvus 的检索框架在召回率 90% 的前提下，检索耗时约 2.9 毫秒，加上特征提取(Embedding)耗时约 500毫秒(测试音频时长约 5秒)，即单条音频测试总共耗时约 503 毫秒，可以满足大多数应用场景
+基于 Milvus 的检索框架在召回率 90% 的前提下，检索耗时约 2.9 毫秒，加上特征提取(Embedding)耗时约 500 毫秒(测试音频时长约 5 秒)，即单条音频测试总共耗时约 503 毫秒，可以满足大多数应用场景。
 
-### 5. 预训练模型
+### 6. 预训练模型
 
 以下是 PaddleSpeech 提供的预训练模型列表：
 

diff --git a/demos/audio_searching/img/search.png b/demos/audio_searching/img/search.png
diff --git a/demos/audio_searching/requirements.txt b/demos/audio_searching/requirements.txt
@@ -1,7 +1,8 @@
 diskcache==5.2.1
+dtaidistance==2.3.1
 fastapi
 librosa==0.8.0
-numpy
+numpy==1.21.0
 pydantic
 pymilvus==2.0.1
 pymysql

diff --git a/demos/audio_searching/src/config.py b/demos/audio_searching/src/config.py
@@ -16,7 +16,7 @@
 ############### Milvus Configuration ###############
 MILVUS_HOST = os.getenv("MILVUS_HOST", "127.0.0.1")
 MILVUS_PORT = int(os.getenv("MILVUS_PORT", "19530"))
-VECTOR_DIMENSION = int(os.getenv("VECTOR_DIMENSION", "2048"))
+VECTOR_DIMENSION = int(os.getenv("VECTOR_DIMENSION", "192"))
 INDEX_FILE_SIZE = int(os.getenv("INDEX_FILE_SIZE", "1024"))
 METRIC_TYPE = os.getenv("METRIC_TYPE", "L2")
 DEFAULT_TABLE = os.getenv("DEFAULT_TABLE", "audio_table")

diff --git a/demos/audio_searching/src/encode.py b/demos/audio_searching/src/encode.py
@@ -15,24 +15,22 @@
 
 import librosa
 import numpy as np
+from config import DEFAULT_TABLE
+
 from logs import LOGGER
+from paddlespeech.cli import VectorExecutor
+
+vector_executor = VectorExecutor()
 
 
 def get_audio_embedding(path):
     """
     Use vpr_inference to generate embedding of audio
     """
     try:
-        RESAMPLE_RATE = 16000
-        audio, _ = librosa.load(path, sr=RESAMPLE_RATE, mono=True)
-
-        # TODO add infer/python interface to get embedding, now fake it by rand
-        # vpr = ECAPATDNN(checkpoint_path=None, device='cuda')
-        # embedding = vpr.inference(audio)
-        np.random.seed(hash(os.path.basename(path)) % 1000000)
-        embedding = np.random.rand(1, 2048)
+        embedding = vector_executor(audio_file=path)
         embedding = embedding / np.linalg.norm(embedding)
-        embedding = embedding.tolist()[0]
+        embedding = embedding.tolist()
         return embedding
     except Exception as e:
         LOGGER.error(f"Error with embedding:{e}")

diff --git a/demos/audio_searching/src/test_main.py b/demos/audio_searching/src/test_main.py
@@ -11,24 +11,24 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import zipfile
-
-import gdown
 from fastapi.testclient import TestClient
 from main import app
 
+from utils.utility import download
+from utils.utility import unpack
+
 client = TestClient(app)
 
 
 def download_audio_data():
     """
     download audio data
     """
-    url = 'https://drive.google.com/uc?id=1bKu21JWBfcZBuEuzFEvPoAX6PmRrgnUp'
-    gdown.download(url)
-
-    with zipfile.ZipFile('example_audio.zip', 'r') as zip_ref:
-        zip_ref.extractall('./example_audio')
+    url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz"
+    md5sum = "52ac69316c1aa1fdef84da7dd2c67b39"
+    target_dir = "./"
+    filepath = download(url, md5sum, target_dir)
+    unpack(filepath, target_dir, True)
 
 
 def test_drop():