[WIP ]add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc #760

shanguanma · 2022-06-28T03:54:02Z

I will clean some Punctuation, and convert full-width English characters into half-width English characters in this recipe.

…ipe text normalization Signed-off-by: shanguanma <[email protected]>

Signed-off-by: shanguanma <[email protected]>

pzelasko · 2022-06-28T11:27:07Z

lhotse/recipes/tal_csasr.py

+    sed "s/[ ][ ]*$//g" | sed "s/\[//g" | sed 's/、//g'
+    210_40223_210_6228_1_1533298404_4812267_555 上面是一般现在对然后然后下面呢 HE IS ALWAYS FINISHIＮG
+    """
+    line = line.replace("Ａ", "A")


If this code becomes slow to run due to the number of replaces, I recommend using “re” module and writing a single regular expressions that covers all symbols to remove. Sth like: re.compile(r“(=|<|>|…)”)

Ok, I will change it.

Signed-off-by: shanguanma <[email protected]>

pzelasko · 2022-07-01T21:52:37Z

LGTM. Can you first fix the formatting issues? Don't worry about unit test, it was a randomness-related error.

shanguanma · 2022-08-18T01:55:25Z

@pzelasko , sorry for the late reply, I have run your test command pytest test, however when I execute the command , it is very slow , the running logging is as follows:

============================= test session starts ==============================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /mntnfs/lee_data1/maduo/k2-fsa/lhotse
plugins: hypothesis-5.41.2, anyio-3.6.1
collected 1434 items / 3 skipped

test/test_audio_reads.py ....FFFFF......FF......FF......FF......FFFFFxxF [  3%]
FxxFFxxFFFF..                                                            [  4%]
test/test_feature_set.py ...........FFFFssss..........                   [  6%]
test/test_kaldi_dirs.py x.FF................                             [  7%]
test/test_lazy.py ..............................x..x                     [  9%]
test/test_manipulation.py .............................................. [ 13%]
............                                                             [ 14%]
test/test_multipexing_iterables.py .......                               [ 14%]
test/test_parallel.py ....                                               [ 14%]
test/test_qa.py ....                                                     [ 15%]
test/test_recording_set.py ....F...........FFFFF........................ [ 18%]
..........ssss...                                                        [ 19%]
test/test_resample_randomized.py .                                       [ 19%]
test/test_serialization.py ............................................. [ 22%]
.........................................................                [ 26%]
test/test_supervision_set.py ...........................                 [ 28%]
test/test_utils.py ....................................................  [ 32%]
test/augmentation/test_torchaudio.py .................................   [ 34%]
test/cut/test_custom_attrs.py ...................                        [ 35%]
test/cut/test_custom_attrs_randomized.py .                               [ 35%]
test/cut/test_cut.py ............................                        [ 37%]
test/cut/test_cut_augmentation.py ...................................... [ 40%]
..                                                                       [ 40%]
test/cut/test_cut_drop_attributes.py ............                        [ 41%]
test/cut/test_cut_extend_by.py .................                         [ 42%]
test/cut/test_cut_fill_supervision.py ..............                     [ 43%]
test/cut/test_cut_merge_supervisions.py .........                        [ 44%]
test/cut/test_cut_mixing.py ...............                              [ 45%]
test/cut/test_cut_ops_preserve_id.py ................................... [ 47%]
.....                                                                    [ 47%]
test/cut/test_cut_set.py ............s.........s.......                  [ 50%]
test/cut/test_cut_set_mix.py .........                                   [ 50%]
test/cut/test_cut_trim_to_supervisions.py .....                          [ 51%]
test/cut/test_cut_truncate.py ........................................   [ 53%]
test/cut/test_cut_with_in_memory_data.py ...........                     [ 54%]
test/cut/test_feature_extraction.py ............ssss....sss........sss.  [ 57%]
test/cut/test_invariants_randomized.py ..                                [ 57%]
test/cut/test_masks.py ................                                  [ 58%]
test/cut/test_padding_cut.py ........................................... [ 61%]
......                                                                   [ 61%]
test/dataset/test_batch_io.py ............

shanguanma · 2022-08-18T02:13:20Z

It is running for more than 12 hours. however, it is not finished. I don't know how to do it

pzelasko · 2022-08-18T12:49:08Z

As far as I remember, the unit tests failed on some test that used random numbers and can crash very rarely; I will fix that separately, some time. Can you resolve the conflicts and then run black lhotse test on your code? It should be good enough.

Signed-off-by: maduo <[email protected]>

pzelasko · 2022-08-22T12:42:03Z

Thanks, can you also merge master and resolve the conflicts?

Signed-off-by: shanguanma <[email protected]>

csukuangfj · 2022-08-23T02:25:48Z

lhotse/recipes/aishell2.py

+        if char == "'" and "\u4e00" <= line[i - 1] <= "\u9fff":
+            char = char.replace("'", "")
+        new_line.append(char)
+    line = "".join(new_line)


Does it remove spaces between English words?

@csukuangfj, No, it doesn't remove spaces between English Words.
I select some lines from trans.txt of the aishell2 dataset.

IC0001W0061 听流年 IC0001W0062 听beat it IC0001W0063 听独角戏 IC0001W0064 听心雨 IC0001W0065 听Yesterday Once More IC0001W0066 听广岛之恋 IC0001W0067 听一生有你 IC0010W0228 Here's IC0012W0161 I'm IC0013W0018 It's IC0017W0126 Nothing'sGChange IC0020W0392 She's IC0022W0444 That's IC0073W0058 搬不走的要及时'关停并转' IC0085W0187 帮我放一首歌Let's IC0392W0410 对低收入群体的帮助也更大' IC0975W0451 明年二月底'小成' ID0114W0368 我感觉就是在不断'拉抽屉' ID0115W0198 我公司员工不存在持有'和泰创投'股份的情况

After the code, it is as follows:

IC0001W0061 听流年 IC0001W0062 听BEAT IT IC0001W0063 听独角戏 IC0001W0064 听心雨 IC0001W0065 听YESTERDAY ONCE MORE IC0001W0066 听广岛之恋 IC0001W0067 听一生有你 IC0010W0228 HERE'S IC0012W0161 I'M IC0013W0018 IT'S IC0017W0126 NOTHING'SGCHANGE IC0020W0392 SHE'S IC0022W0444 THAT'S IC0073W0058 搬不走的要及时关停并转 IC0085W0187 帮我放一首歌LET'S IC0392W0410 对低收入群体的帮助也更大 IC0975W0451 明年二月底小成 ID0114W0368 我感觉就是在不断拉抽屉 ID0115W0198 我公司员工不存在持有和泰创投股份的情况

Thanks, I see.
It splits words into characters, with spaces being kept.

@csukuangfj , the test code is as follows:

#!/usr/bin/env python3 import sys def normalization(line: str): new_line = [] line = list(line) #print(line) for i, char in enumerate(line): if char == "'" and "\u4e00" <= line[i - 1] <= "\u9fff": char = char.replace("'", "") new_line.append(char) #print(new_line) line = "".join(new_line) line = line.upper() return line if __name__ == "__main__": file = sys.argv[1] with open(file, "r" ) as f: for line in f: line = line.strip().split() content = " ".join(line[1:]) content = normalization(content) print(f"{line[0]} {content}")

Thanks, I see. It splits words into characters, with spaces being kept.

Yes.

shanguanma · 2022-08-23T02:31:05Z

@pzelasko, if it has no conflicts and problems, please merge it, I will open another pull request to add normalization for the aishell2 recipe.

pzelasko · 2022-08-23T12:33:53Z

Thanks, merging!

shanguanma added 5 commits June 24, 2022 07:00

add aishell2 aishell tal_asr tal_csasr migicdata aidatatang_200zh rec…

03da28e

…ipe text normalization Signed-off-by: shanguanma <[email protected]>

add text normalization of stcmds recipe

81bd4be

Signed-off-by: shanguanma <[email protected]>

add normalize specify bad case

7b2687a

Signed-off-by: shanguanma <[email protected]>

add str.upper into thchs_30 recipe

2f42b14

Signed-off-by: shanguanma <[email protected]>

add fixed

5a55e07

Signed-off-by: shanguanma <[email protected]>

pzelasko reviewed Jun 28, 2022

View reviewed changes

change str.replace into re.sub

1e56dee

Signed-off-by: shanguanma <[email protected]>

yuekaizhang mentioned this pull request Jul 5, 2022

add aishell2 dev test #766

Merged

black

50e4069

Signed-off-by: maduo <[email protected]>

shanguanma added 2 commits August 23, 2022 10:14

fixed recipes/aishell2

d599dc4

Signed-off-by: shanguanma <[email protected]>

fixed it

f9f8344

Signed-off-by: shanguanma <[email protected]>

csukuangfj reviewed Aug 23, 2022

View reviewed changes

Merge branch 'master' into master

f2c8b1a

pzelasko merged commit a67d1ed into lhotse-speech:master Aug 23, 2022

pzelasko added this to the v1.6 milestone Aug 23, 2022

shanguanma mentioned this pull request Aug 25, 2022

add normalization for aishell2 recipe #790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP ]add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc #760

[WIP ]add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc #760

shanguanma commented Jun 28, 2022

pzelasko Jun 28, 2022

shanguanma Jun 28, 2022

pzelasko commented Jul 1, 2022

shanguanma commented Aug 18, 2022

shanguanma commented Aug 18, 2022

pzelasko commented Aug 18, 2022

pzelasko commented Aug 22, 2022

csukuangfj Aug 23, 2022

shanguanma Aug 23, 2022

csukuangfj Aug 23, 2022

shanguanma Aug 23, 2022

shanguanma Aug 23, 2022

shanguanma commented Aug 23, 2022

pzelasko commented Aug 23, 2022

[WIP ]add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc #760

[WIP ]add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc #760

Conversation

shanguanma commented Jun 28, 2022

pzelasko Jun 28, 2022

Choose a reason for hiding this comment

shanguanma Jun 28, 2022

Choose a reason for hiding this comment

pzelasko commented Jul 1, 2022

shanguanma commented Aug 18, 2022

shanguanma commented Aug 18, 2022

pzelasko commented Aug 18, 2022

pzelasko commented Aug 22, 2022

csukuangfj Aug 23, 2022

Choose a reason for hiding this comment

shanguanma Aug 23, 2022

Choose a reason for hiding this comment

csukuangfj Aug 23, 2022

Choose a reason for hiding this comment

shanguanma Aug 23, 2022

Choose a reason for hiding this comment

shanguanma Aug 23, 2022

Choose a reason for hiding this comment

shanguanma commented Aug 23, 2022

pzelasko commented Aug 23, 2022