Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP ]add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc #760

Merged
merged 10 commits into from
Aug 23, 2022

Conversation

shanguanma
Copy link
Contributor

I will clean some Punctuation, and convert full-width English characters into half-width English characters in this recipe.

sed "s/[ ][ ]*$//g" | sed "s/\[//g" | sed 's/、//g'
210_40223_210_6228_1_1533298404_4812267_555 上面是一般现在对然后然后下面呢 HE IS ALWAYS FINISHING
"""
line = line.replace("A", "A")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code becomes slow to run due to the number of replaces, I recommend using “re” module and writing a single regular expressions that covers all symbols to remove. Sth like: re.compile(r“(=|<|>|…)”)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will change it.

@pzelasko
Copy link
Collaborator

pzelasko commented Jul 1, 2022

LGTM. Can you first fix the formatting issues? Don't worry about unit test, it was a randomness-related error.

@yuekaizhang yuekaizhang mentioned this pull request Jul 5, 2022
@shanguanma
Copy link
Contributor Author

@pzelasko , sorry for the late reply, I have run your test command pytest test, however when I execute the command , it is very slow , the running logging is as follows:

============================= test session starts ==============================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0
rootdir: /mntnfs/lee_data1/maduo/k2-fsa/lhotse
plugins: hypothesis-5.41.2, anyio-3.6.1
collected 1434 items / 3 skipped

test/test_audio_reads.py ....FFFFF......FF......FF......FF......FFFFFxxF [  3%]
FxxFFxxFFFF..                                                            [  4%]
test/test_feature_set.py ...........FFFFssss..........                   [  6%]
test/test_kaldi_dirs.py x.FF................                             [  7%]
test/test_lazy.py ..............................x..x                     [  9%]
test/test_manipulation.py .............................................. [ 13%]
............                                                             [ 14%]
test/test_multipexing_iterables.py .......                               [ 14%]
test/test_parallel.py ....                                               [ 14%]
test/test_qa.py ....                                                     [ 15%]
test/test_recording_set.py ....F...........FFFFF........................ [ 18%]
..........ssss...                                                        [ 19%]
test/test_resample_randomized.py .                                       [ 19%]
test/test_serialization.py ............................................. [ 22%]
.........................................................                [ 26%]
test/test_supervision_set.py ...........................                 [ 28%]
test/test_utils.py ....................................................  [ 32%]
test/augmentation/test_torchaudio.py .................................   [ 34%]
test/cut/test_custom_attrs.py ...................                        [ 35%]
test/cut/test_custom_attrs_randomized.py .                               [ 35%]
test/cut/test_cut.py ............................                        [ 37%]
test/cut/test_cut_augmentation.py ...................................... [ 40%]
..                                                                       [ 40%]
test/cut/test_cut_drop_attributes.py ............                        [ 41%]
test/cut/test_cut_extend_by.py .................                         [ 42%]
test/cut/test_cut_fill_supervision.py ..............                     [ 43%]
test/cut/test_cut_merge_supervisions.py .........                        [ 44%]
test/cut/test_cut_mixing.py ...............                              [ 45%]
test/cut/test_cut_ops_preserve_id.py ................................... [ 47%]
.....                                                                    [ 47%]
test/cut/test_cut_set.py ............s.........s.......                  [ 50%]
test/cut/test_cut_set_mix.py .........                                   [ 50%]
test/cut/test_cut_trim_to_supervisions.py .....                          [ 51%]
test/cut/test_cut_truncate.py ........................................   [ 53%]
test/cut/test_cut_with_in_memory_data.py ...........                     [ 54%]
test/cut/test_feature_extraction.py ............ssss....sss........sss.  [ 57%]
test/cut/test_invariants_randomized.py ..                                [ 57%]
test/cut/test_masks.py ................                                  [ 58%]
test/cut/test_padding_cut.py ........................................... [ 61%]
......                                                                   [ 61%]
test/dataset/test_batch_io.py ............

@shanguanma
Copy link
Contributor Author

It is running for more than 12 hours. however, it is not finished. I don't know how to do it

@pzelasko
Copy link
Collaborator

As far as I remember, the unit tests failed on some test that used random numbers and can crash very rarely; I will fix that separately, some time. Can you resolve the conflicts and then run black lhotse test on your code? It should be good enough.

Signed-off-by: maduo <[email protected]>
@pzelasko
Copy link
Collaborator

Thanks, can you also merge master and resolve the conflicts?

shanguanma added 2 commits August 23, 2022 10:14
Signed-off-by: shanguanma <[email protected]>
Signed-off-by: shanguanma <[email protected]>
if char == "'" and "\u4e00" <= line[i - 1] <= "\u9fff":
char = char.replace("'", "")
new_line.append(char)
line = "".join(new_line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it remove spaces between English words?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csukuangfj, No, it doesn't remove spaces between English Words.
I select some lines from trans.txt of the aishell2 dataset.

IC0001W0061     听流年
IC0001W0062     听beat it
IC0001W0063     听独角戏
IC0001W0064     听心雨
IC0001W0065     听Yesterday Once More
IC0001W0066     听广岛之恋
IC0001W0067     听一生有你
IC0010W0228     Here's
IC0012W0161     I'm
IC0013W0018     It's
IC0017W0126     Nothing'sGChange
IC0020W0392     She's
IC0022W0444     That's
IC0073W0058     搬不走的要及时'关停并转'
IC0085W0187     帮我放一首歌Let's
IC0392W0410     对低收入群体的帮助也更大'
IC0975W0451     明年二月底'小成'
ID0114W0368     我感觉就是在不断'拉抽屉'
ID0115W0198     我公司员工不存在持有'和泰创投'股份的情况                                                                  

After the code, it is as follows:

IC0001W0061 听流年
IC0001W0062 听BEAT IT
IC0001W0063 听独角戏
IC0001W0064 听心雨
IC0001W0065 听YESTERDAY ONCE MORE
IC0001W0066 听广岛之恋
IC0001W0067 听一生有你
IC0010W0228 HERE'S
IC0012W0161 I'M
IC0013W0018 IT'S
IC0017W0126 NOTHING'SGCHANGE
IC0020W0392 SHE'S
IC0022W0444 THAT'S
IC0073W0058 搬不走的要及时关停并转
IC0085W0187 帮我放一首歌LET'S
IC0392W0410 对低收入群体的帮助也更大
IC0975W0451 明年二月底小成
ID0114W0368 我感觉就是在不断拉抽屉
ID0115W0198 我公司员工不存在持有和泰创投股份的情况

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see.
It splits words into characters, with spaces being kept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csukuangfj , the test code is as follows:

#!/usr/bin/env python3

import sys

def normalization(line: str):
    new_line = []
    line = list(line)
    #print(line)
    for i, char in enumerate(line):
        if char == "'" and "\u4e00" <= line[i - 1] <= "\u9fff":
            char = char.replace("'", "")
        new_line.append(char)
    #print(new_line)
    line = "".join(new_line)
    line = line.upper()
    return line


if __name__ == "__main__":
    file = sys.argv[1]
    with open(file, "r" ) as f:
        for line in f:
            line = line.strip().split()
            content = " ".join(line[1:])
            content = normalization(content)
            print(f"{line[0]} {content}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see. It splits words into characters, with spaces being kept.

Yes.

@shanguanma
Copy link
Contributor Author

@pzelasko, if it has no conflicts and problems, please merge it, I will open another pull request to add normalization for the aishell2 recipe.

@pzelasko
Copy link
Collaborator

Thanks, merging!

@pzelasko pzelasko merged commit a67d1ed into lhotse-speech:master Aug 23, 2022
@pzelasko pzelasko added this to the v1.6 milestone Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants