-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP ]add normalization for text of mulit_cn recipe:thchs_30, tal_csasr, tal_asr, aishell, aishell2,etc #760
Conversation
…ipe text normalization Signed-off-by: shanguanma <[email protected]>
Signed-off-by: shanguanma <[email protected]>
Signed-off-by: shanguanma <[email protected]>
Signed-off-by: shanguanma <[email protected]>
Signed-off-by: shanguanma <[email protected]>
sed "s/[ ][ ]*$//g" | sed "s/\[//g" | sed 's/、//g' | ||
210_40223_210_6228_1_1533298404_4812267_555 上面是一般现在对然后然后下面呢 HE IS ALWAYS FINISHING | ||
""" | ||
line = line.replace("A", "A") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code becomes slow to run due to the number of replaces, I recommend using “re” module and writing a single regular expressions that covers all symbols to remove. Sth like: re.compile(r“(=|<|>|…)”)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will change it.
Signed-off-by: shanguanma <[email protected]>
LGTM. Can you first fix the formatting issues? Don't worry about unit test, it was a randomness-related error. |
@pzelasko , sorry for the late reply, I have run your test command
|
It is running for more than 12 hours. however, it is not finished. I don't know how to do it |
As far as I remember, the unit tests failed on some test that used random numbers and can crash very rarely; I will fix that separately, some time. Can you resolve the conflicts and then run |
Signed-off-by: maduo <[email protected]>
Thanks, can you also merge master and resolve the conflicts? |
Signed-off-by: shanguanma <[email protected]>
Signed-off-by: shanguanma <[email protected]>
lhotse/recipes/aishell2.py
Outdated
if char == "'" and "\u4e00" <= line[i - 1] <= "\u9fff": | ||
char = char.replace("'", "") | ||
new_line.append(char) | ||
line = "".join(new_line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it remove spaces between English words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csukuangfj, No, it doesn't remove spaces between English Words.
I select some lines from trans.txt of the aishell2 dataset.
IC0001W0061 听流年
IC0001W0062 听beat it
IC0001W0063 听独角戏
IC0001W0064 听心雨
IC0001W0065 听Yesterday Once More
IC0001W0066 听广岛之恋
IC0001W0067 听一生有你
IC0010W0228 Here's
IC0012W0161 I'm
IC0013W0018 It's
IC0017W0126 Nothing'sGChange
IC0020W0392 She's
IC0022W0444 That's
IC0073W0058 搬不走的要及时'关停并转'
IC0085W0187 帮我放一首歌Let's
IC0392W0410 对低收入群体的帮助也更大'
IC0975W0451 明年二月底'小成'
ID0114W0368 我感觉就是在不断'拉抽屉'
ID0115W0198 我公司员工不存在持有'和泰创投'股份的情况
After the code, it is as follows:
IC0001W0061 听流年
IC0001W0062 听BEAT IT
IC0001W0063 听独角戏
IC0001W0064 听心雨
IC0001W0065 听YESTERDAY ONCE MORE
IC0001W0066 听广岛之恋
IC0001W0067 听一生有你
IC0010W0228 HERE'S
IC0012W0161 I'M
IC0013W0018 IT'S
IC0017W0126 NOTHING'SGCHANGE
IC0020W0392 SHE'S
IC0022W0444 THAT'S
IC0073W0058 搬不走的要及时关停并转
IC0085W0187 帮我放一首歌LET'S
IC0392W0410 对低收入群体的帮助也更大
IC0975W0451 明年二月底小成
ID0114W0368 我感觉就是在不断拉抽屉
ID0115W0198 我公司员工不存在持有和泰创投股份的情况
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see.
It splits words into characters, with spaces being kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csukuangfj , the test code is as follows:
#!/usr/bin/env python3
import sys
def normalization(line: str):
new_line = []
line = list(line)
#print(line)
for i, char in enumerate(line):
if char == "'" and "\u4e00" <= line[i - 1] <= "\u9fff":
char = char.replace("'", "")
new_line.append(char)
#print(new_line)
line = "".join(new_line)
line = line.upper()
return line
if __name__ == "__main__":
file = sys.argv[1]
with open(file, "r" ) as f:
for line in f:
line = line.strip().split()
content = " ".join(line[1:])
content = normalization(content)
print(f"{line[0]} {content}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see. It splits words into characters, with spaces being kept.
Yes.
@pzelasko, if it has no conflicts and problems, please merge it, I will open another pull request to add normalization for the aishell2 recipe. |
Thanks, merging! |
I will clean some Punctuation, and convert full-width English characters into half-width English characters in this recipe.