Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what does the special PinYin "xx5" used for #4

Open
JohnHerry opened this issue Aug 21, 2020 · 2 comments
Open

what does the special PinYin "xx5" used for #4

JohnHerry opened this issue Aug 21, 2020 · 2 comments
Assignees

Comments

@JohnHerry
Copy link

Hi, all,
Thanks for the good job. I found there is a special PinYin "xx5" in class2idx; But there is no corpus labled with this pinyin, Then what does this Pinyin Class used for? Is there anything special?

@seanie12
Copy link
Contributor

Hi, class2idx is a dictionary which maps each pinyin to its own id.
So the id corresponds to the index of softmax layer.
The are two reasons why there is no label of "xx5".

  1. There is no polyphonic character of which pinyin is xx5.
  2. Our dataset does not cover all possible Chinese polyphonic characters. We collect Chinese sentences from wikipedia and label it, so some of polyphonic characters are missing in our data.

@JohnHerry
Copy link
Author

JohnHerry commented Aug 25, 2020

Then that maybe an error from human labelling. There is no Chinese character mapped to this PinYin xx5.

Another question,why the paper balance training data with polyphone instead of polychar? I think the latter is also important, There has been many bad cases of mis-predicted pinyins, We found that polychar samples in CPP is not balanced with its pinyins. It maybe natual from wikipedia if the paper did not wash out some samples: That longer then 50, That shorter then 5, That labelled different from the two people, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants