-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete some duplicate codes #832
Conversation
T-baby
commented
Aug 20, 2020
- Delete some duplicate codes
- Fix the problem of not being able to process unlogged words
- Delete some duplicate codes - Fix the problem of not being able to process unlogged words
if v: | ||
return v | ||
else: | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To find OOV in vocabulary, it would be better to return None. The number 0 is also the index in the vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in tokenizer if it is none it will not get the id, so it may cause the term to be lost, is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We have already done experiments and find out that drop out the OOV term and replace it as [UNK] token make no difference on the model performance.
- Replace OOV as vocab[0] doesn't make sense.
- If replace OOV as [UNK] token, then some modules' vocabulary doesn't contain the [UNK] token.
As refered, to find OOV in vocabulary, it would be better to return None.
elif isinstance(text, | ||
(list, tuple)) and len(text) > 0 and isinstance( | ||
text[0], str): | ||
text[0], str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code style is not the required by PaddleHub. Please commit your codes on python3.6 and install yapf and pre-commit tools, which will check the code style automatically. For more information, please refer to https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.8/docs/contribution/contri_pr.md
Thanks for fixing bugs. Please pull the request to the branch release/v1.8. |