Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete some duplicate codes #832

Merged
merged 1 commit into from
Aug 25, 2020
Merged

Conversation

T-baby
Copy link
Contributor

@T-baby T-baby commented Aug 20, 2020

  • Delete some duplicate codes
  • Fix the problem of not being able to process unlogged words

- Delete some duplicate codes
- Fix the problem of not being able to process unlogged words
@CLAassistant
Copy link

CLAassistant commented Aug 20, 2020

CLA assistant check
All committers have signed the CLA.

if v:
return v
else:
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To find OOV in vocabulary, it would be better to return None. The number 0 is also the index in the vocabulary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in tokenizer if it is none it will not get the id, so it may cause the term to be lost, is there a better way?

Copy link
Contributor

@Steffy-zxf Steffy-zxf Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We have already done experiments and find out that drop out the OOV term and replace it as [UNK] token make no difference on the model performance.
  2. Replace OOV as vocab[0] doesn't make sense.
  3. If replace OOV as [UNK] token, then some modules' vocabulary doesn't contain the [UNK] token.

As refered, to find OOV in vocabulary, it would be better to return None.

elif isinstance(text,
(list, tuple)) and len(text) > 0 and isinstance(
text[0], str):
text[0], str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code style is not the required by PaddleHub. Please commit your codes on python3.6 and install yapf and pre-commit tools, which will check the code style automatically. For more information, please refer to https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.8/docs/contribution/contri_pr.md

@Steffy-zxf
Copy link
Contributor

Steffy-zxf commented Aug 20, 2020

Thanks for fixing bugs. Please pull the request to the branch release/v1.8.

@Steffy-zxf Steffy-zxf self-assigned this Aug 20, 2020
@Steffy-zxf Steffy-zxf linked an issue Aug 25, 2020 that may be closed by this pull request
@Steffy-zxf Steffy-zxf merged commit a2d3359 into PaddlePaddle:develop Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

word2vec_skipgram
3 participants