Delete some duplicate codes #832

T-baby · 2020-08-20T07:32:07Z

Delete some duplicate codes
Fix the problem of not being able to process unlogged words

- Delete some duplicate codes - Fix the problem of not being able to process unlogged words

CLAassistant · 2020-08-20T07:32:12Z

All committers have signed the CLA.

Steffy-zxf · 2020-08-20T08:04:38Z

paddlehub/tokenizer/tokenizer.py

+        if v:
+            return v
+        else:
+            return 0


To find OOV in vocabulary, it would be better to return None. The number 0 is also the index in the vocabulary.

But in tokenizer if it is none it will not get the id, so it may cause the term to be lost, is there a better way?

We have already done experiments and find out that drop out the OOV term and replace it as [UNK] token make no difference on the model performance.

Replace OOV as vocab[0] doesn't make sense.

If replace OOV as [UNK] token, then some modules' vocabulary doesn't contain the [UNK] token.

As refered, to find OOV in vocabulary, it would be better to return None.

Steffy-zxf · 2020-08-20T08:11:05Z

paddlehub/tokenizer/tokenizer.py

            elif isinstance(text,
                            (list, tuple)) and len(text) > 0 and isinstance(
-                                text[0], str):
+                text[0], str):


The code style is not the required by PaddleHub. Please commit your codes on python3.6 and install yapf and pre-commit tools, which will check the code style automatically. For more information, please refer to https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.8/docs/contribution/contri_pr.md

Steffy-zxf · 2020-08-20T08:23:33Z

Thanks for fixing bugs. Please pull the request to the branch release/v1.8.

Delete some duplicate codes

8c9725b

- Delete some duplicate codes - Fix the problem of not being able to process unlogged words

Steffy-zxf requested changes Aug 20, 2020

View reviewed changes

Steffy-zxf self-assigned this Aug 20, 2020

Steffy-zxf linked an issue Aug 25, 2020 that may be closed by this pull request

word2vec_skipgram #818

Open

Steffy-zxf merged commit a2d3359 into PaddlePaddle:develop Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete some duplicate codes #832

Delete some duplicate codes #832

T-baby commented Aug 20, 2020

CLAassistant commented Aug 20, 2020 •

edited

Loading

Steffy-zxf Aug 20, 2020

T-baby Aug 20, 2020

Steffy-zxf Aug 20, 2020 •

edited

Loading

Steffy-zxf Aug 20, 2020

Steffy-zxf commented Aug 20, 2020 •

edited

Loading

Delete some duplicate codes #832

Delete some duplicate codes #832

Conversation

T-baby commented Aug 20, 2020

CLAassistant commented Aug 20, 2020 • edited Loading

Steffy-zxf Aug 20, 2020

Choose a reason for hiding this comment

T-baby Aug 20, 2020

Choose a reason for hiding this comment

Steffy-zxf Aug 20, 2020 • edited Loading

Choose a reason for hiding this comment

Steffy-zxf Aug 20, 2020

Choose a reason for hiding this comment

Steffy-zxf commented Aug 20, 2020 • edited Loading

CLAassistant commented Aug 20, 2020 •

edited

Loading

Steffy-zxf Aug 20, 2020 •

edited

Loading

Steffy-zxf commented Aug 20, 2020 •

edited

Loading