Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete some duplicate codes #832

Merged
merged 1 commit into from
Aug 25, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 13 additions & 9 deletions paddlehub/tokenizer/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,11 @@ def get_vocab(self):

def _convert_token_to_id(self, token):
""" Converts a token (str) in an id using the vocab. """
return self.vocab.get(token, None)
v = self.vocab.get(token, None)
if v:
return v
else:
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To find OOV in vocabulary, it would be better to return None. The number 0 is also the index in the vocabulary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in tokenizer if it is none it will not get the id, so it may cause the term to be lost, is there a better way?

Copy link
Contributor

@Steffy-zxf Steffy-zxf Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We have already done experiments and find out that drop out the OOV term and replace it as [UNK] token make no difference on the model performance.
  2. Replace OOV as vocab[0] doesn't make sense.
  3. If replace OOV as [UNK] token, then some modules' vocabulary doesn't contain the [UNK] token.

As refered, to find OOV in vocabulary, it would be better to return None.


def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
Expand Down Expand Up @@ -123,8 +127,8 @@ def convert_tokens_to_ids(self, tokens):
ids = []
for token in tokens:
wid = self._convert_token_to_id(token)
if wid:
ids.append(self._convert_token_to_id(token))
if wid is not None:
ids.append(wid)
return ids

def tokenize(self, text):
Expand Down Expand Up @@ -204,14 +208,14 @@ def get_input_ids(text):
if isinstance(text, str):
tokens = self.tokenize(text)
ids = self.convert_tokens_to_ids(tokens)
return self.convert_tokens_to_ids(tokens)
return ids
elif isinstance(text,
(list, tuple)) and len(text) > 0 and isinstance(
text[0], str):
text[0], str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code style is not the required by PaddleHub. Please commit your codes on python3.6 and install yapf and pre-commit tools, which will check the code style automatically. For more information, please refer to https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.8/docs/contribution/contri_pr.md

return self.convert_tokens_to_ids(text)
elif isinstance(text,
(list, tuple)) and len(text) > 0 and isinstance(
text[0], int):
text[0], int):
return text
else:
raise ValueError(
Expand Down Expand Up @@ -350,7 +354,7 @@ def clean_up_tokenization(self, out_string: str) -> str:
"""
out_string = (out_string.replace(" .", ".").replace(" ?", "?").replace(
" !", "!").replace(" ,", ",").replace(" ' ", "'").replace(
" n't",
"n't").replace(" 'm", "'m").replace(" 's", "'s").replace(
" 've", "'ve").replace(" 're", "'re"))
" n't",
"n't").replace(" 'm", "'m").replace(" 's", "'s").replace(
" 've", "'ve").replace(" 're", "'re"))
return out_string