-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete some duplicate codes #832
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,7 +78,11 @@ def get_vocab(self): | |
|
||
def _convert_token_to_id(self, token): | ||
""" Converts a token (str) in an id using the vocab. """ | ||
return self.vocab.get(token, None) | ||
v = self.vocab.get(token, None) | ||
if v: | ||
return v | ||
else: | ||
return 0 | ||
|
||
def _convert_id_to_token(self, index): | ||
"""Converts an index (integer) in a token (str) using the vocab.""" | ||
|
@@ -123,8 +127,8 @@ def convert_tokens_to_ids(self, tokens): | |
ids = [] | ||
for token in tokens: | ||
wid = self._convert_token_to_id(token) | ||
if wid: | ||
ids.append(self._convert_token_to_id(token)) | ||
if wid is not None: | ||
ids.append(wid) | ||
return ids | ||
|
||
def tokenize(self, text): | ||
|
@@ -204,14 +208,14 @@ def get_input_ids(text): | |
if isinstance(text, str): | ||
tokens = self.tokenize(text) | ||
ids = self.convert_tokens_to_ids(tokens) | ||
return self.convert_tokens_to_ids(tokens) | ||
return ids | ||
elif isinstance(text, | ||
(list, tuple)) and len(text) > 0 and isinstance( | ||
text[0], str): | ||
text[0], str): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code style is not the required by PaddleHub. Please commit your codes on python3.6 and install yapf and pre-commit tools, which will check the code style automatically. For more information, please refer to https://github.com/PaddlePaddle/PaddleHub/blob/release/v1.8/docs/contribution/contri_pr.md |
||
return self.convert_tokens_to_ids(text) | ||
elif isinstance(text, | ||
(list, tuple)) and len(text) > 0 and isinstance( | ||
text[0], int): | ||
text[0], int): | ||
return text | ||
else: | ||
raise ValueError( | ||
|
@@ -350,7 +354,7 @@ def clean_up_tokenization(self, out_string: str) -> str: | |
""" | ||
out_string = (out_string.replace(" .", ".").replace(" ?", "?").replace( | ||
" !", "!").replace(" ,", ",").replace(" ' ", "'").replace( | ||
" n't", | ||
"n't").replace(" 'm", "'m").replace(" 's", "'s").replace( | ||
" 've", "'ve").replace(" 're", "'re")) | ||
" n't", | ||
"n't").replace(" 'm", "'m").replace(" 's", "'s").replace( | ||
" 've", "'ve").replace(" 're", "'re")) | ||
return out_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To find OOV in vocabulary, it would be better to return None. The number 0 is also the index in the vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in tokenizer if it is none it will not get the id, so it may cause the term to be lost, is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As refered, to find OOV in vocabulary, it would be better to return None.