Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix whitespace LM tokenize issue #7407

Merged
merged 25 commits into from
Dec 14, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
290a56a
fix whitespace LM tokenize issue
howl-anderson Nov 30, 2020
3866874
Merge branch 'master' of https://github.com/RasaHQ/rasa into bugfix/w…
howl-anderson Dec 3, 2020
5feda62
Add testcase
howl-anderson Dec 7, 2020
f72132c
format testcase with black
howl-anderson Dec 7, 2020
82e4ff6
Add changelog entry
howl-anderson Dec 7, 2020
c9131a4
Merge branch 'master' of https://github.com/RasaHQ/rasa into bugfix/w…
howl-anderson Dec 7, 2020
b876612
Update changelog/7407.bugfix.md
tmbo Dec 10, 2020
de831f6
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
tmbo Dec 10, 2020
adefd47
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
0f07431
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
4fef01d
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
5ff8883
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
f09a9a6
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
ed3a905
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
8e21ed9
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
62625d5
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
2594e96
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
c2c9868
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 10, 2020
01647ff
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 11, 2020
7c1366b
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 11, 2020
1b7f4a0
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 14, 2020
75c9a58
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 14, 2020
baa6f18
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 14, 2020
70b1664
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 14, 2020
da4723e
Merge branch 'master' into bugfix/whitespace_lm_tokenize_issue
rasabot Dec 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog/7407.bugfix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Remove token when its text (for example, whitespace) can't be tokenized by LM tokenizer (from `LanguageModelFeaturizer`)
tmbo marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 6 additions & 0 deletions rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,12 @@ def _tokenize_example(
# use lm specific tokenizer to further tokenize the text
split_token_ids, split_token_strings = self._lm_tokenize(token.text)

if not split_token_ids:
# fix the situation that `token.text` only contains whitespace or other special characters,
# which cause `split_token_ids` and `split_token_strings` be empty,
# finally cause `self._lm_specific_token_cleanup()` to raise an exception
continue

(split_token_ids, split_token_strings) = self._lm_specific_token_cleanup(
split_token_ids, split_token_strings
)
Expand Down
30 changes: 30 additions & 0 deletions tests/nlu/featurizers/test_lm_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -763,3 +763,33 @@ def test_preserve_sentence_and_sequence_features_old_config():
assert not (message.features[1].features == lm_docs[SENTENCE_FEATURES]).any()
assert (message.features[0].features == hf_docs[SEQUENCE_FEATURES]).all()
assert (message.features[1].features == hf_docs[SENTENCE_FEATURES]).all()


@pytest.mark.parametrize(
"text, tokens, expected_feature_tokens",
[
(
"购买 iPhone 12", # whitespace ' ' is expected to be removed
[("购买", 0), (" ", 2), ("iPhone", 3), (" ", 9), ("12", 10)],
[("购买", 0), ("iPhone", 3), ("12", 10)],
)
],
)
def test_lm_featurizer_correctly_handle_whitespace_token(
text, tokens, expected_feature_tokens
):
from rasa.nlu.tokenizers.tokenizer import Token

config = {
"model_name": "bert",
"model_weights": "bert-base-chinese",
}

lm_featurizer = LanguageModelFeaturizer(config)

message = Message.build(text=text)
message.set(TOKENS_NAMES[TEXT], [Token(word, start) for (word, start) in tokens])

result, _ = lm_featurizer._tokenize_example(message, TEXT)

assert [(token.text, token.start) for token in result] == expected_feature_tokens