Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FastTokenizer] Add unittest for fast_tokenizer #4339

Merged
merged 15 commits into from
Jan 5, 2023

Conversation

joey12300
Copy link
Contributor

@joey12300 joey12300 commented Jan 4, 2023

PR types

Bug fixes

PR changes

Others

Description

Add unittest for fast_tokenizer

@paddle-bot
Copy link

paddle-bot bot commented Jan 4, 2023

Thanks for your contribution!

@codecov
Copy link

codecov bot commented Jan 4, 2023

Codecov Report

Merging #4339 (bbc999a) into develop (bd2be0f) will increase coverage by 2.38%.
The diff coverage is 96.15%.

@@             Coverage Diff             @@
##           develop    #4339      +/-   ##
===========================================
+ Coverage    36.30%   38.69%   +2.38%     
===========================================
  Files          419      422       +3     
  Lines        59254    59661     +407     
===========================================
+ Hits         21514    23087    +1573     
+ Misses       37740    36574    -1166     
Impacted Files Coverage Δ
paddlenlp/transformers/tokenizer_utils_fast.py 77.77% <50.00%> (ø)
paddlenlp/transformers/auto/tokenizer.py 81.74% <100.00%> (+13.49%) ⬆️
paddlenlp/transformers/bert/fast_tokenizer.py 100.00% <100.00%> (+100.00%) ⬆️
paddlenlp/transformers/convert_slow_tokenizer.py 89.43% <100.00%> (+89.43%) ⬆️
paddlenlp/transformers/ernie/fast_tokenizer.py 100.00% <100.00%> (+100.00%) ⬆️
paddlenlp/transformers/ernie_m/fast_tokenizer.py 90.90% <100.00%> (+90.90%) ⬆️
paddlenlp/transformers/tinybert/fast_tokenizer.py 100.00% <100.00%> (+100.00%) ⬆️
paddlenlp/transformers/unimo/modeling.py 81.66% <0.00%> (-0.92%) ⬇️
paddlenlp/transformers/roberta/modeling.py 89.85% <0.00%> (-0.37%) ⬇️
...s/fast_transformer/transformer/fast_transformer.py 12.08% <0.00%> (-0.34%) ⬇️
... and 35 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@joey12300 joey12300 changed the title [FastTokenizer] with_added_tokens->with_added_vocabulary [FastTokenizer] Add unittest for fast_tokenizer Jan 4, 2023
Copy link
Collaborator

@zjjlivein zjjlivein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测CI 通过

Copy link
Collaborator

@sijunhe sijunhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

高质量的PR,点赞!几个小问题

@@ -68,10 +73,38 @@ def get_input_output_texts(self, tokenizer):

def test_full_tokenizer(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所有的单测我建议把tokenizer和fast_tokenizer分开,即分test_tokenizer和test_fast_tokenizer两个method. 如果有错,比较容易debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,新增test_fast_and_python_full_tokenizer单测,用于比较fast和python的tokenizer的输出是否一致。

tests/transformers/auto/test_tokenizer.py Show resolved Hide resolved
sijunhe
sijunhe previously approved these changes Jan 5, 2023
Copy link
Collaborator

@sijunhe sijunhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个PR我ok的,@wj-Mcat 也看看

Copy link
Contributor

@wj-Mcat wj-Mcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其它我觉得都 OK,就包名的小问题。

另外,fast-tokenizer的自动化打包发包的事情后续是否有必要添加到github action workflow中去吗?github action 中支持os matrix 来提供多平台自动化打包编译,不过最后还是看你们的安排。

pytest-xdist
fast_tokenizer_python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我查了下,包名应该是:fast-tokenizer-python,具体可见:https://pypi.org/project/fast-tokenizer-python/

我本地运行: pip install fast_tokenizer_python 会直接报错。

@joey12300 joey12300 merged commit 6960a46 into PaddlePaddle:develop Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants