Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[text] refine tokenizer #2165

Merged
merged 20 commits into from
Nov 28, 2023
Merged

[text] refine tokenizer #2165

merged 20 commits into from
Nov 28, 2023

Conversation

Mddct
Copy link
Collaborator

@Mddct Mddct commented Nov 25, 2023

issues: #2160

in this pr:

  • tokenzier interface
  • whisper tokenizer
  • wenet tokenizer
  • split WenetTokenizer (WenetTokenizer 作为测试对齐)
    • CharTokenizer
    • BpeTokenizer
  • check if can work with multiprocess env (pickle issues)
    • bpe
    • whisper
  • unit test covers all member functions of tokenizer (可能还有没有看到的 之后有问题再看)
  • check related file
  • check training and recognize.py work

next pr

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 27, 2023

这里补充下whisper tokenizer:
原始官方实现中 tokenizer中对multi task的sot language等在post_init 中实现
image

在wenet中我们选择了,外边添加一个函数:
https://github.com/wenet-e2e/wenet/blob/main/wenet/utils/common.py#L150-L213

所以这里的whisper tokenizer需要fix下他填充的东西
- TODO-
[ ] fix whisper prepend-

@xingchensong
Copy link
Member

post_init 不用管他,不灵活,弃用。举个例子 ,多任务学习时,可能有 transcribe 任务,也可能有 translate 任务,如果执意用 self.sot_sequence, 那么要改成员属性

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 27, 2023

post_init 不用管他,不灵活,弃用。举个例子 ,多任务学习时,可能有 transcribe 任务,也可能有 translate 任务,如果执意用 self.sot_sequence, 那么要改成员属性

确认了下,pos_init 不影响其他的函数, 看着目前的实现中应该也没有问题了

@Mddct Mddct force-pushed the Mddct-refine-tokenzier branch 2 times, most recently from 23459fa to af58893 Compare November 27, 2023 05:55
@Mddct Mddct force-pushed the Mddct-refine-tokenzier branch from af58893 to c2ecc7c Compare November 27, 2023 08:41
@Mddct Mddct force-pushed the Mddct-refine-tokenzier branch from 3acdd59 to 55da48a Compare November 27, 2023 09:20
wenet/utils/common.py Outdated Show resolved Hide resolved
@xingchensong
Copy link
Member

关于 add_special_tokens (如 sos eos) 应该在tokenizer的类函数中实现还是一个独立函数中实现,讨论结果如下:

  1. add_xxx_tokens() 函数实际只用到了special token对应的id,不涉及 tokenize or detokenize, 相对来说较为独立
  2. add_xxx_tokens() 如果放到tokenizer的类函数中,那么所有model都需要own一个tokenizer,但是model实际上需要own的是special token id,所以 model own special ids + 独立 add_xxx_tokens() 可能更合适,也方便推理时c++通过jit-traced.zip 读取相应的id。

另:如果jit可以直接把toeknizer的类函数trace出来,那上述结论就需要推翻

@Mddct Mddct force-pushed the Mddct-refine-tokenzier branch 2 times, most recently from 6d21b0f to 75b1e78 Compare November 27, 2023 13:43
@Mddct Mddct force-pushed the Mddct-refine-tokenzier branch from 75b1e78 to 49994bf Compare November 27, 2023 14:00
@Mddct Mddct force-pushed the Mddct-refine-tokenzier branch from aab5fde to a13dedf Compare November 27, 2023 16:29
@Mddct
Copy link
Collaborator Author

Mddct commented Nov 27, 2023

@Mddct Mddct marked this pull request as ready for review November 27, 2023 16:35
@Mddct
Copy link
Collaborator Author

Mddct commented Nov 28, 2023

whisper tiktokenize 在多进程环境先会有问题, 所有也改成了lazy build的形式
issues:huggingface/datasets#5769

@Mddct Mddct force-pushed the Mddct-refine-tokenzier branch from c70192b to 51a10fa Compare November 28, 2023 04:15
@Mddct
Copy link
Collaborator Author

Mddct commented Nov 28, 2023

recognize.py works!
Screenshot 2023-11-28 at 19 41 57

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 28, 2023

training works
Screenshot 2023-11-28 at 21 02 19

@Mddct Mddct requested a review from xingchensong November 28, 2023 13:06
@Mddct
Copy link
Collaborator Author

Mddct commented Nov 28, 2023

@robin1001 @xingchensong all work, it's time to merge

@robin1001
Copy link
Collaborator

Really a lot of work, and all look great! Just one question, seems WenetTokenizer is not initialized in init_tokenzer.py.

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 28, 2023

Really a lot of work, and all look great! Just one question, seems WenetTokenizer is not initialized in init_tokenzer.py.

Split WenetTokenizer into char and bpe, WenetTokenizer is just a tool to aligm the result for char and bpe tokenizer

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 28, 2023

Really a lot of work, and all look great! Just one question, seems WenetTokenizer is not initialized in init_tokenzer.py.

Split WenetTokenizer into char and bpe, WenetTokenizer is just a tool to aligm the result for char and bpe tokenizer

Will delete in future pr

@robin1001 robin1001 merged commit 3ab6718 into main Nov 28, 2023
6 checks passed
@robin1001 robin1001 deleted the Mddct-refine-tokenzier branch November 28, 2023 13:46
@Mddct Mddct mentioned this pull request Nov 28, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants