Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[text] fix whisper tokens and others #2179

Merged
merged 2 commits into from
Nov 29, 2023
Merged

Conversation

Mddct
Copy link
Collaborator

@Mddct Mddct commented Nov 28, 2023

after #2165 :

  • 修复whisper tokenizer, 先调用vocab_size 等方法 导致picke失败
  • 其他文件 需要tokenizer
    • wenet/bin/alignment.py
    • wenet/bin/recognize_onnx_gpu.py
    • examples/vkw2021/s0/local/vkw_kws_results.py
    • tools/onnx2horizonbin.py
      NOTE: 这几个文件只是兼容了下新的tokenizer和Dataset的接口,具体有些char dict到字的生成逻辑保持原有,之后有时间再修正(用tokenizer.detokenize or id2tokens)

@Mddct Mddct requested a review from xingchensong November 29, 2023 03:28
@xingchensong
Copy link
Member

getstate 是在哪调用的

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 29, 2023

getstate 是在哪调用的

Screenshot 2023-11-29 at 11 31 05

multiprocess 的时候pickle 需要序列化, 这里删掉self.tokenizer,就不需要对它序列化, 然后其他进程依旧lazy init

@xingchensong
Copy link
Member

这个multiprocess带来的lazy build,是否可以理解为:

  1. “CoreBPE没法序列化” =>
  2. "tokenizer无法当参数传给multiprocess,因为multiprocess会对入参进行序列化" =>
  3. "解决方案是先把这个bpe相关属性从tokenizer中删了以使得multiprocess可以序列化tokenizer,当需要用到bpe的时候,tokenizer会被反序列化,这时lazy build使得每个process可以独立build自己的BPE"

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 29, 2023

这个multiprocess带来的lazy build,是否可以理解为:

  1. “CoreBPE没法序列化” =>
  2. "tokenizer无法当参数传给multiprocess,因为multiprocess会对入参进行序列化" =>
  3. "解决方案是先把这个bpe相关属性从tokenizer中删了以使得multiprocess可以序列化tokenizer,当需要用到bpe的时候,tokenizer会被反序列化,这时lazy build使得每个process可以独立build自己的BPE"

是的 , 这里也加个个unit test,
以multiprocessing 的dataset 都会有这个问题 huggingface/datasets#5769

Screenshot 2023-11-29 at 11 46 59

@xingchensong
Copy link
Member

get

@xingchensong xingchensong merged commit a3fbf41 into main Nov 29, 2023
6 checks passed
@xingchensong xingchensong deleted the Mddct-tokenizer-related branch November 29, 2023 03:54
@Mddct Mddct mentioned this pull request Nov 30, 2023
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants