Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer bug #4

Open
qweszxc7410 opened this issue Jul 3, 2024 · 2 comments
Open

tokenizer bug #4

qweszxc7410 opened this issue Jul 3, 2024 · 2 comments

Comments

@qweszxc7410
Copy link

I changed the llm_model_path to 'yentinglin/Llama-3-Taiwan-8B-Instruct'. Then the bug happened. It seems that the Llama-3-Taiwan-8B-Instruct tokenizer.json does not contain "<0xE8>". GFD is based on "bytes". Is it possible to fix this, or is it not the main reason? Thx

========================================
[0] asr_score=-0.80078125, llm_score=-6.636518955230713,fuse_score=-1.9679287910461427,
各位

Traceback (most recent call last):
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 40, in
main()
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 33, in main
result = model.get_transcription(args.audio_file_path)
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 117, in get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 245, in _get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/tokenizer.py", line 75, in tokenize_from_byte
KeyError: b'\xe8'

@Splend1d
Copy link
Contributor

Splend1d commented Jul 3, 2024

Hi,

Thank you for your interest in this project! Currently, only Breeze and Mistral models are supported (please refer to the "Warning" section). The reason being that we need a "byte tokenization" method for the algorithm. Different tokenizations represent tokens in different ways, and we have not found a way to systematically patch this feature for all models, so we chose to only support Mistral and Breeze.

In short, this is not a bug, and we probably will not patch it soon as the list of models is endless. But we encourage you to do so! It is not that complicated. All you have to do is to create a custom tokenizer that supports the byte-functions that we have implemented.

Best,
Jeff

@qweszxc7410
Copy link
Author

Thank you for your explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants