tokenizer bug #4

qweszxc7410 · 2024-07-03T12:20:49Z

I changed the llm_model_path to 'yentinglin/Llama-3-Taiwan-8B-Instruct'. Then the bug happened. It seems that the Llama-3-Taiwan-8B-Instruct tokenizer.json does not contain "<0xE8>". GFD is based on "bytes". Is it possible to fix this, or is it not the main reason? Thx

========================================
[0] asr_score=-0.80078125, llm_score=-6.636518955230713,fuse_score=-1.9679287910461427,
各位

Traceback (most recent call last):
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 40, in
main()
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 33, in main
result = model.get_transcription(args.audio_file_path)
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 117, in get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 245, in _get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/tokenizer.py", line 75, in tokenize_from_byte
KeyError: b'\xe8'

The text was updated successfully, but these errors were encountered:

Splend1d · 2024-07-03T15:28:16Z

Hi,

Thank you for your interest in this project! Currently, only Breeze and Mistral models are supported (please refer to the "Warning" section). The reason being that we need a "byte tokenization" method for the algorithm. Different tokenizations represent tokens in different ways, and we have not found a way to systematically patch this feature for all models, so we chose to only support Mistral and Breeze.

In short, this is not a bug, and we probably will not patch it soon as the list of models is endless. But we encourage you to do so! It is not that complicated. All you have to do is to create a custom tokenizer that supports the byte-functions that we have implemented.

Best,
Jeff

qweszxc7410 · 2024-07-04T01:22:12Z

Thank you for your explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer bug #4

tokenizer bug #4

qweszxc7410 commented Jul 3, 2024

Splend1d commented Jul 3, 2024

qweszxc7410 commented Jul 4, 2024

tokenizer bug #4

tokenizer bug #4

Comments

qweszxc7410 commented Jul 3, 2024

Splend1d commented Jul 3, 2024

qweszxc7410 commented Jul 4, 2024