You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I changed the llm_model_path to 'yentinglin/Llama-3-Taiwan-8B-Instruct'. Then the bug happened. It seems that the Llama-3-Taiwan-8B-Instruct tokenizer.json does not contain "<0xE8>". GFD is based on "bytes". Is it possible to fix this, or is it not the main reason? Thx
Traceback (most recent call last):
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 40, in
main()
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 33, in main
result = model.get_transcription(args.audio_file_path)
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 117, in get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 245, in _get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/tokenizer.py", line 75, in tokenize_from_byte
KeyError: b'\xe8'
The text was updated successfully, but these errors were encountered:
Thank you for your interest in this project! Currently, only Breeze and Mistral models are supported (please refer to the "Warning" section). The reason being that we need a "byte tokenization" method for the algorithm. Different tokenizations represent tokens in different ways, and we have not found a way to systematically patch this feature for all models, so we chose to only support Mistral and Breeze.
In short, this is not a bug, and we probably will not patch it soon as the list of models is endless. But we encourage you to do so! It is not that complicated. All you have to do is to create a custom tokenizer that supports the byte-functions that we have implemented.
I changed the
llm_model_path
to 'yentinglin/Llama-3-Taiwan-8B-Instruct'. Then the bug happened. It seems that the Llama-3-Taiwan-8B-Instruct tokenizer.json does not contain "<0xE8>". GFD is based on "bytes". Is it possible to fix this, or is it not the main reason? Thx========================================
[0] asr_score=-0.80078125, llm_score=-6.636518955230713,fuse_score=-1.9679287910461427,
各位
Traceback (most recent call last):
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 40, in
main()
File "/home/ubuntu/A10California/generative-fusion-decoding/benchmarks/run_single_file.py", line 33, in main
result = model.get_transcription(args.audio_file_path)
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 117, in get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/gfd.py", line 245, in _get_transcription
File "/home/ubuntu/A10California/generative-fusion-decoding/venv/lib/python3.10/site-packages/gfd-0.0.1-py3.10.egg/gfd/tokenizer.py", line 75, in tokenize_from_byte
KeyError: b'\xe8'
The text was updated successfully, but these errors were encountered: