-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : fix bpe tokenize from byte #2889
Conversation
|
This helps, but it seems like there are still pretty serious issues mith BPE handling. For BPE, I tried changing: Lines 6155 to 6157 in ad9ddcf
to if (llama_is_normal_token(model->vocab, token)
|| llama_is_user_defined_token(model->vocab, token)) { which helps, but since everything is marked as user-defined stuff like byte tokens don't get decoded. I'd guess what |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't needed in the reference implementation that is supposed to work 100% correct:
I think this is a workaround for the missing Unicode support in llama.cpp
's BPE tokenizer.
Is it better to add this sort of patches or should we try to implement 100% correct BPE tokenizer? For the latter to happen, we would need help from someone with enough knowledge to implement it in a compact way
From what you said over here: #2938 (review) It seems like an argument against the workaround. Basically at least for the BPE models I tested, it got them a little closer to doing something other than just crashing but they still were pretty far from being able to produce usable output. I might be wrong, but my impression is that without actually supporting the unicode stuff it won't be possible to really use them with or without the workaround. @opparco Seems like the model you were testing with is |
byte-fallback feature is an option in training phase. in google/sentencepiece: in huggingface/tokenizers: class BPE(Model):
"""
An implementation of the BPE (Byte-Pair Encoding) algorithm
...
byte_fallback (:obj:`bool`, `optional`):
Whether to use spm byte-fallback trick (defaults to False)
""" implementation: sentencepiece bpe compatible tokenizer #252 for (int i = 0; i != -1; i = symbols_[i].next) {
auto& symbol = symbols_[i];
auto token = vocab_.token_to_id.find(std::string(symbol.text));
if (token == vocab_.token_to_id.end()) {
// output any symbols that did not form tokens as bytes.
for (int j = 0; j < symbol.text.size(); ++j) {
gpt_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
output.push_back(token_id);
}
} else {
output.push_back((*token).second);
}
} this code maps byte to id correctly.
in libfalcon.cpp, it does not have byte-fallback feature, so broken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation. So it seems if the byte-fallback was enabled during training, we do need this patch.
I guess it is OK to merge then.
One more question: if we use byte-fallback feature with a model that did not have that feature enabled during training, would the tokenization end up wrong? Or would we never even hit this byte-fallback branch?
My guess is the latter, but just want to confirm
even in the current code, there are invalid input prompts that reach the ERROR branch. const auto token = vocab.token_to_id.find(str);
if (token == vocab.token_to_id.end()) {
for (auto j = str.begin(); j != str.end(); ++j) {
std::string byte_str(1, *j);
auto token_multibyte = vocab.token_to_id.find(byte_str);
if (token_multibyte == vocab.token_to_id.end()) {
fprintf(stderr,"ERROR: byte not found in vocab: '%s'\n", byte_str.c_str());
}
output.push_back((*token_multibyte).second);
}
} else {
output.push_back((*token).second);
} even if it reached ERROR branch with the model of byte-fallback = false, byte-fallback is not related in that case. |
in llm_tokenizer_bpe::tokenize, if lookup on vocab.token_to_id fails, llama_byte_to_token(vocab, ch) should be tried.