llama : fix bpe tokenize from byte #2889

opparco · 2023-08-30T02:42:58Z

in llm_tokenizer_bpe::tokenize, if lookup on vocab.token_to_id fails, llama_byte_to_token(vocab, ch) should be tried.

opparco · 2023-08-30T02:45:33Z

convert hf elyza/ELYZA-japanese-Llama-2-7b-fast to gguf
run main
...
DEBUG: string not found in vocab: '
'
DEBUG: char token found in vocab: '<0x0A>' 13
llm_load_print_meta: format         = GGUF V1 (support until nov 2023)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = BPE
llm_load_print_meta: n_vocab        = 45043
llm_load_print_meta: n_merges       = 64040
...
llm_load_print_meta: LF token  = 13 '<0x0A>'  # detect LF token correctly.
...
DEBUG: string not found in vocab: '
'
DEBUG: char token found in vocab: '<0x0A>' 13
DEBUG: string not found in vocab: ' '
DEBUG: char token found in vocab: '<0x20>' 35

KerfuffleV2 · 2023-08-30T07:17:06Z

This helps, but it seems like there are still pretty serious issues mith BPE handling. For BPE, convert.py adds all tokens as type USER_DEFINED, however llama_token_to_piece_with_model in common/common.cpp doesn't handle that case so the content of the tokens is always a blank string.

I tried changing:

llama.cpp/llama.cpp

Lines 6155 to 6157 in ad9ddcf

    
           int llama_token_to_piece_with_model(const struct llama_model * model, llama_token token, char * buf, int length) { 
        
               if (0 <= token && token < llama_model_n_vocab(model)) { 
        
                   if (llama_is_normal_token(model->vocab, token)) {

to

if (llama_is_normal_token(model->vocab, token)
    || llama_is_user_defined_token(model->vocab, token)) {

which helps, but since everything is marked as user-defined stuff like byte tokens don't get decoded. I'd guess what convert.py is doing is most wrong, but it seems like llama_token_to_piece needs to do something other than return a blank string for user defined tokens as well.

@ggerganov

ggerganov

This wasn't needed in the reference implementation that is supposed to work 100% correct:

https://github.com/cmp-nct/ggllm.cpp/blob/2b487f2e1b7284847bcc1f8a368604f356815098/libfalcon.cpp#L2737-L2751

I think this is a workaround for the missing Unicode support in llama.cpp's BPE tokenizer.
Is it better to add this sort of patches or should we try to implement 100% correct BPE tokenizer? For the latter to happen, we would need help from someone with enough knowledge to implement it in a compact way

KerfuffleV2 · 2023-09-01T21:42:23Z

Is it better to add this sort of patches or should we try to implement 100% correct BPE tokenizer?

From what you said over here: #2938 (review)

It seems like an argument against the workaround. Basically at least for the BPE models I tested, it got them a little closer to doing something other than just crashing but they still were pretty far from being able to produce usable output. I might be wrong, but my impression is that without actually supporting the unicode stuff it won't be possible to really use them with or without the workaround.

@opparco Seems like the model you were testing with is ELYZA-japanese-Llama-2-7b-fast. With your patch here, do you actually get good/usable results from it?

opparco · 2023-09-02T23:51:28Z

byte-fallback feature is an option in training phase.
but it is mandatory for Byte-Level BPE.

in google/sentencepiece:
sentencepiece supports byte-fallback feature.
google/sentencepiece#621 (comment)

in huggingface/tokenizers:
https://github.com/huggingface/tokenizers/blob/main/bindings/python/py_src/tokenizers/models/__init__.pyi#L110

class BPE(Model):
    """
    An implementation of the BPE (Byte-Pair Encoding) algorithm
...
        byte_fallback (:obj:`bool`, `optional`):
            Whether to use spm byte-fallback trick (defaults to False)
    """

implementation:
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/decoders/byte_fallback.rs

sentencepiece bpe compatible tokenizer #252

        for (int i = 0; i != -1; i = symbols_[i].next) {
            auto& symbol = symbols_[i];
            auto token = vocab_.token_to_id.find(std::string(symbol.text));

            if (token == vocab_.token_to_id.end()) {
                // output any symbols that did not form tokens as bytes.
                for (int j = 0; j < symbol.text.size(); ++j) {
                    gpt_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
                    output.push_back(token_id);
                }
            } else {
                output.push_back((*token).second);
            }
        }

this code maps byte to id correctly.

<0x0A> 13
<0x20> 35

in libfalcon.cpp, it does not have byte-fallback feature, so broken.

ggerganov

@opparco

Thank you for the explanation. So it seems if the byte-fallback was enabled during training, we do need this patch.
I guess it is OK to merge then.

One more question: if we use byte-fallback feature with a model that did not have that feature enabled during training, would the tokenization end up wrong? Or would we never even hit this byte-fallback branch?
My guess is the latter, but just want to confirm

opparco · 2023-09-03T09:36:57Z

even in the current code, there are invalid input prompts that reach the ERROR branch.

                const auto token = vocab.token_to_id.find(str);

                if (token == vocab.token_to_id.end()) {
                    for (auto j = str.begin(); j != str.end(); ++j) {
                        std::string byte_str(1, *j);
                        auto token_multibyte = vocab.token_to_id.find(byte_str);
                        if (token_multibyte == vocab.token_to_id.end()) {
                            fprintf(stderr,"ERROR: byte not found in vocab: '%s'\n", byte_str.c_str());
                        }
                        output.push_back((*token_multibyte).second);
                    }
                } else {
                    output.push_back((*token).second);
                }

even if it reached ERROR branch with the model of byte-fallback = false, byte-fallback is not related in that case.

llama : fix bpe tokenize from byte

52f8cc8

KerfuffleV2 mentioned this pull request Aug 30, 2023

Converting kfkas Llama-2-ko-7b-Chat to GGUF fails #2865

Closed

This was referenced Aug 31, 2023

convert.py: BPE fixes? #2938

Merged

convert.py : handle special tokens #2820

Closed

ggerganov reviewed Sep 1, 2023

View reviewed changes

ggerganov approved these changes Sep 3, 2023

View reviewed changes

ggerganov merged commit 3730134 into ggerganov:master Sep 3, 2023

opparco deleted the fix-bpe-tokenize-from-byte branch September 3, 2023 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : fix bpe tokenize from byte #2889

llama : fix bpe tokenize from byte #2889

opparco commented Aug 30, 2023

opparco commented Aug 30, 2023 •

edited

Loading

KerfuffleV2 commented Aug 30, 2023 •

edited

Loading

ggerganov left a comment

KerfuffleV2 commented Sep 1, 2023

opparco commented Sep 2, 2023 •

edited

Loading

ggerganov left a comment

opparco commented Sep 3, 2023 •

edited

Loading

llama : fix bpe tokenize from byte #2889

llama : fix bpe tokenize from byte #2889

Conversation

opparco commented Aug 30, 2023

opparco commented Aug 30, 2023 • edited Loading

KerfuffleV2 commented Aug 30, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Sep 1, 2023

opparco commented Sep 2, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

opparco commented Sep 3, 2023 • edited Loading

opparco commented Aug 30, 2023 •

edited

Loading

KerfuffleV2 commented Aug 30, 2023 •

edited

Loading

opparco commented Sep 2, 2023 •

edited

Loading

opparco commented Sep 3, 2023 •

edited

Loading