-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert.py : handle special tokens #2820
Comments
I can look at this after #2753 (hopefully) gets merged. I was planning on doing some cleanup work and fixing the type annotations, seems like the kind of thing that would be reasonable to throw into that kind of pull as well. |
I think we need to use a model that utilizes special tokens to test this with. I see people mentioning "OpenChat V2 x OpenOrca" when they need to handle special tokens - maybe we can try to make those work |
Using a model with special tokens to test handling special tokens is an idea just crazy enough to work! |
For BPE to work with llama models (Aquila?) |
In progress over here: #2842 |
The next step is using the special tokens in My guess is we need to just update the Lines 947 to 950 in dc07dc4
|
I'm not sure where discussion about this should be.
I've been doing some testing with https://huggingface.co/BAAI/Aquila-7B and https://huggingface.co/kfkas/Llama-2-ko-7b-Chat trying to get the BPE stuff to work. First, it seems like all these BPE models just die in llama.cpp without #2889. Little surprised that pull has gotten no attention so far. It also seems like the stuff in |
Is it available for working? |
We now have to use these special tokens in llama.cpp Can somebody confirm that the following is correct:
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
}
|
I don't know what behavior is considered correct, but it seems like in that particular case it means you can't talk about HTML strikethrough tags anymore. I.E. a prompt like "Dear LLaMA model, people make a list according to such and such rules. Surround elements that meet a certain criteria with strikethrough like It's less of an issue when the special tokens are like |
Normally the special token strings should not be recognized as special tokens in the user prompt. Better to have a CLI parameter for users who need to use them. Instead of using the model vocab these tokens should be user-configurable. Something like |
What if it was something like |
The token ids for the special tokens may differ from model to model like the default mapping strings. So any external use of the special tokens should not depend on knowing the token ids. |
In addition, a CLI parameter for enabling or disabling printing the special tokens in the output from the model would be good. |
Decent point. What I was talking about could still work with a small adaptation that you could use stuff like |
Yes, that could also work. Here is a snippet of the tinystories dataset. To correctly tokenize this dataset independent of model, a parameter for setting the EOS token is needed. |
Hi, is handling special tokens working in the latest master branch? I tested with https://huggingface.co/openchat/openchat_v3.2_super It doesn't seem to work. I added a print and exit in convert.py, logging |
I need to understand how special tokens work. If they are not parsed during prompt processing, then I don't understand what is their purpose at all. |
From my understanding: Special tokens are used in finetunes to provide better structure in LLM's output.
Special tokens are defined by the organisers of each dataset during their finetune respectively, so what their uses are depends. So I think it's a good feature to support arbitrary special tokens in llama.cpp convert script by reading the "add_tokens.json" and adding them to the GGUF. Users of those finetunes will know how to use the special tokens at their end as long as those tokens are outputted by the LLM. A drawback of the special tokens is that yes, when defined thoughtlessly, it will conflict with the output, as in the case of |
@l3utterfly Thank you - I think this description gives an answer to my question earlier. So based on this, I think the only part that is currently missing is to replace the special token pieces (i.e. the text such as @KerfuffleV2 Are you interested in looking into this? Probably just To test this, after updating {
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false
}
} main should tokenize the string |
Sure, I can look at it. Not sure I'm 100% clear on what needs to happen from the conversion side so I may need to ask some follow up questions. I'll see what I can figure out on my own first. edit: I think it's definitely going to be a lot more complicated than just changing |
@ggerganov Actually, I'm confused. We already write the text content for special tokens like BOS and llama.cpp seems to already know what the content is for the tokens. For example, when starting up:
So I think the C++ tokenizer side is not using the token content of tokens like BOS when tokenizing rather than this being something that could be fixed on the model conversion side. Or am I misunderstanding something? edit: Not sure if it's significant for this but BOS, EOS get added with token type control ( |
Ah, I guess the Python classes somehow already took care of that. Then I think we are done - special tokens should already work. Can you check how the |
It tokenizes like:
I've been messing around with the C++ side and I don't really understand what's going on. I thought maybe it was because the Dumping the text in
I also added some debug prints to just before
Patch--- a/llama.cpp
+++ b/llama.cpp
@@ -3578,6 +3585,7 @@ struct llm_tokenizer_spm {
llm_symbol sym;
size_t len = utf8_len(text[offs]);
sym.text = text.c_str() + offs;
+ printf("\n-- %s\n", text.c_str() + offs);
sym.n = std::min(len, text.size() - offs);
offs += sym.n;
sym.prev = index - 1;
@@ -3624,6 +3632,7 @@ struct llm_tokenizer_spm {
for (int i = 0; i != -1; i = symbols[i].next) {
auto & symbol = symbols[i];
+ printf("%d: %zu -- %s\n", i, symbol.n, symbol.text);
resegment(symbol, output);
}
}
@@ -3635,9 +3644,11 @@ private:
// Do we need to support is_unused?
if (token != vocab.token_to_id.end()) {
+ printf("** %d -- [%s]\n", token->second, text.c_str());
output.push_back((*token).second);
return;
}
+ printf("!! [%s]\n", text.c_str());
const auto p = rev_merge.find(text); I also tried adding some debug output to Expand
It doesn't look like it tried a combination with |
I took a look at the tokenising code in c++:
Debugging with the prompt: From my understanding: this bi-gram focused tokenisation may skip over long tokens (tokens over multiple characters) because it may not merge correctly, perhaps due to the identified shorter tokens within the long token just so happens to not be divisible by two? (it seems My thought is to use a greedy search on the tokens (n-grams), attempting to match tokens starting from the longest possible length. Regarding the token vocab, we can use a retrieval tree for prefix matching to speed up the search.
The tokenize function would then be:
I tested with the prompt
Using a model that supports the EOS token, it correctly passes the conversation to the "Assistant" after A few things to note in my implementation of the tokeniser:
I did a few short tests with my models, the coherence of the LLM output seems normal to me. I am wondering if this is the right direction to head in? @ggerganov |
Looks like the right direction - although I'm not 100% sure as I don't have deep understanding of how the tokenizer works.
Tagging @goerch in case they might have some insight. |
Looks like that creates a special token to id map and special cases checking it: https://github.com/ggerganov/llama.cpp/pull/1931/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR2098-R2136 I guess it would be possible to take that approach without fully rewriting the tokenizer. (Also, wasn't the tokenizer initially the greedy type a long time ago and then got changed or am I remembering incorrectly?) |
I took @ggerganov , @klosax : are we talking about If we are talking about special tokens for |
I looked into some of the open issues at #3252. @KerfuffleV2 : which models are we testing here? |
I'm staring at the following code in
Are there known cases where |
Does the way it works make sense? Who knows! I think I was the one that added some fallback logic there but I mostly left it the way it was originally when I was was messing with that part. |
I certainly don't. But here is my current understanding (partly based on #3252 (comment)): We try to support two classes of tokenizers:
Both tokenizers use some kind of byte pair encoding to tokenize. Regarding the source of complete models we have
We invented something like On the implementation side it seems we have tokenizer handling split across a couple of conversion scripts, Here are my most urgent questions:
|
it still has problems to support special token in starcoder like |
The lack of handling for special tokens in llm_tokenizer_spm also affects Mistral Orca. In SentencePiece's original implementation there is something called PrefixMatcher, which is initialized with user_defined_symbols (as the special tokens are called there). This PrefixMatcher is then used to split the input into "character sequence" in the BPE tokenizer. I suppose it skips right over the atomic/unsplittable special tokens before the main BPE algorithm begins. The llama.cpp implementation (which is apparently a port of/inspired by the bpe_model.cc from SentencePiece linked above) instead "splits the input into utf8 chars", but without the matcher part, i.e. disregarding the atomic special tokens. |
Thank you, I was thinking the same thing yesterday, and couldn't find any confirmation. I already found a way to extract unsplittable tokens directly in the tokenizer without any model/convert.py changes, I'm gonna play with this some more. I have a general idea of how to solve this with minor changes in tokenizer function. I also found a separate approach for tokenizing in O(log N), while solving this problem in the process, by building a tree structure of token/"subtokens", and matching downwards instead of upwards ( matching full long tokens first ). I have to try this to see how consistent it would be with current tokenizer. |
I assume you are familiar with the trie data structure? I think this is what PrefixMatcher uses. Although it may be an overkill for finding all occurrences of a couple substrings in a short body of text. Apart from that, regular expressions come to mind. (I don't know how important it is for the implementation to stay similar to SentencePiece's for comparability.) |
Oh, it has a name :) That's the exact thing I had in mind, I was using it since uni for text sorting and searching, I didn't know that's how it's called in English :) |
Here's a proof of concept I wrote a while ago that uses Trie: #2820 (comment) Hope it helps. It tokenises special characters correctly, but I haven't had the time to add support for UTF8 chars and edge cases yet |
What do you think about reviving #1931 as suggested in #2820 (comment)? |
I'm working on an experimental solution to this problem because I keep running into it and I'm not the only one; There are plenty of other issues related to this. I'm confident there's a way to do this without creating dependencies. We technically do not need to rely on huggingface and I can actually see reliance on it becoming an issue of its own. I'm in the middle of creating some utilities to dump the necessary data to mapped data structures for reuse; think of it like a programmitic I already created one for safetensors. My next goal is to handle it for torch models. Then for huggingface models. If my intuition is correct, then we shouldn't really need huggingface at all which would actually be a really good thing. It would also be flexible enough to build on top of and extend as needed. It would create a gateway towards unifying and streamlining all model conversions as well, which is my end goal. This comment is a copy-paste from PR #3633. I'd like to know if this is a path worth pursuing. Let me know. |
Is this fixed by #3538? |
I hope so. Looking for more reports if this works as expected. |
Here we need to start handling special tokens in
convert.py
:llama.cpp/convert.py
Lines 790 to 800 in e4324cb
An example is shown in
convert-llama-7b-pth-to-gguf.py
:llama.cpp/convert-llama-7b-pth-to-gguf.py
Lines 186 to 239 in e4324cb
The text was updated successfully, but these errors were encountered: