Skip to content

Latest commit

 

History

History
127 lines (118 loc) · 4.26 KB

rwkv.md

File metadata and controls

127 lines (118 loc) · 4.26 KB

Receptance-Weighted Key-Value tokenizer

llama.cpp RWKV tokenizer

To try this out we need a RWKV model in GGUF format which can be generated with the following commands:

$ cd fundamentals/llama.cpp
$ make checkout-rwkv-model
$ make convert-rwkv-model

The tokenize example can be run using:

$ make run-rwkv-tokenize

And we can debug on Linux using:

$ gdb --args ./tokenize models/v6-Finch-1B6-HF.gguf
(gdb) br llama-vocab.cpp:1511 if raw_text.compare("ÅWhat is LoRA?") == 0
        case LLAMA_VOCAB_TYPE_RWKV:
            {
                for (const auto & fragment : fragment_buffer) {
                    if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_RAW_TEXT) {
                        auto raw_text = fragment.raw_text.substr(fragment.offset, fragment.length);

                        llm_tokenizer_rwkv tokenizer(vocab);
                        tokenizer.tokenize(raw_text, output);
                    }
                }
            }

The constructor of llm_tokenizer_rwkv takes a llm_vocab and will build a Trie using the tokens in the vocabulary. The vocabulary size is:

(gdb) p vocab.id_to_token.size()
$3 = 65536
    llm_tokenizer_rwkv(const llama_vocab & vocab): vocab(vocab) {
        // RWKV supports arbitrary byte tokens, but the vocab struct only supports string tokens.
        // For now, we decode the vocab here into the lookup we'll use for tokenization.

        // build trie
        for (unsigned int id = 0; id < vocab.id_to_token.size(); ++id) {
            const auto & token = vocab.id_to_token[id];
            const auto data = llama_unescape_rwkv_token(token.text);
            token_matcher.insert((const char *) data.data(), data.size(), id);
        }
    }

This unscapes the tokens in the vocabulary and inserts them into the trie. After this we have tokenize which is passed in an empty std::vector<llama_vocab::id>.

The tokenize function will try to match

    void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) {
        uint32_t position = 0;

        while (position < text.size()) {
            const struct naive_trie * node = token_matcher.traverse(text[position]);
            if (node == NULL) {
                // no matching token found, add unknown token
                output.push_back(vocab.special_unk_id);
                position += 1;
                continue;
            }

            // traverse the trie to find the longest matching token
            uint32_t token_id = 0;
            uint32_t token_length = 0;
            while (node != NULL) {
                if (node->has_value) {
                    token_id = node->value;
                    token_length = position + 1;
                }
                node = node->traverse(text[++position]);
            }

            // add the longest matching token
            output.push_back(token_id);
            position = token_length;
        }

Recall that our input text is:

(gdb) p text
$14 = "ÅWhat is LoRA?"

And the the first character will be searched for in the Trie:

(gdb) p/x text[position]
$13 = (const __gnu_cxx::__alloc_traits<std::allocator<char>, char>::value_type &) @0x7fffffffd240: 0xc3

This will be found in the Trie and hence is in the vocabulary:

(gdb) p node->value
$16 = 196

(gdb) p vocab.id_to_token[196]
$15 = {text = "\\xc3", score = 0, attr = LLAMA_TOKEN_ATTR_NORMAL}

And notice that it will continue searching the children of the current node to see if a longer prefix can be matched. In this case there is a longer token:

(gdb) p vocab.id_to_token[2467]
$18 = {text = "\\xc3\\x85", score = 0, attr = LLAMA_TOKEN_ATTR_NORMAL}

This is the token Å which is the longest prefix that can be matched. So this token id will be added to the output vector. This will continue and What is a also a token in the vocabulary and will be added to the output vector:

(gdb) p vocab.id_to_token[24326]
$25 = {text = "What", score = 0, attr = LLAMA_TOKEN_ATTR_NORMAL}

The final output vector will look like this:

(gdb) finish
b) p output
$27 = std::vector of length 6, capacity 8 = {2467, 24326, 4600, 3991, 1393, 64}

So this was a very simple tokenization process I we compare it to the others.