Skip to content

Conversation

@ggerganov
Copy link

Add tokenizer tests + suggest init with UNK tokens

# get tokenizers
python3 convert-hf-to-gguf-update.py <hf_token>

# generate ggml vocab and tests
python3 convert-hf-to-gguf.py models/tokenizers/t5/ --outfile models/ggml-vocab-t5.gguf --vocab-only

# run the tests
make -j tests
./tests/test-tokenizer-0 models/ggml-vocab-t5.gguf

Currently, a few tests that are failing:

src: '!!!!!!'
res: ' !!!!!!'
tok: 3 17065 55 
main : failed test:    '!!!!!!'
main : detokenized to: ' !!!!!!' instead of ' !!!!!!'
main : expected tokens:      3 ' ',     55 '!',  17065 '!!!!!', 
main : got tokens:           3 ' ',  17065 '!!!!!',     55 '!', 

src: ' '
res: '▅'
tok: 2 
main : failed test:    ' '
main : detokenized to: '▅' instead of ''
main : expected tokens: 
main : got tokens:           2 '▅', 

@fairydreaming
Copy link
Owner

@ggerganov These tokenization test failures are caused by differences in tokenization between the transformers T5 "slow" tokenizer (T5Tokenizer) and "fast" tokenizer (T5TokenizerFast). My Unigram tokenizer implementation is compatible with the "slow" tokenizer. It looks like the default implementation returned by AutoTokenizer.from_pretrained(...) in convert-hf-to-gguf-update.py is T5TokenizerFast. I regenerated test inputs with T5Tokenizer by setting use_fast to false: AutoTokenizer.from_pretrained(..., use_fast=False) and they all passed. Try something like this:

@@ -141,7 +147,10 @@ for model in models:
 
     # create the tokenizer
     try:
-        tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+        if name == "t5":
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
     except OSError as e:
         logger.error(f"Error loading tokenizer for model {name}. The model may not exist or is not accessible with the provided token. Error: {e}")
         continue  # Skip to the next model if the tokenizer can't be loaded
@@ -299,7 +309,10 @@ for model in models:
 
     # create the tokenizer
     try:
-        tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+        if name == "t5":
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
     except OSError as e:
         logger.error(f"Failed to load tokenizer for model {name}. Error: {e}")
         continue  # Skip this model and continue with the next one in the loop

@ggerganov
Copy link
Author

Thank you - fixed

@fairydreaming fairydreaming merged commit 7c610fa into fairydreaming:t5-clean-3 Jul 2, 2024
fairydreaming pushed a commit that referenced this pull request Aug 4, 2024
* [example] batched-bench "segmentation fault"

When `llama-batched-bench` is invoked _without_ setting `-npl`, "number
of parallel prompts", it segfaults.

The segfault is caused by invoking `max_element()` on a zero-length
vector, `n_pl`

This commit addresses that by first checking to see if the number of
parallel prompts is zero, and if so sets the maximum sequence size to 1;
otherwise, sets it to the original, the result of `max_element()`.

Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf`

```
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28
   69  	    llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
   70
   71  	    // ensure enough sequences are available
-> 72  	    ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
```

* Update examples/batched-bench/batched-bench.cpp

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: compilade <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants