convert : add t5 tokenizer tests #1

ggerganov · 2024-07-02T07:48:53Z

Add tokenizer tests + suggest init with UNK tokens

# get tokenizers
python3 convert-hf-to-gguf-update.py <hf_token>

# generate ggml vocab and tests
python3 convert-hf-to-gguf.py models/tokenizers/t5/ --outfile models/ggml-vocab-t5.gguf --vocab-only

# run the tests
make -j tests
./tests/test-tokenizer-0 models/ggml-vocab-t5.gguf

Currently, a few tests that are failing:

src: '!!!!!!'
res: ' !!!!!!'
tok: 3 17065 55 
main : failed test:    '!!!!!!'
main : detokenized to: ' !!!!!!' instead of ' !!!!!!'
main : expected tokens:      3 ' ',     55 '!',  17065 '!!!!!', 
main : got tokens:           3 ' ',  17065 '!!!!!',     55 '!', 

src: ' '
res: '▅'
tok: 2 
main : failed test:    ' '
main : detokenized to: '▅' instead of ''
main : expected tokens: 
main : got tokens:           2 '▅',

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

fairydreaming · 2024-07-02T10:28:59Z

@ggerganov These tokenization test failures are caused by differences in tokenization between the transformers T5 "slow" tokenizer (T5Tokenizer) and "fast" tokenizer (T5TokenizerFast). My Unigram tokenizer implementation is compatible with the "slow" tokenizer. It looks like the default implementation returned by AutoTokenizer.from_pretrained(...) in convert-hf-to-gguf-update.py is T5TokenizerFast. I regenerated test inputs with T5Tokenizer by setting use_fast to false: AutoTokenizer.from_pretrained(..., use_fast=False) and they all passed. Try something like this:

@@ -141,7 +147,10 @@ for model in models:
 
     # create the tokenizer
     try:
-        tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+        if name == "t5":
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
     except OSError as e:
         logger.error(f"Error loading tokenizer for model {name}. The model may not exist or is not accessible with the provided token. Error: {e}")
         continue  # Skip to the next model if the tokenizer can't be loaded
@@ -299,7 +309,10 @@ for model in models:
 
     # create the tokenizer
     try:
-        tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
+        if name == "t5":
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
     except OSError as e:
         logger.error(f"Failed to load tokenizer for model {name}. Error: {e}")
         continue  # Skip this model and continue with the next one in the loop

ggerganov · 2024-07-02T16:30:03Z

Thank you - fixed

* [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end()); ``` * Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: compilade <[email protected]>

ggerganov added 2 commits July 2, 2024 10:39

convert : add t5 tokenizer tests

9eb5d56

llama : UGM tokenizer init with UNK tokens instead of PAD

17bb0ea

github-actions bot added the python label Jul 2, 2024

ggerganov mentioned this pull request Jul 2, 2024

Inference support for T5 and FLAN-T5 model families ggml-org/llama.cpp#8141

Merged

4 tasks

convert : use non-fast T5 tokenizer

703764a

fairydreaming merged commit 7c610fa into fairydreaming:t5-clean-3 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : add t5 tokenizer tests #1

convert : add t5 tokenizer tests #1

Uh oh!

ggerganov commented Jul 2, 2024

Uh oh!

fairydreaming commented Jul 2, 2024

Uh oh!

ggerganov commented Jul 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

convert : add t5 tokenizer tests #1

convert : add t5 tokenizer tests #1

Uh oh!

Conversation

ggerganov commented Jul 2, 2024

Uh oh!

fairydreaming commented Jul 2, 2024

Uh oh!

ggerganov commented Jul 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants