Tokenizer works inconsistently but better for bge zh series model #1

xyzhang626 · 2023-12-21T08:22:53Z

Summary

This repo's tokenizer works consistently with huggingface's tokenizer in the most cases and works inconsistently but possibly better bge zh series model.

Details

bge-small-zh-v1.5 tokenizer will be bad at 1) words with capital letter 2) accent letter. It can be caused by the normalization setting of it.

For example, in the case 大家好我是GPT, hf tokenizer (left column) can not recognize the upper GPT but tokenizer in this repo (right column) can do it.

It's similar for the accent case.

If you find any more differences between tokenizer in this repo with the huggingface one, please let me know I will try to fix it.

The text was updated successfully, but these errors were encountered:

snowyu · 2023-12-22T07:53:48Z

Thanks for your kind.

The problem is that the all-MiniLM model can not tokenize all Chinese, eg, '你好，世界'

100 <--> [UNK]
100 <--> [UNK]
1989 <--> ，
1745 <--> 世
100 <--> [UNK]

And the text segmentation of bge zh series or MiniLM models are is split by individual characters, not by words.

The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 is split by words, but it can not work with embeddings.cpp:

# the quantize is ok
pushd models
git clone https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
cd paraphrase-multilingual-MiniLM-L12-v2
wget -O vocab.txt https://huggingface.co/michaelfeil/ct2fast-paraphrase-multilingual-MiniLM-L12-v2/resolve/main/vocabulary.txt?download=true
./run_conversions.sh paraphrase-multilingual-MiniLM-L12-v2
popd

python examples/test_hf_tokenizer.py paraphrase-multilingual-MiniLM-L12-v2
build/bin/test_tokenizer -m models/paraphrase-multilingual-MiniLM-L12-v2/ggml-model-q4_0.bin 
tokenizer test failed: '你好，世界！'
[101, 994, 1322, 100, 6717, 11888, 100, 102, ]
0 -> <s> : 101 -> ▁_
6 -> ▁ : 994 -> 你
124084 -> 你好 : 1322 -> 好
4 -> , : 100 -> ▁for
3221 -> 世界 : 6717 -> 世
38 -> ! : 11888 -> 界
2 -> </s> : 100 -> ▁for
2 -> </s> : 102 -> ta

Maybe the Pre-Tokenization is missing?

xyzhang626 · 2023-12-25T08:39:36Z

hey @snowyu sorry for the late reply and thanks for letting me know this.

Pre-tokenization is not missing in this repo but the strategy seems different from paraphrase-multilingual-MiniLM-L12-v2. Actually I handle the Chinese char in the same way with huggingface rust version, where whitespace is inserted between Chinese char, see hf rust implementation and this repo implementation. It's the exact reason why Chinese words are split and tokenized.

I think the differences are caused by different tokenization algorithm. all-MiniLM and bge-small-zh-v1.5 use the algorithm called WordPiece, where an important feature is to use subword when tokenizing words (Chinese char is treated as a single word), for example, the "tokenization" will be tokenized into "token" and "##ization", the special symbol "##" denotes the subword starts from the middle of the word.

However, in the tokenized results from paraphrase-multilingual-MiniLM-L12-v2, I did not find something similar. I suspect it uses different tokenization algorithm (not reported in their paper or model card). Since paraphrase-multilingual-MiniLM-L12-v2 is not at the leading position in MTEB benchmark, it might indicate that tokenizing Chinese words rather than single character might not be necessary, at least from the view of performance. Anyway this is an interesting point I will try to figure it out when free.

snowyu · 2023-12-25T11:10:09Z

https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/metaspace.rs

snowyu · 2023-12-27T02:13:10Z

The Pre-tokenizers has many types, the WordPiece is just one of them. see the tokenizer.json file in the paraphrase-multilingual-MiniLM:

{
  ...,
  "pre_tokenizer":{
      "type":"Sequence",
       "pretokenizers":[
          {"type":"WhitespaceSplit"},
          {"type":"Metaspace","replacement":"▁", ...
}

More details in HF Document: Pre-tokenizers.

the js code may be more clear, all in one file: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

fix: segfault

xyzhang626 mentioned this issue Dec 21, 2023

llama : add BERT support ggerganov/llama.cpp#2872

Closed

snowyu pushed a commit to snowyu/embeddings.cpp that referenced this issue Feb 4, 2024

Merge pull request xyzhang626#1 from sroussey/segfault

1b82378

fix: segfault

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer works inconsistently but better for bge zh series model #1

Tokenizer works inconsistently but better for bge zh series model #1

xyzhang626 commented Dec 21, 2023 •

edited

Loading

snowyu commented Dec 22, 2023 •

edited

Loading

xyzhang626 commented Dec 25, 2023

snowyu commented Dec 25, 2023

snowyu commented Dec 27, 2023

Tokenizer works inconsistently but better for bge zh series model #1

Tokenizer works inconsistently but better for bge zh series model #1

Comments

xyzhang626 commented Dec 21, 2023 • edited Loading

Summary

Details

snowyu commented Dec 22, 2023 • edited Loading

xyzhang626 commented Dec 25, 2023

snowyu commented Dec 25, 2023

snowyu commented Dec 27, 2023

xyzhang626 commented Dec 21, 2023 •

edited

Loading

snowyu commented Dec 22, 2023 •

edited

Loading