Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer works inconsistently but better for bge zh series model #1

Open
xyzhang626 opened this issue Dec 21, 2023 · 4 comments
Open

Comments

@xyzhang626
Copy link
Owner

xyzhang626 commented Dec 21, 2023

Summary

This repo's tokenizer works consistently with huggingface's tokenizer in the most cases and works inconsistently but possibly better bge zh series model.

Details

bge-small-zh-v1.5 tokenizer will be bad at 1) words with capital letter 2) accent letter. It can be caused by the normalization setting of it.

For example, in the case 大家好我是GPT, hf tokenizer (left column) can not recognize the upper GPT but tokenizer in this repo (right column) can do it.
image

It's similar for the accent case.
image

If you find any more differences between tokenizer in this repo with the huggingface one, please let me know I will try to fix it.

@snowyu
Copy link

snowyu commented Dec 22, 2023

Thanks for your kind.

The problem is that the all-MiniLM model can not tokenize all Chinese, eg, '你好,世界'

100 <--> [UNK]
100 <--> [UNK]
1989 <--> ,
1745 <--> 世
100 <--> [UNK]

And the text segmentation of bge zh series or MiniLM models are is split by individual characters, not by words.

The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 is split by words, but it can not work with embeddings.cpp:

# the quantize is ok
pushd models
git clone https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
cd paraphrase-multilingual-MiniLM-L12-v2
wget -O vocab.txt https://huggingface.co/michaelfeil/ct2fast-paraphrase-multilingual-MiniLM-L12-v2/resolve/main/vocabulary.txt?download=true
./run_conversions.sh paraphrase-multilingual-MiniLM-L12-v2
popd

python examples/test_hf_tokenizer.py paraphrase-multilingual-MiniLM-L12-v2
build/bin/test_tokenizer -m models/paraphrase-multilingual-MiniLM-L12-v2/ggml-model-q4_0.bin 
tokenizer test failed: '你好,世界!'
[101, 994, 1322, 100, 6717, 11888, 100, 102, ]
0 -> <s> : 101 -> ▁_
6 ->: 994 -> 你
124084 -> 你好 : 1322 -> 好
4 -> , : 100 -> ▁for
3221 -> 世界 : 6717 -> 世
38 -> ! : 11888 -> 界
2 -> </s> : 100 -> ▁for
2 -> </s> : 102 -> ta

Maybe the Pre-Tokenization is missing?

@xyzhang626
Copy link
Owner Author

hey @snowyu sorry for the late reply and thanks for letting me know this.

Pre-tokenization is not missing in this repo but the strategy seems different from paraphrase-multilingual-MiniLM-L12-v2. Actually I handle the Chinese char in the same way with huggingface rust version, where whitespace is inserted between Chinese char, see hf rust implementation and this repo implementation. It's the exact reason why Chinese words are split and tokenized.

I think the differences are caused by different tokenization algorithm. all-MiniLM and bge-small-zh-v1.5 use the algorithm called WordPiece, where an important feature is to use subword when tokenizing words (Chinese char is treated as a single word), for example, the "tokenization" will be tokenized into "token" and "##ization", the special symbol "##" denotes the subword starts from the middle of the word.

However, in the tokenized results from paraphrase-multilingual-MiniLM-L12-v2, I did not find something similar. I suspect it uses different tokenization algorithm (not reported in their paper or model card). Since paraphrase-multilingual-MiniLM-L12-v2 is not at the leading position in MTEB benchmark, it might indicate that tokenizing Chinese words rather than single character might not be necessary, at least from the view of performance. Anyway this is an interesting point I will try to figure it out when free.

@snowyu
Copy link

snowyu commented Dec 25, 2023

@snowyu
Copy link

snowyu commented Dec 27, 2023

The Pre-tokenizers has many types, the WordPiece is just one of them. see the tokenizer.json file in the paraphrase-multilingual-MiniLM:

{
  ...,
  "pre_tokenizer":{
      "type":"Sequence",
       "pretokenizers":[
          {"type":"WhitespaceSplit"},
          {"type":"Metaspace","replacement":"▁", ...
}

More details in HF Document: Pre-tokenizers.

the js code may be more clear, all in one file: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

snowyu pushed a commit to snowyu/embeddings.cpp that referenced this issue Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants