-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer works inconsistently but better for bge zh series model #1
Comments
Thanks for your kind. The problem is that the
And the text segmentation of bge zh series or MiniLM models are is split by individual characters, not by words. The # the quantize is ok
pushd models
git clone https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
cd paraphrase-multilingual-MiniLM-L12-v2
wget -O vocab.txt https://huggingface.co/michaelfeil/ct2fast-paraphrase-multilingual-MiniLM-L12-v2/resolve/main/vocabulary.txt?download=true
./run_conversions.sh paraphrase-multilingual-MiniLM-L12-v2
popd
python examples/test_hf_tokenizer.py paraphrase-multilingual-MiniLM-L12-v2
build/bin/test_tokenizer -m models/paraphrase-multilingual-MiniLM-L12-v2/ggml-model-q4_0.bin
tokenizer test failed: '你好,世界!'
[101, 994, 1322, 100, 6717, 11888, 100, 102, ]
0 -> <s> : 101 -> ▁_
6 -> ▁ : 994 -> 你
124084 -> 你好 : 1322 -> 好
4 -> , : 100 -> ▁for
3221 -> 世界 : 6717 -> 世
38 -> ! : 11888 -> 界
2 -> </s> : 100 -> ▁for
2 -> </s> : 102 -> ta Maybe the Pre-Tokenization is missing? |
hey @snowyu sorry for the late reply and thanks for letting me know this. Pre-tokenization is not missing in this repo but the strategy seems different from I think the differences are caused by different tokenization algorithm. However, in the tokenized results from paraphrase-multilingual-MiniLM-L12-v2, I did not find something similar. I suspect it uses different tokenization algorithm (not reported in their paper or model card). Since |
The Pre-tokenizers has many types, the WordPiece is just one of them. see the {
...,
"pre_tokenizer":{
"type":"Sequence",
"pretokenizers":[
{"type":"WhitespaceSplit"},
{"type":"Metaspace","replacement":"▁", ...
} More details in HF Document: Pre-tokenizers. the js code may be more clear, all in one file: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js |
fix: segfault
Summary
This repo's tokenizer works consistently with huggingface's tokenizer in the most cases and works inconsistently but possibly better bge zh series model.
Details
bge-small-zh-v1.5
tokenizer will be bad at 1) words with capital letter 2) accent letter. It can be caused by the normalization setting of it.For example, in the case
大家好我是GPT
, hf tokenizer (left column) can not recognize the upperGPT
but tokenizer in this repo (right column) can do it.It's similar for the accent case.
If you find any more differences between tokenizer in this repo with the huggingface one, please let me know I will try to fix it.
The text was updated successfully, but these errors were encountered: