Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In some cases, blingfire models created with the new vocab.txt produce different results. #181

Open
springkim opened this issue Nov 9, 2024 · 0 comments

Comments

@springkim
Copy link

When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.

Of course, (HF)BertTokenizerFast and (TF)tf_text.FastBertTokenizer also have more than 99% correct answers when run on the same vocab.txt, but blingfire only has 93% correct answers.

(vocab.txt is about 30000)

In the example below, the actual vocab.txthas ##ㅋbut no , as shown below.

--vocab.txt--
##ㅋ

In this case, ##ㅋ must be concatenated with the preceding character, so they all match for 아ㅋ, as shown below.

Tokenizer Framework text ids decode
(HF) BertTokenizer 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(HF) BertTokenizerFast 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(BF) bert_custom.bin 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer 아ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]

On the other hand, if there is a space in the middle, like in 아 ㅋ, only blingfire will produce a different result.

Tokenizer Framework text ids decode
(HF) BertTokenizer 아 ㅋ [31998, 21, 31997, 31999] [CLS] 아 [UNK] [SEP]
(HF) BertTokenizerFast 아 ㅋ [31998, 21, 31997, 31999] [CLS] 아 [UNK] [SEP]
(BF) bert_custom.bin 아 ㅋ [31998, 21, 29981, 31999] [CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer 아 ㅋ [31998, 21, 31997, 31999] [CLS] 아 [UNK] [SEP]

The blingfire settings are shown below.

ldb.conf.small

[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2

options.small

OUTPUT = bert_custom.bin

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
#opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

resources = \
	$(tmpdir)/wbd.fsa.$(mode).dump \
	$(tmpdir)/wbd.mmap.$(mode).dump \

wdb.lex.utf8

_include common/bert.common.def.txt

_define LetterFromVocab [\x0030-\x0039\x0041-\x005a\x0061-...]

< (ChineseChars)|(BertPunctuation) > --> WORD _call FnTokWord
< (AllLettersWithoutToLower|LetterFromVocab)+ > --> WORD _call FnTokWord

#
# BERT specific
#

< [\[] UNK [\]] > --> WORD _call FnTokWord
< [\[] CLS [\]] > --> WORD _call FnTokWord
< [\[] SEP [\]] > --> WORD _call FnTokWord
< [\[] MASK [\]] > --> WORD _call FnTokWord

_function FnTokWord
_include bert_custom/vocab.falex
_end

Other than that, we specified vocab.falex, wdb.target.txt, ldb.conf.i2w, and options.small exactly as guided.

How do you know which part is the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant