In some cases, blingfire models created with the new vocab.txt produce different results. #181

springkim · 2024-11-09T10:57:30Z

When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.

Of course, (HF)BertTokenizerFast and (TF)tf_text.FastBertTokenizer also have more than 99% correct answers when run on the same vocab.txt, but blingfire only has 93% correct answers.

(vocab.txt is about 30000)

In the example below, the actual vocab.txthas ##ㅋbut no ㅋ, as shown below.

--vocab.txt--
##ㅋ

In this case, ##ㅋ must be concatenated with the preceding character, so they all match for 아ㅋ, as shown below.

Tokenizer Framework	text	ids	decode
(HF) BertTokenizer	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(HF) BertTokenizerFast	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(BF) bert_custom.bin	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]

On the other hand, if there is a space in the middle, like in 아 ㅋ, only blingfire will produce a different result.

Tokenizer Framework	text	ids	decode
(HF) BertTokenizer	아 ㅋ	[31998, 21, 31997, 31999]	[CLS] 아 [UNK] [SEP]
(HF) BertTokenizerFast	아 ㅋ	[31998, 21, 31997, 31999]	[CLS] 아 [UNK] [SEP]
(BF) bert_custom.bin	아 ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer	아 ㅋ	[31998, 21, 31997, 31999]	[CLS] 아 [UNK] [SEP]

The blingfire settings are shown below.

ldb.conf.small

[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2

options.small

OUTPUT = bert_custom.bin

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
#opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

resources = \
	$(tmpdir)/wbd.fsa.$(mode).dump \
	$(tmpdir)/wbd.mmap.$(mode).dump \

wdb.lex.utf8

_include common/bert.common.def.txt

_define LetterFromVocab [\x0030-\x0039\x0041-\x005a\x0061-...]

< (ChineseChars)|(BertPunctuation) > --> WORD _call FnTokWord
< (AllLettersWithoutToLower|LetterFromVocab)+ > --> WORD _call FnTokWord

#
# BERT specific
#

< [\[] UNK [\]] > --> WORD _call FnTokWord
< [\[] CLS [\]] > --> WORD _call FnTokWord
< [\[] SEP [\]] > --> WORD _call FnTokWord
< [\[] MASK [\]] > --> WORD _call FnTokWord

_function FnTokWord
_include bert_custom/vocab.falex
_end

Other than that, we specified vocab.falex, wdb.target.txt, ldb.conf.i2w, and options.small exactly as guided.

How do you know which part is the problem?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In some cases, blingfire models created with the new vocab.txt produce different results. #181

In some cases, blingfire models created with the new vocab.txt produce different results. #181

springkim commented Nov 9, 2024

In some cases, blingfire models created with the new vocab.txt produce different results. #181

In some cases, blingfire models created with the new vocab.txt produce different results. #181

Comments

springkim commented Nov 9, 2024

The blingfire settings are shown below.

ldb.conf.small

options.small

wdb.lex.utf8