You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.
Of course, (HF)BertTokenizerFast and (TF)tf_text.FastBertTokenizer also have more than 99% correct answers when run on the same vocab.txt, but blingfire only has 93% correct answers.
(vocab.txt is about 30000)
In the example below, the actual vocab.txthas ##ㅋbut no ㅋ, as shown below.
--vocab.txt--
##ㅋ
In this case, ##ㅋ must be concatenated with the preceding character, so they all match for 아ㅋ, as shown below.
Tokenizer Framework
text
ids
decode
(HF) BertTokenizer
아ㅋ
[31998, 21, 29981, 31999]
[CLS] 아ㅋ [SEP]
(HF) BertTokenizerFast
아ㅋ
[31998, 21, 29981, 31999]
[CLS] 아ㅋ [SEP]
(BF) bert_custom.bin
아ㅋ
[31998, 21, 29981, 31999]
[CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer
아ㅋ
[31998, 21, 29981, 31999]
[CLS] 아ㅋ [SEP]
On the other hand, if there is a space in the middle, like in 아 ㅋ, only blingfire will produce a different result.
When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.
Of course,
(HF)BertTokenizerFast
and(TF)tf_text.FastBertTokenizer
also have more than 99% correct answers when run on the samevocab.txt
, butblingfire
only has 93% correct answers.(vocab.txt is about 30000)
In the example below, the actual
vocab.txt
has##ㅋ
but noㅋ
, as shown below.In this case,
##ㅋ
must be concatenated with the preceding character, so they all match for아ㅋ
, as shown below.On the other hand, if there is a space in the middle, like in
아 ㅋ
, only blingfire will produce a different result.The blingfire settings are shown below.
ldb.conf.small
options.small
wdb.lex.utf8
Other than that, we specified
vocab.falex
,wdb.target.txt
,ldb.conf.i2w
, andoptions.small
exactly as guided.How do you know which part is the problem?
The text was updated successfully, but these errors were encountered: