Fixed a bug for whole word masking #200

TianhaoFu · 2021-08-26T09:41:39Z

Hi，nice work.
And I found a bug when using your repository.

For the original wwm policy, the implementation is to first iterate through each word inside a phrase, and then use a mask token to replace the original token for some words and a random token for others based on the probability. This implementation results in inconsistent replacement rules for each word inside a phrase. The correct way to experiment would be to first determine the probability and then iterate through each word within the phrase. This ensures that the substitution rule is consistent for each word within a phrase.

:)

你好，很棒的工作。
唯一一个问题是我在使用你们的仓库时发现了一个小bug。

针对whole word masking策略，你们的仓库实现方式为首先遍历一个词组内部的每一个字，生成概率决定有的字使用mask token替换原始token，有的字使用随机替换原始token。这种实现方式会导致一个词组内部的每一个字替换法则不一致。正确的实验方式应该为首先确定概率，之后遍历词组内部的每一个字。这样可确保一个词组内部的每一个字替换法则一致。

hhou435 · 2021-08-26T10:42:33Z

您好，感谢您对代码细致的阅读，这里的wwm策略是参考了谷歌的代码。
使用wwm对词组进行mask时，并不是对一个词组中的每个字进行同样的mask方式，而是对每个字分别计算概率选择不同mask方式。
具体细节可以参考谷歌的原始代码
https://github.com/google-research/bert/blob/master/create_pretraining_data.py#L342

Embedding · 2021-08-26T14:04:16Z

Maybe it is a more reasonable option to use the consistent substitution rule for each word within a phrase.
Could you explain where this strategy has been used?

TianhaoFu · 2021-08-30T11:59:52Z

Maybe it is a more reasonable option to use the consistent substitution rule for each word within a phrase.
Could you explain where this strategy has been used?

Hi, I came to the conclusion after reading macbert that the author's wwm is mask in phrases, without looking at the source code to confirm. I did experiments on my own model and concluded that randomization by word is about as effective as mask by phrase.
:)

你好，我是在看完macbert论文（中文模型上wwm出现的地方）以后，得出结论作者的wwm是以词组为单位进行mask，没有看源码确认（macbert也未公布源码）。之后我在自己的模型上做了实验，结论是以字为单位进行随机与以词组为单位进行mask的效果差不多。

@Embedding @hhou435

hhou435 · 2021-08-31T01:49:03Z

Hello, I think [M] in the paper represents the use of mask strategy for this word, rather than replacing the word with [MASK] token.
In addition, the author of macbert has also maintained the same method as Google when implementing the wwm strategy of Chinese-BERT-wwm.

您好，我认为论文中的[M]代表对此单词使用mask策略，而不是把单词替换成[MASK] token。
另外，macbert的作者在实现Chinese-BERT-wwm的wwm策略时，也保持了与谷歌实现的方法一致。
细节可以参考ymcui/Chinese-BERT-wwm#4
@TianhaoFu

Embedding · 2021-08-31T03:54:01Z

Thank you for your suggestion, and you can give us your email if it is convenient for you. We can discuss and seek your more code contributions. @TianhaoFu

TianhaoFu · 2021-09-30T00:57:50Z

Thank you for your suggestion, and you can give us your email if it is convenient for you. We can discuss and seek your more code contributions. @TianhaoFu

OK~

Fixed a bug for whole word masking

51c1df0

TianhaoFu closed this Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed a bug for whole word masking #200

Fixed a bug for whole word masking #200

TianhaoFu commented Aug 26, 2021 •

edited

Loading

hhou435 commented Aug 26, 2021

Embedding commented Aug 26, 2021

TianhaoFu commented Aug 30, 2021 •

edited

Loading

hhou435 commented Aug 31, 2021 •

edited

Loading

Embedding commented Aug 31, 2021

TianhaoFu commented Sep 30, 2021

Fixed a bug for whole word masking #200

Fixed a bug for whole word masking #200

Conversation

TianhaoFu commented Aug 26, 2021 • edited Loading

hhou435 commented Aug 26, 2021

Embedding commented Aug 26, 2021

TianhaoFu commented Aug 30, 2021 • edited Loading

hhou435 commented Aug 31, 2021 • edited Loading

Embedding commented Aug 31, 2021

TianhaoFu commented Sep 30, 2021

TianhaoFu commented Aug 26, 2021 •

edited

Loading

TianhaoFu commented Aug 30, 2021 •

edited

Loading

hhou435 commented Aug 31, 2021 •

edited

Loading