Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a bug for whole word masking #200

Closed
wants to merge 1 commit into from
Closed

Conversation

TianhaoFu
Copy link

@TianhaoFu TianhaoFu commented Aug 26, 2021

Hi,nice work.
And I found a bug when using your repository.

For the original wwm policy, the implementation is to first iterate through each word inside a phrase, and then use a mask token to replace the original token for some words and a random token for others based on the probability. This implementation results in inconsistent replacement rules for each word inside a phrase. The correct way to experiment would be to first determine the probability and then iterate through each word within the phrase. This ensures that the substitution rule is consistent for each word within a phrase.

:)


你好,很棒的工作。
唯一一个问题是我在使用你们的仓库时发现了一个小bug。

针对whole word masking策略,你们的仓库实现方式为首先遍历一个词组内部的每一个字,生成概率决定有的字使用mask token替换原始token,有的字使用随机替换原始token。这种实现方式会导致一个词组内部的每一个字替换法则不一致。正确的实验方式应该为首先确定概率,之后遍历词组内部的每一个字。这样可确保一个词组内部的每一个字替换法则一致。

@hhou435
Copy link
Collaborator

hhou435 commented Aug 26, 2021

您好,感谢您对代码细致的阅读,这里的wwm策略是参考了谷歌的代码。
使用wwm对词组进行mask时,并不是对一个词组中的每个字进行同样的mask方式,而是对每个字分别计算概率选择不同mask方式。
具体细节可以参考谷歌的原始代码
https://github.com/google-research/bert/blob/master/create_pretraining_data.py#L342

@Embedding
Copy link
Collaborator

Maybe it is a more reasonable option to use the consistent substitution rule for each word within a phrase.
Could you explain where this strategy has been used?

@TianhaoFu
Copy link
Author

TianhaoFu commented Aug 30, 2021

Maybe it is a more reasonable option to use the consistent substitution rule for each word within a phrase.
Could you explain where this strategy has been used?

Hi, I came to the conclusion after reading macbert that the author's wwm is mask in phrases, without looking at the source code to confirm. I did experiments on my own model and concluded that randomization by word is about as effective as mask by phrase.
:)


你好,我是在看完macbert论文(中文模型上wwm出现的地方)以后,得出结论作者的wwm是以词组为单位进行mask,没有看源码确认(macbert也未公布源码)。之后我在自己的模型上做了实验,结论是以字为单位进行随机与以词组为单位进行mask的效果差不多。

wwm in macbert


@Embedding @hhou435

@hhou435
Copy link
Collaborator

hhou435 commented Aug 31, 2021

Hello, I think [M] in the paper represents the use of mask strategy for this word, rather than replacing the word with [MASK] token.
In addition, the author of macbert has also maintained the same method as Google when implementing the wwm strategy of Chinese-BERT-wwm.

您好,我认为论文中的[M]代表对此单词使用mask策略,而不是把单词替换成[MASK] token。
另外,macbert的作者在实现Chinese-BERT-wwm的wwm策略时,也保持了与谷歌实现的方法一致。
细节可以参考ymcui/Chinese-BERT-wwm#4
@TianhaoFu

@Embedding
Copy link
Collaborator

Thank you for your suggestion, and you can give us your email if it is convenient for you. We can discuss and seek your more code contributions. @TianhaoFu

@TianhaoFu
Copy link
Author

Thank you for your suggestion, and you can give us your email if it is convenient for you. We can discuss and seek your more code contributions. @TianhaoFu

OK~

@TianhaoFu TianhaoFu closed this Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants