Fixed a bug for whole word masking #200
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,nice work.
And I found a bug when using your repository.
For the original wwm policy, the implementation is to first iterate through each word inside a phrase, and then use a mask token to replace the original token for some words and a random token for others based on the probability. This implementation results in inconsistent replacement rules for each word inside a phrase. The correct way to experiment would be to first determine the probability and then iterate through each word within the phrase. This ensures that the substitution rule is consistent for each word within a phrase.
:)
你好,很棒的工作。
唯一一个问题是我在使用你们的仓库时发现了一个小bug。
针对whole word masking策略,你们的仓库实现方式为首先遍历一个词组内部的每一个字,生成概率决定有的字使用mask token替换原始token,有的字使用随机替换原始token。这种实现方式会导致一个词组内部的每一个字替换法则不一致。正确的实验方式应该为首先确定概率,之后遍历词组内部的每一个字。这样可确保一个词组内部的每一个字替换法则一致。