-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of special tokens in Dictionary #1309
Comments
Cross-referencing related bugs in HuggingFace Transformers: huggingface/transformers#2065 |
I think we should go a step further and remove all implicit special tokens, and only use explicit special tokens. One nice way to handle backward compatibility is to add a header line to new What do you think @louismartin? |
Yes that would definitely solve it. |
So sorry to interrupt. At the bpe process of pretraining, is it right that fariseq did not do special preprocessing to special tokens like "" or "" ? For example, the token ”“ in "A has done to B" would be viewed as something like separate "" by bpe. |
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment! |
https://github.com/pytorch/fairseq/blob/eb68afca0208a040d4e91eceae86f5f22ca24b04/fairseq/data/dictionary.py#L178-L190
When loading
dict.txt
that already contains special tokens such as<s>
or<pad>
(which are added by default in sentencepiece), these tokens appear twice in the fairseq dictionary.They are added once in
Dictionary.__init__()
and a second time from thedict.txt
file inDictionary.add_from_file()
.This causes weird behaviours e.g. when using the model in https://github.com/huggingface/transformers.
Ideally
Dictionary
would not add the special tokens manually when loading an externaldict.txt
that already contains them (such as in https://github.com/huggingface/transformers).But I am afraid that this can break backward compatibility for people who already trained models with this "duplicated special tokens bug".
For instance:
In the
fill_mask()
method for roberta, this is what happens:With the first token
5
being the<s>
that was added as a string and matched to the token fromdict.txt
and the last token2
corresponding todictionary.eos()
.The text was updated successfully, but these errors were encountered: