Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of special tokens in Dictionary #1309

Open
louismartin opened this issue Oct 26, 2019 · 5 comments · May be fixed by #5329
Open

Improve handling of special tokens in Dictionary #1309

louismartin opened this issue Oct 26, 2019 · 5 comments · May be fixed by #5329

Comments

@louismartin
Copy link
Contributor

louismartin commented Oct 26, 2019

https://github.com/pytorch/fairseq/blob/eb68afca0208a040d4e91eceae86f5f22ca24b04/fairseq/data/dictionary.py#L178-L190

When loading dict.txt that already contains special tokens such as <s> or <pad> (which are added by default in sentencepiece), these tokens appear twice in the fairseq dictionary.
They are added once in Dictionary.__init__() and a second time from the dict.txt file in Dictionary.add_from_file().
This causes weird behaviours e.g. when using the model in https://github.com/huggingface/transformers.

Ideally Dictionary would not add the special tokens manually when loading an external dict.txt that already contains them (such as in https://github.com/huggingface/transformers).
But I am afraid that this can break backward compatibility for people who already trained models with this "duplicated special tokens bug".

For instance:

>> print([fairseq_model.task.dictionary[i] for i in range(15)])
['<s>', '<pad>', '</s>', '<unk>', '<unk>', '<s>', '</s>', ',', '▁the', ...]

In the fill_mask() method for roberta, this is what happens:

>> tokens = self.task.source_dictionary.encode_line(
       '<s> ' + text_spans_bpe,
       append_eos=True,
       add_if_not_exist=False,
   )
   print(tokens)
tensor([[    5,  1285, 32004,     2]])

With the first token 5 being the <s> that was added as a string and matched to the token from dict.txt and the last token 2 corresponding to dictionary.eos().

@louismartin
Copy link
Contributor Author

Cross-referencing related bugs in HuggingFace Transformers: huggingface/transformers#2065

@myleott myleott changed the title Duplicate special tokens when loading dictionary from file Improve handling of special tokens in Dictionary Dec 16, 2019
@myleott
Copy link
Contributor

myleott commented Dec 16, 2019

I think we should go a step further and remove all implicit special tokens, and only use explicit special tokens.

One nice way to handle backward compatibility is to add a header line to new dict.txt files indicating the version. Under the new version all special tokens are explicit, but if there is no header then we fall back to the old logic (where extra special tokens are added implicitly).

What do you think @louismartin?

@louismartin
Copy link
Contributor Author

Yes that would definitely solve it.
Not sure there's a better way to handle backward compatibility.
Thanks!

@Junpliu
Copy link

Junpliu commented Mar 24, 2021

So sorry to interrupt. At the bpe process of pretraining, is it right that fariseq did not do special preprocessing to special tokens like "" or "" ? For example, the token ”“ in "A has done to B" would be viewed as something like separate "" by bpe.
I am not sure whether I think it right. Thank you!

@stale
Copy link

stale bot commented Jun 28, 2021

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

@stale stale bot added the stale label Jun 28, 2021
@lydianish lydianish linked a pull request Sep 21, 2023 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants