Improve handling of special tokens in Dictionary #1309

louismartin · 2019-10-26T15:43:29Z

https://github.com/pytorch/fairseq/blob/eb68afca0208a040d4e91eceae86f5f22ca24b04/fairseq/data/dictionary.py#L178-L190

When loading dict.txt that already contains special tokens such as <s> or <pad> (which are added by default in sentencepiece), these tokens appear twice in the fairseq dictionary.
They are added once in Dictionary.__init__() and a second time from the dict.txt file in Dictionary.add_from_file().
This causes weird behaviours e.g. when using the model in https://github.com/huggingface/transformers.

Ideally Dictionary would not add the special tokens manually when loading an external dict.txt that already contains them (such as in https://github.com/huggingface/transformers).
But I am afraid that this can break backward compatibility for people who already trained models with this "duplicated special tokens bug".

For instance:

>> print([fairseq_model.task.dictionary[i] for i in range(15)])
['<s>', '<pad>', '</s>', '<unk>', '<unk>', '<s>', '</s>', ',', '▁the', ...]

In the fill_mask() method for roberta, this is what happens:

>> tokens = self.task.source_dictionary.encode_line(
       '<s> ' + text_spans_bpe,
       append_eos=True,
       add_if_not_exist=False,
   )
   print(tokens)
tensor([[    5,  1285, 32004,     2]])

With the first token 5 being the <s> that was added as a string and matched to the token from dict.txt and the last token 2 corresponding to dictionary.eos().

The text was updated successfully, but these errors were encountered:

louismartin · 2019-12-06T18:05:10Z

Cross-referencing related bugs in HuggingFace Transformers: huggingface/transformers#2065

myleott · 2019-12-16T22:03:36Z

I think we should go a step further and remove all implicit special tokens, and only use explicit special tokens.

One nice way to handle backward compatibility is to add a header line to new dict.txt files indicating the version. Under the new version all special tokens are explicit, but if there is no header then we fall back to the old logic (where extra special tokens are added implicitly).

What do you think @louismartin?

louismartin · 2019-12-17T16:45:09Z

Yes that would definitely solve it.
Not sure there's a better way to handle backward compatibility.
Thanks!

Junpliu · 2021-03-24T12:50:13Z

So sorry to interrupt. At the bpe process of pretraining, is it right that fariseq did not do special preprocessing to special tokens like "" or "" ? For example, the token ”“ in "A has done to B" would be viewed as something like separate "" by bpe.
I am not sure whether I think it right. Thank you!

stale · 2021-06-28T12:36:24Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

louismartin mentioned this issue Dec 6, 2019

Fixing camembert tokenization huggingface/transformers#2065

Merged

myleott added bug enhancement help wanted and removed bug labels Dec 16, 2019

myleott changed the title ~~Duplicate special tokens when loading dictionary from file~~ Improve handling of special tokens in Dictionary Dec 16, 2019

stale bot added the stale label Jun 28, 2021

lydianish linked a pull request Sep 21, 2023 that will close this issue

fix overwrite bug when adding symbol to dictionary #5329

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of special tokens in Dictionary #1309

Improve handling of special tokens in Dictionary #1309

louismartin commented Oct 26, 2019 •

edited

Loading

louismartin commented Dec 6, 2019

myleott commented Dec 16, 2019 •

edited

Loading

louismartin commented Dec 17, 2019

Junpliu commented Mar 24, 2021

stale bot commented Jun 28, 2021

Improve handling of special tokens in Dictionary #1309

Improve handling of special tokens in Dictionary #1309

Comments

louismartin commented Oct 26, 2019 • edited Loading

louismartin commented Dec 6, 2019

myleott commented Dec 16, 2019 • edited Loading

louismartin commented Dec 17, 2019

Junpliu commented Mar 24, 2021

stale bot commented Jun 28, 2021

louismartin commented Oct 26, 2019 •

edited

Loading

myleott commented Dec 16, 2019 •

edited

Loading