Fixing camembert tokenization #2065

thomwolf · 2019-12-05T12:30:27Z

The original fairseq implmentation of Camembert has a bunch of duplicate tokens in the dictionary, in particular there are two <unk> tokens but only the index of the first <unk> should be used:

import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
list(camembert.task.source_dictionary[i] for i in range(10))
>>> ['<s>', '<pad>', '</s>', '<unk>', '<unk>', '<s>', '</s>', ',', '▁de', '.']

This PR updates Camembert tokenizer to fix this behavior and as a consequence fixes #2019 and #2020

thomwolf · 2019-12-05T12:45:41Z

Merging now to fix the xlnet test issue on master at the same time.

julien-c · 2019-12-05T16:20:24Z

Also cc'ing @louismartin on this.

louismartin · 2019-12-06T18:03:42Z

Thanks for fixing that.
This comes from a problem in fairseq where special tokens are added twice when using SentencePiece.
Cross-referencing the fairseq issue: facebookresearch/fairseq#1309

Fixing camembert tokenization

6c5297a

thomwolf mentioned this pull request Dec 5, 2019

Camenbert length Tokenizer not equal config vocab_size #2020

Closed

fix xlnet test

3268ebd

thomwolf mentioned this pull request Dec 5, 2019

[CamemBert] Tokenizer function add_tokens doesn't work #2019

Closed

thomwolf merged commit af077b1 into master Dec 5, 2019

julien-c deleted the fixing-camembert branch December 5, 2019 16:19

louismartin mentioned this pull request Dec 6, 2019

Improve handling of special tokens in Dictionary facebookresearch/fairseq#1309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing camembert tokenization #2065

Fixing camembert tokenization #2065

thomwolf commented Dec 5, 2019 •

edited

Loading

thomwolf commented Dec 5, 2019

julien-c commented Dec 5, 2019 •

edited

Loading

louismartin commented Dec 6, 2019

Fixing camembert tokenization #2065

Fixing camembert tokenization #2065

Conversation

thomwolf commented Dec 5, 2019 • edited Loading

thomwolf commented Dec 5, 2019

julien-c commented Dec 5, 2019 • edited Loading

louismartin commented Dec 6, 2019

thomwolf commented Dec 5, 2019 •

edited

Loading

julien-c commented Dec 5, 2019 •

edited

Loading