Camembert Vocab Issue #1490

simonefrancia · 2019-12-12T14:33:57Z

Hi @louismartin,
I have a question about Camembert Model that you provide from there (http://dl.fbaipublicfiles.com/fairseq/models/camembert.v0.tar.gz).
Why in vocab.txt all the tokens have a 999 next to the token? Is vocab.txt the vocab that goes out from sentencepiece training on large corpora?
Thank you

The text was updated successfully, but these errors were encountered:

louismartin · 2019-12-12T16:44:13Z

Hi @simonefrancia ,
The 999 is a dummy placeholder to make the sentencepiece vocab work with fairseq vocab format.
Thanks

simonefrancia · 2019-12-12T17:25:33Z

Thanks, so you don't use temporal vocab created by fairseq proprocessing at LM inference-level ? that is only needed for LM Training ?

simonefrancia · 2019-12-12T18:36:11Z

Ok I think I understood. The sentence piece vocab is an input of the fairseq preprocess, so you had to convert sentencepiece notation with fairseq notation. Is that right? Thanks

louismartin · 2019-12-12T18:44:51Z

Yes exactly!

simonefrancia · 2019-12-17T11:58:19Z

Hi @louismartin ,
I've inserted the vocab trained with SentencePiece in fairseq preprocessing, but a strange thing happens.
This is the command:

fairseq-preprocess \ 
    --only-source \
    --srcdict sentencepiece.bpe.vocab \
    --trainpref train.bpe \
    --validpref valid.bpe \
    --testpref test.bpe \
    --destdir $DATA_DIR \
    --workers 60

The output of this command:

Vocab Dimension: 32000
Dictionary: 32003 types
train.bpe: 7000000 sents, 167070745 tokens, 91.6% replaced by <unk>
Dictionary: 32003 types
Dictionary: 32003 types
valid.bpe: 1500000 sents, 35775360 tokens, 91.6% replaced by <unk>
Dictionary: 32003 types
test.bpe: 1500000 sents, 35748293 tokens, 91.6% replaced by <unk>
| Wrote preprocessed data to data-bin

I've trained SentencePiece Tokenizer on an entire corpus ( sampling a subset of sentences ), then I have converted the SP vocab in the FAIRSEQ vocab format, then splitted the initial entire corpus in train, valid and test.
The strange thing is that for train.bpe, valid.bpe and test.bpe 91.6% of the tokens are replaced by <unk>.
Do you have any suggestion about this?

myleott · 2019-12-17T13:35:54Z

Possibly the format of the vocab is off somehow. You can try running fairseq-preprocess again without the --srcdict option, that way it will generate a large/fresh dictionary. You can then adapt the sentencepiece vocab to match the format of that dictionary.

simonefrancia · 2019-12-17T13:41:42Z

Thanks for you reply @myleott .
But if I don't insert sentencepiece vocab inside fairseq preprocessing, fairseq creates its own vocab, and with that creates also preprocessed data, am I wrong? ( Maybe I am missing something).
Thanks

myleott · 2019-12-17T13:58:39Z

fairseq creates its own vocab and with that creates also preprocessed data

Yes, exactly :) Since you have a 91.6% unknown rate, I suspect your sentencepiece.bpe.vocab is formatted incorrectly somehow. So to figure out the correct format you should let fairseq generate its own vocab (dict.txt). Then compare the generated dict.txt to the sentencepiece.bpe.vocab you created and confirm that the format is correct.

simonefrancia · 2019-12-17T15:27:14Z

ok I think that the difference is that in sentencepiece.bpe.vocab the separator is \t, and in vocab.txt the separator is .
I already noticed this difference and I made a script that converts one to the other in pandas:

import pandas as pd

df = pd.read_csv("sentencepiece.bpe.vocab", sep="\t", header=None,  index_col=False,  quotechar=None, quoting=3, encoding="utf-8")
df[1] = 12345
print("Vocab Dimension: " + str(df.shape[0]))

df.to_csv("sentencepiece.bpe.fairseq.vocab", sep=" ", header=None,  index=False, encoding="utf-8")

But I see also that fairseq makes a dict like this:

while sentencepiece is like this:

<unk>	0
▁,	-3.11256
▁.	-3.49813
▁'	-3.68534
▁di	-3.70167
......

So I don't understand what are the numbers that are in the first column of fairseq vocab ( maybe the BPE codes, but we don't have the corresponding vocab )

myleott · 2019-12-17T15:57:12Z

In the fairseq dictionary the first column is the token and the second column is the frequency of the word in the training set, but the actual value doesn't matter, you can just use 12345.

What's interesting is that the fairseq dictionary seems to be based on IDs instead of Pieces. Did you use the scripts/spm_encode.py script? If so, did you use --output_format=piece or --output_format=id?

What does your training file look like? Can you share the first few lines of train.bpe?

$ head train.bpe

simonefrancia · 2019-12-17T16:13:44Z

OUTPUT of $ head train.bpe:

1 1256 700 2
1 1880 919 2388 5028 2
1 47 28 2071 20 15473 56 22 3459 28 345 130 2071 6640 2406 597 1313 3993 31 17291 10 25 1261 3 28 2060 102 1313 6 18320 8637 70 3341 17640 21 2
1 83 31 128 31 2
1 18967 20 953 590 10 3 28 3430 293 11147 10 6611 138 227 1048 20 31770 3193 196 28 2
1 1999 8779 52 5670 71 2190 21 5 28 95 4104 10442 85 754 1493 119 130 1930 446 28 3242 102 1243 462 41 2
1 234 969 7688 20 2979 3113 2
1 16960 6862 6 257 30746 5229 20 257 30746 2
1 83 31 309 2
1 83 31 128 31 28 28 2

My spm_train command is this:

spm_train \
    --input=$model.raw \
    --max_sentence_length=4192 \
    --model_prefix=sentencepiece.bpe \
    --vocab_size=32000 \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --pad_id=-1

And my spm_encode command is this:

spm_encode \
    --model=sentencepiece.bpe.model \
    --extra_options=bos:eos \
    --output_format=id \
    < $model.raw \
    > $model.bpe

Ok I think I understand: my error is that I am using --output_format=id , so I encode my data transforming in codes; but I feed the fair-processing with sentencepiece.bpe vocab ( that has not codes, but it has tokens as first column, and so look-up in this dict failed in 91.6% of the cases ).
So I think that changing with --output_format=piece and add --srcdict sentencepiece.bpe.vocab should fix the problem. Am I wrong?
Thank you very much

myleott · 2019-12-17T17:32:40Z

Exactly. If you use --output_format=piece when encoding then it should output pieces. Later you can use --srcdict sentencepiece.bpe.vocab and it should be able to look up the tokens and you should have a (much) lower unknown rate.

Please reopen if this is still an issue after doing the above, thanks!

simonefrancia closed this as completed Dec 13, 2019

simonefrancia reopened this Dec 17, 2019

myleott added the question label Dec 17, 2019

myleott closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Camembert Vocab Issue #1490

Camembert Vocab Issue #1490

simonefrancia commented Dec 12, 2019

louismartin commented Dec 12, 2019

simonefrancia commented Dec 12, 2019

simonefrancia commented Dec 12, 2019

louismartin commented Dec 12, 2019

simonefrancia commented Dec 17, 2019 •

edited

Loading

myleott commented Dec 17, 2019

simonefrancia commented Dec 17, 2019

myleott commented Dec 17, 2019

simonefrancia commented Dec 17, 2019 •

edited

Loading

myleott commented Dec 17, 2019

simonefrancia commented Dec 17, 2019 •

edited

Loading

myleott commented Dec 17, 2019

Camembert Vocab Issue #1490

Camembert Vocab Issue #1490

Comments

simonefrancia commented Dec 12, 2019

louismartin commented Dec 12, 2019

simonefrancia commented Dec 12, 2019

simonefrancia commented Dec 12, 2019

louismartin commented Dec 12, 2019

simonefrancia commented Dec 17, 2019 • edited Loading

myleott commented Dec 17, 2019

simonefrancia commented Dec 17, 2019

myleott commented Dec 17, 2019

simonefrancia commented Dec 17, 2019 • edited Loading

myleott commented Dec 17, 2019

simonefrancia commented Dec 17, 2019 • edited Loading

myleott commented Dec 17, 2019

simonefrancia commented Dec 17, 2019 •

edited

Loading

simonefrancia commented Dec 17, 2019 •

edited

Loading

simonefrancia commented Dec 17, 2019 •

edited

Loading