Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Camembert Vocab Issue #1490

Closed
simonefrancia opened this issue Dec 12, 2019 · 12 comments
Closed

Camembert Vocab Issue #1490

simonefrancia opened this issue Dec 12, 2019 · 12 comments
Labels

Comments

@simonefrancia
Copy link
Contributor

Hi @louismartin,
I have a question about Camembert Model that you provide from there (http://dl.fbaipublicfiles.com/fairseq/models/camembert.v0.tar.gz).
Why in vocab.txt all the tokens have a 999 next to the token? Is vocab.txt the vocab that goes out from sentencepiece training on large corpora?
Thank you

@louismartin
Copy link
Contributor

Hi @simonefrancia ,
The 999 is a dummy placeholder to make the sentencepiece vocab work with fairseq vocab format.
Thanks

@simonefrancia
Copy link
Contributor Author

Thanks, so you don't use temporal vocab created by fairseq proprocessing at LM inference-level ? that is only needed for LM Training ?

@simonefrancia
Copy link
Contributor Author

Ok I think I understood. The sentence piece vocab is an input of the fairseq preprocess, so you had to convert sentencepiece notation with fairseq notation. Is that right? Thanks

@louismartin
Copy link
Contributor

Yes exactly!

@simonefrancia
Copy link
Contributor Author

simonefrancia commented Dec 17, 2019

Hi @louismartin ,
I've inserted the vocab trained with SentencePiece in fairseq preprocessing, but a strange thing happens.
This is the command:

fairseq-preprocess \ 
    --only-source \
    --srcdict sentencepiece.bpe.vocab \
    --trainpref train.bpe \
    --validpref valid.bpe \
    --testpref test.bpe \
    --destdir $DATA_DIR \
    --workers 60

The output of this command:

Vocab Dimension: 32000
Dictionary: 32003 types
train.bpe: 7000000 sents, 167070745 tokens, 91.6% replaced by <unk>
Dictionary: 32003 types
Dictionary: 32003 types
valid.bpe: 1500000 sents, 35775360 tokens, 91.6% replaced by <unk>
Dictionary: 32003 types
test.bpe: 1500000 sents, 35748293 tokens, 91.6% replaced by <unk>
| Wrote preprocessed data to data-bin

I've trained SentencePiece Tokenizer on an entire corpus ( sampling a subset of sentences ), then I have converted the SP vocab in the FAIRSEQ vocab format, then splitted the initial entire corpus in train, valid and test.
The strange thing is that for train.bpe, valid.bpe and test.bpe 91.6% of the tokens are replaced by <unk>.
Do you have any suggestion about this?

@myleott
Copy link
Contributor

myleott commented Dec 17, 2019

Possibly the format of the vocab is off somehow. You can try running fairseq-preprocess again without the --srcdict option, that way it will generate a large/fresh dictionary. You can then adapt the sentencepiece vocab to match the format of that dictionary.

@simonefrancia
Copy link
Contributor Author

Thanks for you reply @myleott .
But if I don't insert sentencepiece vocab inside fairseq preprocessing, fairseq creates its own vocab, and with that creates also preprocessed data, am I wrong? ( Maybe I am missing something).
Thanks

@myleott
Copy link
Contributor

myleott commented Dec 17, 2019

fairseq creates its own vocab and with that creates also preprocessed data

Yes, exactly :) Since you have a 91.6% unknown rate, I suspect your sentencepiece.bpe.vocab is formatted incorrectly somehow. So to figure out the correct format you should let fairseq generate its own vocab (dict.txt). Then compare the generated dict.txt to the sentencepiece.bpe.vocab you created and confirm that the format is correct.

@simonefrancia
Copy link
Contributor Author

simonefrancia commented Dec 17, 2019

ok I think that the difference is that in sentencepiece.bpe.vocab the separator is \t, and in vocab.txt the separator is .
I already noticed this difference and I made a script that converts one to the other in pandas:

import pandas as pd

df = pd.read_csv("sentencepiece.bpe.vocab", sep="\t", header=None,  index_col=False,  quotechar=None, quoting=3, encoding="utf-8")
df[1] = 12345
print("Vocab Dimension: " + str(df.shape[0]))

df.to_csv("sentencepiece.bpe.fairseq.vocab", sep=" ", header=None,  index=False, encoding="utf-8")

But I see also that fairseq makes a dict like this:

10 100517
1 99999
2 99999
21 68345
28 56706
.....

while sentencepiece is like this:

<unk>	0
▁,	-3.11256
▁.	-3.49813
▁'	-3.68534
▁di	-3.70167
......

So I don't understand what are the numbers that are in the first column of fairseq vocab ( maybe the BPE codes, but we don't have the corresponding vocab )

@myleott
Copy link
Contributor

myleott commented Dec 17, 2019

In the fairseq dictionary the first column is the token and the second column is the frequency of the word in the training set, but the actual value doesn't matter, you can just use 12345.

What's interesting is that the fairseq dictionary seems to be based on IDs instead of Pieces. Did you use the scripts/spm_encode.py script? If so, did you use --output_format=piece or --output_format=id?

What does your training file look like? Can you share the first few lines of train.bpe?

$ head train.bpe

@simonefrancia
Copy link
Contributor Author

simonefrancia commented Dec 17, 2019

OUTPUT of $ head train.bpe:

1 1256 700 2
1 1880 919 2388 5028 2
1 47 28 2071 20 15473 56 22 3459 28 345 130 2071 6640 2406 597 1313 3993 31 17291 10 25 1261 3 28 2060 102 1313 6 18320 8637 70 3341 17640 21 2
1 83 31 128 31 2
1 18967 20 953 590 10 3 28 3430 293 11147 10 6611 138 227 1048 20 31770 3193 196 28 2
1 1999 8779 52 5670 71 2190 21 5 28 95 4104 10442 85 754 1493 119 130 1930 446 28 3242 102 1243 462 41 2
1 234 969 7688 20 2979 3113 2
1 16960 6862 6 257 30746 5229 20 257 30746 2
1 83 31 309 2
1 83 31 128 31 28 28 2

My spm_train command is this:

spm_train \
    --input=$model.raw \
    --max_sentence_length=4192 \
    --model_prefix=sentencepiece.bpe \
    --vocab_size=32000 \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --pad_id=-1

And my spm_encode command is this:

spm_encode \
    --model=sentencepiece.bpe.model \
    --extra_options=bos:eos \
    --output_format=id \
    < $model.raw \
    > $model.bpe

Ok I think I understand: my error is that I am using --output_format=id , so I encode my data transforming in codes; but I feed the fair-processing with sentencepiece.bpe vocab ( that has not codes, but it has tokens as first column, and so look-up in this dict failed in 91.6% of the cases ).
So I think that changing with --output_format=piece and add --srcdict sentencepiece.bpe.vocab should fix the problem. Am I wrong?
Thank you very much

@myleott
Copy link
Contributor

myleott commented Dec 17, 2019

Exactly. If you use --output_format=piece when encoding then it should output pieces. Later you can use --srcdict sentencepiece.bpe.vocab and it should be able to look up the tokens and you should have a (much) lower unknown rate.

Please reopen if this is still an issue after doing the above, thanks!

@myleott myleott closed this as completed Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants