-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Camembert Vocab Issue #1490
Comments
Hi @simonefrancia , |
Thanks, so you don't use temporal vocab created by fairseq proprocessing at LM inference-level ? that is only needed for LM Training ? |
Ok I think I understood. The sentence piece vocab is an input of the fairseq preprocess, so you had to convert sentencepiece notation with fairseq notation. Is that right? Thanks |
Yes exactly! |
Hi @louismartin , fairseq-preprocess \
--only-source \
--srcdict sentencepiece.bpe.vocab \
--trainpref train.bpe \
--validpref valid.bpe \
--testpref test.bpe \
--destdir $DATA_DIR \
--workers 60 The output of this command: Vocab Dimension: 32000
Dictionary: 32003 types
train.bpe: 7000000 sents, 167070745 tokens, 91.6% replaced by <unk>
Dictionary: 32003 types
Dictionary: 32003 types
valid.bpe: 1500000 sents, 35775360 tokens, 91.6% replaced by <unk>
Dictionary: 32003 types
test.bpe: 1500000 sents, 35748293 tokens, 91.6% replaced by <unk>
| Wrote preprocessed data to data-bin I've trained SentencePiece Tokenizer on an entire corpus ( sampling a subset of sentences ), then I have converted the SP vocab in the FAIRSEQ vocab format, then splitted the initial entire corpus in train, valid and test. |
Possibly the format of the vocab is off somehow. You can try running fairseq-preprocess again without the |
Thanks for you reply @myleott . |
Yes, exactly :) Since you have a 91.6% unknown rate, I suspect your |
ok I think that the difference is that in import pandas as pd
df = pd.read_csv("sentencepiece.bpe.vocab", sep="\t", header=None, index_col=False, quotechar=None, quoting=3, encoding="utf-8")
df[1] = 12345
print("Vocab Dimension: " + str(df.shape[0]))
df.to_csv("sentencepiece.bpe.fairseq.vocab", sep=" ", header=None, index=False, encoding="utf-8") But I see also that
while
So I don't understand what are the numbers that are in the first column of fairseq vocab ( maybe the BPE codes, but we don't have the corresponding vocab ) |
In the fairseq dictionary the first column is the token and the second column is the frequency of the word in the training set, but the actual value doesn't matter, you can just use 12345. What's interesting is that the fairseq dictionary seems to be based on IDs instead of Pieces. Did you use the scripts/spm_encode.py script? If so, did you use What does your training file look like? Can you share the first few lines of train.bpe? $ head train.bpe |
OUTPUT of
My
And my
Ok I think I understand: my error is that I am using |
Exactly. If you use Please reopen if this is still an issue after doing the above, thanks! |
Hi @louismartin,
I have a question about Camembert Model that you provide from there (http://dl.fbaipublicfiles.com/fairseq/models/camembert.v0.tar.gz).
Why in
vocab.txt
all the tokens have a999
next to the token? Isvocab.txt
the vocab that goes out from sentencepiece training on large corpora?Thank you
The text was updated successfully, but these errors were encountered: