-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to build encoder.json and dict.txt #1186
Comments
What BPE are you using (sentencepiece, fastbpe, something else)? The |
I am using sentencepiece BPE @lematt1991 . So can I copy-paste encoder.json directly? |
You shouldn't need encoder.json at all. Follow these instructions, and skip the "Next encode it with the GPT-2 BPE" section, and encode using your sentencepiece BPE. The rest should be the same. |
But in the preprocessing step there is
|
The *.bpe files are the names of the BPE encoded files. You would do something like: for SPLIT in train valid test; do \
cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done And then: fairseq-preprocess \
--only-source \
--trainpref wikitext-103-raw/wiki.train.bpe \
--validpref wikitext-103-raw/wiki.valid.bpe \
--testpref wikitext-103-raw/wiki.test.bpe \
--destdir data-bin/wikitext-103 \
--workers 60 By not specifying |
Does this solve your problem? If so, do you mind closing this issue. Thanks! |
Closing due to inactivity |
what if dict.txt is not present and I have multiple data-bins (data-bin1:data-bin2:data-bin3, ec...), how can I create a general dict.txt that is valid for each data-bin and not create a new one for each of them that will also cause problems? |
@lematt1991 lematt1991
|
Hi @lematt1991, in case you want to use a specific dictionary, you should create by hand dict.txt, right? Afaik, fairseq-preprocess will generate the dictionary by looking at the unique tokens appearing in training, but in case some of the tokens in your original dictionary don't appear in the train set (e.g., placeholder tokens you want to reserve, or special tokens you may use in a future fine-tuning but not during pre-training), these will not be added to dict.txt, and there are no arguments to add them. In this case, one would have to manually concatenate them to dict.txt, with frequency 0? |
I am training RoBERTa on a different language. I found how to build vocab.bpe using other BPE methods but not able to figure out how to get dict.txt and encoder.json.
Please suggest how to do this.
The text was updated successfully, but these errors were encountered: