Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Sentence Encoders #3

Closed
Fethbita opened this issue Jul 11, 2018 · 5 comments
Closed

About Sentence Encoders #3

Fethbita opened this issue Jul 11, 2018 · 5 comments

Comments

@Fethbita
Copy link

Fethbita commented Jul 11, 2018

The install_models.sh file downloads 3 files, one is the blstm.ep7.9langs-v1.bpej20k.model.py file and the other two are ep7.9langs-v1.bpej20k.bin.9xx & ep7.9langs-v1.bpej20k.codes.9xx. mlenc.py file says that bpe_codes is "File with BPE codes (created by learn_bpe.py)." and on the research paper it is mentioned as "20k joint vocabulary for all the nine languages" I created this using learn_bpe.py as mentioned with my own data but I don't quite understand how to create the other two, hash_table "File with hash table for binarization." and model "File with trained model used for encoding."
Any idea on how I can create hash_table and model? I couldn't find any documentation about them or code sample to train them. Thanks in advance.

@hoschwenk
Copy link
Contributor

Hello,
Thanks for your interest in this work.
In the current version of the code, there is no support to train new sentence embeddings on your own data or for other languages. Normally, this should not be necessary since our experiments have shown that the embeddings seem to be very generic and perform well on several tasks.
We plan to provide code to train new encoders in the future.

To calculate sentence embeddings for arbitrary texts, just use the existing pipeline, e.g. like it is used in tasks/similarity/sim.sh. It should be straight-forward, you only need to call the bash functions "Tokenize" and "Embed" on your data. There is no need to calculate new BPE or binarization vocabularies.

Don't hesitate to contact me again if you need further assistance

@SbstnErhrdt
Copy link

We plan to provide code to train new encoders in the future.

Is there anything new regarding this?

Greetings Seb

@olga-gorun
Copy link

I know that this issue is closed but I'd like to continue the discussion. This year we had at least two big events that is expected to vocabulary context: BLM and especially all related to covid-19. Since embeddings basically encode word with the help of context it meets and the context in real world changed drastically, it is expected that retraining from texts that are available today would not only add new words but also change encodings for existing words. Do you plan to retrain the embeddings it or give the possibility to do it to users?

@Fethbita
Copy link
Author

I'll open the issue again. If it's necessary, it can be closed and locked.

@Fethbita Fethbita reopened this Oct 15, 2020
@avidale
Copy link
Contributor

avidale commented Jun 8, 2023

Hi @Fethbita!
Last year, the embeddings were retrained for 200 languages (so-called LASER-2 and LASER-3 embeddings, see https://github.com/facebookresearch/LASER/tree/main/nllb).

Also, there is code for training new LASER models from scratch (https://github.com/facebookresearch/fairseq/tree/nllb/examples/laser) and for distilling them for new languages (https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/laser_distillation).

I hope this satisfies the request both for new embeddings and for the tools to update them by yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants