Skip to content

No Language Left Unlocked: scalable backtranslation of NLLB models

License

Notifications You must be signed in to change notification settings

LibreTranslate/nllu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

No Language Left Unlocked 🔓

In 2022 Meta released NLLB, a set of multi-lingual models for machine translation with impressive performance. But the model weights have been released using a restrictive non-commercial license, making them unusable for most open-source projects. The models also suffer by having a limited dictionary, which causes many translations to return unknown tokens.

This repository contains the software to run NLLU, an effort to run NLLB inference at scale to generate a corpus of bitext data that can be used to train new, permissively licensed language models.

Running NLLB inference on million of sentences is intensive and it would take years to perform on a single machine. We designed a simple server architecture which can distribute batches of sentences to be translated asynchronously across machines, which can be rented cheaply with providers such as vast.ai or runpod.io.

Datasets

Available at: nllu.libretranslate.com

We welcome requests/contributions for adding more datasets and languages! Get in touch.

Usage

Server

We use NodeJS for the server.

git clone https://github.com/LibreTranslate/nllu
cd nllu/server
npm i
  • Create a new directory in nllu/server/data/<dataset>
  • Place a monolingual English corpus in nllu/server/data/<dataset>/source.txt (one sentence per line)
  • Run:
cd nllu/server
node main.js -p 5555 --batch-size 100
Listening on port 5555

The server has persistency built-in, so you can restart it without losing state information (just don't change batch-size between restarts).

Client

docker run -ti --rm --gpus=all libretranslate/nllu --server http://<ip>:5555 --dataset <dataset> --target-lang <langcode> --batch-size 4 --split

We recommend tweaking batch-size to increase the translation speed, although in our experience it's actually faster to set this value to 1. --split will reduce memory usage on the GPU by loading only batch-size sentences at a time during translation.

You should tweak the --checkout-timeout option, expressed in seconds, if you expect a client to process a batch in longer than 1 hour (the default).

Rebuild Docker Image

docker build -t youruser/nllu .

Download Results

Once the entire dataset is translated, one can download it by visiting:

http://<ip>/download?dataset=<dataset>&lang=<lang>

Or by issuing:

cd nllu/server/<dataset>/<lang>
cat *.txt > ../merged.txt

Filter Results

We provide a script to filter the backtranslated data, following the recommendations of the NLLB paper:

python filter.py source.txt merged.txt <lang>

License

AGPLv3