The dependencies for this project are managed by Poetry. To install them, run
poetry install
Some of the dependencies are:
- Pytorch 2.1
- Python 3.10
A Dockerfile is provided to run the code in a container. To build the image, run
./build_docker_image.sh
The image name is $HOSTNAME/llm-transformer
. To run the container, run
./docker.sh python -m llmt.main --help
This code was developed and tested on the Nvidia 4090 GPU with 24GB of memory.
huggingface-cli login
cp ~/cache/huggingface/token ./data/
poetry install
Token is supposed to be under directory ./data
In order to download the dataset, run
./docker.sh python -m llmt.main dataset download
and it will be downloaded under -./data
.
In order to train a model, run
./docker.sh python -m llmt.main train
We use the tokenizer from https://huggingface.co/replit/replit-code-v1-3b
TODO
Links:
- https://github.com/openai/human-eval-infilling
- https://github.com/nuprl/MultiPL-E
TODO
- Implement the test function to evaluate the generated code
- Use sintax trees for the languages to remove the spaces which add no information and may lead to slower learning
- Use sintax trees to change the variable names