-a probablistic generative transformer neural network model for molecular design
This repository contains the datasets and link to the code for our paper:
GENERATIVE TRANSFORMER LANGUAGE MODELS FOR GENERATIVE AND TINKERING DESIGN OF MOLECULES
Lai Wei, Nihang Fu, Yuqi Song, Qian Wang, and Jianjun Hu
by Machine Learning and Evolution Laboratory, University of South Carolina.
Benchmark Datasets from Molecular Sets(MOSES): MOSES
SELFIES tokenizers from: Selfies
The GMTransformer datasets including:
SMILES-atom training dataset (1,584,664 samples)
SMILES-atom validation dataset (176,075 samples)
SELFIES-atom training dataset (1,584,664 samples)
SELFIES-atom validation dataset (176,075 samples)
They can be downloaded here:
The BLM language model code we used is from here, which is based on the PyTorch Lightning framework. It has been tested in PyTorch 1.6.0, PyTorch Lightning 1.0.7
Install pytorch
from pytorch web based on your python & cuda version
conda create -n blm
conda activate blm
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch
conda install -c conda-forge pytorch-lightning=1.0.7
or for Nvidia 3090
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install pytorch-lightning==1.0.7
Go to the blank language model (BLM) repository from https://github.com/Varal7/blank_language_model/
Download the repository and unzip it. Install the necessary python libraries as instructed.
And then put a GMTransformer folder inside it.
git clone https://github.com/Varal7/blank_language_model.git
cd blank_language_model
mkdir GMTransformer
cd GMTransformer
Download datasets from the above links and put it into the GMTransformer folder, then unzip it under SMILE_data.zip
and SELFIES_data.zip
folder.
wget https://github.com/usccolumbia/GMTransformer/blob/main/SELFIES_data.zip?raw=true -O SELFIES_data.zip
wget https://github.com/usccolumbia/GMTransformer/blob/main/SMILE_data.zip?raw=true -O SMILE_data.zip
unzip SELFIES_data.zip
unzip SMILES_data.zip
After the above, the directory should be:
blank_language_model
|-GMTransformer
├── SMILE_data
├── SMILES_atom_train.txt
├── SMILES_atom_valid.txt
├── SELFIES_data
├── SELFIES_atom_train.txt
├── SELFIES_atom_valid.txt
└── README.md
An example is to train a GMTransformer model on the SMILES_atom dataset.
cd blank_language_model
python train.py --train GMTransformer/SMILE_data/SMILES_atom_train.txt --valid GMTransformer/SMILE_data/SMILES_atom_valid.txt --root_dir checkpoints/SMILES/atom/ \
--vocab_size 100 --max_len 200 --model_type blm --share_emb_prj_weight
The training for other models is similar to SMILES_atom dataset.
For all of the following, replace epoch\=???.ckpt
with the checkpoint saved in training.
Generate molecules using the trained SMILES_atom model.
python test.py --checkpoint checkpoints/SMILES/atom/lightning_logs/version_0/checkpoints/epoch\=???.ckpt \
--sample 1000 --decode sample --output sample.txt
The output file is located at
checkpoints/SMILES/atom/lightning_logs/version_1/outputs/sample.txt
You can then convert the generated token list into SMILES file:
python convert2smiles.py --input checkpoints/SMILES/atom/lightning_logs/version_1/outputs/sample.txt --output output_smiles.txt
for SELFIES-model,
python selfiestoken2smiles.py --input checkpoints/SELFIES/atom/lightning_logs/version_1/outputs/sample.txt --output output2_smiles.txt
Download the zipped model file from figshare zipped model file put it into the GMTransformer folder, and unzip it.
Then run the following to generate molecules using the GMTransformer-SMILES or GMTransformer-SELFIES model.
python test.py --checkpoint GMTransformer/models/SELFIES-model/checkpoint/blanklm-epoch=2835-val_loss=0.78.ckpt \
--sample 1000 --decode sample --output sample.txt
python test.py --checkpoint GMTransformer/models/SMILES-model/checkpoint/blanklm-epoch=2716-val_loss=0.71.ckpt \
--sample 1000 --decode sample --output sample.txt
After the generation, you need to use the same conversion step as above to convert the sequences into SMILES format.
If you use our work, please cite:
@article{wei2023probabilistic,
title={Probabilistic generative transformer language models for generative design of molecules},
author={Wei, Lai and Fu, Nihang and Song, Yuqi and Wang, Qian and Hu, Jianjun},
journal={Journal of Cheminformatics},
volume={15},
number={1},
pages={88},
year={2023},
publisher={Springer}
}
}
``