Skip to content

Latest commit

 

History

History
17 lines (10 loc) · 1012 Bytes

README.md

File metadata and controls

17 lines (10 loc) · 1012 Bytes

High level goals

This repository is dedicated specifically to the development of (Large) Language Models, and/or Language/Structure models in the bio-chem space.

Further details can be found here

Bio-LM PubChem Selfies

We are training an Electra-style model on the PubChem dataset with SELFIES representations. The SELFIES is a chemical language that is based on the SMILES language, but is more robust. More info about SELFIES can be found here.

We have released the dataset to HuggingFace Datasets, which contains ~110M compounds in total.

We will perform a hyperparameter search using Maximal Update Parameterization to find a good set of hyperparameters to transfer to a larger model. To launch a sweep on the cluster, run

sbatch --array=1-N mup_train.sh