High level goals

This repository is dedicated specifically to the development of (Large) Language Models, and/or Language/Structure models in the bio-chem space.

Further details can be found here

Bio-LM PubChem Selfies

We are training an Electra-style model on the PubChem dataset with SELFIES representations. The SELFIES is a chemical language that is based on the SMILES language, but is more robust. More info about SELFIES can be found here.

We have released the dataset to HuggingFace Datasets, which contains ~110M compounds in total.

We will perform a hyperparameter search using Maximal Update Parameterization to find a good set of hyperparameters to transfer to a larger model. To launch a sweep on the cluster, run

sbatch --array=1-N mup_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

High level goals

Bio-LM PubChem Selfies

Files

README.md

Latest commit

History

README.md

File metadata and controls

High level goals

Bio-LM PubChem Selfies