Skip to content

Latest commit



225 lines (166 loc) · 14.1 KB

File metadata and controls

225 lines (166 loc) · 14.1 KB

There are many different ways to contribute to ChemNLP! You can get in touch via the GitHub task board and issues and our Discord.


Please make a GitHub account prior to implementing a dataset; you can follow instructions to install git here.

  1. Fork the ChemNLP repository
  2. Clone your fork
  3. Make a new branch
  4. Please try using conventional commits for formatting your commit messages

If you wish to work on one of the submodules for the project, please see the git workflow docs.

Create a development environment (For code/dataset contributions)

For code and data contributions, we recommend you creata a conda environment. If you do not have conda already installed on your system, we recommend installing miniconda:

To create your developer environment please follow the guidelines in the Installation and set-up of

Work package leads

If you are contributing to an existing task which contains a work package: <name> label, please refer to the list below to find a main point of contact for that piece of work. If you've any questions or wish to contribute additional issues feel free to reach out to these work package leads from the core team on the OpenBioML Discord or message directly on GitHub issues.

Name (discord & github) Main Work Packages
Michael Pieler (MicPie#9427 & MicPie) 💾 Structured Data, Knowledge Graph, Tokenisers, Data Sampling
Kevin Jablonka (Kevin Jablonka#1694 & kjappelbaum) 💾 Structured Data, Knowledge Graph, Tokenisers, Data Sampling
Bethany Connolly (bethconnolly#3951 & bethanyconnolly) 📊 Model Evaluation
Jack Butler (Jack Butler#8114 & jackapbutler) ⚙️ Model Training
Mark Worrall (Mark Worrall#3307 & maw501) 🦑 Model Adaptations

Implementing a dataset

Contributing a dataset

One of the most important ways to contribute to the ChemNLP efforts is to implement a dataset. With "implementing" we mean the following:

  • Take a dataset from our awesome list (if it is not there, please add it there first, so we keep track)
  • Make an issue in this repository that you want to add this dataset (we will label this issue and assign it to you)
  • Make a PR that adds in a new folder in data
    • meta.yaml describing the dataset in the form that produces. We will use this later to construct the prompts.

      If your dataset has multiple natural splits (i.e. train, test, validation) you can create a _meta.yaml for each.

    • Python code that transforms the original dataset (linked in meta.yaml) into a form that can be consumed by the loader. For tabular datasets that will mostly involve: Removing/merging duplicated entries, renaming columns, dropping unused columns. Try to keep the output your uses as lean as possible (i.e. no columns that will not be used). In some cases, you might envision that extra columns might be useful. If this is the case, please add them (e.g., indicating some grouping, etc.) Even though some examples create the meta.yaml in there is no need to do so. You can also do it by hand.

      In the please try to download the data from an official resource. We encourage you to upload the raw data to HuggingFace Hub, Foundry or some other repository and then retrieve the data from there with your script, if the raw data license permits it.

    • If you need additional dependencies, add them to dev-requirements.txt (those are needed for linting/testing/validation) or requirements.txt (those are the ones for running

The meta.yaml has the following structure:

name: aquasoldb # unique identifier, we will also use this for directory names
description: | # short description what this dataset is about
  Curation of nine open source datasets on aqueous solubility.
  The authors also assigned reliability groups.
  - id: Solubility # name of the column in a tabular dataset
    description: Experimental aqueous solubility value (LogS) # description of what this column means
    units: log(mol/L) # units of the values in this column (leave empty if unitless)
    type: continuous # can be "categorical", "ordinal", "continuous", "boolean"
    names: # names for the property (to sample from for building the prompts)
      - noun: aqueous solubility
      - noun: solubility in water
  - id: SD
    description: Standard deviation of the experimental aqueous solubility value for multiple occurences
    units: log(mol/L)
    type: continuous
      - noun: standard deviation of the aqueous solubility
      - noun: tandard deviation of the solubility in water
benchmarks: # lists all benchmarks this dataset has been part of. split_column is a column in this dataframe with the value "train", "valid", "test" - indicating to which fold a specific entry belongs to
    - name: TDC
      split_column: split
  - id: InChI # column name
    type: InChI # can be "SMILES", "SELFIES", "IUPAC", "Other", "InChI", "InChiKey", "RXNSMILES", "RXNSMILESWAdd" see IdentifierEnum
    description: International Chemical Identifier # description (optional, except for "OTHER")
license: CC0 1.0 # license under which the original dataset was published
num_points: 10000 # number of datapoints in this dataset
links: # list of relevant links (original dataset, other uses, etc.)
  - name: dataset
    description: Original dataset
bibtex: # citation(s) for this dataset in BibTeX format
  - |
    doi = {10.1038/s41597-019-0151-1},
    url = {},
    year = 2019,
    month = {aug},
    publisher = {Springer Science and Business Media {LLC}},
    volume = {6},
    number = {1},
    author = {Murat Cihan Sorkun and Abhishek Khetan and Süleyman Er},
    title = {{AqSolDB}, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds},
    journal = {Sci Data}

Please do not simply copy/paste generic descriptions but try to give a concise and specific description for the dataset you are adding.

For the typical material-property datasets, we will later use the identifier and property columns to create and fill prompt templates. In case your dataset isn't a simple tabular dataset with chemical compounds and properties, please also add the following additional fields for the templates:

  - prompt: "Please answer the following chemistry question.\nDerive for the molecule with the <molecule#text> <molecule#value> the <expt_value#text>."
    completion: "<exp_value.value>"
  - prompt: "Please answer the following question.\nPredict the <expt_value#text> for <molecule#value>."
    completion: "<exp_value#value>"
      - name: lab_value
        column: lab_value
        text: adsorption energy
      - name: smiles
        column: smiles
      - name: inchi
        column: inchi
        text: InChI

With this approach you can specify different fields, where each field maps to one of many columns in a dataframe. In the templates you can use # to either fill in the value of a particular entry or the .text, that you specify in the yaml.

If there are multiple values for one field, we will sample combinations. If you want to suggest sampling from different prompt prefixes, you can do so by specifying a template fields and different text.

In case you run into issues (or think you don't have enough compute or storage, please let us know). Also, in some cases csv might not be the best format. If you think that csv is not suitable for your dataset, let us know.

For now, you do not need to upload the transformed datasets anywhere. We will collect the URLs of the raw data in meta.yaml and the code to produce curated data in and then run in this on dedicated infrastructure.

How will the datasets be used?

If your dataset is in tabular form, we will construct prompts using, for example, the LIFT framework. In this case, we will sample from the identifier and targets columns. If you specify prompt templates, we will also sample from those. Therefore, it is very important that the column names in the meta.yaml match the ones in the file that produces. One example of a prompt we might construct is "What is the <target_name> of <identifier>", where we sample target_name from the names of the targets listed in meta.yaml and identifier from the identifiers provided in meta.yaml.


If your dataset is part of a benchmark, please indicate what fold your data is part of using an additional split_col in which you use train, valid, test to indicate the split type. Please indicate this in the meta.yaml under the field split_col.


We ask you to add uris and pubchem_aids in case you find suitable references. We distinguish certain types of identifiers, for which you have to specify the correct strings. The currently allowed types are in the IdentifierEnum in src/chemnlp/data_val/

  • SMILES: Use the canonical form (RdKit)
  • SELFIES: Self-referencing embedded strings
  • IUPAC: IUPAC-Name, not use it for non-standard, common names
  • InChI
  • InChIKey: The key derived from the InChI
  • RXNSMILES: The reaction SMILES containing only educt and product
  • RXNSMILESWAdd: The reaction SMILES also containing solvent and additives
  • Other: For all other identifiers
Uniform Resource Identifiers (URIs)

If you have a uniform resource identifier (URI) that links to a suitable name of a property, please list it in the uris list for a given target. Please ensure that the link is specific. If you have a boolean target that measures inhibition of a protein, link to inhbitor of XY and not to the protein. If such a link does not exist, leave the field empty.

You might find suitable links using the following resources:

PubChem Assay IDs

For some targets, the activity was measured using assays. In this case, please list the assays using with their numeric PubChem assay id in the field pubchem_aids. Please ensure that the first entry in this list is a primary scan which corresponds to the target property (and not to its inverse or a control). Keep in mind that we plan to look up the name and the description of the assay to build prompt. That is, the name of the assay of the first entry in this list should also work in a prompt such as Is <identifier> active in ?`

Prompt examples

For datasets that are not in tabular form, we are still discussing the best process, but we also envision that we might perform some named-entity-recognition to also use some of the text datasets in a framework such as LIFT. Otherwise, we will simple use them in the typical GPT pretraining task.

Implementing structured data sampler


Implementing tokenizers


Implementing model adaptations

Our first experiments will be based on Pythia model suite from EleuetherAI that is based on GPT-NeoX.

If you are not familiar LLM training have a look at this very good guide: Large-scale language modeling tutorials with PyTorch from TUNiB

Please have a look for the details in the corresponding section in our proposal.

Hugging Face Hub

We have a preference for using the Hugging Face Hub and processing datasets through the datasets package when storing larger datasets on the OpenBioML hub as it can offer us a lot of nice features such as

  • Easy multiprocessing parallelism for data cleaning
  • Version controlling of the datasets as well as our code
  • Easy interface into tokenisation & other aspects for model training
  • Reuse of utility functions once we have a consistent data structure.

However, don't feel pressured to use this if you're more comfortable contributing an external dataset in another format. We are primarily thinking of using this functionality for processed, combined datasets which are ready for training.

Feel free to reach out to one of the team and read this guide for more information.