Skip to content

Commit

Permalink
Preprocessing update (#1)
Browse files Browse the repository at this point in the history
* update to reflect open sourcing of preprocessing

* update to reflect open sourced preprocessing

* check mypy on disconnection aware

[bump patch]

Co-authored-by: Amol Thakkar <[email protected]>
  • Loading branch information
A-Thakkar and Amol Thakkar authored Nov 15, 2022
1 parent 7157f5e commit ac45726
Show file tree
Hide file tree
Showing 38 changed files with 576 additions and 828 deletions.
39 changes: 19 additions & 20 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

name: "Running tests: style, mypy, pytest"

on: [push, pull_request]
Expand All @@ -8,22 +7,22 @@ jobs:
runs-on: ubuntu-latest
name: Style, mypy, pytest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.7
uses: actions/setup-python@v3
with:
python-version: 3.7
- name: Install Dependencies
run: pip install -e .[dev,rdkit]
- name: Check black
run: python -m black --check --diff --color .
- name: Check isort
run: python -m isort --check --diff .
- name: Check flake8
run: python -m flake8 .
- name: Check mypy (on the tests)
run: python -m mypy tests
- name: Check mypy (on package)
run: python -m mypy disconnection_aware_retrosynthesis
- name: Run pytests
run: python -m pytest
- uses: actions/checkout@v3
- name: Set up Python 3.7
uses: actions/setup-python@v3
with:
python-version: 3.7
- name: Install Dependencies
run: pip install -e .[dev,rdkit]
- name: Check black
run: python -m black --check --diff --color .
- name: Check isort
run: python -m isort --check --diff .
- name: Check flake8
run: python -m flake8 .
- name: Check mypy (on the tests)
run: python -m mypy tests
- name: Check mypy (on package)
run: python -m mypy src/dar
- name: Run pytests
run: python -m pytest
5 changes: 0 additions & 5 deletions AUTHORS.rst

This file was deleted.

169 changes: 65 additions & 104 deletions README.rst → README.md
Original file line number Diff line number Diff line change
@@ -1,115 +1,80 @@
.. These are examples of badges you might want to add to your README:
please update the URLs accordingly
.. image:: https://api.cirrus-ci.com/github/<USER>/disconnection_aware_retrosynthesis.svg?branch=main
:alt: Built Status
:target: https://cirrus-ci.com/github/<USER>/disconnection_aware_retrosynthesis
.. image:: https://readthedocs.org/projects/disconnection_aware_retrosynthesis/badge/?version=latest
:alt: ReadTheDocs
:target: https://disconnection_aware_retrosynthesis.readthedocs.io/en/stable/
.. image:: https://img.shields.io/coveralls/github/<USER>/disconnection_aware_retrosynthesis/main.svg
:alt: Coveralls
:target: https://coveralls.io/r/<USER>/disconnection_aware_retrosynthesis
.. image:: https://img.shields.io/pypi/v/disconnection_aware_retrosynthesis.svg
:alt: PyPI-Server
:target: https://pypi.org/project/disconnection_aware_retrosynthesis/
.. image:: https://img.shields.io/conda/vn/conda-forge/disconnection_aware_retrosynthesis.svg
:alt: Conda-Forge
:target: https://anaconda.org/conda-forge/disconnection_aware_retrosynthesis
.. image:: https://pepy.tech/badge/disconnection_aware_retrosynthesis/month
:alt: Monthly Downloads
:target: https://pepy.tech/project/disconnection_aware_retrosynthesis
.. image:: https://img.shields.io/twitter/url/http/shields.io.svg?style=social&label=Twitter
:alt: Twitter
:target: https://twitter.com/disconnection_aware_retrosynthesis
.. image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
:alt: Project generated with PyScaffold
:target: https://pyscaffold.org/

==================================
Interactive Retrosynthetic Language Models
==================================


This repository provides functionality for preprocessing data for, and training interactive retrosynthetic language models,
# Interactive Retrosynthetic Language Models

This repository provides functionality for preprocessing data for, and training interactive retrosynthesis language models,
so called disconnection aware retrosynthesis.

Abstract
########

Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions,
and in the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in
natural language processing to the task of chemical language modelling. We show that by using a prompt describing the disconnection
site in a molecule, we are able to steer the model to propose a wider sets of precursors, overcoming training data biases in
retrosynthetic recommendations and achieving a 39 % performance improvement over the baseline. For the first time, the use of a
disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more
diverse and creative recommendations. In addition, in lieu of a human-in-the-loop strategy, we propose a schema for automatic
identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity
as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data, providing a
larger variety of usable building blocks which in turn improves the end-user digital experience. We demonstrate its application
to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key.

.. image:: images/overview_figure.jpeg
:width: 800px
:align: center
:height: 400px
:alt: alternate text


Dataset
#######

The data used was derived from the `US Patent Office extracts (USPTO) by Lowe <https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873>`
and was processed to filter for the following conditions:

- min_reactants: 2
- max_reactants: 10
- max_reactants_tokens: 300
- min_agents: 0
- max_agents: 0
- max_agents_tokens: 0
- min_products: 1
- max_products: 1
- max_products_tokens: 200
- max_absolute_formal_charge: 2

The reactions were mapped with `RXNMapper <https://github.com/rxn4chemistry/rxnmapper>`
## Abstract

Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions,
and in the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in
natural language processing to the task of chemical language modelling. We show that by using a prompt describing the disconnection
site in a molecule, we are able to steer the model to propose a wider sets of precursors, overcoming training data biases in
retrosynthetic recommendations and achieving a 39 % performance improvement over the baseline. For the first time, the use of a
disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more
diverse and creative recommendations. In addition, in lieu of a human-in-the-loop strategy, we propose a schema for automatic
identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity
as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data, providing a
larger variety of usable building blocks which in turn improves the end-user digital experience. We demonstrate its application
to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key.

![alt text](images/overview_figure.jpeg "Overview")

## Dataset

The data was derived from the [US Patent Office extracts (USPTO) by Lowe](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873)
and was processed using the following [workflow](./notebooks_and_scripts/rxn_preprocessing_workflow.ipynb)

Refer to the data folder for an explanation of how to process an example dataset.

Note: For using Enzyme data the model can be trained following the procedure of `Probst et. al. <https://github.com/rxn4chemistry/biocatalysis-model>` after tagging disconnection sites as shown below.
Note: For using Enzyme data the model can be trained following the procedure of [Probst et. al.](https://github.com/rxn4chemistry/biocatalysis-model) after tagging disconnection sites as shown below.

Installation and Usage
######
::
## Installation and Usage

git clone https://github.com/rxn4chemistry/disconnection_aware_retrosynthesis.git
```
git clone https://github.com/rxn4chemistry/disconnection_aware_retrosynthesis.git
cd disconnection_aware_retrosynthesis
conda create -n disconnect python=3.7 -y
conda activate disconnect
```

If rdkit is already installed from conda:

```
pip install -e .
```

If you want to install rdkit:

```
pip install -e .[rdkit]
```

For development:

```
pip install -e .[rdkit,dev]
```

For the purposes preprocessing the data and tagging the disconnection site some utility functions are provided.

.. code-block:: python
```python

from disconnection_aware_retrosynthesis.tagging import get_tagged_products
from dar.tagging import get_tagged_products

rxn = 'CC(C)(C)O[Cl:18].CCO.ClCCl.[CH3:1][CH2:2][O:3][C:4](=[O:5])[CH2:6][NH:7][c:8]1[cH:9][cH:11][c:12]([CH2:13][CH2:14][OH:15])[cH:16][cH:17]1.[ClH:10]>>[CH3:1][CH2:2][O:3][C:4](=[O:5])[CH2:6][NH:7][c:8]1[c:9]([Cl:10])[cH:11][c:12]([CH2:13][CH2:14][OH:15])[cH:16][c:17]1[Cl:18]'
precursor, product = rxn.split('>>')
tagged_product = get_tagged_products(precursor, product)
print(tagged_product)
>>> 'CCOC(=O)CNc1[c:1]([Cl:1])cc(CCO)c[c:1]1[Cl:1]'
```

## Training the Disconnection Aware Retrosynthesis Model

Training the Disconnection Aware Retrosynthesis Model
#####################################################
Model training was conducted with OpenNMT-py

The first step is to run `onmt_preprocess`:

::

```
DATA=data/
DATASET=FullUSPTO
Expand All @@ -121,11 +86,11 @@ The first step is to run `onmt_preprocess`:
-save_data ${DATA}/${DATASET} \
-src_seq_length 1000 -tgt_seq_length 1000 \
-src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab
```

Once the OpenNMT pre-preprocessing has finished, the actual training can be started:

::

```
DATA=data/
SAVE_MODEL=disconnection_aware
Expand Down Expand Up @@ -165,15 +130,14 @@ Once the OpenNMT pre-preprocessing has finished, the actual training can be star
-global_attention_function softmax -self_attn_type scaled-dot \
-heads 8 -transformer_ff 2048 \
--tensorboard --tensorboard_log_dir ${DATA}/logs
```

Note: The above procedure can be followed to preprocess and train a model between any two sequences.
Note: The above procedure can be followed to preprocess and train a model between any two sequences.
For instance the AutoTag model can be trained by using the same approach.

Translation
***********

::
## Translation

```
DATA=data/
MODEL=$(ls data/disconnection_aware*.pt -t | head -1)
DATASET=FullUSPTO
Expand All @@ -185,19 +149,16 @@ Translation
-output ${DATA}/retro_predictions_${MODEL}_top_${N_BEST}.txt \
-batch_size 64 -replace_unk -max_length 200 \
-gpu 0 -n_best ${N_BEST} -beam_size 10
```

Automatic Tagging of Disconnection Sites (AutoTag)
**************************************************
## Automatic Tagging of Disconnection Sites (AutoTag)

A model can be trained to automatically identify disconnection sites in a given molcule using the data provided and the training workflow shown above.
The data must first be pre-processed such that the following apply:
- Source data: Tokenised product SMILES (no atom-mapping)
- Target data: Tokenised tagged product SMILES
The data must first be pre-processed such that the following apply: - Source data: Tokenised product SMILES (no atom-mapping) - Target data: Tokenised tagged product SMILES

A notebook is given to outline the general workflow used to preprocess the given data.

Improving Class Diversity at Model Inference
*********************************************
### Improving Class Diversity at Model Inference

Class diversity of single-step retrosynthesis can be improved by calling the 'AutoTag' model first to identify potential disconnection sites.
The number of disconnection sites identified can be tuned with the `-n_best` parameter. We recommend setting the `-n_best` parameter to 10.
Expand All @@ -206,8 +167,7 @@ For each prediction the Disconnection Aware model can be used to predict one set

The following calls to translate are an example:

::

```
DATA=data/
AUTOTAG_MODEL=$(ls data/autotag*.pt -t | head -1)
DATASET=FullUSPTO
Expand All @@ -219,11 +179,11 @@ The following calls to translate are an example:
-output ${DATA}/autotagged_output.txt \
-batch_size 64 -replace_unk -max_length 200 \
-gpu 0 -n_best ${N_BEST} -beam_size 10
```

We suggest canonicalising the output from the AutoTag model prior to subsequent translation for optimal performance.

::

```
DATA=data/
DISCONNECTION_MODEL=$(ls data/disconnection_aware*.pt -t | head -1)
DATASET=FullUSPTO
Expand All @@ -235,3 +195,4 @@ We suggest canonicalising the output from the AutoTag model prior to subsequent
-output ${DATA}/diverse_output.txt \
-batch_size 64 -replace_unk -max_length 200 \
-gpu 0 -n_best ${N_BEST} -beam_size 10
```
19 changes: 14 additions & 5 deletions data/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,22 @@ The data used to train the models in this study can be found at:
## Usage of pre-processed data on Zenodo

- Download the data from Zenodo.
- You should have a file called 'complete_disconnection_labelled.csv' which you can use with our [notebook](../notebooks_and_scripts/tag_and_tokenise.ipynb) to reprocess the data.
- Alternatively, you can retrain the models using are tokenised datasets.
- You should have a file called 'complete_disconnection_labelled.csv' which you can use with our [notebook](../notebooks_and_scripts/basic_tag_and_tokenise.ipynb) to reprocess the data.
- Alternatively, you can retrain the models using the tokenised datasets.

## Usage from Lowe patent data

The disconnection aware retrosynthesis model may still be trained using the procedure outline in this repository starting from the US Patent Office extracts by Lowe, [USPTO](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873)
The disconnection aware retrosynthesis model may still be trained using the procedure outline in this repository starting from the US Patent Office extracts by Lowe,

- Download the data from Figshare.
- Download the [USPTO data](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) from Figshare.
- Unzip '1976_Sep2016_USPTOgrants_smiles.7z' which we will use for demo purposes.
- You should have a file called '1976_Sep2016_USPTOgrants_smiles.rsmi' which you can use with our [notebook](../notebooks_and_scripts/uspto_preprocessing_example.ipynb) for basic data pre-processing.

### Preprocessing uisng [rxn-reaction-preprocessing](https://github.com/rxn4chemistry/rxn-reaction-preprocessing) (recommended)

- [workflow](./notebooks_and_scripts/rxn_preprocessing_workflow.ipynb)

For further information about reaction preprocessing refer to [rxn-reaction-preprocessing](https://github.com/rxn4chemistry/rxn-reaction-preprocessing)

### Basic Data Processing

- You should have a file called '1976_Sep2016_USPTOgrants_smiles.rsmi' which you can use with our [notebook](../notebooks_and_scripts/basic_preprocessing_example.ipynb) for basic data pre-processing.
16 changes: 0 additions & 16 deletions disconnection_aware_retrosynthesis/__init__.py

This file was deleted.

29 changes: 0 additions & 29 deletions docs/Makefile

This file was deleted.

1 change: 0 additions & 1 deletion docs/_static/.gitignore

This file was deleted.

Loading

0 comments on commit ac45726

Please sign in to comment.