Preprocessing update (#1)

* update to reflect open sourcing of preprocessing * update to reflect open sourced preprocessing * check mypy on disconnection aware [bump patch] Co-authored-by: Amol Thakkar <[email protected]>
rxn4chemistry · Nov 15, 2022 · ac45726 · ac45726
1 parent 7157f5e
commit ac45726
Show file tree

Hide file tree

Showing 38 changed files with 576 additions and 828 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -1,4 +1,3 @@
-
 name: "Running tests: style, mypy, pytest"
 
 on: [push, pull_request]
@@ -8,22 +7,22 @@ jobs:
     runs-on: ubuntu-latest
     name: Style, mypy, pytest
     steps:
-    - uses: actions/checkout@v3
-    - name: Set up Python 3.7
-      uses: actions/setup-python@v3
-      with:
-        python-version: 3.7
-    - name: Install Dependencies
-      run: pip install -e .[dev,rdkit]
-    - name: Check black
-      run: python -m black --check --diff --color .
-    - name: Check isort
-      run: python -m isort --check --diff .
-    - name: Check flake8
-      run: python -m flake8 .
-    - name: Check mypy (on the tests)
-      run: python -m mypy tests
-    - name: Check mypy (on package)
-      run: python -m mypy disconnection_aware_retrosynthesis
-    - name: Run pytests
-      run: python -m pytest
+      - uses: actions/checkout@v3
+      - name: Set up Python 3.7
+        uses: actions/setup-python@v3
+        with:
+          python-version: 3.7
+      - name: Install Dependencies
+        run: pip install -e .[dev,rdkit]
+      - name: Check black
+        run: python -m black --check --diff --color .
+      - name: Check isort
+        run: python -m isort --check --diff .
+      - name: Check flake8
+        run: python -m flake8 .
+      - name: Check mypy (on the tests)
+        run: python -m mypy tests
+      - name: Check mypy (on package)
+        run: python -m mypy src/dar
+      - name: Run pytests
+        run: python -m pytest
diff --git a/AUTHORS.rst b/AUTHORS.rst
diff --git a/README.rst → README.md b/README.rst → README.md
@@ -1,115 +1,80 @@
-.. These are examples of badges you might want to add to your README:
-   please update the URLs accordingly
-
-    .. image:: https://api.cirrus-ci.com/github/<USER>/disconnection_aware_retrosynthesis.svg?branch=main
-        :alt: Built Status
-        :target: https://cirrus-ci.com/github/<USER>/disconnection_aware_retrosynthesis
-    .. image:: https://readthedocs.org/projects/disconnection_aware_retrosynthesis/badge/?version=latest
-        :alt: ReadTheDocs
-        :target: https://disconnection_aware_retrosynthesis.readthedocs.io/en/stable/
-    .. image:: https://img.shields.io/coveralls/github/<USER>/disconnection_aware_retrosynthesis/main.svg
-        :alt: Coveralls
-        :target: https://coveralls.io/r/<USER>/disconnection_aware_retrosynthesis
-    .. image:: https://img.shields.io/pypi/v/disconnection_aware_retrosynthesis.svg
-        :alt: PyPI-Server
-        :target: https://pypi.org/project/disconnection_aware_retrosynthesis/
-    .. image:: https://img.shields.io/conda/vn/conda-forge/disconnection_aware_retrosynthesis.svg
-        :alt: Conda-Forge
-        :target: https://anaconda.org/conda-forge/disconnection_aware_retrosynthesis
-    .. image:: https://pepy.tech/badge/disconnection_aware_retrosynthesis/month
-        :alt: Monthly Downloads
-        :target: https://pepy.tech/project/disconnection_aware_retrosynthesis
-    .. image:: https://img.shields.io/twitter/url/http/shields.io.svg?style=social&label=Twitter
-        :alt: Twitter
-        :target: https://twitter.com/disconnection_aware_retrosynthesis
-
-.. image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
-    :alt: Project generated with PyScaffold
-    :target: https://pyscaffold.org/
-
-==================================
-Interactive Retrosynthetic Language Models
-==================================
-
-
-    This repository provides functionality for preprocessing data for, and training interactive retrosynthetic language models,
+# Interactive Retrosynthetic Language Models
+
+    This repository provides functionality for preprocessing data for, and training interactive retrosynthesis language models,
     so called disconnection aware retrosynthesis.
 
-Abstract
-########
-
-Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions, 
-and in the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in 
-natural language processing to the task of chemical language modelling. We show that by using a prompt describing the disconnection 
-site in a molecule, we are able to steer the model to propose a wider sets of precursors, overcoming training data biases in 
-retrosynthetic recommendations and achieving a 39 % performance improvement over the baseline. For the first time, the use of a 
-disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more 
-diverse and creative recommendations. In addition, in lieu of a human-in-the-loop strategy, we propose a schema for automatic 
-identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity 
-as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data, providing a 
-larger variety of usable building blocks which in turn improves the end-user digital experience. We demonstrate its application 
-to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key. 
-
-.. image:: images/overview_figure.jpeg
-    :width: 800px
-    :align: center
-    :height: 400px
-    :alt: alternate text
-
-
-Dataset
-#######
-
-The data used was derived from the `US Patent Office extracts (USPTO) by Lowe <https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873>`
-and was processed to filter for the following conditions:
-
-- min_reactants: 2
-- max_reactants: 10
-- max_reactants_tokens: 300
-- min_agents: 0
-- max_agents: 0
-- max_agents_tokens: 0
-- min_products: 1
-- max_products: 1
-- max_products_tokens: 200
-- max_absolute_formal_charge: 2
-
-The reactions were mapped with `RXNMapper <https://github.com/rxn4chemistry/rxnmapper>`
+## Abstract
+
+Data-driven approaches to retrosynthesis have thus far been limited in user interaction, in the diversity of their predictions,
+and in the recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in
+natural language processing to the task of chemical language modelling. We show that by using a prompt describing the disconnection
+site in a molecule, we are able to steer the model to propose a wider sets of precursors, overcoming training data biases in
+retrosynthetic recommendations and achieving a 39 % performance improvement over the baseline. For the first time, the use of a
+disconnection prompt empowers chemists by giving them back greater control over the disconnection predictions, resulting in more
+diverse and creative recommendations. In addition, in lieu of a human-in-the-loop strategy, we propose a schema for automatic
+identification of disconnection sites, followed by prediction of reactant sets, achieving a 100 % improvement in class diversity
+as compared to the baseline. The approach is effective in mitigating prediction biases deriving from training data, providing a
+larger variety of usable building blocks which in turn improves the end-user digital experience. We demonstrate its application
+to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is key.
+
+![alt text](images/overview_figure.jpeg "Overview")
+
+## Dataset
+
+The data was derived from the [US Patent Office extracts (USPTO) by Lowe](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873)
+and was processed using the following [workflow](./notebooks_and_scripts/rxn_preprocessing_workflow.ipynb)
 
 Refer to the data folder for an explanation of how to process an example dataset.
 
-Note: For using Enzyme data the model can be trained following the procedure of `Probst et. al. <https://github.com/rxn4chemistry/biocatalysis-model>` after tagging disconnection sites as shown below.
+Note: For using Enzyme data the model can be trained following the procedure of [Probst et. al.](https://github.com/rxn4chemistry/biocatalysis-model) after tagging disconnection sites as shown below.
 
-Installation and Usage
-######
-::
+## Installation and Usage
 
-    git clone https://github.com/rxn4chemistry/disconnection_aware_retrosynthesis.git 
+```
+    git clone https://github.com/rxn4chemistry/disconnection_aware_retrosynthesis.git
     cd disconnection_aware_retrosynthesis
     conda create -n disconnect python=3.7 -y
     conda activate disconnect
+```
+
+If rdkit is already installed from conda:
+
+```
     pip install -e .
+```
+
+If you want to install rdkit:
+
+```
+    pip install -e .[rdkit]
+```
+
+For development:
+
+```
+    pip install -e .[rdkit,dev]
+```
 
 For the purposes preprocessing the data and tagging the disconnection site some utility functions are provided.
 
-.. code-block:: python
+```python
 
-    from disconnection_aware_retrosynthesis.tagging import get_tagged_products
+    from dar.tagging import get_tagged_products
 
     rxn = 'CC(C)(C)O[Cl:18].CCO.ClCCl.[CH3:1][CH2:2][O:3][C:4](=[O:5])[CH2:6][NH:7][c:8]1[cH:9][cH:11][c:12]([CH2:13][CH2:14][OH:15])[cH:16][cH:17]1.[ClH:10]>>[CH3:1][CH2:2][O:3][C:4](=[O:5])[CH2:6][NH:7][c:8]1[c:9]([Cl:10])[cH:11][c:12]([CH2:13][CH2:14][OH:15])[cH:16][c:17]1[Cl:18]'
     precursor, product = rxn.split('>>')
     tagged_product = get_tagged_products(precursor, product)
     print(tagged_product)
     >>> 'CCOC(=O)CNc1[c:1]([Cl:1])cc(CCO)c[c:1]1[Cl:1]'
+```
+
+## Training the Disconnection Aware Retrosynthesis Model
 
-Training the Disconnection Aware Retrosynthesis Model
-#####################################################
 Model training was conducted with OpenNMT-py
 
 The first step is to run `onmt_preprocess`:
 
-::
-
+```
     DATA=data/
     DATASET=FullUSPTO
 
@@ -121,11 +86,11 @@ The first step is to run `onmt_preprocess`:
     -save_data ${DATA}/${DATASET} \
     -src_seq_length 1000 -tgt_seq_length 1000 \
     -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab
+```
 
 Once the OpenNMT pre-preprocessing has finished, the actual training can be started:
 
-::
-
+```
     DATA=data/
     SAVE_MODEL=disconnection_aware
 
@@ -165,15 +130,14 @@ Once the OpenNMT pre-preprocessing has finished, the actual training can be star
     -global_attention_function softmax -self_attn_type scaled-dot \
     -heads 8 -transformer_ff 2048 \
     --tensorboard --tensorboard_log_dir ${DATA}/logs
+```
 
-Note: The above procedure can be followed to preprocess and train a model between any two sequences. 
+Note: The above procedure can be followed to preprocess and train a model between any two sequences.
 For instance the AutoTag model can be trained by using the same approach.
 
-Translation
-***********
-
-::
+## Translation
 
+```
     DATA=data/
     MODEL=$(ls data/disconnection_aware*.pt -t | head -1)
     DATASET=FullUSPTO
@@ -185,19 +149,16 @@ Translation
     -output ${DATA}/retro_predictions_${MODEL}_top_${N_BEST}.txt \
     -batch_size 64 -replace_unk -max_length 200 \
     -gpu 0 -n_best ${N_BEST} -beam_size 10
+```
 
-Automatic Tagging of Disconnection Sites (AutoTag)
-**************************************************
+## Automatic Tagging of Disconnection Sites (AutoTag)
 
 A model can be trained to automatically identify disconnection sites in a given molcule using the data provided and the training workflow shown above.
-The data must first be pre-processed such that the following apply:
-    - Source data: Tokenised product SMILES (no atom-mapping)
-    - Target data: Tokenised tagged product SMILES
+The data must first be pre-processed such that the following apply: - Source data: Tokenised product SMILES (no atom-mapping) - Target data: Tokenised tagged product SMILES
 
 A notebook is given to outline the general workflow used to preprocess the given data.
 
-Improving Class Diversity at Model Inference
-*********************************************
+### Improving Class Diversity at Model Inference
 
 Class diversity of single-step retrosynthesis can be improved by calling the 'AutoTag' model first to identify potential disconnection sites.
 The number of disconnection sites identified can be tuned with the `-n_best` parameter. We recommend setting the `-n_best` parameter to 10.
@@ -206,8 +167,7 @@ For each prediction the Disconnection Aware model can be used to predict one set
 
 The following calls to translate are an example:
 
-::
-
+```
     DATA=data/
     AUTOTAG_MODEL=$(ls data/autotag*.pt -t | head -1)
     DATASET=FullUSPTO
@@ -219,11 +179,11 @@ The following calls to translate are an example:
     -output ${DATA}/autotagged_output.txt \
     -batch_size 64 -replace_unk -max_length 200 \
     -gpu 0 -n_best ${N_BEST} -beam_size 10
+```
 
 We suggest canonicalising the output from the AutoTag model prior to subsequent translation for optimal performance.
 
-::
-
+```
     DATA=data/
     DISCONNECTION_MODEL=$(ls data/disconnection_aware*.pt -t | head -1)
     DATASET=FullUSPTO
@@ -235,3 +195,4 @@ We suggest canonicalising the output from the AutoTag model prior to subsequent
     -output ${DATA}/diverse_output.txt \
     -batch_size 64 -replace_unk -max_length 200 \
     -gpu 0 -n_best ${N_BEST} -beam_size 10
+```
diff --git a/data/README.MD b/data/README.MD
@@ -7,13 +7,22 @@ The data used to train the models in this study can be found at:
 ## Usage of pre-processed data on Zenodo
 
 - Download the data from Zenodo.
-- You should have a file called 'complete_disconnection_labelled.csv' which you can use with our [notebook](../notebooks_and_scripts/tag_and_tokenise.ipynb) to reprocess the data.
-- Alternatively, you can retrain the models using are tokenised datasets.
+- You should have a file called 'complete_disconnection_labelled.csv' which you can use with our [notebook](../notebooks_and_scripts/basic_tag_and_tokenise.ipynb) to reprocess the data.
+- Alternatively, you can retrain the models using the tokenised datasets.
 
 ## Usage from Lowe patent data
 
-The disconnection aware retrosynthesis model may still be trained using the procedure outline in this repository starting from the US Patent Office extracts by Lowe, [USPTO](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873)
+The disconnection aware retrosynthesis model may still be trained using the procedure outline in this repository starting from the US Patent Office extracts by Lowe,
 
-- Download the data from Figshare.
+- Download the [USPTO data](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873) from Figshare.
 - Unzip '1976_Sep2016_USPTOgrants_smiles.7z' which we will use for demo purposes.
-- You should have a file called '1976_Sep2016_USPTOgrants_smiles.rsmi' which you can use with our [notebook](../notebooks_and_scripts/uspto_preprocessing_example.ipynb) for basic data pre-processing.
+
+### Preprocessing uisng [rxn-reaction-preprocessing](https://github.com/rxn4chemistry/rxn-reaction-preprocessing) (recommended)
+
+- [workflow](./notebooks_and_scripts/rxn_preprocessing_workflow.ipynb)
+
+  For further information about reaction preprocessing refer to [rxn-reaction-preprocessing](https://github.com/rxn4chemistry/rxn-reaction-preprocessing)
+
+### Basic Data Processing
+
+- You should have a file called '1976_Sep2016_USPTOgrants_smiles.rsmi' which you can use with our [notebook](../notebooks_and_scripts/basic_preprocessing_example.ipynb) for basic data pre-processing.
diff --git a/disconnection_aware_retrosynthesis/__init__.py b/disconnection_aware_retrosynthesis/__init__.py
diff --git a/docs/Makefile b/docs/Makefile
diff --git a/docs/_static/.gitignore b/docs/_static/.gitignore