A set of scripts with tools and functions for benchmarking and analysis of generative reinforcement learning models.
Overview:
- analysis_utils.py: module containing tools for analysis.
- analysis_utils1.py: newer module containing updated tools for analysis. This module should be used in the future.
- egfr_demo.py: training routine for reinforcement learning to generate molecules with predicted activity against EGFR.
- gridsearch.py: script to run hyperparameter gridsearches for RL training routines.
- iter_utils.py: module containing accessory functions for gridsearch module.
General usage: The gridsearch.py module provides a general framework to perform hyperparameter gridsearch.
Usage: gridsearch.py [-h] script config_path log_path
Positional arguments:
script Filename of python script to be run for gridsearch. Script must have a 'main' method that takes any number of keyword arguments. The script may have an option to write a generated library to a file. In this case the main method must have a 'save_path' keyword argument
config_path Filename containing parameters to be passed to the running script. The parameters should be stored as a dictionary with values of type list for parameters to be optimized.
log_path Filename for logging training information. All output to sys.stdout will be redirected to the log_path.
optional arguments: -h, --help show this help message and exit
Example: $ python3 -m gridsearch my_script my_log_path
The config script should be a .txt file containing a dictionary with values of type list for parameters to be optimized. Two special features are supported: nested parameter and coupled parameter passing. If the executable script takes a dict as a parameter, nested parameters within the dict keyword can be passed by adding a '__' separator. Coupled parameters can be specified by joining multiple parameters by ','. The corresponding hyperparameter values must consist of a list of tuples.
Example of a valid config:
{
'spam_params__foo': ['nested', 'parameter'],
'foo,bar': [('coupled', 'parameters'), ('in', 'tuple')],
'eggs': 'this parameter is not iterated'
}
The log_path is a filename for logging training information. The gridsearch.py program postpends a '.log' tag. The gridsearch script provides an option to save a molecular library generated by the model trained on each set of hyperparameters. For this functionality to work, the 'main' executable routine must take a 'save_path' keyword argument, to which the generated library is saved.
The information to be logged, as well as the nature of the saved library, are deferred to the implementation of the executable training script.
Notes on egfr_demo.py: This is a training routine for reinforcement learning to generate molecules with predicted activity against EGFR. It requires several pre-loaded files... relevant lines are listed below.
- Line 97: Default path for replay data ('./data/replay_data.smi')
- Line 98: Default path to load pre-trained RL model ('./checkpoints/generator/checkpoint_batch_training')
- Line 105: Path for data used for training ('./data/chembl_22_clean_1576904_sorted_std_final.smi')
- Line 145: Path to load pre-trained classifier model for EGFR activity prediction ('../project/checkpoints/predictor/egfr_rfc')
- Line 295: Path to load experimental EGFR data for comparisons ('../project/datasets/egfr_with_pubchem1.csv')