Skip to content

Commit

Permalink
Code cleanup (#132)
Browse files Browse the repository at this point in the history
* initial cleanup

* remove reporting

* remove metaclass code and add some docs

* update docs a bit

* prefer compute_properties

* remove cli docs

* more docs

* update module docstrings

* more docs and cleanup

* rename test file

* add to config docs more

* update changelog
  • Loading branch information
lilyminium authored Sep 9, 2024
1 parent 00cca08 commit b0d5db5
Show file tree
Hide file tree
Showing 26 changed files with 431 additions and 784 deletions.
9 changes: 9 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ The rules for this file:
* accompany each entry with github issue/PR number (Issue #xyz)
-->

## ??

### Authors
- [@lilyminium]

### Changed
- Removed unused, undocumented code paths, and updated docs (PR #132)


## v0.4.0 -- 2024-07-18

This version adds support for lookup tables.
Expand Down
7 changes: 0 additions & 7 deletions docs/cli.md

This file was deleted.

32 changes: 32 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Training a GNN using config files

The configuration files defined in `openff.nagl.config` are expected
to be the main way a user will train a GNN. They are split into
three sections, which are all combined in a [`TrainingConfig`]. This document gives a broad outline of the config classes; please see examples for how to train a GNN.

## Model config

A model is defined using the classes in `openff.nagl.config.model`.
A model is expected to consist of a single [`ConvolutionModule`],
and any number of [`ReadoutModule`]s. The readout modules are
registered by name so multiple properties can be predicted from a single
GNN.

Moreover, the atom and bond features used to featurize a molecule
are defined in the model config.


## Data config

The datasets used for training, validation, and testing are defined here.
As this class is only used for training or testing a model, a [`DatasetConfig`]
must also define training targets and loss metrics.

## Optimizer config

Here is where the optimizer is configured for training the GNN.

## Training config

The model, data, and optimizer configs are combined in a [`TrainingConfig`] that is then used to create a [`TrainingGNNModel`] and [`DGLMoleculeDataModule`]
that can be passed to a Pytorch Lightning trainer.
154 changes: 106 additions & 48 deletions docs/designing.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
(designing_a_model)=

# Designing a GCN

Designing a GCN with NAGL primarily involves creating an instance of the [`GNNModel`] class. This can be done straightforwardly in Python:
Designing a GCN with NAGL primarily involves creating an instance of the [`ModelConfig`] class.

## In Python

A ModelConfig class can be created can be done straightforwardly in Python.

```python
from openff.nagl.features import atoms, bonds
Expand All @@ -9,6 +15,17 @@ from openff.nagl.nn import gcn
from openff.nagl.nn.activation import ActivationFunction
from openff.nagl.nn import postprocess

from openff.nagl.config.model import (
ModelConfig,
ConvolutionModule, ReadoutModule,
ConvolutionLayer, ForwardLayer,
)
```

First we can specify our desired features.
These should be instances of the feature classes.

```python
atom_features = (
atoms.AtomicElement(["C", "H", "O", "N", "P", "S"]),
atoms.AtomConnectivity(),
Expand All @@ -19,67 +36,112 @@ bond_features = (
bonds.BondOrder(),
...
)
```

Next, we can design convolution and readout modules. For example:

```python
convolution_module = ConvolutionModule(
architecture="SAGEConv",
# construct 2 layers with dropout 0 (default),
# hidden feature size 512, and ReLU activation function
# these layers can also be individually specified,
# but we just duplicate the layer 6 times for identical layers
layers=[
ConvolutionLayer(
hidden_feature_size=512,
activation_function="ReLU",
aggregator_type="mean"
)
] * 2,
)

# define our readout module/s
# multiple are allowed but let's focus on charges
readout_modules = {
# key is the name of output property, any naming is allowed
"charges": ReadoutModule(
pooling="atoms",
postprocess="compute_partial_charges",
# 1 layers
layers=[
ForwardLayer(
hidden_feature_size=512,
activation_function="ReLU",
)
],
)
}
```

model = GNNModel(
We can now bring it all together as a `ModelConfig` and create a `GNNModel`.

```python
model_config = ModelConfig(
version="0.1",
atom_features=atom_features,
bond_features=bond_features,
convolution_architecture=gcn.SAGEConvStack,
n_convolution_hidden_features=128,
n_convolution_layers=3,
n_readout_hidden_features=128,
n_readout_layers=4,
activation_function=ActivationFunction.ReLU,
postprocess_layer=postprocess.ComputePartialCharges,
readout_name=f"am1bcc-charges",
learning_rate=0.001,
convolution=convolution_module,
readouts=readout_modules,
)

model = GNNModel(model_config)
```

## From YAML

Or if you prefer, the same model architecture can be specified as a YAML file:

```yaml
convolution_architecture: SAGEConv
postprocess_layer: compute_partial_charges

activation_function: ReLU
learning_rate: 0.001
n_convolution_hidden_features: 128
n_convolution_layers: 3
n_readout_hidden_features: 128
n_readout_layers: 4

version: '0.1'
convolution:
architecture: SAGEConv
layers:
- hidden_feature_size: 512
activation_function: ReLU
dropout: 0
aggregator_type: mean
- hidden_feature_size: 512
activation_function: ReLU
dropout: 0
aggregator_type: mean
readouts:
charges:
pooling: atoms
postprocess: compute_partial_charges
layers:
- hidden_feature_size: 128
activation_function: Sigmoid
dropout: 0
atom_features:
- AtomicElement:
categories: ["C", "H", "O", "N", "P", "S"]
- AtomConnectivity
- AtomAverageFormalCharge
- AtomHybridization
- AtomInRingOfSize: 3
- AtomInRingOfSize: 4
- AtomInRingOfSize: 5
- AtomInRingOfSize: 6
- name: atomic_element
categories: ["C", "H", "O", "N", "P", "S"]
- name: atom_connectivity
categories: [1, 2, 3, 4, 5, 6]
- name: atom_hybridization
- name: atom_in_ring_of_size
ring_size: 3
- name: atom_in_ring_of_size
ring_size: 4
- name: atom_in_ring_of_size
ring_size: 5
- name: atom_in_ring_of_size
ring_size: 6
bond_features:
- BondOrder
- BondInRingOfSize: 3
- BondInRingOfSize: 4
- BondInRingOfSize: 5
- BondInRingOfSize: 6

- name: bond_is_in_ring
```
And then loaded with the [`GNNModel.from_yaml()`] method:
And then loaded into a config using the [`ModelConfig.from_yaml()`] method:

```python
from openff.nagl import GNNModel
from openff.nagl.config import ModelConfig
model = GNNModel.from_yaml("model.yml")
model = GNNModel(ModelConfig.from_yaml("model.yaml"))
```

Here we'll go through each option, what it means, and where to find the available choices.

[`GNNModel`]: openff.nagl.GNNModel
[`GNNModel.from_yaml_file()`]: openff.nagl.GNNModel.from_yaml_file

(model_features)=
## `atom_features` and `bond_features`

Expand All @@ -93,18 +155,14 @@ These arguments specify the featurization scheme for the model (see [](featuriza

## `convolution_architecture`

The `convolution_architecture` argument specifies the structure of the convolution module. Available options are provided in the [`openff.nagl.nn.gcn`] module.
The `convolution_architecture` argument specifies the structure of the convolution module. Available options are provided in the [`openff.nagl.config.model`] module.

[`openff.nagl.nn.gcn`]: openff.nagl.nn.gcn

## Number of Features and Layers

The size and shape of the neural networks in the convolution and readout modules are specified by four arguments:

- `n_convolution_hidden_features`
- `n_convolution_layers`
- `n_readout_hidden_features`
- `n_readout_layers`
Each module comprises a number of layers that must be individually specified.
For example, a [`ConvolutionModule`] consists of specified [`ConvolutionLayer`]s. A [`ReadoutModule`] consists of specified [`ForwardLayer`]s.

The "convolution" arguments define the update network in the convolution module, and the "readout" the network in the readout module (see [](convolution_theory) and [](readout_theory)). Read the `GNNModel` docstring carefully to determine which layers are considered hidden.

Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ from openff.nagl import GNNModel
model = GNNModel.load("trained_model.pt")
```

Then, the properties the model is trained to predict can be computed with the [`GNNModel.compute_property()`] method, which takes an OpenFF [`Molecule`] object:
Then, the properties the model is trained to predict can be computed with the [`GNNModel.compute_properties()`] method, which takes an OpenFF [`Molecule`] object:

```python
from openff.toolkit import Molecule
Expand All @@ -51,7 +51,7 @@ model.compute_property(ethanol)

[`openff.nagl.GNNModel`]: openff.nagl.GNNModel
[`GNNModel.load()`]: openff.nagl.GNNModel.load
[`GNNModel.compute_property()`]: openff.nagl.GNNModel.compute_property
[`GNNModel.compute_properties()`]: openff.nagl.GNNModel.compute_properties
[`Molecule`]: openff.toolkit.topology.Molecule

:::{toctree}
Expand Down
37 changes: 30 additions & 7 deletions docs/theory.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,17 +88,16 @@ Neutral and zwitterionic alanine. These might be considered the same molecular s
:::

:::{admonition} Molecular Graphs in NAGL
Molecular graphs are provided to NAGL via the [`Molecule`] class of the [OpenFF Toolkit]. These objects can be created in a [variety of ways], but the most common is probably via a [SMILES string] or [.SDF file]. NAGL ingests molecular graphs for inference through the [`GNNModel.compute_property()`] method, and for dataset construction through the [`MoleculeRecord.from_openff()`] and [`MoleculeRecord.from_precomputed_openff()`] methods.
Molecular graphs are provided to NAGL via the [`Molecule`] class of the [OpenFF Toolkit]. These objects can be created in a [variety of ways], but the most common is probably via a [SMILES string] or [.SDF file]. NAGL ingests molecular graphs for inference through the [`GNNModel.compute_properties()`] method.
:::

[`Molecule`]: openff.toolkit.topology.Molecule
[SMILES string]: openff.toolkit.topology.Molecule.from_smiles
[.SDF file]: openff.toolkit.topology.Molecule.from_file
[OpenFF Toolkit]: inv:openff.toolkit#index
[variety of ways]: inv:openff.toolkit#users/molecule_cookbook
[`GNNModel.compute_property()`]: openff.nagl.GNNModel.compute_property
[`MoleculeRecord.from_openff()`]: openff.nagl.storage.record.MoleculeRecord.from_openff
[`MoleculeRecord.from_precomputed_openff()`]: openff.nagl.storage.record.MoleculeRecord.from_precomputed_openff
[`GNNModel.compute_properties()`]: openff.nagl.GNNModel.compute_properties


(featurization_theory)=
## Featurization
Expand All @@ -117,7 +116,7 @@ NAGL therefore uses **one-hot encoding** for most featurization, in which each f
This prevents the model from assuming that adjacent elements are similar. Note that these values are represented internally as floating point numbers, not single bits. The model's internal representation is therefore free to mix and scale them as needed, which allows the model to represent a carbon atom (#6) with some oxygen character (#8) without the result appearing like nitrogen (#7).

:::{admonition} Featurization in NAGL
Feature templates for [bonds] and [atoms] are available in the [`features`] module. A featurization scheme is composed of a tuple of instances of these classes, which are passed to the `bond_features` and `atom_features` arguments of the [`GNNModel`] constructor. Most of these features simply apply a one-hot encoding directly to the molecular graph, but a few are built on the [`ResonanceEnumerator`] class to support features that take into account multiple resonance forms.
Feature templates for [bonds] and [atoms] are available in the [`features`] module. A featurization scheme is composed of a tuple of instances of these classes, which are passed to the `bond_features` and `atom_features` arguments of the [`ModelConfig`] constructor. Most of these features simply apply a one-hot encoding directly to the molecular graph, but a few are built on the [`ResonanceEnumerator`] class to support features that take into account multiple resonance forms.
:::

[bonds]: openff.nagl.features.bonds
Expand Down Expand Up @@ -151,7 +150,7 @@ NAGL's goal is to produce machine-learned models that can compute partial charge

NAGL produces atom embeddings with a message-passing graph convolutional network (GCN). A GCN takes each node's feature vector and iteratively mixes it with those of progressively more distant neighbors to produce an embedding for the node. To start with, NAGL uses an atom's feature vector as its embedding, and its neighbors are the atoms directly bonded to it. On each iteration, the GCN first **aggregates** the feature vectors of an atom's neighbors and the associated bonds, and then **updates** the embedding with the aggregated features of that neighborhood. On the next iteration, the new embedding will be updated again, and the atoms an additional step away will form its new neighborhood.

NAGL's convolution module supports the [GraphSAGE] architecture to train a single neural network to produce an embedding for any atom in any molecule. GraphSAGE uses any of a number of simple mathematical functions for its aggregation step, and trains a neural network to perform the update function. NAGL currently uses simple element-wise averaging of features for its aggregation step, and trains the update network automatically as part of training a model. The update function is thus custom built to produce an embedding for the property we're trying to predict!
NAGL's convolution module supports multiple architectures. The most well-tested is the [GraphSAGE] architecture to train a single neural network to produce an embedding for any atom in any molecule. GraphSAGE uses any of a number of simple mathematical functions for its aggregation step, and trains a neural network to perform the update function. NAGL currently uses simple element-wise averaging of features for its aggregation step, and trains the update network automatically as part of training a model. The update function is thus custom built to produce an embedding for the property we're trying to predict!

:::{admonition} Note
GraphSAGE has nothing to do with the Sage force field!
Expand All @@ -164,7 +163,21 @@ GraphSAGE has nothing to do with the Sage force field!

A trained convolution module takes a molecular graph and produces a representation for each atom that is custom-made for prediction of the desired property. These embeddings are then passed directly to the **Readout module** to predict the properties themselves; in fact, both modules are trained together to optimize their performance.

The readout module consists of a neural network and a **post-processing** function. The post-processing function takes the outputs of the neural network and applies some traditional computation to them before producing the model's final result. NAGL therefore sandwiches its machine learning core between conventional, symbolic computation where chemistry knowledge can be injected.
The readout module consists of a **pooling layer**, a feed-forward neural network and an optional **post-processing** function. The post-processing function takes the outputs of the neural network and applies some traditional computation to them before producing the model's final result. NAGL therefore sandwiches its machine learning core between conventional, symbolic computation where chemistry knowledge can be injected.


### Pooling

The function of the pooling layer is to "pool" the feature vectors
produced by the convolution module into a single representation.

### Postprocessing

NAGL supports a number of post-processing layers.
These should be specified in the [`ModelConfig`] with their name,
not an instantiation of the class.

#### compute_partial_charges

NAGL uses an interesting application of the post-processing layer when calculating charges: an application of the **charge equilibration method** [inspired by electronegativity equalization]. The readout neural network does not directly infer partial charges; instead, it predicts two variables that are interpreted as **electronegativity** $e$ and **hardness** $s$. These are the first two derivatives in charge of the molecule's potential energy, and respectively quantify the atom's proclivity to hold negative charge and the atom's resistance to changing charge. An atom's partial charge $q$ is then computed from these properties and the molecule's net charge $Q$:

Expand All @@ -181,3 +194,13 @@ NAGL's machine learning models can be supplied a post-processing function with t
:::

[inspired by electronegativity equalization]: https://arxiv.org/pdf/1909.07903.pdf

#### regularized_compute_partial_charges

The is a modification of `compute_partial_charges` that
also receives an initial charge, $q_0$, and uses the electronegativity
and hardness to instead compute a "charge correction".

Any neural network using this as a postprocessing layer
therefore predicts three quantities: the initial charge $q_0$,
the electronegativity $d$, and the hardness $s$.
Loading

0 comments on commit b0d5db5

Please sign in to comment.