Code cleanup (#132)

* initial cleanup * remove reporting * remove metaclass code and add some docs * update docs a bit * prefer compute_properties * remove cli docs * more docs * update module docstrings * more docs and cleanup * rename test file * add to config docs more * update changelog
openforcefield · Sep 9, 2024 · b0d5db5 · b0d5db5
1 parent 00cca08
commit b0d5db5
Show file tree

Hide file tree

Showing 26 changed files with 431 additions and 784 deletions.
diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -14,6 +14,15 @@ The rules for this file:
   * accompany each entry with github issue/PR number (Issue #xyz)
 -->
 
+## ??
+
+### Authors
+- [@lilyminium]
+
+### Changed
+- Removed unused, undocumented code paths, and updated docs (PR #132)
+
+
 ## v0.4.0 -- 2024-07-18
 
 This version adds support for lookup tables.

diff --git a/docs/cli.md b/docs/cli.md
diff --git a/docs/config.md b/docs/config.md
@@ -0,0 +1,32 @@
+# Training a GNN using config files
+
+The configuration files defined in `openff.nagl.config` are expected
+to be the main way a user will train a GNN. They are split into
+three sections, which are all combined in a [`TrainingConfig`]. This document gives a broad outline of the config classes; please see examples for how to train a GNN.
+
+## Model config
+
+A model is defined using the classes in `openff.nagl.config.model`.
+A model is expected to consist of a single [`ConvolutionModule`],
+and any number of [`ReadoutModule`]s. The readout modules are
+registered by name so multiple properties can be predicted from a single
+GNN.
+
+Moreover, the atom and bond features used to featurize a molecule
+are defined in the model config.
+
+
+## Data config
+
+The datasets used for training, validation, and testing are defined here.
+As this class is only used for training or testing a model, a [`DatasetConfig`]
+must also define training targets and loss metrics.
+
+## Optimizer config
+
+Here is where the optimizer is configured for training the GNN.
+
+## Training config
+
+The model, data, and optimizer configs are combined in a [`TrainingConfig`] that is then used to create a [`TrainingGNNModel`] and [`DGLMoleculeDataModule`]
+that can be passed to a Pytorch Lightning trainer.
diff --git a/docs/designing.md b/docs/designing.md
@@ -1,6 +1,12 @@
+(designing_a_model)=
+
 # Designing a GCN
 
-Designing a GCN with NAGL primarily involves creating an instance of the [`GNNModel`] class. This can be done straightforwardly in Python:
+Designing a GCN with NAGL primarily involves creating an instance of the [`ModelConfig`] class.
+
+## In Python
+
+A ModelConfig class can be created can be done straightforwardly in Python.
 
 ```python
 from openff.nagl.features import atoms, bonds
@@ -9,6 +15,17 @@ from openff.nagl.nn import gcn
 from openff.nagl.nn.activation import ActivationFunction
 from openff.nagl.nn import postprocess
 
+from openff.nagl.config.model import (
+    ModelConfig,
+    ConvolutionModule, ReadoutModule,
+    ConvolutionLayer, ForwardLayer,
+)
+```
+
+First we can specify our desired features.
+These should be instances of the feature classes.
+
+```python
 atom_features = (
     atoms.AtomicElement(["C", "H", "O", "N", "P", "S"]),
     atoms.AtomConnectivity(),
@@ -19,67 +36,112 @@ bond_features = (
     bonds.BondOrder(),
     ...
 )
+```
+
+Next, we can design convolution and readout modules. For example:
+
+```python
+convolution_module = ConvolutionModule(
+    architecture="SAGEConv",
+    # construct 2 layers with dropout 0 (default),
+    # hidden feature size 512, and ReLU activation function
+    # these layers can also be individually specified,
+    # but we just duplicate the layer 6 times for identical layers
+    layers=[
+        ConvolutionLayer(
+            hidden_feature_size=512,
+            activation_function="ReLU",
+            aggregator_type="mean"
+        )
+    ] * 2,
+)
+
+# define our readout module/s
+# multiple are allowed but let's focus on charges
+readout_modules = {
+    # key is the name of output property, any naming is allowed
+    "charges": ReadoutModule(
+        pooling="atoms",
+        postprocess="compute_partial_charges",
+        # 1 layers
+        layers=[
+            ForwardLayer(
+                hidden_feature_size=512,
+                activation_function="ReLU",
+            )
+        ],
+    )
+}
+```
 
-model = GNNModel(
+We can now bring it all together as a `ModelConfig` and create a `GNNModel`.
+
+```python
+model_config = ModelConfig(
+    version="0.1",
     atom_features=atom_features,
     bond_features=bond_features,
-    convolution_architecture=gcn.SAGEConvStack,
-    n_convolution_hidden_features=128,
-    n_convolution_layers=3,
-    n_readout_hidden_features=128,
-    n_readout_layers=4,
-    activation_function=ActivationFunction.ReLU,
-    postprocess_layer=postprocess.ComputePartialCharges,
-    readout_name=f"am1bcc-charges",
-    learning_rate=0.001,
+    convolution=convolution_module,
+    readouts=readout_modules,
 )
+
+model = GNNModel(model_config)
 ```
 
+## From YAML
+
 Or if you prefer, the same model architecture can be specified as a YAML file:
 
 ```yaml
-convolution_architecture: SAGEConv
-postprocess_layer: compute_partial_charges
-
-activation_function: ReLU
-learning_rate: 0.001
-n_convolution_hidden_features: 128
-n_convolution_layers: 3
-n_readout_hidden_features: 128
-n_readout_layers: 4
-
+version: '0.1'
+convolution:
+  architecture: SAGEConv
+  layers:
+    - hidden_feature_size: 512
+      activation_function: ReLU
+      dropout: 0
+      aggregator_type: mean
+    - hidden_feature_size: 512
+      activation_function: ReLU
+      dropout: 0
+      aggregator_type: mean
+readouts:
+  charges:
+    pooling: atoms
+    postprocess: compute_partial_charges
+    layers:
+      - hidden_feature_size: 128
+        activation_function: Sigmoid
+        dropout: 0
 atom_features:
-  - AtomicElement:
-      categories: ["C", "H", "O", "N", "P", "S"]
-  - AtomConnectivity
-  - AtomAverageFormalCharge
-  - AtomHybridization
-  - AtomInRingOfSize: 3
-  - AtomInRingOfSize: 4
-  - AtomInRingOfSize: 5
-  - AtomInRingOfSize: 6
+  - name: atomic_element
+    categories: ["C", "H", "O", "N", "P", "S"]
+  - name: atom_connectivity
+    categories: [1, 2, 3, 4, 5, 6]
+  - name: atom_hybridization
+  - name: atom_in_ring_of_size
+    ring_size: 3
+  - name: atom_in_ring_of_size
+    ring_size: 4
+  - name: atom_in_ring_of_size
+    ring_size: 5
+  - name: atom_in_ring_of_size
+    ring_size: 6
 bond_features:
-  - BondOrder
-  - BondInRingOfSize: 3
-  - BondInRingOfSize: 4
-  - BondInRingOfSize: 5
-  - BondInRingOfSize: 6
-
+  - name: bond_is_in_ring
 ```
 
-And then loaded with the [`GNNModel.from_yaml()`] method:
+And then loaded into a config using the [`ModelConfig.from_yaml()`] method:
 
 ```python
 from openff.nagl import GNNModel
+from openff.nagl.config import ModelConfig
 
-model = GNNModel.from_yaml("model.yml")
+model = GNNModel(ModelConfig.from_yaml("model.yaml"))
 ```
 
 Here we'll go through each option, what it means, and where to find the available choices.
 
-[`GNNModel`]: openff.nagl.GNNModel
-[`GNNModel.from_yaml_file()`]: openff.nagl.GNNModel.from_yaml_file 
-
 (model_features)=
 ## `atom_features` and `bond_features`
 
@@ -93,18 +155,14 @@ These arguments specify the featurization scheme for the model (see [](featuriza
 
 ## `convolution_architecture`
 
-The `convolution_architecture` argument specifies the structure of the convolution module. Available options are provided in the [`openff.nagl.nn.gcn`] module. 
+The `convolution_architecture` argument specifies the structure of the convolution module. Available options are provided in the [`openff.nagl.config.model`] module. 
 
 [`openff.nagl.nn.gcn`]: openff.nagl.nn.gcn
 
 ## Number of Features and Layers
 
-The size and shape of the neural networks in the convolution and readout modules are specified by four arguments:
-
-- `n_convolution_hidden_features`
-- `n_convolution_layers`
-- `n_readout_hidden_features`
-- `n_readout_layers`
+Each module comprises a number of layers that must be individually specified.
+For example, a [`ConvolutionModule`] consists of specified [`ConvolutionLayer`]s. A [`ReadoutModule`] consists of specified [`ForwardLayer`]s.
 
 The "convolution" arguments define the update network in the convolution module, and the "readout" the network in the readout module (see [](convolution_theory) and [](readout_theory)). Read the `GNNModel` docstring carefully to determine which layers are considered hidden.
 

diff --git a/docs/index.md b/docs/index.md
@@ -39,7 +39,7 @@ from openff.nagl import GNNModel
 model = GNNModel.load("trained_model.pt")
 ```
 
-Then, the properties the model is trained to predict can be computed with the [`GNNModel.compute_property()`] method, which takes an OpenFF [`Molecule`] object:
+Then, the properties the model is trained to predict can be computed with the [`GNNModel.compute_properties()`] method, which takes an OpenFF [`Molecule`] object:
 
 ```python
 from openff.toolkit import Molecule
@@ -51,7 +51,7 @@ model.compute_property(ethanol)
 
 [`openff.nagl.GNNModel`]: openff.nagl.GNNModel
 [`GNNModel.load()`]: openff.nagl.GNNModel.load
-[`GNNModel.compute_property()`]: openff.nagl.GNNModel.compute_property
+[`GNNModel.compute_properties()`]: openff.nagl.GNNModel.compute_properties
 [`Molecule`]: openff.toolkit.topology.Molecule
 
 :::{toctree}

diff --git a/docs/theory.md b/docs/theory.md
@@ -88,17 +88,16 @@ Neutral and zwitterionic alanine. These might be considered the same molecular s
 :::
 
 :::{admonition} Molecular Graphs in NAGL
-Molecular graphs are provided to NAGL via the [`Molecule`] class of the [OpenFF Toolkit]. These objects can be created in a [variety of ways], but the most common is probably via a [SMILES string] or [.SDF file]. NAGL ingests molecular graphs for inference through the [`GNNModel.compute_property()`] method, and for dataset construction through the [`MoleculeRecord.from_openff()`] and [`MoleculeRecord.from_precomputed_openff()`] methods.
+Molecular graphs are provided to NAGL via the [`Molecule`] class of the [OpenFF Toolkit]. These objects can be created in a [variety of ways], but the most common is probably via a [SMILES string] or [.SDF file]. NAGL ingests molecular graphs for inference through the [`GNNModel.compute_properties()`] method.
 :::
 
 [`Molecule`]: openff.toolkit.topology.Molecule
 [SMILES string]: openff.toolkit.topology.Molecule.from_smiles
 [.SDF file]: openff.toolkit.topology.Molecule.from_file
 [OpenFF Toolkit]: inv:openff.toolkit#index
 [variety of ways]: inv:openff.toolkit#users/molecule_cookbook
-[`GNNModel.compute_property()`]: openff.nagl.GNNModel.compute_property
-[`MoleculeRecord.from_openff()`]: openff.nagl.storage.record.MoleculeRecord.from_openff
-[`MoleculeRecord.from_precomputed_openff()`]: openff.nagl.storage.record.MoleculeRecord.from_precomputed_openff
+[`GNNModel.compute_properties()`]: openff.nagl.GNNModel.compute_properties
+
 
 (featurization_theory)=
 ## Featurization
@@ -117,7 +116,7 @@ NAGL therefore uses **one-hot encoding** for most featurization, in which each f
 This prevents the model from assuming that adjacent elements are similar. Note that these values are represented internally as floating point numbers, not single bits. The model's internal representation is therefore free to mix and scale them as needed, which allows the model to represent a carbon atom (#6) with some oxygen character (#8) without the result appearing like nitrogen (#7).
 
 :::{admonition} Featurization in NAGL
-Feature templates for [bonds] and [atoms] are available in the [`features`] module. A featurization scheme is composed of a tuple of instances of these classes, which are passed to the `bond_features` and `atom_features` arguments of the [`GNNModel`] constructor. Most of these features simply apply a one-hot encoding directly to the molecular graph, but a few are built on the [`ResonanceEnumerator`] class to support features that take into account multiple resonance forms.
+Feature templates for [bonds] and [atoms] are available in the [`features`] module. A featurization scheme is composed of a tuple of instances of these classes, which are passed to the `bond_features` and `atom_features` arguments of the [`ModelConfig`] constructor. Most of these features simply apply a one-hot encoding directly to the molecular graph, but a few are built on the [`ResonanceEnumerator`] class to support features that take into account multiple resonance forms.
 :::
 
 [bonds]: openff.nagl.features.bonds
@@ -151,7 +150,7 @@ NAGL's goal is to produce machine-learned models that can compute partial charge
 
 NAGL produces atom embeddings with a message-passing graph convolutional network (GCN). A GCN takes each node's feature vector and iteratively mixes it with those of progressively more distant neighbors to produce an embedding for the node. To start with, NAGL uses an atom's feature vector as its embedding, and its neighbors are the atoms directly bonded to it. On each iteration, the GCN first **aggregates** the feature vectors of an atom's neighbors and the associated bonds, and then **updates** the embedding with the aggregated features of that neighborhood. On the next iteration, the new embedding will be updated again, and the atoms an additional step away will form its new neighborhood.
 
-NAGL's convolution module supports the [GraphSAGE] architecture to train a single neural network to produce an embedding for any atom in any molecule. GraphSAGE uses any of a number of simple mathematical functions for its aggregation step, and trains a neural network to perform the update function. NAGL currently uses simple element-wise averaging of features for its aggregation step, and trains the update network automatically as part of training a model. The update function is thus custom built to produce an embedding for the property we're trying to predict!
+NAGL's convolution module supports multiple architectures. The most well-tested is the [GraphSAGE] architecture to train a single neural network to produce an embedding for any atom in any molecule. GraphSAGE uses any of a number of simple mathematical functions for its aggregation step, and trains a neural network to perform the update function. NAGL currently uses simple element-wise averaging of features for its aggregation step, and trains the update network automatically as part of training a model. The update function is thus custom built to produce an embedding for the property we're trying to predict!
 
 :::{admonition} Note
 GraphSAGE has nothing to do with the Sage force field!
@@ -164,7 +163,21 @@ GraphSAGE has nothing to do with the Sage force field!
 
 A trained convolution module takes a molecular graph and produces a representation for each atom that is custom-made for prediction of the desired property. These embeddings are then passed directly to the **Readout module** to predict the properties themselves; in fact, both modules are trained together to optimize their performance.
 
-The readout module consists of a neural network and a **post-processing** function. The post-processing function takes the outputs of the neural network and applies some traditional computation to them before producing the model's final result. NAGL therefore sandwiches its machine learning core between conventional, symbolic computation where chemistry knowledge can be injected.
+The readout module consists of a **pooling layer**, a feed-forward neural network and an optional **post-processing** function. The post-processing function takes the outputs of the neural network and applies some traditional computation to them before producing the model's final result. NAGL therefore sandwiches its machine learning core between conventional, symbolic computation where chemistry knowledge can be injected.
+
+
+### Pooling
+
+The function of the pooling layer is to "pool" the feature vectors
+produced by the convolution module into a single representation.
+
+### Postprocessing
+
+NAGL supports a number of post-processing layers.
+These should be specified in the [`ModelConfig`] with their name,
+not an instantiation of the class.
+
+#### compute_partial_charges
 
 NAGL uses an interesting application of the post-processing layer when calculating charges: an application of the **charge equilibration method** [inspired by electronegativity equalization]. The readout neural network does not directly infer partial charges; instead, it predicts two variables that are interpreted as **electronegativity** $e$ and **hardness** $s$. These are the first two derivatives in charge of the molecule's potential energy, and respectively quantify the atom's proclivity to hold negative charge and the atom's resistance to changing charge. An atom's partial charge $q$ is then computed from these properties and the molecule's net charge $Q$:
 
@@ -181,3 +194,13 @@ NAGL's machine learning models can be supplied a post-processing function with t
 :::
 
 [inspired by electronegativity equalization]: https://arxiv.org/pdf/1909.07903.pdf
+
+#### regularized_compute_partial_charges
+
+The is a modification of `compute_partial_charges` that
+also receives an initial charge, $q_0$, and uses the electronegativity
+and hardness to instead compute a "charge correction".
+
+Any neural network using this as a postprocessing layer
+therefore predicts three quantities: the initial charge $q_0$,
+the electronegativity $d$, and the hardness $s$.