Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML DataLoaders #115

Merged
merged 125 commits into from
Feb 3, 2022
Merged
Show file tree
Hide file tree
Changes from 105 commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
d797589
First implementation of Keras loader and generator
al-rigazzi Nov 5, 2021
13ba12f
Improve naming conventions
al-rigazzi Nov 5, 2021
5fe08eb
Enhance TF generator flexibility
al-rigazzi Nov 15, 2021
252c981
Adjustments to TF API
al-rigazzi Nov 22, 2021
3c83441
Add Torch dataset
al-rigazzi Nov 24, 2021
e2f96d8
Add Pytorch Dataset
al-rigazzi Nov 29, 2021
78dddb7
Add torch iterable and multiprocess load
al-rigazzi Nov 30, 2021
ff805b3
Add Horovod to Torch loader example
al-rigazzi Dec 2, 2021
6d4cdc2
Unify data loader API
al-rigazzi Dec 2, 2021
1ed2bb5
Add superclass to ml loaders
al-rigazzi Dec 2, 2021
b683bb2
Added docstrings to base downloader class
al-rigazzi Dec 6, 2021
ef00694
Update ML docs and TF tutorial
al-rigazzi Dec 9, 2021
b74796e
Apply style
al-rigazzi Dec 9, 2021
4094333
Update Torch ML training tutorial
al-rigazzi Dec 10, 2021
daade7c
Add documentation of ML training
al-rigazzi Dec 12, 2021
19a58b9
Add ML API to docs
al-rigazzi Dec 12, 2021
74cd1a9
Fix typo in docs
al-rigazzi Dec 13, 2021
62fc65b
Add parameterized command line arguments
al-rigazzi Nov 1, 2021
f212e25
Fix generator to accept multiple tags on same line
al-rigazzi Nov 1, 2021
fb1fd36
Apply style
al-rigazzi Nov 1, 2021
92b07e6
Update comments of ensemble generation
al-rigazzi Nov 1, 2021
ee8827e
Include reviewer's suggestions
al-rigazzi Nov 4, 2021
e41202c
implement create_run_settings
Nov 8, 2021
375bced
Split out create_run_settings
Nov 8, 2021
6c3d415
Migrate tests to use create_run_settings
Nov 9, 2021
fb5f3c2
Account for null case in create_run_settings
Nov 19, 2021
05403f7
Format and make style
Nov 21, 2021
31af7ef
Remove pandas from experiment.py
MattToast Nov 30, 2021
9c1547f
Change table fmt to match pd.df, run black
MattToast Dec 3, 2021
47dd489
ake style
MattToast Dec 3, 2021
616e2b0
refactor test to work with str from summary
MattToast Dec 7, 2021
04602ae
add summary test to test_testexperiment.py, refactor test_simple_enti…
MattToast Dec 7, 2021
cd6977f
Experiment.create_batch_settings
Nov 21, 2021
64699f4
Add generic batch test with create_batch_settings
Dec 3, 2021
59d97ff
Add LSF test to test_create_batch
Dec 3, 2021
94e9a55
Add format to summary as an optional parameter
MattToast Dec 9, 2021
9f35503
add tabulate format link to docstring
MattToast Dec 9, 2021
66e6727
Add missing functions to LSFSettings API
al-rigazzi Nov 22, 2021
52e44e8
Fix incorrect node setting for BsubBatchSettings
al-rigazzi Dec 7, 2021
b5e5621
Integrate new batch commands
al-rigazzi Dec 7, 2021
34fb974
Patched broken tests for LSF
al-rigazzi Dec 7, 2021
474d81c
Unify batch settings creation
al-rigazzi Dec 8, 2021
f56c546
Fix LSF problem with kwargs and project
al-rigazzi Dec 8, 2021
d4194bc
Fix multiple account for LSF batch settings
al-rigazzi Dec 8, 2021
a94cb8f
Remove print statements
al-rigazzi Dec 8, 2021
06e9e64
Add check of qstat output length in pbsParser
al-rigazzi Dec 8, 2021
b0220f2
Apply style
al-rigazzi Nov 1, 2021
8689a69
Migrate tests to use create_run_settings
Nov 9, 2021
92bdbec
Format and make style
Nov 21, 2021
7033dfe
Add missing functions to LSFSettings API
al-rigazzi Nov 22, 2021
a10e94b
Fix incorrect node setting for BsubBatchSettings
al-rigazzi Dec 7, 2021
bbf13b4
Add Generator Support for Directories (#88)
MattToast Jan 7, 2022
70e20f7
Unify data loader API
al-rigazzi Dec 2, 2021
b44e879
Apply style
al-rigazzi Dec 9, 2021
f99f210
Apply style
al-rigazzi Nov 1, 2021
2bd460e
Migrate tests to use create_run_settings
Nov 9, 2021
6739b7b
Format and make style
Nov 21, 2021
432c8ff
Add missing functions to LSFSettings API
al-rigazzi Nov 22, 2021
282c44c
Rebased and updated ML loader
al-rigazzi Jan 11, 2022
b11208c
Apply style
al-rigazzi Nov 1, 2021
03fd007
Migrate tests to use create_run_settings
Nov 9, 2021
c71d091
Format and make style
Nov 21, 2021
579632e
Add missing functions to LSFSettings API
al-rigazzi Nov 22, 2021
8913ccb
Fix incorrect node setting for BsubBatchSettings
al-rigazzi Dec 7, 2021
57696db
Add Generator Support for Directories (#88)
MattToast Jan 7, 2022
9cf3e21
Unify data loader API
al-rigazzi Dec 2, 2021
4c86d86
Apply style
al-rigazzi Dec 9, 2021
e6ecf4e
Apply style
al-rigazzi Nov 1, 2021
8c7e0f1
Format and make style
Nov 21, 2021
4d84ac4
Add missing functions to LSFSettings API
al-rigazzi Nov 22, 2021
4d348f6
Add missing functions to LSFSettings API
al-rigazzi Nov 22, 2021
ee90aab
Add Generator Support for Directories (#88)
MattToast Jan 7, 2022
0d7bf8d
Add parameterized command line arguments
al-rigazzi Nov 1, 2021
4034163
Fix generator to accept multiple tags on same line
al-rigazzi Nov 1, 2021
ac8a437
Apply style
al-rigazzi Nov 1, 2021
6f72ba8
Update comments of ensemble generation
al-rigazzi Nov 1, 2021
1349bec
Include reviewer's suggestions
al-rigazzi Nov 4, 2021
c627ac4
implement create_run_settings
Nov 8, 2021
f344f0b
Split out create_run_settings
Nov 8, 2021
2015b68
Migrate tests to use create_run_settings
Nov 9, 2021
20d4fc5
Account for null case in create_run_settings
Nov 19, 2021
23ae580
Format and make style
Nov 21, 2021
77927a1
Remove pandas from experiment.py
MattToast Nov 30, 2021
1a90947
Change table fmt to match pd.df, run black
MattToast Dec 3, 2021
9034f47
ake style
MattToast Dec 3, 2021
72638ef
refactor test to work with str from summary
MattToast Dec 7, 2021
8c3088a
add summary test to test_testexperiment.py, refactor test_simple_enti…
MattToast Dec 7, 2021
268763d
Experiment.create_batch_settings
Nov 21, 2021
5fcf288
Add generic batch test with create_batch_settings
Dec 3, 2021
3ea5e66
Add LSF test to test_create_batch
Dec 3, 2021
d56dfe0
Add format to summary as an optional parameter
MattToast Dec 9, 2021
d59f7ab
add tabulate format link to docstring
MattToast Dec 9, 2021
0a2e127
Add missing functions to LSFSettings API
al-rigazzi Nov 22, 2021
f5b9845
Fix incorrect node setting for BsubBatchSettings
al-rigazzi Dec 7, 2021
e79751f
Integrate new batch commands
al-rigazzi Dec 7, 2021
3056241
Patched broken tests for LSF
al-rigazzi Dec 7, 2021
ead3673
Unify batch settings creation
al-rigazzi Dec 8, 2021
7a932e0
Fix LSF problem with kwargs and project
al-rigazzi Dec 8, 2021
31a543f
Fix multiple account for LSF batch settings
al-rigazzi Dec 8, 2021
5236867
Remove print statements
al-rigazzi Dec 8, 2021
d266b2d
Add Generator Support for Directories (#88)
MattToast Jan 7, 2022
3517ae7
Remove redundant code from ml loaders
al-rigazzi Jan 17, 2022
8d6bfb4
Remove old code from ml.torch
al-rigazzi Jan 17, 2022
29ab033
Rebase on develop
al-rigazzi Jan 17, 2022
2072328
Merge branch 'develop' into trainer
al-rigazzi Jan 17, 2022
a5bb69d
Address reviewers' comment on code
al-rigazzi Jan 31, 2022
cba90b9
Add workaround for old SmartRedis versions
al-rigazzi Jan 31, 2022
bdb687a
Fix docstring
al-rigazzi Jan 31, 2022
b25fc23
Add Build Matrix for CI (#130)
EricGustin Jan 28, 2022
a2d1f89
Remove numpy as a dependency (#132)
EricGustin Jan 28, 2022
8284a22
Update Tutorials for Release (#132)
al-rigazzi Jan 31, 2022
2676c91
Docker Documentation Builds (#133)
Jan 28, 2022
c88ce3c
Update build_docs.yml
Jan 28, 2022
34ca199
Remove Duplicate Tutorials
MattToast Jan 28, 2022
d3d1eb0
Update Tutorials for Release (#132)
MattToast Jan 28, 2022
853e8c6
Docker Documentation Builds (#133)
Jan 28, 2022
b85619b
Update build_docs.yml
Jan 28, 2022
353269f
Merge branch 'develop' into trainer
al-rigazzi Jan 31, 2022
fa440ef
Reduce torch nn size in dataloader test
al-rigazzi Jan 31, 2022
a3f4a09
Remove logging from test
al-rigazzi Jan 31, 2022
dba0cac
Address reviewers' comments on documentation
al-rigazzi Jan 31, 2022
3ad1f1a
Modify docstrings and adapt class names
al-rigazzi Jan 31, 2022
d96b7bb
Update documentation
al-rigazzi Jan 31, 2022
28c8933
Use SmartSim logger
al-rigazzi Jan 31, 2022
2fd37d1
Add deprecation warning and make import not fatal
al-rigazzi Jan 31, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 43 additions & 4 deletions doc/api/smartsim_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -454,23 +454,62 @@ Model
:inherited-members:


Machine Learning
================

.. ml_api:

SmartSim includes built-in utilities for supporting TensorFlow, Keras, and Pytorch.

TensorFlow
==========
----------

.. _smartsim_tf_api:

SmartSim includes built-in utilities for supporting TensorFlow and Keras in SmartSim.
SmartSim includes built-in utilities for supporting TensorFlow and Keras in training and inference.

.. currentmodule:: smartsim.tf.utils
.. currentmodule:: smartsim.ml.tf.utils

.. autosummary::

freeze_model

.. automodule:: smartsim.tf.utils
.. automodule:: smartsim.ml.tf.utils
:members:


.. currentmodule:: smartsim.ml.tf.data

.. autoclass:: StaticDataGenerator
:members:
:show-inheritance:
:inherited-members:

.. autoclass:: DataGenerator
:members:
:show-inheritance:
:inherited-members:


PyTorch
----------

.. _smartsim_torch_api:

SmartSim includes built-in utilities for supporting PyTorch in training and inference.

.. currentmodule:: smartsim.ml.torch.data

.. autoclass:: StaticDataGenerator
:members:
:show-inheritance:
:inherited-members:

.. autoclass:: DataGenerator
:members:
:show-inheritance:
:inherited-members:

Slurm
=====

Expand Down
2 changes: 1 addition & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
tutorials/05_starting_ray/05_starting_ray
tutorials/using_clients
tutorials/lattice_boltz_analysis
tutorials/inference
tutorials/machine_learning


.. toctree::
Expand Down
195 changes: 176 additions & 19 deletions doc/tutorials/inference.rst → doc/tutorials/machine_learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ by SmartSim.
The following code examples do not include code to train the models shown.


4.2 PyTorch
===========
4.1.1 PyTorch
-------------

.. _TorchScript: https://pytorch.org/docs/stable/jit.html
.. _PyTorch: https://pytorch.org/
Expand Down Expand Up @@ -172,8 +172,8 @@ If running on CPU, be sure to change the argument in the ``set_model`` call
above to ``CPU``.


4.2 TensorFlow and Keras
========================
4.1.2 TensorFlow and Keras
--------------------------

.. _TensorFlow: https://www.tensorflow.org/
.. _Keras: https://keras.io/
Expand Down Expand Up @@ -253,8 +253,8 @@ returns these values for convenience as shown below.
print(pred)


4.3 ONNX
========
4.1.3 ONNX
----------

.. _Scikit-learn: https://scikit-learn.org
.. _XGBoost: https://xgboost.readthedocs.io
Expand All @@ -271,21 +271,21 @@ Libraries are supported by ONNX and can be readily used with SmartSim.

Some popular ones are:

- `Scikit-learn`_
- `XGBoost`_
- `CatBoost`_
- `TensorFlow`_
- `Keras`_
- `PyTorch`_
- `LightGBM`_
- `libsvm`_
- `Scikit-learn`_
- `XGBoost`_
- `CatBoost`_
- `TensorFlow`_
- `Keras`_
- `PyTorch`_
- `LightGBM`_
- `libsvm`_

As well as some that are not listed. There are also many tools to help convert
models to ONNX.

- `onnxmltools`_
- `skl2onnx`_
- `tensorflow-onnx`_
- `onnxmltools`_
- `skl2onnx`_
- `tensorflow-onnx`_

And PyTorch has it's own converter.

Expand All @@ -298,7 +298,7 @@ These scripts can be used with the :ref:`SmartSim code <infrastructure_code>`
above to launch an inference session with any of the supported ONNX libraries.

KMeans
------
++++++

.. _skl2onnx.to_onnx: http://onnx.ai/sklearn-onnx/auto_examples/plot_convert_syntax.html

Expand Down Expand Up @@ -328,7 +328,7 @@ with two ``outputs``.


Random Forest
-------------
+++++++++++++

The Random Forest example uses the Iris dataset from Scikit Learn to train a
RandomForestRegressor. As with the other examples, the skl2onnx function
Expand All @@ -351,3 +351,160 @@ RandomForestRegressor. As with the other examples, the skl2onnx function
client.set_model("rf_regressor", model, "ONNX", device="CPU")
client.run_model("rf_regressor", inputs="input", outputs="output")
print(client.get_tensor("output"))


4.2 Online training with SmartSim
=================================

A SmartSim ``Orchestrator`` can be used to store and retrieve samples and targets used to
train a ML model. A typical example is one in which one simulation produces samples at
each time step and another application needs to download the samples as they are produced
to train a Deep Neural Network (e.g. a surrogate model).

In this section, we will use components implemented in ``smartsim.ml.tf.data``, to train a
Neural Network implemented in TensorFlow and Keras. In particular, we will be using
two classes:
- ``smartsim.ml.data.TrainingUploader`` which streamlines the uploading of samples and corresponding targets to the DB
- ``smartsim.ml.tf.data.DataGenerator`` which is a Keras ``Generator`` which can be used to train a DNN,
and will download the samples from the DB updating the training set at the end of each epoch.

The SmartSim ``Experiment`` will consist in one mock simulation (the ``producer``) uploading samples,
and one application (the ``training_service``) downloading the samples to train a DNN.

A richer example, entirely implemented in Python, is available as a Jupyter Notebook in the
``tutorials`` section of the SmartSim repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link for reader convenience?

An equivalent example using PyTorch instead of TensorFlow is available in the same directory.


4.2.1 Producing and uploading the samples
-----------------------------------------

.. _ml_training_producer_code:

The first application in the workflow, the ``producer`` will upload batches of samples at regular intervals,
mimicking the behavior of an iterative simulation.

Since the ``training_service`` will use a ``smartsim.ml.tf.DataGenerator`` two download the samples, their
keys need to follow a pre-defined format. Assuming that only one process in the simulation
uploads the data, this format is ``<sample_prefix>_<iteration>``. And for targets
(which can also be integer labels), the key format is ``<target_prefix>_<iteration>``. Both ``<sample_prefix>``
and ``<target_prefix>`` are user-defined, and will need to be used to initialize the
``smartsim.ml.tf.DataGenerator`` object.

Assuming the simulation is written in Python, then the code would look like

.. code-block:: python

from SmartRedis import Client
# simulation initialization code
client = Client(cluster=False, address=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note here that address is needed if not launched through SmartSim


for iteration in range(num_iterations):
# simulation code producing two tensors, data_points
# and data_values
client.put_tensor(f"points_{iteration}", data_points)
client.put_tensor(f"values_{iteration}", data_values)


For simple simulations, this is sufficient. But if the simulation
uses MPI, then each rank could upload a portion of the data set. In that case,
the format for sample and target keys will be ``<sample_prefix>_<sub-index>_<iteration>``
and ``<target_prefix>_<sub-index>_<iteration>``, where ``<sub_index>`` can be, e.g.
the MPI rank id.


4.2.1 Downloading the samples and training the model
al-rigazzi marked this conversation as resolved.
Show resolved Hide resolved
----------------------------------------------------

The second part of the workflow is the ``training_service``, an application that
downloads the data uploaded by the ``producer`` and uses them to train a ML model.
Most importantly, the ``training_service`` needs to keep looking for new samples,
and download them as they are available. The training data set size thus needs to grow at
each ``producer`` iteration.

In Keras, a ``Sequence`` represents a data set and can be passed to ``model.fit()``.
The class ``smartsim.ml.tf.DataGenerator`` is a Keras ``Sequence``, which updates
its data set at the end of each training epoch, looking for newly produced batches of samples.
A current limitation of the TensorFlow training algorithm is that it does not take
into account changes of size in the data sets once the training has started, i.e. it is always
assumed that the training (and validation) data does not change during the training. To
overcome this limitation, we need to train one epoch at the time. Thus,
following what we defined in the :ref:`producer section <ml_training_produced_code>`,
the ``training_service`` would look like

.. code-block:: python

from smartsim.ml.tf.data import DataGenerator
generator = DataGenerator(
sample_prefix="points",
target_prefix="value",
batch_size=32,
smartredis_cluster=False)

model = # some ML model
# model initialization

for epoch in range(100):
model.fit(generator,
steps_per_epoch=None,
epochs=epoch+1,
initial_epoch=epoch,
batch_size=generator.batch_size,
verbose=2)


Again, this is enough for simple simulations. If the simulation uses MPI,
then the ``DataGenerator`` needs to know about the possible sub-indices. For example,
if the simulation runs 8 MPI ranks, the ``DataGenerator`` initialization will
need to be adapted as follows

.. code-block:: python

generator = DataGenerator(
sample_prefix="points",
target_prefix="value",
batch_size=32,
smartredis_cluster=False,
sub_indices=8)


4.2.2 Launching the experiment
al-rigazzi marked this conversation as resolved.
Show resolved Hide resolved
------------------------------

To launch the ``producer`` and the ``training_service`` as models
within a SmartSim ``Experiment``, we can use the following code:

.. code-block:: python

from smartsim import Experiment
from smartsim.database import Orchestrator
from smartsim.settings import RunSettings

db = Orchestrator(port=6780)
exp = ("online-training", launcher="local")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp = Experiment("online-training", launcher="local")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch


# producer
producer_script = "producer.py"
settings = RunSettings("python", exe_args=producer_script)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_run_settings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

uploader_model = exp.create_model("producer", settings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can enable key prefixing here instead and save a line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no clue we could do that.

uploader_model.attach_generator_files(to_copy=script)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_copy=producer_script

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

uploader_model.enable_key_prefixing()

# training_service
training_script = "training_service.py"
settings = RunSettings("python", exe_args=training_script)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_run_settings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

trainer_model = exp.create_model("training_service", settings)
trainer_model.register_incoming_entity(uploader_model)

exp.start(db)
exp.start(uploader_model, block=False, summary=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launch in same start call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

exp.start(trainer_model, block=True, summary=False)


Two lines require attention, as they are needed by the ``DataGenerator`` to work:
- ``uploader_model.enable_key_prefixing()`` will ensure that the ``producer`` prefixes
all tensor keys with its name
- ``trainer_model.register_incoming_entity(uploader_model)`` enables the ``DataGenerator``
in the ``training_service`` to know that it needs to download samples produced by the ``producer``


1 change: 1 addition & 0 deletions smartsim/ml/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .data import TrainingDataUploader, form_name
Loading