-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML DataLoaders #115
ML DataLoaders #115
Changes from 105 commits
d797589
13ba12f
5fe08eb
252c981
3c83441
e2f96d8
78dddb7
ff805b3
6d4cdc2
1ed2bb5
b683bb2
ef00694
b74796e
4094333
daade7c
19a58b9
74cd1a9
62fc65b
f212e25
fb1fd36
92b07e6
ee8827e
e41202c
375bced
6c3d415
fb5f3c2
05403f7
31af7ef
9c1547f
47dd489
616e2b0
04602ae
cd6977f
64699f4
59d97ff
94e9a55
9f35503
66e6727
52e44e8
b5e5621
34fb974
474d81c
f56c546
d4194bc
a94cb8f
06e9e64
b0220f2
8689a69
92bdbec
7033dfe
a10e94b
bbf13b4
70e20f7
b44e879
f99f210
2bd460e
6739b7b
432c8ff
282c44c
b11208c
03fd007
c71d091
579632e
8913ccb
57696db
9cf3e21
4c86d86
e6ecf4e
8c7e0f1
4d84ac4
4d348f6
ee90aab
0d7bf8d
4034163
ac8a437
6f72ba8
1349bec
c627ac4
f344f0b
2015b68
20d4fc5
23ae580
77927a1
1a90947
9034f47
72638ef
8c3088a
268763d
5fcf288
3ea5e66
d56dfe0
d59f7ab
0a2e127
f5b9845
e79751f
3056241
ead3673
7a932e0
31a543f
5236867
d266b2d
3517ae7
8d6bfb4
29ab033
2072328
a5bb69d
cba90b9
bdb687a
b25fc23
a2d1f89
8284a22
2676c91
c88ce3c
34ca199
d3d1eb0
853e8c6
b85619b
353269f
fa440ef
a3f4a09
dba0cac
3ad1f1a
d96b7bb
28c8933
2fd37d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -66,8 +66,8 @@ by SmartSim. | |
The following code examples do not include code to train the models shown. | ||
|
||
|
||
4.2 PyTorch | ||
=========== | ||
4.1.1 PyTorch | ||
------------- | ||
|
||
.. _TorchScript: https://pytorch.org/docs/stable/jit.html | ||
.. _PyTorch: https://pytorch.org/ | ||
|
@@ -172,8 +172,8 @@ If running on CPU, be sure to change the argument in the ``set_model`` call | |
above to ``CPU``. | ||
|
||
|
||
4.2 TensorFlow and Keras | ||
======================== | ||
4.1.2 TensorFlow and Keras | ||
-------------------------- | ||
|
||
.. _TensorFlow: https://www.tensorflow.org/ | ||
.. _Keras: https://keras.io/ | ||
|
@@ -253,8 +253,8 @@ returns these values for convenience as shown below. | |
print(pred) | ||
|
||
|
||
4.3 ONNX | ||
======== | ||
4.1.3 ONNX | ||
---------- | ||
|
||
.. _Scikit-learn: https://scikit-learn.org | ||
.. _XGBoost: https://xgboost.readthedocs.io | ||
|
@@ -271,21 +271,21 @@ Libraries are supported by ONNX and can be readily used with SmartSim. | |
|
||
Some popular ones are: | ||
|
||
- `Scikit-learn`_ | ||
- `XGBoost`_ | ||
- `CatBoost`_ | ||
- `TensorFlow`_ | ||
- `Keras`_ | ||
- `PyTorch`_ | ||
- `LightGBM`_ | ||
- `libsvm`_ | ||
- `Scikit-learn`_ | ||
- `XGBoost`_ | ||
- `CatBoost`_ | ||
- `TensorFlow`_ | ||
- `Keras`_ | ||
- `PyTorch`_ | ||
- `LightGBM`_ | ||
- `libsvm`_ | ||
|
||
As well as some that are not listed. There are also many tools to help convert | ||
models to ONNX. | ||
|
||
- `onnxmltools`_ | ||
- `skl2onnx`_ | ||
- `tensorflow-onnx`_ | ||
- `onnxmltools`_ | ||
- `skl2onnx`_ | ||
- `tensorflow-onnx`_ | ||
|
||
And PyTorch has it's own converter. | ||
|
||
|
@@ -298,7 +298,7 @@ These scripts can be used with the :ref:`SmartSim code <infrastructure_code>` | |
above to launch an inference session with any of the supported ONNX libraries. | ||
|
||
KMeans | ||
------ | ||
++++++ | ||
|
||
.. _skl2onnx.to_onnx: http://onnx.ai/sklearn-onnx/auto_examples/plot_convert_syntax.html | ||
|
||
|
@@ -328,7 +328,7 @@ with two ``outputs``. | |
|
||
|
||
Random Forest | ||
------------- | ||
+++++++++++++ | ||
|
||
The Random Forest example uses the Iris dataset from Scikit Learn to train a | ||
RandomForestRegressor. As with the other examples, the skl2onnx function | ||
|
@@ -351,3 +351,160 @@ RandomForestRegressor. As with the other examples, the skl2onnx function | |
client.set_model("rf_regressor", model, "ONNX", device="CPU") | ||
client.run_model("rf_regressor", inputs="input", outputs="output") | ||
print(client.get_tensor("output")) | ||
|
||
|
||
4.2 Online training with SmartSim | ||
================================= | ||
|
||
A SmartSim ``Orchestrator`` can be used to store and retrieve samples and targets used to | ||
train a ML model. A typical example is one in which one simulation produces samples at | ||
each time step and another application needs to download the samples as they are produced | ||
to train a Deep Neural Network (e.g. a surrogate model). | ||
|
||
In this section, we will use components implemented in ``smartsim.ml.tf.data``, to train a | ||
Neural Network implemented in TensorFlow and Keras. In particular, we will be using | ||
two classes: | ||
- ``smartsim.ml.data.TrainingUploader`` which streamlines the uploading of samples and corresponding targets to the DB | ||
- ``smartsim.ml.tf.data.DataGenerator`` which is a Keras ``Generator`` which can be used to train a DNN, | ||
and will download the samples from the DB updating the training set at the end of each epoch. | ||
|
||
The SmartSim ``Experiment`` will consist in one mock simulation (the ``producer``) uploading samples, | ||
and one application (the ``training_service``) downloading the samples to train a DNN. | ||
|
||
A richer example, entirely implemented in Python, is available as a Jupyter Notebook in the | ||
``tutorials`` section of the SmartSim repository. | ||
An equivalent example using PyTorch instead of TensorFlow is available in the same directory. | ||
|
||
|
||
4.2.1 Producing and uploading the samples | ||
----------------------------------------- | ||
|
||
.. _ml_training_producer_code: | ||
|
||
The first application in the workflow, the ``producer`` will upload batches of samples at regular intervals, | ||
mimicking the behavior of an iterative simulation. | ||
|
||
Since the ``training_service`` will use a ``smartsim.ml.tf.DataGenerator`` two download the samples, their | ||
keys need to follow a pre-defined format. Assuming that only one process in the simulation | ||
uploads the data, this format is ``<sample_prefix>_<iteration>``. And for targets | ||
(which can also be integer labels), the key format is ``<target_prefix>_<iteration>``. Both ``<sample_prefix>`` | ||
and ``<target_prefix>`` are user-defined, and will need to be used to initialize the | ||
``smartsim.ml.tf.DataGenerator`` object. | ||
|
||
Assuming the simulation is written in Python, then the code would look like | ||
|
||
.. code-block:: python | ||
|
||
from SmartRedis import Client | ||
# simulation initialization code | ||
client = Client(cluster=False, address=None) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note here that address is needed if not launched through SmartSim |
||
|
||
for iteration in range(num_iterations): | ||
# simulation code producing two tensors, data_points | ||
# and data_values | ||
client.put_tensor(f"points_{iteration}", data_points) | ||
client.put_tensor(f"values_{iteration}", data_values) | ||
|
||
|
||
For simple simulations, this is sufficient. But if the simulation | ||
uses MPI, then each rank could upload a portion of the data set. In that case, | ||
the format for sample and target keys will be ``<sample_prefix>_<sub-index>_<iteration>`` | ||
and ``<target_prefix>_<sub-index>_<iteration>``, where ``<sub_index>`` can be, e.g. | ||
the MPI rank id. | ||
|
||
|
||
4.2.1 Downloading the samples and training the model | ||
al-rigazzi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
---------------------------------------------------- | ||
|
||
The second part of the workflow is the ``training_service``, an application that | ||
downloads the data uploaded by the ``producer`` and uses them to train a ML model. | ||
Most importantly, the ``training_service`` needs to keep looking for new samples, | ||
and download them as they are available. The training data set size thus needs to grow at | ||
each ``producer`` iteration. | ||
|
||
In Keras, a ``Sequence`` represents a data set and can be passed to ``model.fit()``. | ||
The class ``smartsim.ml.tf.DataGenerator`` is a Keras ``Sequence``, which updates | ||
its data set at the end of each training epoch, looking for newly produced batches of samples. | ||
A current limitation of the TensorFlow training algorithm is that it does not take | ||
into account changes of size in the data sets once the training has started, i.e. it is always | ||
assumed that the training (and validation) data does not change during the training. To | ||
overcome this limitation, we need to train one epoch at the time. Thus, | ||
following what we defined in the :ref:`producer section <ml_training_produced_code>`, | ||
the ``training_service`` would look like | ||
|
||
.. code-block:: python | ||
|
||
from smartsim.ml.tf.data import DataGenerator | ||
generator = DataGenerator( | ||
sample_prefix="points", | ||
target_prefix="value", | ||
batch_size=32, | ||
smartredis_cluster=False) | ||
|
||
model = # some ML model | ||
# model initialization | ||
|
||
for epoch in range(100): | ||
model.fit(generator, | ||
steps_per_epoch=None, | ||
epochs=epoch+1, | ||
initial_epoch=epoch, | ||
batch_size=generator.batch_size, | ||
verbose=2) | ||
|
||
|
||
Again, this is enough for simple simulations. If the simulation uses MPI, | ||
then the ``DataGenerator`` needs to know about the possible sub-indices. For example, | ||
if the simulation runs 8 MPI ranks, the ``DataGenerator`` initialization will | ||
need to be adapted as follows | ||
|
||
.. code-block:: python | ||
|
||
generator = DataGenerator( | ||
sample_prefix="points", | ||
target_prefix="value", | ||
batch_size=32, | ||
smartredis_cluster=False, | ||
sub_indices=8) | ||
|
||
|
||
4.2.2 Launching the experiment | ||
al-rigazzi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
------------------------------ | ||
|
||
To launch the ``producer`` and the ``training_service`` as models | ||
within a SmartSim ``Experiment``, we can use the following code: | ||
|
||
.. code-block:: python | ||
|
||
from smartsim import Experiment | ||
from smartsim.database import Orchestrator | ||
from smartsim.settings import RunSettings | ||
|
||
db = Orchestrator(port=6780) | ||
exp = ("online-training", launcher="local") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch |
||
|
||
# producer | ||
producer_script = "producer.py" | ||
settings = RunSettings("python", exe_args=producer_script) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
uploader_model = exp.create_model("producer", settings) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can enable key prefixing here instead and save a line There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had no clue we could do that. |
||
uploader_model.attach_generator_files(to_copy=script) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
uploader_model.enable_key_prefixing() | ||
|
||
# training_service | ||
training_script = "training_service.py" | ||
settings = RunSettings("python", exe_args=training_script) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
trainer_model = exp.create_model("training_service", settings) | ||
trainer_model.register_incoming_entity(uploader_model) | ||
|
||
exp.start(db) | ||
exp.start(uploader_model, block=False, summary=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. launch in same start call? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
exp.start(trainer_model, block=True, summary=False) | ||
|
||
|
||
Two lines require attention, as they are needed by the ``DataGenerator`` to work: | ||
- ``uploader_model.enable_key_prefixing()`` will ensure that the ``producer`` prefixes | ||
all tensor keys with its name | ||
- ``trainer_model.register_incoming_entity(uploader_model)`` enables the ``DataGenerator`` | ||
in the ``training_service`` to know that it needs to download samples produced by the ``producer`` | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .data import TrainingDataUploader, form_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link for reader convenience?