ML DataLoaders #115

al-rigazzi · 2021-12-02T23:35:11Z

This PR adds ML Data Loaders for Keras/TF and PyTorch.

They follow the download-and-store approach, not the streaming one (thus this could be added). Samples are downloaded from the DB and kept as copy of the data set (which means that they are not discarded after one epoch).

Specializations required some fine tuning, but the behavior is the same for TF and PyTorch. There are super-classes for all downloaders.

Spartee

Will you run make style on this before I place in review comments?

Spartee · 2021-12-15T00:13:17Z

Went to review and looks like there are still a couple conflicts @al-rigazzi

Experiment.create_run_settings has been implmented which is a factory method for creating run settings types. The run_command default is "auto" which attempts to automatically detect the run_command on the system to use.

this functionality may be reused in other places within the library, so the create_run_settings function was placed inside smartsim/settings and is now called from the experiment Local tests have been adapted

Spartee

Couple things

make sure to throw the deprecation wanring for smartsim.tf and lets let the user import that for a release. basically an empty module with the dep and an import works here. use the warnings library and throw DeprecationWarning
Some re-naming for brevity
Lets add tests for all of these in the tests/backends folder.
Use the logger.
Lotta copy paste, lets keep the comments to what users need to see and direct them to the places they need to go for more information if necessary. These should be easy to grok in a VSCode like editor
instead of smartsim.ml.tf.data lets have those imports resolve to smartsim.ml.tf for brevity.
Workers comment may be in a differnt ticket but lets expose that if the functionality works and be sure to document the decision there.
tests tests tests tests

smartsim/ml/data.py

Spartee · 2022-01-17T20:49:20Z

smartsim/ml/data.py

+        batch_key = form_name(self.sample_prefix, self.batch_idx, sub_index)
+        self.client.put_tensor(batch_key, samples)
+        if self.verbose:
+            print(f"Put batch {batch_key}")


logger instead here?

Spartee · 2022-01-17T20:50:43Z

doc/tutorials/machine_learning.rst

+
+    from SmartRedis import Client
+    # simulation initialization code
+    client = Client(cluster=False, address=None)


note here that address is needed if not launched through SmartSim

Spartee · 2022-01-17T20:52:48Z

doc/tutorials/machine_learning.rst

+
+    # producer
+    producer_script = "producer.py"
+    settings = RunSettings("python", exe_args=producer_script)


create_run_settings

Spartee · 2022-01-17T20:53:25Z

doc/tutorials/machine_learning.rst

+    # producer
+    producer_script = "producer.py"
+    settings = RunSettings("python", exe_args=producer_script)
+    uploader_model = exp.create_model("producer", settings)


can enable key prefixing here instead and save a line

I had no clue we could do that.

Spartee · 2022-01-17T20:55:13Z

smartsim/ml/data.py

+            self.init_sources()
+            self.init_samples()
+
+    def log(self, message):


OK, added logger now!

Spartee · 2022-01-17T20:56:16Z

smartsim/ml/data.py

+    `auto`.
+
+     - When specifying `auto`, the user must also specify
+      `uploader_name`. BatchDownloader will get all needed information


looks like you might have done some copy paste here.

Yep, now I fixed all names and only kept docstrings of base classes

Spartee · 2022-01-17T20:56:57Z

smartsim/ml/data.py

+    :type smartredis_cluster: bool
+    :param smartredis_address: Address of Redis client as <ip_address>:<port>
+    :type smartredis_address: str
+    :param replica_rank: When BatchDownloader is used in a distributed setting, indicates


We have these an sub-indicies? this doesn't make sense to me, can you explain?

Yeah, this was confusing. The sub-indices are those of the uploader, I renamed them as uploader_ranks, which is just an int.

Spartee · 2022-01-17T20:58:53Z

smartsim/ml/torch/data.py

+        )
+
+    @staticmethod
+    def worker_init_fn(worker_id):


How is the user specifying number of workers?

Does this spawn threads or processes?

Both are handled by the super-class torch.DataLoader and corresponding settings.

Spartee · 2022-01-17T20:59:20Z

tutorials/04_ml_training/tf/data_uploader.py

+from os import environ
+from time import sleep
+import numpy as np
+from mpi4py import MPI


Lets make sure we have notes in here that specify you need MPI4Py

I put it in the Jupyter notebook, should be enough, but I'm open to suggestions.

MattToast

Supper cool looking stuff!! Here's some quick notes for consideration:

I'm gonna second shortening up the docstrings to essential information. As someone coming from the outside, it took me while to try and figure out how the sub-classes differed while searching through the boilerplate.
In docstrings, multi-line bullet points should end in \ when moving to next line
Found a couple of minor errors in the docs.
Marked some places in where I thought the code-style seemed a bit off. Address as you see fit.

MattToast · 2022-01-21T19:34:10Z

doc/tutorials/machine_learning.rst

+and one application (the ``training_service``) downloading the samples to train a DNN.
+
+A richer example, entirely implemented in Python, is available as a Jupyter Notebook in the
+``tutorials`` section of the SmartSim repository.


Add a link for reader convenience?

doc/tutorials/machine_learning.rst

MattToast · 2022-01-21T20:01:11Z

doc/tutorials/machine_learning.rst

+    from smartsim.settings import RunSettings
+
+    db = Orchestrator(port=6780)
+    exp = ("online-training", launcher="local")


exp = Experiment("online-training", launcher="local")

MattToast · 2022-01-21T20:02:59Z

doc/tutorials/machine_learning.rst

+    producer_script = "producer.py"
+    settings = RunSettings("python", exe_args=producer_script)
+    uploader_model = exp.create_model("producer", settings)
+    uploader_model.attach_generator_files(to_copy=script)


to_copy=producer_script

MattToast · 2022-01-21T22:08:08Z

smartsim/ml/torch/data.py

+            else:
+                sources = []
+
+        print(f"{worker_id}: {sources}")


replace with logger?

Well, this was too verbose anyhow. I removed it.

MattToast · 2022-01-21T22:15:47Z

tutorials/04_ml_training/tf/training_service_hvd.py

@@ -0,0 +1,40 @@
+import numpy as np


np never accessed

MattToast · 2022-01-21T22:20:14Z

tutorials/04_ml_training/torch/training_service.py

@@ -0,0 +1,55 @@
+import numpy as np


np not accessed

tutorials/04_ml_training/torch/training_service_hvd.py

MattToast · 2022-01-21T22:37:34Z

smartsim/ml/torch/data.py

+        dataset.init_sources()
+        overall_sources = dataset.sources
+
+        worker_id = worker_info.id


is the worker_id param used before being overwritten here?

I defer to the Torch docs I took this from. My best guess is that it is not.

@EricGustin

Adds various dimensions to the CI build matrix for SmartSim. The build matrix now uses MacOS & Ubuntu, GNU8, RedisAI 1.2.3 & 1.2.5, and Python 3.7-3.9. The build matrix excludes building with RedisAI 1.2.5 when on MacOS as RedisAI temporarily removed support for MacOS in 1.2.4 and 1.2.5 [ committed by @EricGustin and @Spartee ] [ reviewed by @Spartee ]

@EricGustin

* Remove np from step.py and requirements Create a helper function called get_base_36_repr so that we can remove numpy from step.py Remove numpy from requirements.txt Pin requirements.dev to a specific version of numpy * Remove numpy from requirements * Remove numpy from setup [ committed by @EricGustin ] [ reviewed by @Spartee ]

@MattToast

Edits to the Tutorials section of the documentation to highlight refined api [ committed by @MattToast ] [ reviewed by @Spartee ]

@Spartee

This PR adds the ability to build the documentation inside a docker build so that users don't have to worry about the specific build steps. The build exports the docs folder into the current clone so that the dev doesn't have to do it themselves. build with ``make docks``. The manual documentation build is still available with make docs. There is also a new make tutorials that builds the tutorials into a developer docker with the current clone of SmartSim. This will eventually have a prod docker build too whch downloads the pinned version of smartsim for the latest release. [ committed by @Spartee ] [ reviewed by @al-rigazzi ]

@MattToast

Remove some tutorial files that appear to have been duplicated. [ committed by @MattToast ] [ reviewed by @Spartee ]

@MattToast

Edits to the Tutorials section of the documentation to highlight refined api [ committed by @MattToast ] [ reviewed by @Spartee ]

@Spartee

This PR adds the ability to build the documentation inside a docker build so that users don't have to worry about the specific build steps. The build exports the docs folder into the current clone so that the dev doesn't have to do it themselves. build with ``make docks``. The manual documentation build is still available with make docs. There is also a new make tutorials that builds the tutorials into a developer docker with the current clone of SmartSim. This will eventually have a prod docker build too whch downloads the pinned version of smartsim for the latest release. [ committed by @Spartee ] [ reviewed by @al-rigazzi ]

Spartee

LGTM! Great work! Excited to see the propagate into the community.

al-rigazzi requested a review from Spartee December 3, 2021 00:06

al-rigazzi marked this pull request as ready for review December 6, 2021 16:28

Spartee reviewed Dec 9, 2021

View reviewed changes

Spartee added area: ML Issues related to SmartSim ML classes and utilities type: feature Issues that include feature request or feature idea labels Jan 13, 2022

al-rigazzi and others added 24 commits January 17, 2022 10:10

First implementation of Keras loader and generator

d797589

Improve naming conventions

13ba12f

Enhance TF generator flexibility

5fe08eb

Adjustments to TF API

252c981

Add Torch dataset

3c83441

Add Pytorch Dataset

e2f96d8

Add torch iterable and multiprocess load

78dddb7

Add Horovod to Torch loader example

ff805b3

Unify data loader API

6d4cdc2

Add superclass to ml loaders

1ed2bb5

Added docstrings to base downloader class

b683bb2

Update ML docs and TF tutorial

ef00694

Apply style

b74796e

Update Torch ML training tutorial

4094333

Add documentation of ML training

daade7c

Add ML API to docs

19a58b9

Fix typo in docs

74cd1a9

Add parameterized command line arguments

62fc65b

Fix generator to accept multiple tags on same line

f212e25

Apply style

fb1fd36

Update comments of ensemble generation

92b07e6

Include reviewer's suggestions

ee8827e

implement create_run_settings

e41202c

Experiment.create_run_settings has been implmented which is a factory method for creating run settings types. The run_command default is "auto" which attempts to automatically detect the run_command on the system to use.

Split out create_run_settings

375bced

this functionality may be reused in other places within the library, so the create_run_settings function was placed inside smartsim/settings and is now called from the experiment Local tests have been adapted

al-rigazzi added 2 commits January 17, 2022 10:33

Rebase on develop

29ab033

Merge branch 'develop' into trainer

2072328

Spartee self-requested a review January 17, 2022 20:45

Spartee suggested changes Jan 17, 2022

View reviewed changes

MattToast self-requested a review January 20, 2022 17:09

MattToast reviewed Jan 21, 2022

View reviewed changes

MattToast mentioned this pull request Jan 25, 2022

Tutorial Update #129

Merged

al-rigazzi and others added 20 commits January 31, 2022 05:22

Address reviewers' comment on code

a5bb69d

Add workaround for old SmartRedis versions

cba90b9

Fix docstring

bdb687a

Update Tutorials for Release (CrayLabs#132)

8284a22

Edits to the Tutorials section of the documentation to highlight refined api [ committed by @MattToast ] [ reviewed by @Spartee ]

Update build_docs.yml

c88ce3c

Remove Duplicate Tutorials

34ca199

Remove some tutorial files that appear to have been duplicated. [ committed by @MattToast ] [ reviewed by @Spartee ]

Update Tutorials for Release (CrayLabs#132)

d3d1eb0

Edits to the Tutorials section of the documentation to highlight refined api [ committed by @MattToast ] [ reviewed by @Spartee ]

Update build_docs.yml

b85619b

Merge branch 'develop' into trainer

353269f

Reduce torch nn size in dataloader test

fa440ef

Remove logging from test

a3f4a09

Address reviewers' comments on documentation

dba0cac

Modify docstrings and adapt class names

3ad1f1a

Update documentation

d96b7bb

Use SmartSim logger

28c8933

Add deprecation warning and make import not fatal

2fd37d1

Spartee approved these changes Feb 3, 2022

View reviewed changes

al-rigazzi merged commit e9210d4 into CrayLabs:develop Feb 3, 2022

al-rigazzi deleted the trainer branch February 10, 2022 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML DataLoaders #115

ML DataLoaders #115

al-rigazzi commented Dec 2, 2021 •

edited

Loading

Spartee left a comment

Spartee commented Dec 15, 2021

Spartee left a comment

Spartee Jan 17, 2022

Spartee Jan 17, 2022

Spartee Jan 17, 2022

al-rigazzi Jan 31, 2022

Spartee Jan 17, 2022

al-rigazzi Jan 31, 2022

Spartee Jan 17, 2022

al-rigazzi Jan 31, 2022

Spartee Jan 17, 2022

al-rigazzi Jan 31, 2022 •

edited

Loading

Spartee Jan 17, 2022

al-rigazzi Jan 31, 2022

Spartee Jan 17, 2022

al-rigazzi Jan 31, 2022

Spartee Jan 17, 2022

al-rigazzi Jan 31, 2022

MattToast left a comment •

edited

Loading

MattToast Jan 21, 2022

MattToast Jan 21, 2022

al-rigazzi Jan 31, 2022

MattToast Jan 21, 2022

al-rigazzi Jan 31, 2022

MattToast Jan 21, 2022

al-rigazzi Jan 31, 2022

MattToast Jan 21, 2022

al-rigazzi Jan 31, 2022

MattToast Jan 21, 2022

al-rigazzi Jan 31, 2022

MattToast Jan 21, 2022

al-rigazzi Jan 31, 2022

Spartee left a comment

ML DataLoaders #115

ML DataLoaders #115

Conversation

al-rigazzi commented Dec 2, 2021 • edited Loading

Spartee left a comment

Choose a reason for hiding this comment

Spartee commented Dec 15, 2021

Spartee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

al-rigazzi Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MattToast left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Spartee left a comment

Choose a reason for hiding this comment

al-rigazzi commented Dec 2, 2021 •

edited

Loading

al-rigazzi Jan 31, 2022 •

edited

Loading

MattToast left a comment •

edited

Loading