[MXNet-1349][Fit API]Add validation support and unit tests for fit() API #14442

abhinavs95 · 2019-03-15T21:37:50Z

Description

Adding validation support and unit tests for fit() API.
This PR depends on the parent PR for fit() API #14346
JIRA epic: https://issues.apache.org/jira/projects/MXNET/issues/MXNET-1333
Design: https://cwiki.apache.org/confluence/display/MXNET/Gluon+Fit+API+-+Tech+Design

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with MXNet-1349
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Added validation support in estimator class, allow users to monitor loss and metrics on validation set during training.
Added unit tests for estimator and fit()
Added error handlers in estimator

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

piyushghai · 2019-03-15T21:57:35Z

python/mxnet/gluon/estimator/estimator.py

+__all__ = ['Estimator']
+
+
+class Estimator(object):


Wasn't this class included in this PR : #14346 ?

Yes, I have made changes in it for validation support.

I'd recommend either adding the changes for validation metrics as a commit to the existing PR #14346 , OR, wait until that PR gets merged, and add this as a singular commit so that the diff is incremental.

@piyushghai I asked @abhinavs95 to open this PR so I can review it early. We can do a rebase and resolve the conflict once parent PR is merged.

roywei

Thanks for your contribution! generally looks good! added a few comments.
Remember to rebase and resolve the conflict once parent PR is merged.

roywei · 2019-03-15T22:08:12Z

python/mxnet/gluon/estimator/estimator.py

+        label = gluon.utils.split_and_load(label, ctx_list=ctx, batch_axis=0)
+        return data, label
+
+    def _test(self, val_data, batch_fn=None):


maybe rename to _evaluate(self, eval_data), as this can be used for both validation (on epoch end) and test(on train end) dataset.

roywei · 2019-03-15T22:15:32Z

python/mxnet/gluon/estimator/estimator.py

+            self.train_loss_metrics.append(Loss(l.name))
+            self.test_loss_metrics.append(Loss(l.name))
+            self.train_stats['train_' + l.name] = []
+            self.train_stats['test_' + l.name] = []


roywei · 2019-03-15T22:15:52Z

python/mxnet/gluon/estimator/estimator.py

+            # only record the latest metric numbers after each batch
+            self.train_stats['batch_' + metric.name] = 0.
+        for metric in self.test_metrics:
+            self.train_stats['test_' + metric.name] = []


roywei · 2019-03-15T22:17:46Z

python/mxnet/gluon/estimator/event_handler.py

+        msg = '\n[Epoch %d] finished in %.3fs: ' % (epoch, epoch_time)
+        for key in self._estimator.train_stats.keys():
+            if do_validation:
+                if key.startswith('train_') or key.startswith('test_'):


you can remove the if else and no need to pass do_validation. The logic can simply be if key starts with train or val, log it, even if no validation is done

Done! Also moved val stats update loop to fit() to avoid key error. If no val set is passed 'nan' will be logged for val stats.

karan6181 · 2019-03-15T22:37:28Z

@mxnet-label-bot add [Gluon, Test]

roywei

LGTM, please fix the CI failure

roywei · 2019-03-18T18:12:42Z

python/mxnet/gluon/estimator/estimator.py

@@ -26,6 +26,7 @@
 from ...context import Context, cpu, gpu, num_gpus
 from ...io import DataIter
 from ...metric import EvalMetric, Loss
+import copy


this failed sanity check: import it first

piyushghai · 2019-03-19T00:02:18Z

python/mxnet/gluon/estimator/estimator.py

                self.train_stats['train_' + metric.name].append(metric.get()[1])
+            for metric in self.test_metrics + self.test_loss_metrics:


Can we rename test_metrics to val_metrics for consistency, since we are referring to them as validation stuff throughout

we can use test_metrics for both validation (at epoch end) and test (at train end) dataset

piyushghai · 2019-03-19T00:03:44Z

tests/python/unittest/test_gluon_estimator.py

+                              loss=loss,
+                              trainers=trainer,
+                              context=ctx)
+    est.fit(train_data=train_data,


What are we asserting against here ?

this checks if the estimator works with no metric specified, doesn't throw any error/warning when its successful

So there's a particular happy path that you're trying to test here

nswamy · 2019-03-19T10:19:13Z

python/mxnet/gluon/estimator/estimator.py

@@ -64,17 +65,21 @@ def __init__(self, net,
            self.loss = [loss]
        else:
            self.loss = loss or []
+            if not self.loss:


i think this if loop is not correct

if isinstance(loss, gluon.loss.Loss): self.loss = [loss] elif isinstance(loss, list) and all([isinstance(l, gluon.loss.Loss) for l in loss]): self.loss = loss else: raise ValueError("loss must be a Loss or a list of Loss, refer to gluon.loss.Loss:{}".format(loss))

nswamy · 2019-03-19T10:21:26Z

python/mxnet/gluon/estimator/estimator.py

                if not isinstance(metric, EvalMetric):
                    raise ValueError("metrics must be a Metric or a list of Metric, refer to mxnet.metric.EvalMetric")
+        # Use same metrics for validation
+        self.test_metrics = copy.deepcopy(self.train_metrics)


rename self.test_metrics-> self.val_metrics

nswamy · 2019-03-19T10:26:29Z

python/mxnet/gluon/estimator/estimator.py

            for l in self.loss:
-                if not isinstance(loss, gluon.loss.Loss):
+                if not isinstance(l, gluon.loss.Loss):
                    raise ValueError("loss must be a Loss or a list of Loss, refer to gluon.loss.Loss")

        if isinstance(metrics, EvalMetric):


use logic similar to above

nswamy · 2019-03-19T10:32:03Z

python/mxnet/gluon/estimator/estimator.py

@@ -156,7 +168,33 @@ def _batch_fn(self, batch, ctx, is_iterator=False):
        label = gluon.utils.split_and_load(label, ctx_list=ctx, batch_axis=0)
        return data, label

+    def _evaluate(self, val_data, batch_fn=None):


what if i just want to validate on a single item? can i not pass X, y?

Can be done by wrapping X, y in a dataloader/dataiter and passing it as val_data

nswamy · 2019-03-19T17:50:15Z

python/mxnet/gluon/estimator/estimator.py

+                if isinstance(val_data, gluon.data.DataLoader):
+                    data, label = self._batch_fn(batch, self.context)
+                elif isinstance(val_data, DataIter):
+                    data, label = self._batch_fn(batch, self.context, is_iterator=True)


nswamy · 2019-03-19T17:52:22Z

python/mxnet/gluon/estimator/estimator.py

+
+        for _, batch in enumerate(val_data):
+            if not batch_fn:
+                if isinstance(val_data, gluon.data.DataLoader):


move this if/else into into self._batch_fn

this check needs to be before calling batch_fn as val_data is not available to it

nswamy · 2019-03-19T17:53:47Z

python/mxnet/gluon/estimator/estimator.py

@@ -156,7 +168,33 @@ def _batch_fn(self, batch, ctx, is_iterator=False):
        label = gluon.utils.split_and_load(label, ctx_list=ctx, batch_axis=0)
        return data, label

+    def _evaluate(self, val_data, batch_fn=None):


can we expose this method?

nswamy · 2019-03-19T17:56:16Z

python/mxnet/gluon/estimator/estimator.py

@@ -192,6 +230,9 @@ def fit(self, train_data,
                not any(isinstance(handler, LoggingHandler) for handler in event_handlers):
            event_handlers.append(LoggingHandler(self))

+        # Check for validation data
+        do_validation = True if val_data else False


remove this.

nswamy · 2019-03-19T17:57:59Z

python/mxnet/gluon/estimator/estimator.py

    def fit(self, train_data,
+            val_data=None,


this shouldn't be optional

Users might want to train without a validation set. Although this is rare, still keeping it optional provides a bit of flexibility

I don't see a reason why users would not a validation dataset, its required to know that the model is not overfitting/

abhinavs95

@roywei @nswamy @piyushghai Thanks for the review. I have addressed all the comments.

nswamy · 2019-03-21T18:13:50Z

python/mxnet/gluon/estimator/estimator.py


        if isinstance(metrics, EvalMetric):
-            self.metrics = [metrics]
+            self.train_metrics = [metrics]


Can you infer from the loss function? use 'Accuracy' as default when not passed?

'Accuracy' will only work for classification cases, for other cases it will give inaccurate resutls or even fail. Also I'm not sure how we can infer metrics from loss function as there isn't a direct correlation between them, do you have any suggestions?

I think we should still infer metrics from known Loss functions (at least from the examples you know)

Added Accuracy metric as default for SoftmaxCrossEntropy loss for now. Will add more in a followup PR.

karan6181 · 2019-03-21T18:48:01Z

python/mxnet/gluon/estimator/estimator.py

            # record a history of metrics over each epoch
            self.train_stats['train_' + metric.name] = []
            # only record the latest metric numbers after each batch
            self.train_stats['batch_' + metric.name] = 0.
-        self.loss_metrics = []
+        for metric in self.val_metrics:


Can we have one for loop for self.train_metrics and self.val_metrics since value for both the parameter is same according to line 82. Something like this for train_m, val_m in zip(self.train_metrics, self.val_metrics). Though zip() operator stop after exhausting shorter array but since both the array are of same length, we can use zip() operator.

We want to keep the train and val metrics separate here. Currently we are using the same metrics for val and train but future updates may involve separate user specified val metrics in which case combining this update loop won't work.

karan6181 · 2019-03-21T18:56:44Z

python/mxnet/gluon/estimator/estimator.py

+                elif isinstance(val_data, DataIter):
+                    data, label = self._batch_fn(batch, self.context, is_iterator=True)
+                else:
+                    raise ValueError("You are using a custom iteration, please also provide "


It would be helpful to the end user if could provide more detailed exception. you can append the below statement at the end, something like this: or you can provide the data in terms of gluon.data.DataLoader or mx.io.DataIter. Please also change this statement in fit() method if you are changing it here.

Thanks for pointing it out, updated!

nswamy

can you fix the context validation ? currently if its a list of contexts it will fail.

nswamy · 2019-03-23T00:41:51Z

python/mxnet/gluon/estimator/estimator.py

@@ -77,6 +77,10 @@ def __init__(self, net,
                raise ValueError("metrics must be a Metric or a list of Metric, "
                                 "refer to mxnet.metric.EvalMetric:{}".format(metrics))

+        # Use default mx.metric.Accuracy() for gluon.loss.SoftmaxCrossEntropyLoss()
+        if not self.train_metrics and any([isinstance(l, gluon.loss.SoftmaxCrossEntropyLoss) for l in self.loss]):


lets get this from a map of Loss->[default metrics] in the next version.

Yes, tracking it using JIRA issue: https://issues.apache.org/jira/browse/MXNET-1364

…API (apache#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric

* [MXNet-1334][Fit API]base class for estimator and eventhandler (#14346) * base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests * Fixed issue where the estimator was printing beyond the dataset size … (#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI * [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric * [MXNet-1340][Fit API]Update train stats (#14494) * add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers * [MXNet-1375][Fit API]Added RNN integration test for fit() API (#14547) * Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code * [MXNet-1343][Fit API]Add CNN integration test for fit() API (#14405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context * [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py * move estimator to contrib (#14633) * move to gluon contrib (#14635) * [Fit API] improve event handlers (#14685) * improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests * [MXNET-1396][Fit-API] Update default handler logic (#14765) * move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci * [Fit API] update estimator (#14849) * address comments * add comment * check available context * fix bug * change cpu check * [Fit-API] Adress PR comments (#14885) * address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

…API (#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric

* [MXNet-1334][Fit API]base class for estimator and eventhandler (apache#14346) * base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests * Fixed issue where the estimator was printing beyond the dataset size … (apache#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI * [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (apache#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric * [MXNet-1340][Fit API]Update train stats (apache#14494) * add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers * [MXNet-1375][Fit API]Added RNN integration test for fit() API (apache#14547) * Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code * [MXNet-1343][Fit API]Add CNN integration test for fit() API (apache#14405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context * [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (apache#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py * move estimator to contrib (apache#14633) * move to gluon contrib (apache#14635) * [Fit API] improve event handlers (apache#14685) * improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests * [MXNET-1396][Fit-API] Update default handler logic (apache#14765) * move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci * [Fit API] update estimator (apache#14849) * address comments * add comment * check available context * fix bug * change cpu check * [Fit-API] Adress PR comments (apache#14885) * address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

…API (apache#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric

abhinavs95 added 4 commits March 14, 2019 14:31

added estimator unittests

9f334da

add more tests for estimator

c81132a

added validation logic

69e118b

added error handlers, unittests

7750027

abhinavs95 requested a review from szha as a code owner March 15, 2019 21:37

piyushghai reviewed Mar 15, 2019

View reviewed changes

roywei reviewed Mar 15, 2019

View reviewed changes

marcoabreu added Gluon Test labels Mar 15, 2019

abhinavs95 added 2 commits March 15, 2019 16:22

improve val stats

1eafd3a

resolve merge conflict

d9b7480

roywei approved these changes Mar 18, 2019

View reviewed changes

abhinavs95 added 2 commits March 18, 2019 11:22

fix pylint

5d7b58e

fix pylint

7d9137a

roywei mentioned this pull request Mar 18, 2019

[MXNet-1343][Fit API]Add CNN integration test for fit() API #14405

Merged

5 tasks

update unit test

353e3d3

piyushghai mentioned this pull request Mar 18, 2019

[MXNET-1358][Fit API] Fit api tutorial #14462

Closed

2 tasks

piyushghai reviewed Mar 19, 2019

View reviewed changes

abhinavs95 added 2 commits March 19, 2019 10:33

fix tests

b843f56

fix tests

305d1bf

nswamy reviewed Mar 19, 2019

View reviewed changes

nswamy changed the title ~~[MXNet-1349][WIP][Fit API]Add validation support and unit tests for fit() API~~ [MXNet-1349][Fit API]Add validation support and unit tests for fit() API Mar 19, 2019

updated metrics, val logic

d07052a

abhinavs95 commented Mar 19, 2019

View reviewed changes

abhinavs95 added 2 commits March 19, 2019 15:49

trigger ci

abf6a68

trigger ci

f88515f

nswamy reviewed Mar 21, 2019

View reviewed changes

karan6181 reviewed Mar 21, 2019

View reviewed changes

update metric, batch_fn error handler

282957e

nswamy reviewed Mar 22, 2019

View reviewed changes

update context logic, add default metric

5f77df9

nswamy approved these changes Mar 23, 2019

View reviewed changes

nswamy merged commit 8186772 into apache:fit-api Mar 25, 2019

		self.train_stats['train_' + metric.name].append(metric.get()[1])
		for metric in self.test_metrics + self.test_loss_metrics:

[MXNet-1349][Fit API]Add validation support and unit tests for fit() API #14442

[MXNet-1349][Fit API]Add validation support and unit tests for fit() API #14442

Conversation

abhinavs95 commented Mar 15, 2019 • edited by nswamy Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roywei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karan6181 commented Mar 15, 2019

roywei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piyushghai Mar 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavs95 Mar 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavs95 left a comment • edited Loading

Choose a reason for hiding this comment

nswamy Mar 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karan6181 Mar 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nswamy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavs95 commented Mar 15, 2019 •

edited by nswamy

Loading

piyushghai Mar 22, 2019 •

edited

Loading

abhinavs95 Mar 19, 2019 •

edited

Loading

abhinavs95 left a comment •

edited

Loading

nswamy Mar 21, 2019 •

edited

Loading

karan6181 Mar 21, 2019 •

edited

Loading