[MXNET-698] Correct train-metric log to reflect epoch metric #12182

vandanavk · 2018-08-15T16:19:05Z

Description

The log self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val) gives the user the impression that the metric printed is for the entire epoch, which is misleading. When auto_reset=False, Train-<metric> is the average for the entire epoch and when auto_reset=True, Train-<metric> reflects the metric value for the remaining batches in the epoch.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Track epoch metric value separately
Print batch numbers for which Speedometer prints metric value

Before

INFO:root:Epoch[0] Batch [100]  Speed: 45690.13 samples/sec     accuracy=0.772123
INFO:root:Epoch[0] Batch [200]  Speed: 50611.24 samples/sec     accuracy=0.898594
INFO:root:Epoch[0] Batch [300]  Speed: 54257.01 samples/sec     accuracy=0.921562
INFO:root:Epoch[0] Batch [400]  Speed: 57961.52 samples/sec     accuracy=0.934063
INFO:root:Epoch[0] Batch [500]  Speed: 44537.19 samples/sec     accuracy=0.931719
INFO:root:Epoch[0] Batch [600]  Speed: 48699.48 samples/sec     accuracy=0.935469
INFO:root:Epoch[0] Batch [700]  Speed: 41613.32 samples/sec     accuracy=0.943750
INFO:root:Epoch[0] Batch [800]  Speed: 42915.27 samples/sec     accuracy=0.941719
INFO:root:Epoch[0] Batch [900]  Speed: 52047.39 samples/sec     accuracy=0.950625
INFO:root:Epoch[0] Train-accuracy=0.944679
INFO:root:Epoch[0] Time cost=1.250
INFO:root:Epoch[0] Validation-accuracy=0.953125
INFO:root:Epoch[1] Batch [100]  Speed: 58464.58 samples/sec     accuracy=0.956219
INFO:root:Epoch[1] Batch [200]  Speed: 52709.13 samples/sec     accuracy=0.954844
INFO:root:Epoch[1] Batch [300]  Speed: 52257.74 samples/sec     accuracy=0.963125
INFO:root:Epoch[1] Batch [400]  Speed: 60261.37 samples/sec     accuracy=0.964219
INFO:root:Epoch[1] Batch [500]  Speed: 61615.53 samples/sec     accuracy=0.967969
INFO:root:Epoch[1] Batch [600]  Speed: 57549.34 samples/sec     accuracy=0.968906
INFO:root:Epoch[1] Batch [700]  Speed: 50833.12 samples/sec     accuracy=0.972031
INFO:root:Epoch[1] Batch [800]  Speed: 58602.68 samples/sec     accuracy=0.965469
INFO:root:Epoch[1] Batch [900]  Speed: 58724.73 samples/sec     accuracy=0.970156

After

With auto_reset=True

INFO:root:Epoch[0] Batch [0-100]	Speed: 46616.33 samples/sec	accuracy=0.764542
INFO:root:Epoch[0] Batch [100-200]	Speed: 40784.88 samples/sec	accuracy=0.902500
INFO:root:Epoch[0] Batch [200-300]	Speed: 45424.86 samples/sec	accuracy=0.926406
INFO:root:Epoch[0] Batch [300-400]	Speed: 51504.93 samples/sec	accuracy=0.936406
INFO:root:Epoch[0] Batch [400-500]	Speed: 45343.44 samples/sec	accuracy=0.939531
INFO:root:Epoch[0] Batch [500-600]	Speed: 42812.13 samples/sec	accuracy=0.940469
INFO:root:Epoch[0] Batch [600-700]	Speed: 43986.20 samples/sec	accuracy=0.948438
INFO:root:Epoch[0] Batch [700-800]	Speed: 31437.88 samples/sec	accuracy=0.943438
INFO:root:Epoch[0] Batch [800-900]	Speed: 20493.63 samples/sec	accuracy=0.945469
INFO:root:Epoch[0] Train-accuracy (averaged over entire epoch)=0.917594
INFO:root:Epoch[0] Time cost=1.562
INFO:root:Epoch[0] Validation-accuracy=0.952627
INFO:root:Epoch[1] Batch [0-100]	Speed: 47150.69 samples/sec	accuracy=0.952042
INFO:root:Epoch[1] Batch [100-200]	Speed: 50114.25 samples/sec	accuracy=0.960000
INFO:root:Epoch[1] Batch [200-300]	Speed: 32850.97 samples/sec	accuracy=0.960000
INFO:root:Epoch[1] Batch [300-400]	Speed: 42086.20 samples/sec	accuracy=0.962344
INFO:root:Epoch[1] Batch [400-500]	Speed: 43728.10 samples/sec	accuracy=0.963906
INFO:root:Epoch[1] Batch [500-600]	Speed: 50114.34 samples/sec	accuracy=0.964063
INFO:root:Epoch[1] Batch [600-700]	Speed: 41799.55 samples/sec	accuracy=0.969375
INFO:root:Epoch[1] Batch [700-800]	Speed: 44728.66 samples/sec	accuracy=0.961875
INFO:root:Epoch[1] Batch [800-900]	Speed: 45162.34 samples/sec	accuracy=0.968594
INFO:root:Epoch[1] Train-accuracy (averaged over entire epoch)=0.962603
INFO:root:Epoch[1] Time cost=1.375
INFO:root:Epoch[1] Validation-accuracy=0.970044

With auto_reset=False

INFO:root:Epoch[0] Batch [0-100]	Speed: 52076.47 samples/sec	accuracy=0.780631
INFO:root:Epoch[0] Batch [0-200]	Speed: 55338.05 samples/sec	accuracy=0.843595
INFO:root:Epoch[0] Batch [0-300]	Speed: 52772.13 samples/sec	accuracy=0.870847
INFO:root:Epoch[0] Batch [0-400]	Speed: 51781.13 samples/sec	accuracy=0.887936
INFO:root:Epoch[0] Batch [0-500]	Speed: 42743.89 samples/sec	accuracy=0.898204
INFO:root:Epoch[0] Batch [0-600]	Speed: 39555.73 samples/sec	accuracy=0.905236
INFO:root:Epoch[0] Batch [0-700]	Speed: 43493.87 samples/sec	accuracy=0.910463
INFO:root:Epoch[0] Batch [0-800]	Speed: 54414.84 samples/sec	accuracy=0.914677
INFO:root:Epoch[0] Batch [0-900]	Speed: 49354.10 samples/sec	accuracy=0.919170
INFO:root:Epoch[0] Train-accuracy (averaged over entire epoch)=0.920392
INFO:root:Epoch[0] Time cost=1.240
INFO:root:Epoch[0] Validation-accuracy=0.958201
INFO:root:Epoch[1] Batch [0-100]	Speed: 52614.29 samples/sec	accuracy=0.959313
INFO:root:Epoch[1] Batch [0-200]	Speed: 44468.21 samples/sec	accuracy=0.960976
INFO:root:Epoch[1] Batch [0-300]	Speed: 49660.88 samples/sec	accuracy=0.963767
INFO:root:Epoch[1] Batch [0-400]	Speed: 55187.98 samples/sec	accuracy=0.964347
INFO:root:Epoch[1] Batch [0-500]	Speed: 46373.76 samples/sec	accuracy=0.965007
INFO:root:Epoch[1] Batch [0-600]	Speed: 55060.86 samples/sec	accuracy=0.965214
INFO:root:Epoch[1] Batch [0-700]	Speed: 44390.48 samples/sec	accuracy=0.966588
INFO:root:Epoch[1] Batch [0-800]	Speed: 51865.47 samples/sec	accuracy=0.966604
INFO:root:Epoch[1] Batch [0-900]	Speed: 48503.70 samples/sec	accuracy=0.966860
INFO:root:Epoch[1] Train-accuracy (averaged over entire epoch)=0.966968
INFO:root:Epoch[1] Time cost=1.209
INFO:root:Epoch[1] Validation-accuracy=0.971039

Using deepcopy:
Deepcopy is being used as epoch_eval_metric needs to track the same metric as eval_metric, but separately. Since EvalMetric objects used in fit() are typically small (and the copy is done only once per fit() call), this will not add significant overhead.

Comments

Testing done:

Tested image classification example (train_mnist.py, train_imagenet.py) with these changes
Tested with Speedometer's auto_reset=True and auto_reset=False
Tested with default metric 'accuracy', composite metric ['accuracy','rmse'], and EvalMetric object
Tested tools/parse_log.py

@indhub

Roshrini · 2018-08-16T23:19:27Z

As far as I know, various benchmarking scripts in the repo use this printed metric. If we want to remove this metric, scripts will need to be changed too.

vandanavk · 2018-08-21T20:42:55Z

This PR is on hold - waiting for MXNet 1.3 release

vandanavk · 2018-08-23T20:13:38Z

This PR is ready for review again. Fixed parse_log to accommodate this change.

vandanavk · 2018-08-23T20:25:58Z

@sandeep-krishnamurthy Please change it back to pr-awaiting-review

lupesko · 2018-08-24T22:02:18Z

@vandanavk, thanks for the contribution!
A few comments/asks:

Can you please file a JIRA (per the community's decision to have JIRA for non trivial changes)
Can you paste "before" and "after" log output, to clarify how the change impacts the user experience?
"Risk of breaking scripts that look for this text" - how are we mitigating this risk? are you referring to MXNet scripts, or customer-external scripts?
"Changes" section - please update boilerplate or remove.

vandanavk · 2018-08-25T00:38:08Z

Thanks for the inputs @lupesko.
AFAIK parse_log.py and a file in one of the examples are the only ones in MXNet that parse this log. We are not sure about external scripts. I'll send a note on dev list notifying about this change, so that authors can either change their script or let us know their thoughts on this change.

nswamy · 2018-08-27T17:32:22Z

Can you not just print the batch range[10-20] along with existing metric, there might be user scripts that depend on these values, I don't think we should drop this log.

vandanavk · 2018-08-27T22:23:50Z

Thanks for the input @nswamy

Editing the log in the following way will ensure minimum impact on scripts that parse this log.

INFO:root:Epoch[0] Batch [900]	Speed: 48105.51 samples/sec	accuracy=0.946875
INFO:root:Epoch[0] After batch [938], Train-accuracy=0.951436
INFO:root:Epoch[0] Time cost=2.271
INFO:root:Epoch[0] Validation-accuracy=0.952130

The intention of the PR was to avoid giving the impression that the metric printed in this log is for the entire epoch.

Does this look like a better idea @nswamy @indhub ?

nswamy · 2018-08-27T22:43:57Z

INFO:root:Epoch[0] After batch [938], Train-accuracy=0.951436 does it mean for Batch [0-938] the accuracy is 0.951436

vandanavk · 2018-08-28T15:38:20Z

@nswamy The following is the current behavior of the code. We need to reflect this in the "train-accuracy" log in base_module.py

With Speedometer's auto_reset=True

INFO:root:Epoch[0] Batch [1-100]  Speed: 45690.13 samples/sec     accuracy=0.772123
INFO:root:Epoch[0] Batch [101-200]  Speed: 50611.24 samples/sec     accuracy=0.898594
.
.
.
INFO:root:Epoch[0] Batch [801-900]  Speed: 52047.39 samples/sec     accuracy=0.950625
INFO:root:Epoch[0] Batch [901-938] Train-accuracy=0.944679
INFO:root:Epoch[0] Time cost=1.250
INFO:root:Epoch[0] Validation-accuracy=0.953125

With auto_reset=False

INFO:root:Epoch[0] Batch [1-100]	Speed: 16628.82 samples/sec	accuracy=0.759746
INFO:root:Epoch[0] Batch [1-200]	Speed: 40806.39 samples/sec	accuracy=0.828980
.
.
.
INFO:root:Epoch[0] Batch [1-900]	Speed: 42613.81 samples/sec	accuracy=0.911470
INFO:root:Epoch[0] Batch [1-938] Train-accuracy=0.912830
INFO:root:Epoch[0] Time cost=1.811
INFO:root:Epoch[0] Validation-accuracy=0.956509

nswamy · 2018-08-31T00:07:14Z

Please also add training-accuracy(averaged for the entire epoch) at the end of every epoch.

vandanavk · 2018-08-31T23:16:29Z

@nswamy As discussed offline, I have submitted a new commit with the following changes: a separate metric is being maintained for the entire epoch vs batch-wise metric, Added the text "(averaged for the entire epoch)" and batch range in Speedometer prints too.

sandeep-krishnamurthy

Thanks @vandanavk for your contributions.

This is more clear. One dependency I know does not use Epoch details as regex to parse metrics - https://github.com/awslabs/deeplearning-benchmark/blob/master/task_config_template.cfg

However, let us keep an eye on the issues, if users depending on this for parsing epoch details.

vandanavk requested a review from szha as a code owner August 15, 2018 16:19

vandanavk mentioned this pull request Aug 16, 2018

Image classfication example has wrong accuracy metric. #11480

Closed

sandeep-krishnamurthy added Python pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Aug 17, 2018

vandanavk force-pushed the metric branch from a5088be to b524a3e Compare August 23, 2018 19:10

sandeep-krishnamurthy added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Aug 23, 2018

vandanavk changed the title ~~Remove Epoch training metric log~~ [MXNET-698] Remove Epoch training metric log Aug 24, 2018

vandanavk force-pushed the metric branch from b524a3e to 0339f79 Compare August 25, 2018 00:32

vandanavk force-pushed the metric branch 2 times, most recently from 2840a58 to 8f20f7f Compare August 31, 2018 23:12

vandanavk changed the title ~~[MXNET-698] Remove Epoch training metric log~~ [MXNET-698] Correct train-metric log to reflect epoch metric Aug 31, 2018

vandanavk force-pushed the metric branch from 8f20f7f to 00ec53e Compare September 4, 2018 03:14

vandanavk changed the title ~~[MXNET-698] Correct train-metric log to reflect epoch metric~~ [MXNET-698] [WIP] Correct train-metric log to reflect epoch metric Sep 5, 2018

vandanavk force-pushed the metric branch from 00ec53e to 14b6483 Compare September 10, 2018 16:04

vandanavk changed the title ~~[MXNET-698] [WIP] Correct train-metric log to reflect epoch metric~~ [MXNET-698] Correct train-metric log to reflect epoch metric Sep 10, 2018

Track epoch metric separately

cad4456

vandanavk force-pushed the metric branch from 14b6483 to cad4456 Compare September 14, 2018 18:18

sandeep-krishnamurthy approved these changes Sep 19, 2018

View reviewed changes

sandeep-krishnamurthy merged commit ce6525a into apache:master Sep 19, 2018

nswamy mentioned this pull request Sep 24, 2018

[MXNET-982] Provide example to illustrate usage of CSVIter in C++ API #12636

Merged

2 tasks

wkcn mentioned this pull request Oct 24, 2018

a very interesting problem #12952

Open

ptrendx mentioned this pull request Nov 30, 2018

Optimization of metric evaluation #13471

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-698] Correct train-metric log to reflect epoch metric #12182

[MXNET-698] Correct train-metric log to reflect epoch metric #12182

vandanavk commented Aug 15, 2018 •

edited

Loading

Roshrini commented Aug 16, 2018

vandanavk commented Aug 21, 2018

vandanavk commented Aug 23, 2018

vandanavk commented Aug 23, 2018

lupesko commented Aug 24, 2018

vandanavk commented Aug 25, 2018

nswamy commented Aug 27, 2018

vandanavk commented Aug 27, 2018

nswamy commented Aug 27, 2018

vandanavk commented Aug 28, 2018 •

edited

Loading

nswamy commented Aug 31, 2018

vandanavk commented Aug 31, 2018 •

edited

Loading

sandeep-krishnamurthy left a comment

[MXNET-698] Correct train-metric log to reflect epoch metric #12182

[MXNET-698] Correct train-metric log to reflect epoch metric #12182

Conversation

vandanavk commented Aug 15, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Roshrini commented Aug 16, 2018

vandanavk commented Aug 21, 2018

vandanavk commented Aug 23, 2018

vandanavk commented Aug 23, 2018

lupesko commented Aug 24, 2018

vandanavk commented Aug 25, 2018

nswamy commented Aug 27, 2018

vandanavk commented Aug 27, 2018

nswamy commented Aug 27, 2018

vandanavk commented Aug 28, 2018 • edited Loading

nswamy commented Aug 31, 2018

vandanavk commented Aug 31, 2018 • edited Loading

sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

vandanavk commented Aug 15, 2018 •

edited

Loading

vandanavk commented Aug 28, 2018 •

edited

Loading

vandanavk commented Aug 31, 2018 •

edited

Loading