Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-698] Correct train-metric log to reflect epoch metric #12182

Merged
merged 1 commit into from
Sep 19, 2018

Conversation

vandanavk
Copy link
Contributor

@vandanavk vandanavk commented Aug 15, 2018

Description

The log self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val) gives the user the impression that the metric printed is for the entire epoch, which is misleading. When auto_reset=False, Train-<metric> is the average for the entire epoch and when auto_reset=True, Train-<metric> reflects the metric value for the remaining batches in the epoch.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Track epoch metric value separately
  • Print batch numbers for which Speedometer prints metric value

Before

INFO:root:Epoch[0] Batch [100]  Speed: 45690.13 samples/sec     accuracy=0.772123
INFO:root:Epoch[0] Batch [200]  Speed: 50611.24 samples/sec     accuracy=0.898594
INFO:root:Epoch[0] Batch [300]  Speed: 54257.01 samples/sec     accuracy=0.921562
INFO:root:Epoch[0] Batch [400]  Speed: 57961.52 samples/sec     accuracy=0.934063
INFO:root:Epoch[0] Batch [500]  Speed: 44537.19 samples/sec     accuracy=0.931719
INFO:root:Epoch[0] Batch [600]  Speed: 48699.48 samples/sec     accuracy=0.935469
INFO:root:Epoch[0] Batch [700]  Speed: 41613.32 samples/sec     accuracy=0.943750
INFO:root:Epoch[0] Batch [800]  Speed: 42915.27 samples/sec     accuracy=0.941719
INFO:root:Epoch[0] Batch [900]  Speed: 52047.39 samples/sec     accuracy=0.950625
INFO:root:Epoch[0] Train-accuracy=0.944679
INFO:root:Epoch[0] Time cost=1.250
INFO:root:Epoch[0] Validation-accuracy=0.953125
INFO:root:Epoch[1] Batch [100]  Speed: 58464.58 samples/sec     accuracy=0.956219
INFO:root:Epoch[1] Batch [200]  Speed: 52709.13 samples/sec     accuracy=0.954844
INFO:root:Epoch[1] Batch [300]  Speed: 52257.74 samples/sec     accuracy=0.963125
INFO:root:Epoch[1] Batch [400]  Speed: 60261.37 samples/sec     accuracy=0.964219
INFO:root:Epoch[1] Batch [500]  Speed: 61615.53 samples/sec     accuracy=0.967969
INFO:root:Epoch[1] Batch [600]  Speed: 57549.34 samples/sec     accuracy=0.968906
INFO:root:Epoch[1] Batch [700]  Speed: 50833.12 samples/sec     accuracy=0.972031
INFO:root:Epoch[1] Batch [800]  Speed: 58602.68 samples/sec     accuracy=0.965469
INFO:root:Epoch[1] Batch [900]  Speed: 58724.73 samples/sec     accuracy=0.970156

After

With auto_reset=True

INFO:root:Epoch[0] Batch [0-100]	Speed: 46616.33 samples/sec	accuracy=0.764542
INFO:root:Epoch[0] Batch [100-200]	Speed: 40784.88 samples/sec	accuracy=0.902500
INFO:root:Epoch[0] Batch [200-300]	Speed: 45424.86 samples/sec	accuracy=0.926406
INFO:root:Epoch[0] Batch [300-400]	Speed: 51504.93 samples/sec	accuracy=0.936406
INFO:root:Epoch[0] Batch [400-500]	Speed: 45343.44 samples/sec	accuracy=0.939531
INFO:root:Epoch[0] Batch [500-600]	Speed: 42812.13 samples/sec	accuracy=0.940469
INFO:root:Epoch[0] Batch [600-700]	Speed: 43986.20 samples/sec	accuracy=0.948438
INFO:root:Epoch[0] Batch [700-800]	Speed: 31437.88 samples/sec	accuracy=0.943438
INFO:root:Epoch[0] Batch [800-900]	Speed: 20493.63 samples/sec	accuracy=0.945469
INFO:root:Epoch[0] Train-accuracy (averaged over entire epoch)=0.917594
INFO:root:Epoch[0] Time cost=1.562
INFO:root:Epoch[0] Validation-accuracy=0.952627
INFO:root:Epoch[1] Batch [0-100]	Speed: 47150.69 samples/sec	accuracy=0.952042
INFO:root:Epoch[1] Batch [100-200]	Speed: 50114.25 samples/sec	accuracy=0.960000
INFO:root:Epoch[1] Batch [200-300]	Speed: 32850.97 samples/sec	accuracy=0.960000
INFO:root:Epoch[1] Batch [300-400]	Speed: 42086.20 samples/sec	accuracy=0.962344
INFO:root:Epoch[1] Batch [400-500]	Speed: 43728.10 samples/sec	accuracy=0.963906
INFO:root:Epoch[1] Batch [500-600]	Speed: 50114.34 samples/sec	accuracy=0.964063
INFO:root:Epoch[1] Batch [600-700]	Speed: 41799.55 samples/sec	accuracy=0.969375
INFO:root:Epoch[1] Batch [700-800]	Speed: 44728.66 samples/sec	accuracy=0.961875
INFO:root:Epoch[1] Batch [800-900]	Speed: 45162.34 samples/sec	accuracy=0.968594
INFO:root:Epoch[1] Train-accuracy (averaged over entire epoch)=0.962603
INFO:root:Epoch[1] Time cost=1.375
INFO:root:Epoch[1] Validation-accuracy=0.970044

With auto_reset=False

INFO:root:Epoch[0] Batch [0-100]	Speed: 52076.47 samples/sec	accuracy=0.780631
INFO:root:Epoch[0] Batch [0-200]	Speed: 55338.05 samples/sec	accuracy=0.843595
INFO:root:Epoch[0] Batch [0-300]	Speed: 52772.13 samples/sec	accuracy=0.870847
INFO:root:Epoch[0] Batch [0-400]	Speed: 51781.13 samples/sec	accuracy=0.887936
INFO:root:Epoch[0] Batch [0-500]	Speed: 42743.89 samples/sec	accuracy=0.898204
INFO:root:Epoch[0] Batch [0-600]	Speed: 39555.73 samples/sec	accuracy=0.905236
INFO:root:Epoch[0] Batch [0-700]	Speed: 43493.87 samples/sec	accuracy=0.910463
INFO:root:Epoch[0] Batch [0-800]	Speed: 54414.84 samples/sec	accuracy=0.914677
INFO:root:Epoch[0] Batch [0-900]	Speed: 49354.10 samples/sec	accuracy=0.919170
INFO:root:Epoch[0] Train-accuracy (averaged over entire epoch)=0.920392
INFO:root:Epoch[0] Time cost=1.240
INFO:root:Epoch[0] Validation-accuracy=0.958201
INFO:root:Epoch[1] Batch [0-100]	Speed: 52614.29 samples/sec	accuracy=0.959313
INFO:root:Epoch[1] Batch [0-200]	Speed: 44468.21 samples/sec	accuracy=0.960976
INFO:root:Epoch[1] Batch [0-300]	Speed: 49660.88 samples/sec	accuracy=0.963767
INFO:root:Epoch[1] Batch [0-400]	Speed: 55187.98 samples/sec	accuracy=0.964347
INFO:root:Epoch[1] Batch [0-500]	Speed: 46373.76 samples/sec	accuracy=0.965007
INFO:root:Epoch[1] Batch [0-600]	Speed: 55060.86 samples/sec	accuracy=0.965214
INFO:root:Epoch[1] Batch [0-700]	Speed: 44390.48 samples/sec	accuracy=0.966588
INFO:root:Epoch[1] Batch [0-800]	Speed: 51865.47 samples/sec	accuracy=0.966604
INFO:root:Epoch[1] Batch [0-900]	Speed: 48503.70 samples/sec	accuracy=0.966860
INFO:root:Epoch[1] Train-accuracy (averaged over entire epoch)=0.966968
INFO:root:Epoch[1] Time cost=1.209
INFO:root:Epoch[1] Validation-accuracy=0.971039

Using deepcopy:
Deepcopy is being used as epoch_eval_metric needs to track the same metric as eval_metric, but separately. Since EvalMetric objects used in fit() are typically small (and the copy is done only once per fit() call), this will not add significant overhead.

Comments

Testing done:

  • Tested image classification example (train_mnist.py, train_imagenet.py) with these changes
  • Tested with Speedometer's auto_reset=True and auto_reset=False
  • Tested with default metric 'accuracy', composite metric ['accuracy','rmse'], and EvalMetric object
  • Tested tools/parse_log.py

@indhub

@Roshrini
Copy link
Member

As far as I know, various benchmarking scripts in the repo use this printed metric. If we want to remove this metric, scripts will need to be changed too.

@sandeep-krishnamurthy sandeep-krishnamurthy added Python pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Aug 17, 2018
@vandanavk
Copy link
Contributor Author

This PR is on hold - waiting for MXNet 1.3 release

@vandanavk
Copy link
Contributor Author

This PR is ready for review again. Fixed parse_log to accommodate this change.

@vandanavk
Copy link
Contributor Author

@sandeep-krishnamurthy Please change it back to pr-awaiting-review

@sandeep-krishnamurthy sandeep-krishnamurthy added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-response PR is reviewed and waiting for contributor to respond labels Aug 23, 2018
@lupesko
Copy link
Contributor

lupesko commented Aug 24, 2018

@vandanavk, thanks for the contribution!
A few comments/asks:

  • Can you please file a JIRA (per the community's decision to have JIRA for non trivial changes)
  • Can you paste "before" and "after" log output, to clarify how the change impacts the user experience?
  • "Risk of breaking scripts that look for this text" - how are we mitigating this risk? are you referring to MXNet scripts, or customer-external scripts?
  • "Changes" section - please update boilerplate or remove.

@vandanavk vandanavk changed the title Remove Epoch training metric log [MXNET-698] Remove Epoch training metric log Aug 24, 2018
@vandanavk
Copy link
Contributor Author

Thanks for the inputs @lupesko.
AFAIK parse_log.py and a file in one of the examples are the only ones in MXNet that parse this log. We are not sure about external scripts. I'll send a note on dev list notifying about this change, so that authors can either change their script or let us know their thoughts on this change.

@nswamy
Copy link
Member

nswamy commented Aug 27, 2018

Can you not just print the batch range[10-20] along with existing metric, there might be user scripts that depend on these values, I don't think we should drop this log.

@vandanavk
Copy link
Contributor Author

Thanks for the input @nswamy

Editing the log in the following way will ensure minimum impact on scripts that parse this log.

INFO:root:Epoch[0] Batch [900]	Speed: 48105.51 samples/sec	accuracy=0.946875
INFO:root:Epoch[0] After batch [938], Train-accuracy=0.951436
INFO:root:Epoch[0] Time cost=2.271
INFO:root:Epoch[0] Validation-accuracy=0.952130

The intention of the PR was to avoid giving the impression that the metric printed in this log is for the entire epoch.

Does this look like a better idea @nswamy @indhub ?

@nswamy
Copy link
Member

nswamy commented Aug 27, 2018

INFO:root:Epoch[0] After batch [938], Train-accuracy=0.951436 does it mean for Batch [0-938] the accuracy is 0.951436

@vandanavk
Copy link
Contributor Author

vandanavk commented Aug 28, 2018

@nswamy The following is the current behavior of the code. We need to reflect this in the "train-accuracy" log in base_module.py

With Speedometer's auto_reset=True

INFO:root:Epoch[0] Batch [1-100]  Speed: 45690.13 samples/sec     accuracy=0.772123
INFO:root:Epoch[0] Batch [101-200]  Speed: 50611.24 samples/sec     accuracy=0.898594
.
.
.
INFO:root:Epoch[0] Batch [801-900]  Speed: 52047.39 samples/sec     accuracy=0.950625
INFO:root:Epoch[0] Batch [901-938] Train-accuracy=0.944679
INFO:root:Epoch[0] Time cost=1.250
INFO:root:Epoch[0] Validation-accuracy=0.953125

With auto_reset=False

INFO:root:Epoch[0] Batch [1-100]	Speed: 16628.82 samples/sec	accuracy=0.759746
INFO:root:Epoch[0] Batch [1-200]	Speed: 40806.39 samples/sec	accuracy=0.828980
.
.
.
INFO:root:Epoch[0] Batch [1-900]	Speed: 42613.81 samples/sec	accuracy=0.911470
INFO:root:Epoch[0] Batch [1-938] Train-accuracy=0.912830
INFO:root:Epoch[0] Time cost=1.811
INFO:root:Epoch[0] Validation-accuracy=0.956509

@nswamy
Copy link
Member

nswamy commented Aug 31, 2018

Please also add training-accuracy(averaged for the entire epoch) at the end of every epoch.

@vandanavk vandanavk force-pushed the metric branch 2 times, most recently from 2840a58 to 8f20f7f Compare August 31, 2018 23:12
@vandanavk
Copy link
Contributor Author

vandanavk commented Aug 31, 2018

@nswamy As discussed offline, I have submitted a new commit with the following changes: a separate metric is being maintained for the entire epoch vs batch-wise metric, Added the text "(averaged for the entire epoch)" and batch range in Speedometer prints too.

@vandanavk vandanavk changed the title [MXNET-698] Remove Epoch training metric log [MXNET-698] Correct train-metric log to reflect epoch metric Aug 31, 2018
@vandanavk vandanavk changed the title [MXNET-698] Correct train-metric log to reflect epoch metric [MXNET-698] [WIP] Correct train-metric log to reflect epoch metric Sep 5, 2018
@vandanavk vandanavk changed the title [MXNET-698] [WIP] Correct train-metric log to reflect epoch metric [MXNET-698] Correct train-metric log to reflect epoch metric Sep 10, 2018
Copy link
Contributor

@sandeep-krishnamurthy sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vandanavk for your contributions.

This is more clear. One dependency I know does not use Epoch details as regex to parse metrics - https://github.com/awslabs/deeplearning-benchmark/blob/master/task_config_template.cfg

However, let us keep an eye on the issues, if users depending on this for parsing epoch details.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants