-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-698] Correct train-metric log to reflect epoch metric #12182
Conversation
As far as I know, various benchmarking scripts in the repo use this printed metric. If we want to remove this metric, scripts will need to be changed too. |
This PR is on hold - waiting for MXNet 1.3 release |
This PR is ready for review again. Fixed parse_log to accommodate this change. |
@sandeep-krishnamurthy Please change it back to pr-awaiting-review |
@vandanavk, thanks for the contribution!
|
Thanks for the inputs @lupesko. |
Can you not just print the batch range[10-20] along with existing metric, there might be user scripts that depend on these values, I don't think we should drop this log. |
Thanks for the input @nswamy Editing the log in the following way will ensure minimum impact on scripts that parse this log.
The intention of the PR was to avoid giving the impression that the metric printed in this log is for the entire epoch. |
|
@nswamy The following is the current behavior of the code. We need to reflect this in the "train-accuracy" log in base_module.py With Speedometer's
With
|
Please also add training-accuracy(averaged for the entire epoch) at the end of every epoch. |
2840a58
to
8f20f7f
Compare
@nswamy As discussed offline, I have submitted a new commit with the following changes: a separate metric is being maintained for the entire epoch vs batch-wise metric, Added the text "(averaged for the entire epoch)" and batch range in Speedometer prints too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vandanavk for your contributions.
This is more clear. One dependency I know does not use Epoch details as regex to parse metrics - https://github.com/awslabs/deeplearning-benchmark/blob/master/task_config_template.cfg
However, let us keep an eye on the issues, if users depending on this for parsing epoch details.
Description
The log
self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val)
gives the user the impression that the metric printed is for the entire epoch, which is misleading. Whenauto_reset=False
, Train-<metric> is the average for the entire epoch and whenauto_reset=True
, Train-<metric> reflects the metric value for the remaining batches in the epoch.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Before
After
With
auto_reset=True
With
auto_reset=False
Using deepcopy:
Deepcopy is being used as epoch_eval_metric needs to track the same metric as eval_metric, but separately. Since EvalMetric objects used in fit() are typically small (and the copy is done only once per fit() call), this will not add significant overhead.
Comments
Testing done:
@indhub