use nd for accuracy calculation #9583

szha · 2018-01-27T00:37:00Z

Description

use ndarray for accuracy calculation.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

use ndarray instead of numpy for metric.Accuracy that fixes ACCURACY IS USING NUMPY, URGENT FIX #9571
Improve accuracy doc that fixes Misleading calculation of mxnet.metric.Accuracy #9582

Comments

Backward compatible.

zhreshold · 2018-01-27T01:19:37Z

python/mxnet/metric.py

        """
        check_label_shapes(labels, preds)

        for label, pred_label in zip(labels, preds):
            if pred_label.shape != label.shape:
                pred_label = ndarray.argmax(pred_label, axis=self.axis)
-            pred_label = pred_label.asnumpy().astype('int32')
-            label = label.asnumpy().astype('int32')
+            pred_label = pred_label.astype('int32')


int64is better?

This requires larger space, which can show when the prediction class is large (such as in NLP applications). Should I make it an option?

should be good in most cases.

piiswrong · 2018-01-29T18:59:24Z

This is not as simple as changing numpy array to ndarray. See #7995
There are some cases where numpy array is faster. Please check performance against all cases.

Also flatten reshapes to 2 dimensions. This could cause problems when output is 1 dimensional. Use reshape((-1,))

szha · 2018-01-29T19:17:33Z

Thanks for the comment. I will switch to use reshape.

Regarding performance, that change was made before the CPU kernel optimization work. Now we should have better performance. Thanks to that, I'm not sure now whether it's worth investing time and code complexity on this now. Also, I'm not entirely convinced that the frontend code should do the job of the backend such as selecting implementation, just for the sake of performance. What do you think?

That said, if there's immediate performance hit due to switching to ND, I'm open to switching back to numpy. Given that it would be infeasible for me to check for all cases, is there any specific observation or reasons from your side that requires attention on its performance?

piiswrong · 2018-01-29T19:24:51Z

Since this is a performance improvement, please verify that it indeed improve performance, at least for common cases.

Please verify this change does not bring back the negative performance impact reported by #7995.
Please also verify that this change does improve performance for the case @zackchase reported

ptrendx · 2018-02-06T04:21:56Z

@szha Did you do the performance experiments that @piiswrong asked you about (and what were the results)? With this commit we see 20% perf regression on 8 Voltas in resnet.

szha · 2018-02-06T08:16:52Z

We intend to start tracking the performance of metrics going forward. See 9705 for some numbers. The acc implementation has been updated after this commit. Does the regression still exist on master head?

DickJC123 · 2018-02-06T18:35:48Z

We saw the perf regression on a build that includes this PR and also ed823b2 "proper flatten in acc (#9619)."

This reverts commit f5f1b91.

* Revert "avoid per-batch blocking in metric (#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (#9583)" This reverts commit f5f1b91. * keep doc change

ThomasDelteil · 2018-03-19T16:55:36Z

python/mxnet/metric.py

+            if pred_label.context != label.context:
+                pred_label = pred_label.as_in_context(label.context)
+
+            self.sum_metric += (pred_label.flatten() == label.flatten()).sum().asscalar()


asscalar is equivalent to : asnumpy()[0]

This PR did not solve the problem of the metric being computed in the numpy world

Computation happens before asnumpy() happens, so nothing is happening in the numpy world other than passing out a scalar value.

May I ask what your interest is in this PR? Do you have a use case that benefits from using ndarray for metric?

Maybe my understanding is flawed but in order to be able to bring back this value on the CPU to self.sum_metric, we will need to do a wait_on_read on the pred_label ?
If we have a loop that has:

for batch in data: ... metric.update()

we will wait for the label_pred to be computed before we load the next batch on the GPU.

Whilst if we had self.sum_metric = nd.zeros(1, ctx=ctx)
self.sum_metric += (pred_label.flatten() == label.flatten()).sum()
then we wouldn't be blocking on this operation before loading the next batch of data and it would just enqueue more operations on the back-end.

I have run several experiments where having the loss computed .asscalar() on every loop is slowing the training by 2-25%.

see:
ilkarman/DeepLearningFrameworks#55
zackchase/mxnet-the-straight-dope#455

@szha to precise what I mean, I don't think using numpy is slower in itself, it is just that having a blocking operation in the loop is limiting the use of the parallelization happening in the backend and lead to GPU starvation.

However, completely using the non-blocking logic will cause some other problems. To be more specific, the allocated NDArrays cannot be reused and will finally cause an OOM. We should use asscalar() to avoid this.

Thanks for the precision @sxjscience , but why is it so? Once processed in the computation graph aren't they garbage collected?

I think it's because the OPs are pushed in a much faster speed than the real computation. The graph will keep expanding and the allocation OPs will be executed to allocate new space (Even before the actual computation is performed). We have to call a blocking operator at some point to make sure that the current calculation in the graph has been completed. CC @piiswrong for this.

Thanks yes that's my understanding. However I think it should be left to the user to decide when to block, since it depends highly on their GPU and model size (like every 100 batches or every epoch). Also is there a reason why the accuracy is stored on the CPU rather than on specific context? My measures showed great improvements when storing the accuracy on GPU. Maybe if you don't mind we can continue the discussion there: #9571

sxjscience · 2018-03-19T23:54:45Z

Sure. Let’s move the discussion there. Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: ThomasDelteil <[email protected]> Sent: Monday, March 19, 2018 4:17:58 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] use nd for accuracy calculation (#9583) @ThomasDelteil commented on this pull request.

________________________________ In python/mxnet/metric.py<#9583 (comment)>:

check_label_shapes(label, pred_label) - self.sum_metric += (pred_label.flat == label.flat).sum() - self.num_inst += len(pred_label.flat) + if pred_label.context != label.context: + pred_label = pred_label.as_in_context(label.context) + + self.sum_metric += (pred_label.flatten() == label.flatten()).sum().asscalar() Thanks yes that's my understanding. However I think it should be left to the user to decide when to block, since it depends highly on their GPU and model size (like every 100 batches or every epoch). Also is there a reason why the accuracy is stored on the CPU rather than on specific context? My measures showed great improvements when storing the accuracy on GPU. Maybe if you don't mind we can continue the discussion there: #9571<#9571> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#9583 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7uxiHGeRbYA_vJVn5ffFHvCLZn-5ks5tgDymgaJpZM4RvD6W>.

* Bump 1.1 (#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (#9618) * Fixed 4 broken links (#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (#9706) * revert acc changes (#9731) * Revert "avoid per-batch blocking in metric (#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (#9728) * Add my key (#9736) * [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test

* Bump 1.1 (apache#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (apache#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (apache#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (apache#9618) * Fixed 4 broken links (apache#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (apache#9706) * revert acc changes (apache#9731) * Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (apache#9728) * Add my key (apache#9736) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (apache#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test

* use nd for accuracy calculation * check for context

* Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change

* Bump 1.1 (apache#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (apache#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (apache#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (apache#9618) * Fixed 4 broken links (apache#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (apache#9706) * revert acc changes (apache#9731) * Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (apache#9728) * Add my key (apache#9736) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (apache#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test

* use nd for accuracy calculation * check for context

* Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change

* Bump 1.1 (apache#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (apache#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (apache#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (apache#9618) * Fixed 4 broken links (apache#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (apache#9706) * revert acc changes (apache#9731) * Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (apache#9728) * Add my key (apache#9736) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (apache#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test

use nd for accuracy calculation

3c0f9f7

zhreshold reviewed Jan 27, 2018

View reviewed changes

zhreshold approved these changes Jan 27, 2018

View reviewed changes

check for context

99bc034

zhreshold merged commit f5f1b91 into apache:master Jan 27, 2018

szha deleted the metric_nd branch January 27, 2018 13:10

szha added a commit to szha/mxnet that referenced this pull request Feb 7, 2018

Revert "use nd for accuracy calculation (apache#9583)"

46e3361

This reverts commit f5f1b91.

ThomasDelteil reviewed Mar 19, 2018

View reviewed changes

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018

use nd for accuracy calculation (apache#9583)

90dfd70

* use nd for accuracy calculation * check for context

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

use nd for accuracy calculation (apache#9583)

807bc68

* use nd for accuracy calculation * check for context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use nd for accuracy calculation #9583

use nd for accuracy calculation #9583

szha commented Jan 27, 2018 •

edited

Loading

zhreshold Jan 27, 2018

szha Jan 27, 2018 •

edited

Loading

zhreshold Jan 27, 2018

piiswrong commented Jan 29, 2018

szha commented Jan 29, 2018

piiswrong commented Jan 29, 2018 •

edited

Loading

ptrendx commented Feb 6, 2018

szha commented Feb 6, 2018

DickJC123 commented Feb 6, 2018

ThomasDelteil Mar 19, 2018

szha Mar 19, 2018

ThomasDelteil Mar 19, 2018

ThomasDelteil Mar 19, 2018

sxjscience Mar 19, 2018

ThomasDelteil Mar 19, 2018 •

edited

Loading

sxjscience Mar 19, 2018

ThomasDelteil Mar 19, 2018

sxjscience commented Mar 19, 2018 via email

use nd for accuracy calculation #9583

use nd for accuracy calculation #9583

Conversation

szha commented Jan 27, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

szha Jan 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented Jan 29, 2018

szha commented Jan 29, 2018

piiswrong commented Jan 29, 2018 • edited Loading

ptrendx commented Feb 6, 2018

szha commented Feb 6, 2018

DickJC123 commented Feb 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasDelteil Mar 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience commented Mar 19, 2018 via email

szha commented Jan 27, 2018 •

edited

Loading

szha Jan 27, 2018 •

edited

Loading

piiswrong commented Jan 29, 2018 •

edited

Loading

ThomasDelteil Mar 19, 2018 •

edited

Loading