-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
""" | ||
check_label_shapes(labels, preds) | ||
|
||
for label, pred_label in zip(labels, preds): | ||
if pred_label.shape != label.shape: | ||
pred_label = ndarray.argmax(pred_label, axis=self.axis) | ||
pred_label = pred_label.asnumpy().astype('int32') | ||
label = label.asnumpy().astype('int32') | ||
pred_label = pred_label.astype('int32') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires larger space, which can show when the prediction class is large (such as in NLP applications). Should I make it an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be good in most cases.
This is not as simple as changing numpy array to ndarray. See #7995 Also flatten reshapes to 2 dimensions. This could cause problems when output is 1 dimensional. Use reshape((-1,)) |
Thanks for the comment. I will switch to use reshape. Regarding performance, that change was made before the CPU kernel optimization work. Now we should have better performance. Thanks to that, I'm not sure now whether it's worth investing time and code complexity on this now. Also, I'm not entirely convinced that the frontend code should do the job of the backend such as selecting implementation, just for the sake of performance. What do you think? That said, if there's immediate performance hit due to switching to ND, I'm open to switching back to numpy. Given that it would be infeasible for me to check for all cases, is there any specific observation or reasons from your side that requires attention on its performance? |
Since this is a performance improvement, please verify that it indeed improve performance, at least for common cases. Please verify this change does not bring back the negative performance impact reported by #7995. |
@szha Did you do the performance experiments that @piiswrong asked you about (and what were the results)? With this commit we see 20% perf regression on 8 Voltas in resnet. |
We intend to start tracking the performance of metrics going forward. See 9705 for some numbers. The acc implementation has been updated after this commit. Does the regression still exist on master head? |
This reverts commit f5f1b91.
if pred_label.context != label.context: | ||
pred_label = pred_label.as_in_context(label.context) | ||
|
||
self.sum_metric += (pred_label.flatten() == label.flatten()).sum().asscalar() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asscalar is equivalent to : asnumpy()[0]
This PR did not solve the problem of the metric being computed in the numpy world
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computation happens before asnumpy() happens, so nothing is happening in the numpy world other than passing out a scalar value.
May I ask what your interest is in this PR? Do you have a use case that benefits from using ndarray for metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe my understanding is flawed but in order to be able to bring back this value on the CPU to self.sum_metric, we will need to do a wait_on_read on the pred_label ?
If we have a loop that has:
for batch in data:
...
metric.update()
we will wait for the label_pred to be computed before we load the next batch on the GPU.
Whilst if we had self.sum_metric = nd.zeros(1, ctx=ctx)
self.sum_metric += (pred_label.flatten() == label.flatten()).sum()
then we wouldn't be blocking on this operation before loading the next batch of data and it would just enqueue more operations on the back-end.
I have run several experiments where having the loss computed .asscalar() on every loop is slowing the training by 2-25%.
see:
ilkarman/DeepLearningFrameworks#55
zackchase/mxnet-the-straight-dope#455
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szha to precise what I mean, I don't think using numpy is slower in itself, it is just that having a blocking operation in the loop is limiting the use of the parallelization happening in the backend and lead to GPU starvation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, completely using the non-blocking logic will cause some other problems. To be more specific, the allocated NDArrays cannot be reused and will finally cause an OOM. We should use asscalar()
to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the precision @sxjscience , but why is it so? Once processed in the computation graph aren't they garbage collected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because the OPs are pushed in a much faster speed than the real computation. The graph will keep expanding and the allocation OPs will be executed to allocate new space (Even before the actual computation is performed). We have to call a blocking operator at some point to make sure that the current calculation in the graph has been completed. CC @piiswrong for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks yes that's my understanding. However I think it should be left to the user to decide when to block, since it depends highly on their GPU and model size (like every 100 batches or every epoch). Also is there a reason why the accuracy is stored on the CPU rather than on specific context? My measures showed great improvements when storing the accuracy on GPU. Maybe if you don't mind we can continue the discussion there: #9571
Sure. Let’s move the discussion there.
Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: ThomasDelteil <[email protected]>
Sent: Monday, March 19, 2018 4:17:58 PM
To: apache/incubator-mxnet
Cc: Xingjian SHI; Mention
Subject: Re: [apache/incubator-mxnet] use nd for accuracy calculation (#9583)
@ThomasDelteil commented on this pull request.
________________________________
In python/mxnet/metric.py<#9583 (comment)>:
check_label_shapes(label, pred_label)
- self.sum_metric += (pred_label.flat == label.flat).sum()
- self.num_inst += len(pred_label.flat)
+ if pred_label.context != label.context:
+ pred_label = pred_label.as_in_context(label.context)
+
+ self.sum_metric += (pred_label.flatten() == label.flatten()).sum().asscalar()
Thanks yes that's my understanding. However I think it should be left to the user to decide when to block, since it depends highly on their GPU and model size (like every 100 batches or every epoch). Also is there a reason why the accuracy is stored on the CPU rather than on specific context? My measures showed great improvements when storing the accuracy on GPU. Maybe if you don't mind we can continue the discussion there: #9571<#9571>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#9583 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7uxiHGeRbYA_vJVn5ffFHvCLZn-5ks5tgDymgaJpZM4RvD6W>.
|
* Bump 1.1 (#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (#9618) * Fixed 4 broken links (#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (#9706) * revert acc changes (#9731) * Revert "avoid per-batch blocking in metric (#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (#9728) * Add my key (#9736) * [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test
* Bump 1.1 (apache#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (apache#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (apache#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (apache#9618) * Fixed 4 broken links (apache#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (apache#9706) * revert acc changes (apache#9731) * Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (apache#9728) * Add my key (apache#9736) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (apache#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test
* use nd for accuracy calculation * check for context
* Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change
* Bump 1.1 (apache#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (apache#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (apache#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (apache#9618) * Fixed 4 broken links (apache#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (apache#9706) * revert acc changes (apache#9731) * Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (apache#9728) * Add my key (apache#9736) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (apache#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test
* use nd for accuracy calculation * check for context
* Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change
* Bump 1.1 (apache#192) * bump * also update base.h * revert website changes * Update index.html * update news.md (apache#191) * Update NEWS.md * Update README.md * refactor regression ops to nnvm interface (apache#9540) * refactor regression ops * fix err for instantiation of minus_sign * remove useless header file init_op.h * replace with macro and address other comments * update * minor revise docs * add mae test * Update KEYS * Update NEWS.md * fixed links that were missng ndarray folder path (apache#9618) * Fixed 4 broken links (apache#9698) * Fixed 4 broken links * fixed pylint for long line disable and 1 broken link * Update NEWS.md * Update NOTICE (apache#9706) * revert acc changes (apache#9731) * Revert "avoid per-batch blocking in metric (apache#9636)" This reverts commit 3fe694e. * Revert "proper flatten in acc (apache#9619)" This reverts commit ed823b2. * Revert "use nd for accuracy calculation (apache#9583)" This reverts commit f5f1b91. * keep doc change * PGP keys add liuyizhi AT apache.org (apache#9728) * Add my key (apache#9736) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update navbar model zoo link (apache#9749) * update navbar model zoo link * update * initial commit * clean up * refactor * fix test
Description
use ndarray for accuracy calculation.
Checklist
Essentials
make lint
)Changes
metric.Accuracy
that fixes ACCURACY IS USING NUMPY, URGENT FIX #9571Comments