Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

use nd for accuracy calculation #9583

Merged
merged 2 commits into from
Jan 27, 2018
Merged

use nd for accuracy calculation #9583

merged 2 commits into from
Jan 27, 2018

Conversation

szha
Copy link
Member

@szha szha commented Jan 27, 2018

Description

use ndarray for accuracy calculation.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Comments

  • Backward compatible.

"""
check_label_shapes(labels, preds)

for label, pred_label in zip(labels, preds):
if pred_label.shape != label.shape:
pred_label = ndarray.argmax(pred_label, axis=self.axis)
pred_label = pred_label.asnumpy().astype('int32')
label = label.asnumpy().astype('int32')
pred_label = pred_label.astype('int32')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64is better?

Copy link
Member Author

@szha szha Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires larger space, which can show when the prediction class is large (such as in NLP applications). Should I make it an option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be good in most cases.

@zhreshold zhreshold merged commit f5f1b91 into apache:master Jan 27, 2018
@szha szha deleted the metric_nd branch January 27, 2018 13:10
@piiswrong
Copy link
Contributor

This is not as simple as changing numpy array to ndarray. See #7995
There are some cases where numpy array is faster. Please check performance against all cases.

Also flatten reshapes to 2 dimensions. This could cause problems when output is 1 dimensional. Use reshape((-1,))

@szha
Copy link
Member Author

szha commented Jan 29, 2018

Thanks for the comment. I will switch to use reshape.

Regarding performance, that change was made before the CPU kernel optimization work. Now we should have better performance. Thanks to that, I'm not sure now whether it's worth investing time and code complexity on this now. Also, I'm not entirely convinced that the frontend code should do the job of the backend such as selecting implementation, just for the sake of performance. What do you think?

That said, if there's immediate performance hit due to switching to ND, I'm open to switching back to numpy. Given that it would be infeasible for me to check for all cases, is there any specific observation or reasons from your side that requires attention on its performance?

@piiswrong
Copy link
Contributor

piiswrong commented Jan 29, 2018

Since this is a performance improvement, please verify that it indeed improve performance, at least for common cases.

Please verify this change does not bring back the negative performance impact reported by #7995.
Please also verify that this change does improve performance for the case @zackchase reported

@ptrendx
Copy link
Member

ptrendx commented Feb 6, 2018

@szha Did you do the performance experiments that @piiswrong asked you about (and what were the results)? With this commit we see 20% perf regression on 8 Voltas in resnet.

@szha
Copy link
Member Author

szha commented Feb 6, 2018

We intend to start tracking the performance of metrics going forward. See 9705 for some numbers. The acc implementation has been updated after this commit. Does the regression still exist on master head?

@DickJC123
Copy link
Contributor

We saw the perf regression on a build that includes this PR and also ed823b2 "proper flatten in acc (#9619)."

szha added a commit to szha/mxnet that referenced this pull request Feb 7, 2018
eric-haibin-lin pushed a commit that referenced this pull request Feb 7, 2018
* Revert "avoid per-batch blocking in metric (#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (#9583)"

This reverts commit f5f1b91.

* keep doc change
szha added a commit that referenced this pull request Feb 7, 2018
* Revert "avoid per-batch blocking in metric (#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (#9583)"

This reverts commit f5f1b91.

* keep doc change
if pred_label.context != label.context:
pred_label = pred_label.as_in_context(label.context)

self.sum_metric += (pred_label.flatten() == label.flatten()).sum().asscalar()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asscalar is equivalent to : asnumpy()[0]

This PR did not solve the problem of the metric being computed in the numpy world

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computation happens before asnumpy() happens, so nothing is happening in the numpy world other than passing out a scalar value.

May I ask what your interest is in this PR? Do you have a use case that benefits from using ndarray for metric?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my understanding is flawed but in order to be able to bring back this value on the CPU to self.sum_metric, we will need to do a wait_on_read on the pred_label ?
If we have a loop that has:

for batch in data:
   ...
   metric.update()

we will wait for the label_pred to be computed before we load the next batch on the GPU.

Whilst if we had self.sum_metric = nd.zeros(1, ctx=ctx)
self.sum_metric += (pred_label.flatten() == label.flatten()).sum()
then we wouldn't be blocking on this operation before loading the next batch of data and it would just enqueue more operations on the back-end.

I have run several experiments where having the loss computed .asscalar() on every loop is slowing the training by 2-25%.

see:
ilkarman/DeepLearningFrameworks#55
zackchase/mxnet-the-straight-dope#455

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha to precise what I mean, I don't think using numpy is slower in itself, it is just that having a blocking operation in the loop is limiting the use of the parallelization happening in the backend and lead to GPU starvation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, completely using the non-blocking logic will cause some other problems. To be more specific, the allocated NDArrays cannot be reused and will finally cause an OOM. We should use asscalar() to avoid this.

Copy link
Contributor

@ThomasDelteil ThomasDelteil Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the precision @sxjscience , but why is it so? Once processed in the computation graph aren't they garbage collected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because the OPs are pushed in a much faster speed than the real computation. The graph will keep expanding and the allocation OPs will be executed to allocate new space (Even before the actual computation is performed). We have to call a blocking operator at some point to make sure that the current calculation in the graph has been completed. CC @piiswrong for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yes that's my understanding. However I think it should be left to the user to decide when to block, since it depends highly on their GPU and model size (like every 100 batches or every epoch). Also is there a reason why the accuracy is stored on the CPU rather than on specific context? My measures showed great improvements when storing the accuracy on GPU. Maybe if you don't mind we can continue the discussion there: #9571

@sxjscience
Copy link
Member

sxjscience commented Mar 19, 2018 via email

eric-haibin-lin added a commit that referenced this pull request Mar 27, 2018
* Bump 1.1 (#192)

* bump

* also update base.h

* revert website changes

* Update index.html

* update news.md (#191)

* Update NEWS.md

* Update README.md

* refactor regression ops to nnvm interface (#9540)

* refactor regression ops

* fix err for instantiation of minus_sign

* remove useless header file init_op.h

* replace with macro and address other comments

* update

* minor revise docs

* add mae test

* Update KEYS

* Update NEWS.md

* fixed links that were missng ndarray folder path (#9618)

* Fixed 4 broken links (#9698)

* Fixed 4 broken links

* fixed pylint for long line disable and 1 broken link

* Update NEWS.md

* Update NOTICE (#9706)

* revert acc changes (#9731)

* Revert "avoid per-batch blocking in metric (#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (#9583)"

This reverts commit f5f1b91.

* keep doc change

* PGP keys add liuyizhi AT apache.org (#9728)

* Add my key (#9736)

* [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update navbar model zoo link (#9749)

* update navbar model zoo link

* update

* initial commit

* clean up

* refactor

* fix test
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* Bump 1.1 (apache#192)

* bump

* also update base.h

* revert website changes

* Update index.html

* update news.md (apache#191)

* Update NEWS.md

* Update README.md

* refactor regression ops to nnvm interface (apache#9540)

* refactor regression ops

* fix err for instantiation of minus_sign

* remove useless header file init_op.h

* replace with macro and address other comments

* update

* minor revise docs

* add mae test

* Update KEYS

* Update NEWS.md

* fixed links that were missng ndarray folder path (apache#9618)

* Fixed 4 broken links (apache#9698)

* Fixed 4 broken links

* fixed pylint for long line disable and 1 broken link

* Update NEWS.md

* Update NOTICE (apache#9706)

* revert acc changes (apache#9731)

* Revert "avoid per-batch blocking in metric (apache#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (apache#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (apache#9583)"

This reverts commit f5f1b91.

* keep doc change

* PGP keys add liuyizhi AT apache.org (apache#9728)

* Add my key (apache#9736)

* [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update navbar model zoo link (apache#9749)

* update navbar model zoo link

* update

* initial commit

* clean up

* refactor

* fix test
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* use nd for accuracy calculation

* check for context
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Revert "avoid per-batch blocking in metric (apache#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (apache#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (apache#9583)"

This reverts commit f5f1b91.

* keep doc change
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Bump 1.1 (apache#192)

* bump

* also update base.h

* revert website changes

* Update index.html

* update news.md (apache#191)

* Update NEWS.md

* Update README.md

* refactor regression ops to nnvm interface (apache#9540)

* refactor regression ops

* fix err for instantiation of minus_sign

* remove useless header file init_op.h

* replace with macro and address other comments

* update

* minor revise docs

* add mae test

* Update KEYS

* Update NEWS.md

* fixed links that were missng ndarray folder path (apache#9618)

* Fixed 4 broken links (apache#9698)

* Fixed 4 broken links

* fixed pylint for long line disable and 1 broken link

* Update NEWS.md

* Update NOTICE (apache#9706)

* revert acc changes (apache#9731)

* Revert "avoid per-batch blocking in metric (apache#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (apache#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (apache#9583)"

This reverts commit f5f1b91.

* keep doc change

* PGP keys add liuyizhi AT apache.org (apache#9728)

* Add my key (apache#9736)

* [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update navbar model zoo link (apache#9749)

* update navbar model zoo link

* update

* initial commit

* clean up

* refactor

* fix test
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* use nd for accuracy calculation

* check for context
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Revert "avoid per-batch blocking in metric (apache#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (apache#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (apache#9583)"

This reverts commit f5f1b91.

* keep doc change
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Bump 1.1 (apache#192)

* bump

* also update base.h

* revert website changes

* Update index.html

* update news.md (apache#191)

* Update NEWS.md

* Update README.md

* refactor regression ops to nnvm interface (apache#9540)

* refactor regression ops

* fix err for instantiation of minus_sign

* remove useless header file init_op.h

* replace with macro and address other comments

* update

* minor revise docs

* add mae test

* Update KEYS

* Update NEWS.md

* fixed links that were missng ndarray folder path (apache#9618)

* Fixed 4 broken links (apache#9698)

* Fixed 4 broken links

* fixed pylint for long line disable and 1 broken link

* Update NEWS.md

* Update NOTICE (apache#9706)

* revert acc changes (apache#9731)

* Revert "avoid per-batch blocking in metric (apache#9636)"

This reverts commit 3fe694e.

* Revert "proper flatten in acc (apache#9619)"

This reverts commit ed823b2.

* Revert "use nd for accuracy calculation (apache#9583)"

This reverts commit f5f1b91.

* keep doc change

* PGP keys add liuyizhi AT apache.org (apache#9728)

* Add my key (apache#9736)

* [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update navbar model zoo link (apache#9749)

* update navbar model zoo link

* update

* initial commit

* clean up

* refactor

* fix test
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ACCURACY IS USING NUMPY, URGENT FIX Misleading calculation of mxnet.metric.Accuracy
7 participants