Add fp16 support for topk #15560

anirudhacharya · 2019-07-16T22:18:48Z

Description

Add fp16 support for topk operator. Required for certain machine translation tasks( and other NLP related tasks).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

add fp16 support for topk operator
Fixes FP16 support for topK #14125 mx.nd.topk does not work with ndarray of type float16 #11156 Fp16 support for top_k operator #12705

For review - @anirudh2290 @apeforest @sxjscience
Same as this PR - #14912

karan6181 · 2019-07-16T23:08:02Z

@mxnet-label-bot add [FP16, Operator, pr-awaiting-review]

haojin2 · 2019-07-17T14:04:44Z

@anirudhacharya Seems like this change is breaking some tests, please fix those.

src/operator/tensor/ordering_op-inl.h

anirudhacharya · 2019-07-18T18:01:00Z

@anirudhacharya Seems like this change is breaking some tests, please fix those.

the tests that are failing are tests for topk with int32 dtype in test_ndarray:test_order

piyushghai · 2019-08-02T19:42:46Z

@anirudhacharya What's the path forward for this PR ?

drivanov · 2019-08-19T19:16:34Z

@anirudhacharya: I see couple bugs in sort_op-inl.cuh, which show up when fp16 for topK is implemented. First of all, the lines 52, 63 are identical (see https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/sort_op-inl.cuh):
return static_cast<mshadow::half::half_t>(lhs) < static_cast<mshadow::half::half_t>(rhs);
therefore:

cuda::less_half<half>()
cuda::greater_half<half>()

are identical too.
Second, instead of

    thrust::stable_sort_by_key(
      thrust::cuda::par.on(stream),
          key_iter, key_iter + (keys.size(0)), value_iter, cuda::greater_half<half>());

(here and on couple of other places) we should use

    thrust::stable_sort_by_key(
      thrust::cuda::par.on(stream),
          key_iter, key_iter + (keys.size(0)), value_iter.get(), cuda::greater_half<half>());

anirudhacharya · 2019-08-21T15:50:38Z

@drivanov yes, cuda::less_half<half>() and cuda::greater_half<half>() are defined similarly and should be looked into.
But the failure, as far as fp16 for topk and this PR goes, is here - https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/sort_op-inl.cuh#L75. There is a problem when we try to perform DeviceRadixSort with half precision values. The documentation - https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html says that DeviceRadixSort handles all numeric primitive types, including half precision. So either there is an issue with the way it is being called here or there is an issue with the routine itself, not sure which.

drivanov · 2019-08-21T19:57:14Z

@anirudhacharya: cuda::less_half<half>() and cuda::greater_half<half>() are not only defined similarly, but they are identical and, definitely, it is a bug.
As for the problem with DeviceRadixSort, yes, it's not comiled for mshadow::half::half_t, but it could be compiled for __half and I think we could use that. The solution here would be the usage of separate templates SortPairsWorkspaseSize. It's how I did it:

#define m_half mshadow::half::half_t

#ifndef SORT_WITH_THRUST
template <typename KDType, typename VDType>
inline void WorkspaceSize4KeysAndValues(
  const size_t num_keys, size_t *pKeys_bytes, size_t *pValues_bytes) {
  const size_t alignment = std::max(sizeof(KDType), sizeof(VDType));
  *pKeys_bytes = PadBytes(num_keys * sizeof(KDType), alignment);
  *pValues_bytes = PadBytes(num_keys * sizeof(VDType), alignment);
}

template <typename KDType, typename VDType>
inline typename std::enable_if<!std::is_same<KDType, m_half>::value, size_t>::type
SortPairsWorkspaseSize(const size_t num_keys) {
  size_t sortpairs_bytes = 0;
  cub::DeviceRadixSort::SortPairs<KDType, VDType>(NULL, sortpairs_bytes,
    NULL, NULL, NULL, NULL, num_keys);
  return sortpairs_bytes;
}

template <typename KDType, typename VDType>
inline typename std::enable_if<std::is_same<KDType, m_half>::value, size_t>::type
SortPairsWorkspaseSize(const size_t num_keys) {
  size_t sortpairs_bytes = 0;
  cub::DeviceRadixSort::SortPairs<__half, VDType>(NULL, sortpairs_bytes,
    NULL, NULL, NULL, NULL, num_keys);
  return sortpairs_bytes;
}
#endif

template <typename KDType, typename VDType, typename xpu>
inline typename std::enable_if<std::is_same<xpu, gpu>::value, size_t>::type
SortByKeyWorkspaceSize(const size_t num_keys) {
#ifdef SORT_WITH_THRUST
  return 0;
#else
  size_t keys_bytes, values_bytes;
  WorkspaceSize4KeysAndValues<KDType, VDType>(num_keys, &keys_bytes, &values_bytes);
  return keys_bytes + values_bytes + SortPairsWorkspaseSize<KDType, VDType>(num_keys);
#endif
}

As you could see, in my implementation SortPairsWorkspaseSize<KDType, VDType> is defined differently for KDType equal and NOT equal to m_half.

anirudhacharya · 2019-08-21T22:25:23Z

@anirudhacharya: cuda::less_half<half>() and cuda::greater_half<half>() are not only defined similarly, but they are identical and, definitely, it is a bug.

yes, sorry for sounding ambiguous. The intent of my previous comment was to ack that it was a bug.

As for the problem with DeviceRadixSort, yes, it's not comiled for mshadow::half::half_t, but it could be compiled for __half and I think we could use that. The solution here would be the usage of separate templates SortPairsWorkspaseSize...

Thanks for this suggestion, I will try this and see if it works. On the other hand if you have a solution for this already and you would like to make a PR for it, let me know, I can close this and we can merge your code.

drivanov · 2019-08-22T16:01:25Z

@anirudhacharya: Yes, I do have a solution, but it is not quite completed yet.

I added the np.float16 testing to test_ndarray.py:test_order and all these tests are OK. But these tests are using the changes I've made for PR Mxnet allclose #14443, which is still open.
There is another test test_operator.py:test_order) which also was design mx.sym.topk and other related function. Now I am working on extension of these tests for np.float16. I plan to finish this task till tomorrow (08/23/2019)

anirudhacharya · 2019-08-23T17:33:50Z

Yes, I do have a solution, but it is not quite completed yet.

I added the np.float16 testing to test_ndarray.py:test_order and all these tests are OK. But these tests are using the changes I've made for PR Mxnet allclose #14443, which is still open.

then we can probably merge this, since i think this pr is pretty much merge ready.

@sxjscience @haojin2 could you review/merge this.

src/operator/tensor/sort_op-inl.cuh

tests/python/unittest/test_operator.py

marcoabreu added FP16 Operator pr-awaiting-review PR is waiting for code review labels Jul 16, 2019

anirudhacharya force-pushed the fp16 branch from ebebebd to f5705d8 Compare July 17, 2019 23:30

anirudhacharya commented Jul 17, 2019

View reviewed changes

src/operator/tensor/ordering_op-inl.h Outdated Show resolved Hide resolved

anirudhacharya force-pushed the fp16 branch from f5705d8 to de5d002 Compare July 17, 2019 23:35

anirudhacharya force-pushed the fp16 branch from de5d002 to 8b01c06 Compare August 21, 2019 02:48

anirudhacharya force-pushed the fp16 branch 2 times, most recently from 4d81d59 to 8b0884f Compare August 22, 2019 07:48

anirudhacharya force-pushed the fp16 branch from 8b0884f to 86acbd9 Compare August 22, 2019 17:49

anirudhacharya added 5 commits August 22, 2019 22:11

fp16 for topk

30ab914

indentation and mem alloc

2f32c58

fix cuda code

57a20c0

fix cuda sort routine for half precision

d278579

remove ambiguous half type

c28888f

anirudhacharya force-pushed the fp16 branch from 86acbd9 to c28888f Compare August 22, 2019 22:11

sxjscience reviewed Aug 23, 2019

View reviewed changes

src/operator/tensor/sort_op-inl.cuh Show resolved Hide resolved

sxjscience reviewed Aug 23, 2019

View reviewed changes

tests/python/unittest/test_operator.py Show resolved Hide resolved

sxjscience approved these changes Aug 23, 2019

View reviewed changes

sxjscience merged commit cba7c4e into apache:master Aug 23, 2019

anirudhacharya deleted the fp16 branch August 23, 2019 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fp16 support for topk #15560

Add fp16 support for topk #15560

anirudhacharya commented Jul 16, 2019 •

edited

Loading

karan6181 commented Jul 16, 2019

haojin2 commented Jul 17, 2019

anirudhacharya commented Jul 18, 2019

piyushghai commented Aug 2, 2019

drivanov commented Aug 19, 2019

anirudhacharya commented Aug 21, 2019 •

edited

Loading

drivanov commented Aug 21, 2019 •

edited

Loading

anirudhacharya commented Aug 21, 2019

drivanov commented Aug 22, 2019 •

edited

Loading

anirudhacharya commented Aug 23, 2019

Add fp16 support for topk #15560

Add fp16 support for topk #15560

Conversation

anirudhacharya commented Jul 16, 2019 • edited Loading

Description

Checklist

Essentials

Changes

karan6181 commented Jul 16, 2019

haojin2 commented Jul 17, 2019

anirudhacharya commented Jul 18, 2019

piyushghai commented Aug 2, 2019

drivanov commented Aug 19, 2019

anirudhacharya commented Aug 21, 2019 • edited Loading

drivanov commented Aug 21, 2019 • edited Loading

anirudhacharya commented Aug 21, 2019

drivanov commented Aug 22, 2019 • edited Loading

anirudhacharya commented Aug 23, 2019

anirudhacharya commented Jul 16, 2019 •

edited

Loading

anirudhacharya commented Aug 21, 2019 •

edited

Loading

drivanov commented Aug 21, 2019 •

edited

Loading

drivanov commented Aug 22, 2019 •

edited

Loading