Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add fp16 support for topk #15560

Merged
merged 5 commits into from
Aug 23, 2019
Merged

Add fp16 support for topk #15560

merged 5 commits into from
Aug 23, 2019

Conversation

anirudhacharya
Copy link
Member

@anirudhacharya anirudhacharya commented Jul 16, 2019

Description

Add fp16 support for topk operator. Required for certain machine translation tasks( and other NLP related tasks).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

For review - @anirudh2290 @apeforest @sxjscience
Same as this PR - #14912

@karan6181
Copy link
Contributor

@mxnet-label-bot add [FP16, Operator, pr-awaiting-review]

@marcoabreu marcoabreu added FP16 Operator pr-awaiting-review PR is waiting for code review labels Jul 16, 2019
@haojin2
Copy link
Contributor

haojin2 commented Jul 17, 2019

@anirudhacharya Seems like this change is breaking some tests, please fix those.

@anirudhacharya
Copy link
Member Author

@anirudhacharya Seems like this change is breaking some tests, please fix those.

the tests that are failing are tests for topk with int32 dtype in test_ndarray:test_order

@piyushghai
Copy link
Contributor

@anirudhacharya What's the path forward for this PR ?

@drivanov
Copy link
Contributor

@anirudhacharya: I see couple bugs in sort_op-inl.cuh, which show up when fp16 for topK is implemented. First of all, the lines 52, 63 are identical (see https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/sort_op-inl.cuh):
return static_cast<mshadow::half::half_t>(lhs) < static_cast<mshadow::half::half_t>(rhs);
therefore:

cuda::less_half<half>()
cuda::greater_half<half>()

are identical too.
Second, instead of

    thrust::stable_sort_by_key(
      thrust::cuda::par.on(stream),
          key_iter, key_iter + (keys.size(0)), value_iter, cuda::greater_half<half>());

(here and on couple of other places) we should use

    thrust::stable_sort_by_key(
      thrust::cuda::par.on(stream),
          key_iter, key_iter + (keys.size(0)), value_iter.get(), cuda::greater_half<half>());

@anirudhacharya
Copy link
Member Author

anirudhacharya commented Aug 21, 2019

@drivanov yes, cuda::less_half<half>() and cuda::greater_half<half>() are defined similarly and should be looked into.
But the failure, as far as fp16 for topk and this PR goes, is here - https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/sort_op-inl.cuh#L75. There is a problem when we try to perform DeviceRadixSort with half precision values. The documentation - https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html says that DeviceRadixSort handles all numeric primitive types, including half precision. So either there is an issue with the way it is being called here or there is an issue with the routine itself, not sure which.

@drivanov
Copy link
Contributor

drivanov commented Aug 21, 2019

@anirudhacharya: cuda::less_half<half>() and cuda::greater_half<half>() are not only defined similarly, but they are identical and, definitely, it is a bug.
As for the problem with DeviceRadixSort, yes, it's not comiled for mshadow::half::half_t, but it could be compiled for __half and I think we could use that. The solution here would be the usage of separate templates SortPairsWorkspaseSize. It's how I did it:

#define m_half mshadow::half::half_t

#ifndef SORT_WITH_THRUST
template <typename KDType, typename VDType>
inline void WorkspaceSize4KeysAndValues(
  const size_t num_keys, size_t *pKeys_bytes, size_t *pValues_bytes) {
  const size_t alignment = std::max(sizeof(KDType), sizeof(VDType));
  *pKeys_bytes = PadBytes(num_keys * sizeof(KDType), alignment);
  *pValues_bytes = PadBytes(num_keys * sizeof(VDType), alignment);
}

template <typename KDType, typename VDType>
inline typename std::enable_if<!std::is_same<KDType, m_half>::value, size_t>::type
SortPairsWorkspaseSize(const size_t num_keys) {
  size_t sortpairs_bytes = 0;
  cub::DeviceRadixSort::SortPairs<KDType, VDType>(NULL, sortpairs_bytes,
    NULL, NULL, NULL, NULL, num_keys);
  return sortpairs_bytes;
}

template <typename KDType, typename VDType>
inline typename std::enable_if<std::is_same<KDType, m_half>::value, size_t>::type
SortPairsWorkspaseSize(const size_t num_keys) {
  size_t sortpairs_bytes = 0;
  cub::DeviceRadixSort::SortPairs<__half, VDType>(NULL, sortpairs_bytes,
    NULL, NULL, NULL, NULL, num_keys);
  return sortpairs_bytes;
}
#endif

template <typename KDType, typename VDType, typename xpu>
inline typename std::enable_if<std::is_same<xpu, gpu>::value, size_t>::type
SortByKeyWorkspaceSize(const size_t num_keys) {
#ifdef SORT_WITH_THRUST
  return 0;
#else
  size_t keys_bytes, values_bytes;
  WorkspaceSize4KeysAndValues<KDType, VDType>(num_keys, &keys_bytes, &values_bytes);
  return keys_bytes + values_bytes + SortPairsWorkspaseSize<KDType, VDType>(num_keys);
#endif
}

As you could see, in my implementation SortPairsWorkspaseSize<KDType, VDType> is defined differently for KDType equal and NOT equal to m_half.

@anirudhacharya
Copy link
Member Author

@anirudhacharya: cuda::less_half<half>() and cuda::greater_half<half>() are not only defined similarly, but they are identical and, definitely, it is a bug.

yes, sorry for sounding ambiguous. The intent of my previous comment was to ack that it was a bug.

As for the problem with DeviceRadixSort, yes, it's not comiled for mshadow::half::half_t, but it could be compiled for __half and I think we could use that. The solution here would be the usage of separate templates SortPairsWorkspaseSize...

Thanks for this suggestion, I will try this and see if it works. On the other hand if you have a solution for this already and you would like to make a PR for it, let me know, I can close this and we can merge your code.

@drivanov
Copy link
Contributor

drivanov commented Aug 22, 2019

@anirudhacharya: Yes, I do have a solution, but it is not quite completed yet.

  1. I added the np.float16 testing to test_ndarray.py:test_order and all these tests are OK. But these tests are using the changes I've made for PR Mxnet allclose #14443, which is still open.
  2. There is another test test_operator.py:test_order) which also was design mx.sym.topk and other related function. Now I am working on extension of these tests for np.float16. I plan to finish this task till tomorrow (08/23/2019)

@anirudhacharya
Copy link
Member Author

Yes, I do have a solution, but it is not quite completed yet.

  1. I added the np.float16 testing to test_ndarray.py:test_order and all these tests are OK. But these tests are using the changes I've made for PR Mxnet allclose #14443, which is still open.

then we can probably merge this, since i think this pr is pretty much merge ready.

@sxjscience @haojin2 could you review/merge this.

@sxjscience sxjscience merged commit cba7c4e into apache:master Aug 23, 2019
@anirudhacharya anirudhacharya deleted the fp16 branch August 23, 2019 19:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FP16 Operator pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FP16 support for topK
7 participants