Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

Merged
merged 5 commits into from
Aug 9, 2022

Conversation

canonizer
Copy link
Contributor

Fix begin_bit == end_bit == 0 for device-wide and segmented sort.

@canonizer
Copy link
Contributor Author

This pull request should address #353.

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@canonizer thank you for addressing this! I'm a bit concerned about the approach, though. I wonder if we could short-circuit in the begin_bit == end_bit case. For instance, if is_overwrite_okay == true we wouldn't do anything, since the double buffer would contain proper data already. Otherwise, we might just copy the data. I've written a simple benchmark below that might help understand the impact of this approach. For double buffer case, we have noop, which is definitely faster. Otherwise, memcpy is about 40% faster than actually sorting anything.

#include <cub/cub.cuh>

#include <thrust/device_vector.h>

#include <iostream>

void sort(
    std::uint8_t *d_temp_storage, std::size_t &temp_storage_bytes,
    int *d_keys_in, int *d_keys_out, 
    int num_items, 
    bool use_buffer, bool short_circuit) 
{
  const int begin_bit = 0;
  const int end_bit = begin_bit;
  cub::DoubleBuffer<int> d_keys(d_keys_in, d_keys_out);

  if (use_buffer) {
    if (short_circuit) {
      temp_storage_bytes = 1; // noop
    } else {
      cub::DeviceRadixSort::SortKeys(
          d_temp_storage, temp_storage_bytes, 
          d_keys, num_items, begin_bit, end_bit);
    }
  } else {
    if (short_circuit) {
      if (d_temp_storage == nullptr) {
        temp_storage_bytes = 1; 
      } else {
        cudaMemcpy(d_keys_out, d_keys_in, sizeof(int) * num_items, cudaMemcpyDeviceToDevice);
      }
    } else {
      cub::DeviceRadixSort::SortKeys(
          d_temp_storage, temp_storage_bytes, 
          d_keys_in, d_keys_out, num_items, begin_bit, end_bit);
    }
  }
}

int main()
{
  const int num_items = 128 * 1024 * 1024;
  thrust::device_vector<int> in(num_items);
  thrust::device_vector<int> out(num_items);

  int *d_keys_in = thrust::raw_pointer_cast(in.data());
  int *d_keys_out = thrust::raw_pointer_cast(out.data());

  std::uint8_t *d_temp_storage{};
  std::size_t temp_storage_bytes = 0;

  const bool use_buffer = false;
  const bool short_circuit = true;

  sort(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items,
       use_buffer, short_circuit);

  thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
  d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

  cudaEvent_t begin, end;
  cudaEventCreate(&begin);
  cudaEventCreate(&end);

  cudaEventRecord(begin);
  sort(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items,
       use_buffer, short_circuit);
  cudaEventRecord(end);
  cudaEventSynchronize(end);

  float ms{};
  cudaEventElapsedTime(&ms, begin, end);

  std::cout << ms << "ms" << std::endl;


  cudaEventDestroy(end);
  cudaEventDestroy(begin);
}

I don't think this approach is applicable to segmented version. But I'd like to know your opinion on this for non-segmented API. Are there any downsides I'm missing?

@alliepiper alliepiper added this to the 2.0.0 milestone May 13, 2022
@alliepiper alliepiper added type: bug: functional Does not work as intended. P2: nice to have Desired, but not necessary. labels May 13, 2022
@canonizer
Copy link
Contributor Author

@senior-zero @allisonvacanti I've added short-circuiting when begin_bit == end_bit and double buffers are passed as arguments. Could you take another look?

Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In @senior-zero's earlier comment, he suggested making the is_overwrite_okay == false case to just do a copy and skip the sorting altogether. Can we add that optimization?

cub/device/dispatch/dispatch_radix_sort.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_radix_sort.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_radix_sort.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_radix_sort.cuh Outdated Show resolved Hide resolved
@canonizer
Copy link
Contributor Author

Thanks for your comments!

@allisonvacanti I've addressed your comments.

@senior-zero I've added the copy shortcut if begin_bit == end_bit, and overwrite is not allowed, for the device-wide sort only. For the segmented sort, as you've mentioned, this approach won't work, and adding a separate kernel to perform segmented copy is definitely for a different pull request.

Could you take another look?

@alliepiper
Copy link
Collaborator

@canonizer Can you rebase this on main? It looks like there are some conflicts.

@alliepiper alliepiper modified the milestones: 2.0.0, 2.1.0 Jul 25, 2022
-   Copy if begin_bit == end_bit, but overwrite not allowed
-   Fix style
-   When begin_bit == end_bit and double-buffering, don't do any sorting work
-   Uncommented segmented sort test
-   begin_bit == end_bit == 0 for upsweep/downsweep and segmented sort
-   Fixed begin_bit == end_bit == 0 case
@canonizer
Copy link
Contributor Author

@allisonvacanti @senior-zero I've synced with the latest main branch and addressed all review comments. Could you take another look?

cub/device/dispatch/dispatch_radix_sort.cuh Outdated Show resolved Hide resolved
cub/util_device.cuh Outdated Show resolved Hide resolved
@canonizer
Copy link
Contributor Author

@senior-zero I've addressed your comments. Could you take another look?

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this optimization! I'll start testing now.

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022
cub/util_device.cuh Outdated Show resolved Hide resolved
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022
@gevtushenko gevtushenko added the testing: gpuCI in progress Started gpuCI testing. label Aug 6, 2022
@alliepiper alliepiper modified the milestones: 2.1.0, 2.0.0 Aug 8, 2022
@gevtushenko gevtushenko added testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Aug 9, 2022
@gevtushenko gevtushenko merged commit 832f5c8 into NVIDIA:main Aug 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2: nice to have Desired, but not necessary. testing: gpuCI passed Passed gpuCI testing. type: bug: functional Does not work as intended.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs)
3 participants