Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

canonizer · 2022-05-12T20:31:08Z

Fix begin_bit == end_bit == 0 for device-wide and segmented sort.

canonizer · 2022-05-12T20:31:52Z

This pull request should address #353.

gevtushenko

@canonizer thank you for addressing this! I'm a bit concerned about the approach, though. I wonder if we could short-circuit in the begin_bit == end_bit case. For instance, if is_overwrite_okay == true we wouldn't do anything, since the double buffer would contain proper data already. Otherwise, we might just copy the data. I've written a simple benchmark below that might help understand the impact of this approach. For double buffer case, we have noop, which is definitely faster. Otherwise, memcpy is about 40% faster than actually sorting anything.

#include <cub/cub.cuh>

#include <thrust/device_vector.h>

#include <iostream>

void sort(
    std::uint8_t *d_temp_storage, std::size_t &temp_storage_bytes,
    int *d_keys_in, int *d_keys_out, 
    int num_items, 
    bool use_buffer, bool short_circuit) 
{
  const int begin_bit = 0;
  const int end_bit = begin_bit;
  cub::DoubleBuffer<int> d_keys(d_keys_in, d_keys_out);

  if (use_buffer) {
    if (short_circuit) {
      temp_storage_bytes = 1; // noop
    } else {
      cub::DeviceRadixSort::SortKeys(
          d_temp_storage, temp_storage_bytes, 
          d_keys, num_items, begin_bit, end_bit);
    }
  } else {
    if (short_circuit) {
      if (d_temp_storage == nullptr) {
        temp_storage_bytes = 1; 
      } else {
        cudaMemcpy(d_keys_out, d_keys_in, sizeof(int) * num_items, cudaMemcpyDeviceToDevice);
      }
    } else {
      cub::DeviceRadixSort::SortKeys(
          d_temp_storage, temp_storage_bytes, 
          d_keys_in, d_keys_out, num_items, begin_bit, end_bit);
    }
  }
}

int main()
{
  const int num_items = 128 * 1024 * 1024;
  thrust::device_vector<int> in(num_items);
  thrust::device_vector<int> out(num_items);

  int *d_keys_in = thrust::raw_pointer_cast(in.data());
  int *d_keys_out = thrust::raw_pointer_cast(out.data());

  std::uint8_t *d_temp_storage{};
  std::size_t temp_storage_bytes = 0;

  const bool use_buffer = false;
  const bool short_circuit = true;

  sort(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items,
       use_buffer, short_circuit);

  thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);
  d_temp_storage = thrust::raw_pointer_cast(temp_storage.data());

  cudaEvent_t begin, end;
  cudaEventCreate(&begin);
  cudaEventCreate(&end);

  cudaEventRecord(begin);
  sort(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items,
       use_buffer, short_circuit);
  cudaEventRecord(end);
  cudaEventSynchronize(end);

  float ms{};
  cudaEventElapsedTime(&ms, begin, end);

  std::cout << ms << "ms" << std::endl;


  cudaEventDestroy(end);
  cudaEventDestroy(begin);
}

I don't think this approach is applicable to segmented version. But I'd like to know your opinion on this for non-segmented API. Are there any downsides I'm missing?

canonizer · 2022-05-20T22:08:22Z

@senior-zero @allisonvacanti I've added short-circuiting when begin_bit == end_bit and double buffers are passed as arguments. Could you take another look?

alliepiper

In @senior-zero's earlier comment, he suggested making the is_overwrite_okay == false case to just do a copy and skip the sorting altogether. Can we add that optimization?

cub/device/dispatch/dispatch_radix_sort.cuh

canonizer · 2022-05-31T23:05:04Z

Thanks for your comments!

@allisonvacanti I've addressed your comments.

@senior-zero I've added the copy shortcut if begin_bit == end_bit, and overwrite is not allowed, for the device-wide sort only. For the segmented sort, as you've mentioned, this approach won't work, and adding a separate kernel to perform segmented copy is definitely for a different pull request.

Could you take another look?

alliepiper · 2022-06-22T20:51:46Z

@canonizer Can you rebase this on main? It looks like there are some conflicts.

cub/device/dispatch/dispatch_radix_sort.cuh

- Copy if begin_bit == end_bit, but overwrite not allowed - Fix style - When begin_bit == end_bit and double-buffering, don't do any sorting work - Uncommented segmented sort test - begin_bit == end_bit == 0 for upsweep/downsweep and segmented sort - Fixed begin_bit == end_bit == 0 case

canonizer · 2022-08-02T17:57:33Z

@allisonvacanti @senior-zero I've synced with the latest main branch and addressed all review comments. Could you take another look?

cub/device/dispatch/dispatch_radix_sort.cuh

cub/util_device.cuh

canonizer · 2022-08-05T22:13:01Z

@senior-zero I've addressed your comments. Could you take another look?

cub/device/dispatch/dispatch_radix_sort.cuh

gevtushenko

Thank you for this optimization! I'll start testing now.

cub/util_device.cuh

canonizer requested review from alliepiper and gevtushenko May 12, 2022 20:31

canonizer mentioned this pull request May 12, 2022

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

Closed

gevtushenko suggested changes May 13, 2022

View reviewed changes

alliepiper linked an issue May 13, 2022 that may be closed by this pull request

DeviceRadixSort fails when begin_bit = end_bit = 0 (for large inputs) #353

Closed

alliepiper added this to the 2.0.0 milestone May 13, 2022

alliepiper added type: bug: functional Does not work as intended. P2: nice to have Desired, but not necessary. labels May 13, 2022

alliepiper suggested changes May 24, 2022

View reviewed changes

alliepiper approved these changes Jun 22, 2022

View reviewed changes

gevtushenko suggested changes Jun 23, 2022

View reviewed changes

alliepiper modified the milestones: 2.0.0, 2.1.0 Jul 25, 2022

canonizer force-pushed the onesweep-begin-end-bit branch from d63448c to 9b50753 Compare August 2, 2022 14:45

Addressed review comments.

d8ca8c7

Check for unified addressing before using cudaMemcpyDefault.

e9bcf7a

gevtushenko suggested changes Aug 4, 2022

View reviewed changes

cub/device/dispatch/dispatch_radix_sort.cuh Outdated Show resolved Hide resolved

cub/util_device.cuh Outdated Show resolved Hide resolved

Addressed review comments.

f094620

gevtushenko reviewed Aug 5, 2022

View reviewed changes

cub/device/dispatch/dispatch_radix_sort.cuh Show resolved Hide resolved

gevtushenko approved these changes Aug 6, 2022

View reviewed changes

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022

Testing NVIDIA/cub#481

d0ceb92

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022

Testing NVIDIA/cub#481

e7c2738

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022

Testing NVIDIA/cub#481

ae7bff6

gevtushenko reviewed Aug 6, 2022

View reviewed changes

cub/util_device.cuh Outdated Show resolved Hide resolved

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Aug 6, 2022

Testing NVIDIA/cub#481

d205c8a

gevtushenko added the testing: gpuCI in progress Started gpuCI testing. label Aug 6, 2022

alliepiper modified the milestones: 2.1.0, 2.0.0 Aug 8, 2022

Add inline to HasUVA

5daac39

gevtushenko added testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Aug 9, 2022

gevtushenko merged commit 832f5c8 into NVIDIA:main Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

canonizer commented May 12, 2022

canonizer commented May 12, 2022

gevtushenko left a comment

canonizer commented May 20, 2022

alliepiper left a comment

canonizer commented May 31, 2022

alliepiper commented Jun 22, 2022

canonizer commented Aug 2, 2022

canonizer commented Aug 5, 2022

gevtushenko left a comment

Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

Fix begin_bit == end_bit == 0 for device-wide and segmented sort #481

Conversation

canonizer commented May 12, 2022

canonizer commented May 12, 2022

gevtushenko left a comment

Choose a reason for hiding this comment

canonizer commented May 20, 2022

alliepiper left a comment

Choose a reason for hiding this comment

canonizer commented May 31, 2022

alliepiper commented Jun 22, 2022

canonizer commented Aug 2, 2022

canonizer commented Aug 5, 2022

gevtushenko left a comment

Choose a reason for hiding this comment