Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Releases: NVIDIA/cub

CUB 1.8.0

19 May 08:59
Compare
Choose a tag to compare

Summary

CUB 1.8.0 introduces changes to the cub::Shuffle* interfaces.

Breaking Changes

  • The interfaces of cub::ShuffleIndex, cub::ShuffleUp, and cub::ShuffleDown have been changed to allow for better computation of the PTX SHFL control constant for logical warps smaller than 32 threads.

Bug Fixes

  • #112: Fix cub::WarpScan's broadcast of warp-wide aggregate for logical warps smaller than 32 threads.

CUB 1.7.5

19 May 08:56
Compare
Choose a tag to compare

Summary

CUB 1.7.5 adds support for radix sorting __half keys and improved sorting performance for 1 byte keys. It was incorporated into Thrust 1.9.2.

Enhancements

  • Radix sort support for __half keys.
  • Radix sort tuning policy updates to improve 1 byte key performance.

Bug Fixes

  • Syntax tweaks to mollify Clang.
  • #127: cub::DeviceRunLengthEncode::Encode returns incorrect results.
  • #128: 7-bit sorting passes fail for SM61 with large values.

CUB 1.7.4

19 May 08:56
Compare
Choose a tag to compare

Summary

CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2.

Bug Fixes

  • #114: Can't pair non-trivially-constructible values in radix sort.
  • #115: cub::WarpReduce segmented reduction is broken in CUDA 9 for logical warp sizes smaller than 32.

CUB 1.7.3

19 May 08:56
Compare
Choose a tag to compare

Summary

CUB 1.7.3 is a minor release.

Bug Fixes

  • #110: cub::DeviceHistogram null-pointer exception bug for iterator inputs.

CUB 1.7.2

19 May 08:56
Compare
Choose a tag to compare

Summary

CUB 1.7.2 is a minor release.

Bug Fixes

  • #104: Device-wide reduction is now "run-to-run" deterministic for pseudo-associative reduction operators (like floating point addition).

CUB 1.7.1

19 May 08:55
Compare
Choose a tag to compare

Summary

CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs.
It is compatible with independent thread scheduling.

Breaking Changes

  • Remove cub::WarpAll and cub::WarpAny. These functions served to emulate __all and __any functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.

Other Enhancements

  • Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.

Bug Fixes

  • #86: Incorrect results with reduce-by-key.

CUB 1.7.0

19 May 08:55
Compare
Choose a tag to compare

Summary

CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs. It is compatible with independent thread scheduling. It was incorporated into Thrust 1.9.2.

Breaking Changes

  • Remove cub::WarpAll and cub::WarpAny. These functions served to emulate __all and __any functionality for SM1x devices, which did not have those operations. However, SM1x devices are now deprecated in CUDA, and the interfaces of these two functions are now lacking the lane-mask needed for collectives to run on SM7x and newer GPUs which have independent thread scheduling.

Other Enhancements

  • Remove any assumptions of implicit warp synchronization to be compatible with SM7x's (Volta) independent thread scheduling.

Bug Fixes

  • #86: Incorrect results with reduce-by-key.

CUB 1.6.4

19 May 08:45
Compare
Choose a tag to compare

Summary

CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x (Pascal) GPUs.

Enhancements

  • Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) - 3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.

Bug Fixes

  • Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
  • #65: cub::DeviceSegmentedRadixSort should allow inputs to have pointer-to-const type.
  • Mollify Clang device-side warnings.
  • Remove out-dated MSVC project files.

CUB 1.6.3

19 May 08:45
Compare
Choose a tag to compare

Summary

CUB 1.6.3 improves support for Windows, changes cub::BlockLoad/cub::BlockStore interface to take the local data type, and enhances radix sort performance for SM6x (Pascal) GPUs.

Breaking Changes

  • cub::BlockLoad and cub::BlockStore are now templated by the local data type, instead of the Iterator type. This allows for output iterators having void as their value_type (e.g. discard iterators).

Other Enhancements

  • Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte keys/s on GP100.
  • Improved support for Windows (warnings, alignment, etc).

Bug Fixes

  • #74: cub::WarpReduce executes reduction operator for out-of-bounds items.
  • #72: cub:InequalityWrapper::operator should be non-const.
  • #71: cub::KeyValuePair won't work if Key has non-trivial constructor.
  • #69: cub::BlockStore::Storedoesn't compile ifOutputIteratorT::value_typeisn'tT`.
  • #68: cub::TilePrefixCallbackOp::WarpReduce doesn't permit PTX arch specialization.

CUB 1.6.2 (previously 1.5.5)

19 May 08:45
Compare
Choose a tag to compare

Summary

CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal) GPUs.

Enhancements

  • Radix sort tuning policies updated for SM6x (Pascal) GPUs.

Bug Fixes

  • Fix AArch64 compilation of cub::CachingDeviceAllocator.