Releases: NVIDIA/cub
CUB 2.1.0
Breaking Changes
- #553: Deprecate the
CUB_USE_COOPERATIVE_GROUPS
macro, as all supported CTK distributions provide CG. This macro will be removed in a future version of CUB.
New Features
- #359: Add new
DeviceBatchMemcpy
algorithm. - #565: Add
DeviceMergeSort::StableSortKeysCopy
API. Thanks to David Wendt (@davidwendt) for this contribution. - #585: Add SM90 tuning policy for
DeviceRadixSort
. Thanks to Andy Adinets (@canonizer) for this contribution. - #586: Introduce a new mechanism to opt-out of compiling CDP support in CUB algorithms by defining
CUB_DISABLE_CDP
. - #589: Support 64-bit indexing in
DeviceReduce
. - #607: Support 128-bit integers in radix sort.
Bug Fixes
- #547: Resolve several long-running issues resulting from using multiple versions of CUB within the same process. Adds an inline namespace that encodes CUB version and targeted PTX architectures.
- #562: Fix bug in
BlockShuffle
resulting from an invalid thread offset. Thanks to @sjfeng1999 for this contribution. - #564: Fix bug in
BlockRadixRank
when used with blocks that are not a multiple of 32 threads. - #579: Ensure that all threads in the logical warp participate in the index-shuffle for
BlockRadixRank
. Thanks to Andy Adinets (@canonizer) for this contribution. - #582: Fix reordering in CUB member initializer lists.
- #589: Fix
DeviceSegmentedSort
when used withbool
keys. - #590: Fix CUB’s CMake install rules. Thanks to Robert Maynard (@robertmaynard) for this contribution.
- #592: Fix overflow in
DeviceReduce
. - #598: Fix
DeviceRunLengthEncode
when the first item is aNaN
. - #611: Fix
WarpScanExclusive
for vector types.
Other Enhancements
- #537: Add detailed and expanded version of a [CUB developer overview](https://github.com/NVIDIA/cub/blob/main/docs/developer_overview.md).
- #549: Fix
BlockReduceRaking
docs for non-commutative operations. Thanks to Tobias Ribizel (@upsj) for this contribution. - #606: Optimize CUB’s decoupled-lookback implementation.
CUB 2.0.1
Other Enhancements
- Skip device-side synchronization on SM90+. These syncs are a debugging-only feature and not required for correctness, and a warning will be emitted if this happens.
CUB 1.17.2
Summary
CUB 1.17.2 is a minor bugfix release.
- #547: Introduce an annotated inline namespace to prevent issues with
collisions and mismatched kernel configurations across libraries. The new
namespace encodes the CUB version and target SM architectures.
CUB 2.0.0
Summary
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous
debugging flags.
This release also includes several new features. DeviceHistogram
now supports __half
and better handles various edge cases. WarpReduce
now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync
accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort
learned to handle the case where begin_bit == end_bit
.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
Breaking Changes
- #448 Add libcu++ dependency (v1.8.0+).
- #448: The following macros are no longer defined by default. They can be re-enabled by defining
CUB_PROVIDE_LEGACY_ARCH_MACROS
. These will be completely removed in a future release.CUB_IS_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_IS_DEVICE_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_DEVICE_CODE
: Replace withNV_IF_TARGET
.
- #486: CUB’s CUDA Runtime support macros have been updated to support
NV_IF_TARGET
. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.CUB_RUNTIME_FUNCTION
: Execution space annotations for functions that invoke CUDA Runtime APIs.- Old behavior:
- RDC enabled: Defined to
__host__ __device__
- RDC not enabled:
- NVCC host pass: Defined to
__host__ __device__
- NVCC device pass: Defined to
__host__
- NVCC host pass: Defined to
- RDC enabled: Defined to
- New behavior:
- RDC enabled: Defined to
__host__ __device__
- RDC not enabled: Defined to
__host__
- RDC enabled: Defined to
- Old behavior:
CUB_RUNTIME_ENABLED
: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:- RDC enabled: Macro is defined.
- RDC not enabled:
- NVCC host pass: Macro is defined.
- NVCC device pass: Macro is not defined.
CUB_RDC_ENABLED
: New macro, may be combined withNV_IF_TARGET
to replace most usages ofCUB_RUNTIME_ENABLED
. Behavior:- RDC enabled: Macro is defined.
- RDC not enabled: Macro is not defined.
- #509: A compile-time error is now emitted when a
__device__
-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).- Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
- When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
- Use a named function object with a
__device__
-only implementation ofoperator()
. - Use a
__host__ __device__
lambda. - Use
cuda::proclaim_return_type
(Added in libcu++ 1.9.0)
- Use a named function object with a
- #509: Use the result type of the binary reduction operator for accumulating intermediate results in the
DeviceReduce
algorithm, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatch*Reduce
layer have changed:DispatchReduce
:- Now accepts accumulator type as last parameter.
- Now accepts initializer type instead of output iterator value type.
- Constructor now accepts
init
as initial type instead of output iterator value type.
DispatchSegmentedReduce
:- Accepts accumulator type as last parameter.
- Accepts initializer type instead of output iterator value type.
- Thread operators now accept parameters using different types:
Equality
,Inequality
,InequalityWrapper
,Sum
,Difference
,Division
,Max
,ArgMax
,Min
,ArgMin
. ThreadReduce
now accepts accumulator type and uses a different type forprefix
.
- #511: Use the result type of the binary operator for accumulating intermediate results in the
DeviceScan
,DeviceScanByKey
, andDeviceReduceByKey
algorithms, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatch
layer have changed:DispatchScan
now accepts accumulator type as a template parameter.DispatchScanByKey
now accepts accumulator type as a template parameter.DispatchReduceByKey
now accepts accumulator type as the last template parameter.
- #527: Deprecate the
debug_synchronous
flags on device algorithms.- This flag no longer has any effect. Define
CUB_DEBUG_SYNC
during compilation to enable these checks. - Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.
- This flag no longer has any effect. Define
New Features
- #514: Support
__half
inDeviceHistogram
. - #516: Add support for single-threaded invocations of
WarpReduce
. - #516: Use
__reduce_add_sync
hardware acceleration forWarpReduce
on supported architectures.
Bug Fixes
- #481: Fix the device-wide radix sort implementations to simply copy the input to the output when
begin_bit == end_bit
. - #487: Fix
DeviceHistogram::Even
for a variety of edge cases:- Bin ids are now correctly computed when mixing different types for
SampleT
andLevelT
. - Bin ids are now correctly computed when
LevelT
is an integral type and the number of levels does not evenly divide the level range.
- Bin ids are now correctly computed when mixing different types for
- #508: Ensure that
temp_storage_bytes
is properly set in theAdjacentDifferenceCopy
device algorithms. - #508: Remove excessive calls to the binary operator given to the
AdjacentDifferenceCopy
device algorithms. - #533: Fix debugging utilities when RDC is disabled.
Other Enhancements
- #448: Removed special case code for unsupported CUDA architectures.
- #448: Replace several usages of
__CUDA_ARCH__
with<nv/target>
to handle host/device code divergence. - #448: Mark unused PTX arch parameters as legacy.
- #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
- #480: Add
CUB_DISABLE_BF16_SUPPORT
to avoid including thecuda_bf16.h
header or using the__nv_bfloat16
type. - #486: Add debug log messages for post-kernel debug synchronizations.
- #490: Clarify documentation for in-place usage of
DeviceScan
algorithms. - #494: Clarify documentation for in-place usage of
DeviceHistogram
algorithms. - #495: Clarify documentation for in-place usage of
DevicePartition
algorithms. - #499: Clarify documentation for in-place usage of
Device*Sort
algorithms. - #500: Clarify documentation for in-place usage of
DeviceReduce
algorithms. - #501: Clarify documentation for in-place usage of
DeviceRunLengthEncode
algorithms. - #503: Clarify documentation for in-place usage of
DeviceSelect
algorithms. - #518: Fix typo in
WarpMergeSort
documentation. - #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.
CUB 1.17.1
Summary
CUB 1.17.1 is a minor bugfix release.
CUB 1.17.0
CUB 1.17.0
Summary
CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.
Known Issues
“Run-to-run” Determinism Broken
Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey
and DeviceScan
algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.
Bug Fixes
- #444: Fixed
DeviceSelect
to work with discard iterators and mixed input/output types. - #452: Fixed install issue when
CMAKE_INSTALL_LIBDIR
contained nested directories. Thanks to @robertmaynard for this contribution. - #462: Fixed bug that produced incorrect results from
DeviceSegmentedSort
on sm_61 and sm_70. - #464: Fixed
DeviceSelect::Flagged
so that flags are normalized to 0 or 1. - #468: Fixed overflow issues in
DeviceRadixSort
givennum_items
close to 2^32. Thanks to @canonizer for this contribution. - #498: Fixed compiler regression in
BlockAdjacentDifference
. Thanks to @MKKnorr for this contribution.
Other Enhancements
- #445: Remove device-sync in
DeviceSegmentedSort
when launched via CDP. - #449: Fixed invalid link in documentation. Thanks to @kshitij12345 for this contribution.
- #450:
BlockDiscontinuity
: Replaced recursive-template loop unrolling with#pragma unroll
. Thanks to @kshitij12345 for this contribution. - #451: Replaced the deprecated
TexRefInputIterator
implementation with an alias toTexObjInputIterator
. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB. - #456:
BlockAdjacentDifference
: Replaced recursive-template loop unrolling with#pragma unroll
. Thanks to @kshitij12345 for this contribution. - #466:
cub::DeviceAdjacentDifference
API has been updated to use the newOffsetT
deduction approach described in #212. - #470: Fix several doxygen-related warnings. Thanks to @karthikeyann for this contribution.
CUB 1.16.0
Summary
CUB 1.16.0 is a major release providing several improvements to the device scope algorithms. DeviceRadixSort
now supports large (64-bit indexed) input data. A new UniqueByKey
algorithm has been added to DeviceSelect
. DeviceAdjacentDifference
provides new SubtractLeft
and SubtractRight
functionality.
This release also deprecates several obsolete APIs, including type traits and BlockAdjacentDifference
algorithms. Many bugfixes and documentation updates are also included.
64-bit Offsets in DeviceRadixSort
Public APIs
Users frequently want to process large datasets using CUB’s device-scope algorithms, but the current public APIs limit input data sizes to those that can be indexed by a 32-bit integer. Beginning with this release, CUB is updating these APIs to support 64-bit offsets, as discussed in #212.
The device-scope algorithms will be updated with 64-bit offset support incrementally, starting with the cub::DeviceRadixSort
family of algorithms. Thanks to @canonizer for contributing this functionality.
New DeviceSelect::UniqueByKey
Algorithm
cub::DeviceSelect
now provides a UniqueByKey
algorithm, which has been ported from Thrust. Thanks to @zasdfgbnm for this contribution.
New DeviceAdjacentDifference
Algorithms
The new cub::DeviceAdjacentDifference
interface, also ported from Thrust, provides SubtractLeft
and SubtractRight
algorithms as CUB kernels.
Deprecation Notices
Synchronous CUDA Dynamic Parallelism Support
A future version of CUB will change the debug_synchronous
behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).
This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true
. Such invocations will continue to print extra debugging information, but they will no longer synchronize after kernel launches.
Deprecated Traits
CUB provided a variety of metaprogramming type traits in order to support C++03. Since C++14 is now required, these traits have been deprecated in favor of their STL equivalents, as shown below:
Deprecated CUB Trait | Replacement STL Trait |
---|---|
cub::If | std::conditional |
cub::Equals | std::is_same |
cub::IsPointer | std::is_pointer |
cub::IsVolatile | std::is_volatile |
cub::RemoveQualifiers | std::remove_cv |
cub::EnableIf | std::enable_if |
CUB now uses the STL traits internally, resulting in a ~6% improvement in compile time.
Misnamed cub::BlockAdjacentDifference
APIs
The algorithms in cub::BlockAdjacentDifference
have been deprecated, as their names did not clearly describe their intent. The FlagHeads
method is now SubtractLeft
, and FlagTails
has been replaced by SubtractRight
.
Breaking Changes
- #331: Deprecate the misnamed
BlockAdjacentDifference::FlagHeads
andFlagTails
methods. Use the newSubtractLeft
andSubtractRight
methods instead. - #364: Deprecate some obsolete type traits. These should be replaced by the equivalent traits in
<type_traits>
as described above.
New Features
- #331: Port the
thrust::adjacent_difference
kernel and expose it ascub::DeviceAdjacentDifference
. - #405: Port the
thrust::unique_by_key
kernel and expose it ascub::DeviceSelect::UniqueByKey
. Thanks to @zasdfgbmn for this contribution.
Enhancements
- #340: Allow 64-bit offsets in
DeviceRadixSort
public APIs. Thanks to @canonizer for this contribution. - #400: Implement a significant reduction in
DeviceMergeSort
compilation time. - #415: Support user-defined
CMAKE_INSTALL_INCLUDEDIR
values in Thrust’s CMake install rules. Thanks for @robertmaynard for this contribution.
Bug Fixes
- #381: Fix shared memory alignment in
dyn_smem
example. - #393: Fix some collisions with the
min
/max
macros defined inwindows.h
. - #404: Fix bad cast in
util_device
. - #410: Fix CDP issues in
DeviceSegmentedSort
. - #411: Ensure that the
nv_exec_check_disable
pragma is only used on nvcc. - #418: Fix
-Wsizeof-array-div
warning on gcc 11. Thanks to @robertmaynard for this contribution. - #420: Fix new uninitialized variable warning in
DiscardIterator
on gcc 10. - #423: Fix some collisions with the
small
macro defined inwindows.h
. - #426: Fix some issues with version handling in CUB’s CMake packages.
- #430: Remove documentation for
DeviceSpmv
parameters that are absent from public APIs. - #432: Remove incorrect documentation for
DeviceScan
algorithms that guaranteed run-to-run deterministic results for floating-point addition.
CUB 1.15.0
Summary
CUB 1.15.0 includes a new cub::DeviceSegmentedSort
algorithm, which demonstrates up to 5000x speedup compared to cub::DeviceSegmentedRadixSort
when sorting a large number of small segments. A new cub::FutureValue<T>
helper allows the cub::DeviceScan
algorithms to lazily load the initial_value
from a pointer. cub::DeviceScan
also added ScanByKey
functionality.
The new DeviceSegmentedSort
algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.
cub::FutureValue<T>
provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:
int *d_intermediate_result = ...;
intermediate_kernel<<<blocks, threads>>>(d_intermediate_result, // output
arg1, // input
arg2); // input
// Wrap the intermediate pointer in a FutureValue -- no need to explicitly
// sync when both kernels are stream-ordered. The pointer is read after
// the ExclusiveScan kernel starts executing.
cub::FutureValue<int> init_value(d_intermediate_result);
cub::DeviceScan::ExclusiveScan(d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
cub::Sum(),
init_value,
num_items);
Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.
Deprecation Notices
A future version of CUB will change the debug_synchronous
behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).
This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true
. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.
Breaking Changes
- #305: The template parameters of
cub::DispatchScan
have changed to support the newcub::FutureValue
helper. More details under "New Features". - #377: Remove broken
operator->()
fromcub::TransformInputIterator
, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
New Features
- #305: Add overloads to
cub::DeviceScan
algorithms that allow the output of a previous kernel to be used asinitial_value
without explicit synchronization. See the newcub::FutureValue
helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #354: Add
cub::BlockRunLengthDecode
algorithm. Thanks to Elias Stehle (@elstehle) for this contribution. - #357: Add
cub::DeviceSegmentedSort
, an optimized version ofcub::DeviceSegmentedSort
with improved load balancing and small array performance. - #376: Add "by key" overloads to
cub::DeviceScan
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
Bug Fixes
- #349: Doxygen and unused variable fixes.
- #363: Maintenance updates for the new
cub::DeviceMergeSort
algorithms. - #382: Fix several
-Wconversion
warnings. Thanks to Matt Stack (@matt-stack) for this contribution. - #388: Fix debug assertion on MSVC when using
cub::CachingDeviceAllocator
. - #395: Support building with
__CUDA_NO_HALF_CONVERSIONS__
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
CUB 1.14.0 (NVIDIA HPC SDK 21.9)
CUB 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.
This release provides the often-requested merge sort algorithm, ported from the thrust::sort
implementation. Merge sort provides more flexibility than the existing radix sort by supporting arbitrary data types and comparators, though radix sorting is still faster for supported inputs. This functionality is provided through the new cub::DeviceMergeSort
and cub::BlockMergeSort
algorithms.
The namespace wrapping mechanism has been overhauled for 1.14. The existing macros (CUB_NS_PREFIX
/CUB_NS_POSTFIX
) can now be replaced by a single macro, CUB_WRAPPED_NAMESPACE
, which is set to the name of the desired wrapped namespace. Defining a similar THRUST_CUB_WRAPPED_NAMESPACE
macro will embed both thrust::
and cub::
symbols in the same external namespace. The prefix/postfix macros are still supported, but now require a new CUB_NS_QUALIFIER
macro to be defined, which provides the fully qualified CUB namespace (e.g. ::foo::cub
). See cub/util_namespace.cuh
for details.
Breaking Changes
- #350: When the
CUB_NS_[PRE|POST]FIX
macros are set,CUB_NS_QUALIFIER
must also be defined to the fully qualified CUB namespace (e.g.#define CUB_NS_QUALIFIER ::foo::cub
). Note that this is handled automatically when using the new[THRUST_]CUB_WRAPPED_NAMESPACE
mechanism.
New Features
- #322: Ported the merge sort algorithm from Thrust;
cub::BlockMergeSort
andcub::DeviceMergeSort
are now available. - #326: Simplify the namespace wrapper macros, and detect when Thrust's symbols are in a wrapped namespace.
Bug Fixes
- #160, #163, #352: Fixed several bugs in
cub::DeviceSpmv
and added basic tests for this algorithm. Thanks to James Wyles and Seunghwa Kang for their contributions. - #328: Fixed error handling bug and incorrect debugging output in
cub::CachingDeviceAllocator
. Thanks to Felix Kallenborn for this contribution. - #335: Fixed a compile error affecting clang and NVRTC. Thanks to Jiading Guo for this contribution.
- #351: Fixed some errors in the
cub::DeviceHistogram
documentation.
Enhancements
- #348: Add an example that demonstrates how to use dynamic shared memory with a CUB block algorithm. Thanks to Matthias Jouanneaux for this contribution.
CUB 1.13.1 (CUDA Toolkit 11.5)
CUB 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.
This release provides a new hook for embedding the cub::
namespace inside
a custom namespace. This is intended to work around various issues related to
linking multiple shared libraries that use CUB. The existing CUB_NS_PREFIX
and
CUB_NS_POSTFIX
macros already provided this capability; this update provides a
simpler mechanism that is extended to and integrated with Thrust. Simply define
THRUST_CUB_WRAPPED_NAMESPACE
to a namespace name, and both thrust::
and
cub::
will be placed inside the new namespace. Using different wrapped
namespaces for each shared library will prevent issues like those reported in
NVIDIA/thrust#1401.
New Features
- #326: Add
THRUST_CUB_WRAPPED_NAMESPACE
hooks.