Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups #741

sleeepyjack · 2025-07-05T01:23:21Z

Closes #725

PointKernel

First pass

The logic is straightforward, so there’s not much to review from a technical standpoint.

benchmarks/CMakeLists.txt

PointKernel · 2025-08-14T21:16:40Z

ci/build.sh

    --arch) CUDA_ARCHS="${args[1]}";    args=("${args[@]:2}");;
    --std)  CXX_STANDARD="${args[1]}";  args=("${args[@]:2}");;
    -v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");;
+    --) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;;


why do we need this change?

This allows for passing arbitrary cmake build options through our ci/build.sh script. For instance, this PR introduces the -DCUCO_DOWNLOAD_ROARING_DATA option, Which I can now pass as ci/build.sh — -DCUCO_DOWNLOAD_ROARING_DATA=OFF.

So this can replace #704?

#704 allows for specifying the cmake binary while this change here allows for passing build arguments directly to cmake as a pass-through.

Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build

I've outsourced the changes to the build script in #749

include/cuco/roaring_bitmap.cuh

PointKernel · 2025-08-14T21:43:40Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_storage.cuh

+ private:
+  metadata_type metadata_;
+  cuda::std::byte const* data_;
+  cuda::std::uint8_t const* run_container_bitmap_;


what does run_container_bitmap do

There are three types of containers in the bitmap. (1) array containers, (2) bitset containers, and (3) "run containers" aka array that encode consecutive sequences of elements in RLE format. To distinguish the first two types we can look at the cardinality (if it's >4096 then it is stored as a bitset, otherwise it is an array container). We cannot apply this trick for run containers, thus we have to store a bitset with one bit per container, indicating if it's a run container or not.

Can we document this somewhere for posterity?

This is taken directly from the RoaringBitmapFormatSpec which I reference a couple of times in the docs. I you read throught the cookie header definition you'll find the following explanation:

Let size be the number of containers. Then we store (size + 7) / 8 bytes, following the initial 32 bits, as a bitset to indicate whether each of the containers is a run container (bit set to 1) or not (bit set to 0). The first (least significant) bit of the first byte corresponds to the first stored container and so forth. In this scenario, the cookie header uses 32 bits followed by (size + 7) / 8 bytes.

PointKernel · 2025-08-14T21:44:27Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_ref.inl

+template <class U /* = T */,
+          class /* = cuda::std::enable_if_t<cuda::std::is_same_v<U, cuda::std::uint32_t>> */>


is this some cleanup leftover?

It's a hint mirroring what is there in the definition of the function to help understand why these tparams are there. In this case it means this function is only enabled for the 32-bit roaring bitmap format.

PointKernel · 2025-08-14T21:57:40Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh

+    if (this->empty()) {
+      thrust::fill(
+        nosync_exec_policy, contained, contained + cuda::std::distance(first, last), false);
+    } else {
+      thrust::transform(nosync_exec_policy,
+                        first,
+                        last,
+                        contained,
+                        cuda::proclaim_return_type<bool>(
+                          [*this] __device__(auto key) { return this->contains(key); }));
+    }


As discussed offline, we should use cub algorithms here as thrust doesn't provide control for potential temporary data allocation/deallocation

My last commit should have switched to cub but somehow the diff is not showing it. I'll verify this.

See commit diff: aa56fd6

PointKernel · 2025-08-14T21:58:27Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh

+        card = 1u + misaligned_load<cuda::std::uint16_t>(
+                      storage_ref_.key_cards() + (index * 2 + 1) * sizeof(cuda::std::uint16_t));
+      }
+      if (card <= 4096) {


let's get rid of magic number and give it a name

PointKernel · 2025-08-14T22:01:10Z

include/cuco/detail/roaring_bitmap/util.cuh

+    cuda::std::uint32_t cookie;
+    cuda::std::memcpy(&cookie, buf, sizeof(cuda::std::uint32_t));
+    buf += sizeof(cuda::std::uint32_t);
+    if ((cookie & 0xFFFF) != serial_cookie && cookie != serial_cookie_no_runcontainer) {


maybe name 0xFFFF as mask

Yep, cookie_mask

…tmap

benchmarks/roaring_bitmap/contains_bench.cu

mhaseeb123 · 2025-08-15T22:37:36Z

benchmarks/roaring_bitmap/contains_bench.cu

+
+  thrust::device_vector<T> items(num_items);
+
+  key_generator gen{};


Just caffeinating the engine ;)

Suggested change

key_generator gen{};

key_generator gen{0xc0ffee};

Ultimately, I would like to use the same default seed for all cuco benchmarks and add an option to adjust it via nvbench. We currently use std::time(nullptr) as the default seed which is another problem in itself. So yeah, this will be a separate PR to fix this for the entire library.

mhaseeb123 · 2025-08-15T22:40:47Z

ci/build.sh

    --arch) CUDA_ARCHS="${args[1]}";    args=("${args[@]:2}");;
    --std)  CXX_STANDARD="${args[1]}";  args=("${args[@]:2}");;
    -v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");;
+    --) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;;


Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build

mhaseeb123 · 2025-08-15T22:42:31Z

examples/roaring_bitmap/host_bulk_example.cu

+      // Create query keys for the portable_bitmap64.bin file:
+      // https://github.com/RoaringBitmap/RoaringFormatSpec/blob/5177ad9/testdata64/README.md#portable_bitmap64bin
+      std::vector<cuda::std::uint64_t> keys;
+      for (cuda::std::uint64_t k = 0x00000ull; k < 0x09000ull; ++k) {


std::iota at a few places here

I would lean towards keeping the original logic as it is a 1:1 copy of how the bitmap was serialize (see link in line 70). Just makes it clearer for the reader where these numbers come from.

examples/roaring_bitmap/host_bulk_example.cu

mhaseeb123 · 2025-08-15T22:52:10Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh

+    return static_cast<cuda::std::uint8_t>(container[lower / 8]) &
+           (cuda::std::uint8_t(1) << (lower % 8));


May a comment about the logic and any ref.

Added some comments and also added a helper function check_bit() to reduce code duplication.

mhaseeb123 · 2025-08-15T22:54:20Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh

+namespace cuco::detail {
+
+// primary template
+template <class T>


Looks like we are re-implementing binary search quite a few times in this file. Perhaps we could add a couple of functions in anonymous namespace and reuse. Future PR to do that is also ok.

Yeah, that's definitely a candidate for a refactoring. However, there a some details that are slightly different between the implementations which I'd have to abstract from somehow. I would say this is something we can do in a future PR.

mhaseeb123 · 2025-08-15T22:55:43Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_storage.cuh

+ private:
+  metadata_type metadata_;
+  cuda::std::byte const* data_;
+  cuda::std::uint8_t const* run_container_bitmap_;


Can we document this somewhere for posterity?

mhaseeb123 · 2025-08-15T22:58:43Z

include/cuco/detail/roaring_bitmap/util.cuh

+    else {
+      cuda::std::memcpy(&num_containers, buf, sizeof(cuda::std::uint32_t));
+      buf += sizeof(cuda::std::uint32_t);
+    }


Don't have a better suggestion but quite a few magic numbers in here 🪄

I cleaned up and simplified some of the logic and added some variable names for these constants. Let me know if I missed something.

mhaseeb123 · 2025-08-15T23:00:06Z

tests/roaring_bitmap/contains_test.cu

+
+#include <fstream>
+#include <string>
+#include <vector>


Same comments from host_bulk_example.cu about Use of std::iota and std::filesystem::file_size

sleeepyjack · 2025-08-18T23:39:38Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_impl.cuh

+      cub::DeviceTransform::Transform(
+        thrust::constant_iterator<bool>(false),
+        contained,
+        cuda::std::distance(first, last),
+        cuda::proclaim_return_type<bool>([] __device__(auto /* dummy */) { return false; }),


Can anyone think of a better way of doing a fill operation using cub that works with arbitrary input iterators? We want to avoid thrust::fill as it might introduce some unexpected host syncs.

PointKernel

Looks good to me. One last question needs some discussions. Do we want to get this merged in cudf or cuco?

PointKernel · 2025-08-19T01:20:50Z

include/cuco/detail/roaring_bitmap/roaring_bitmap_storage.cuh

+
+ private:
+  allocator_type allocator_;
+  typename ref_type::metadata_type metadata_;


Are there cases where users need to access or manage the metadata when working with a roaring bitmap? If so, yes.

sleeepyjack · 2025-08-19T01:36:25Z

Looks good to me. One last question needs some discussions. Do we want to get this merged in cudf or cuco?

@PointKernel The container as is does pass the cuco vibecheck I'd say. The current implementation is limited to lookups but could be extended to support mutable operations as well (although it would require some significant effort).

That said, I'm not at all opposed to moving it into cudf instead - just saying that it could fit into cuco from my standpoint. If you have strong oppinions against including it in cuco I'm open to discuss them. If we move it into cudf I would need someone to lead the effort who is more versed with libcudf details than me.

CC @mhaseeb123 what do you think?

mhaseeb123 · 2025-08-19T18:03:34Z

The container as is does pass the cuco vibecheck I'd say

Agreed with this statement. My vote would be to keep this container in cuco especially considering we may want to expand the set of features in the future and then would need to move it back to cuco. That said, if y'all think it's better to move this to libcudf, I am happy to lead the effort.

PointKernel

I’m concerned that this is mainly addressing Spark’s requirements rather than providing a general-purpose bitmap solution on GPUs. Once we decide to support insertion, the existing lookups may also need to change. That said, it’s fine for now—let’s merge the PR as is and revisit the full integration later once we have a clearer picture of how cudf/Spark will use this data structure.

mhaseeb123 · 2025-08-19T20:38:07Z

I’m concerned that this is mainly addressing Spark’s requirements rather than providing a general-purpose bitmap solution on GPUs.

Agreed. I don't see any requirement from Spark or libcudf, in the near term at least, to add insertions, deletions etc.

Once we decide to support insertion, the existing lookups may also need to change.

Yup and we may need to move it back to cuco.

That said, it’s fine for now—let’s merge the PR as is and revisit the full integration later once we have a clearer picture of how cudf/Spark will use this data structure.

Agreed with this. We can always move it to libcudf if needed.

PointKernel

Looks good. @sleeepyjack Thanks!

sleeepyjack added 3 commits July 4, 2025 16:37

Simple lookup working

9f0c40e

Preliminary benchmark

7ff8399

Optimizations

77a4c1d

sleeepyjack self-assigned this Jul 5, 2025

sleeepyjack added helps: rapids Helps or needed by RAPIDS In Progress Currently a work in progress topic: roaring_bitmap labels Jul 5, 2025

sleeepyjack added 8 commits July 7, 2025 17:54

v2

142ac06

Get rid of span and scope

5977bea

Add empty()

18acbed

Add storage class

64bf0f3

Improve member order to reduce struct size

26e23da

64-bit roaring bitmap

90b6fc5

Merge remote-tracking branch 'upstream/dev' into roaring-bitmap

0e76a29

Update readme

a56e3a9

sleeepyjack force-pushed the roaring-bitmap branch from c8eb1a2 to a56e3a9 Compare August 13, 2025 18:00

sleeepyjack added 3 commits August 13, 2025 17:26

Download roaring bitmap .bin files at build time

078ea67

Allow build script to handle extra cmake args

6cd8413

Add unit test

42e5d01

sleeepyjack marked this pull request as ready for review August 14, 2025 00:47

sleeepyjack requested a review from PointKernel as a code owner August 14, 2025 00:47

sleeepyjack added Needs Review Awaiting reviews before merging and removed In Progress Currently a work in progress labels Aug 14, 2025

sleeepyjack added 3 commits August 14, 2025 05:36

Minor doc fix

20dc816

Compile benchmarks with -lineinfo

5d1b470

Use cub::DeviceTransform instead of thrust::transform

aa56fd6

PointKernel reviewed Aug 14, 2025

View reviewed changes

sleeepyjack added 2 commits August 14, 2025 16:24

Pass ParentT to CG type and pass CG objects by-value

144be85

Merge branch 'dev', remote-tracking branch 'upstream' into roaring-bi…

b6800a9

…tmap

mhaseeb123 reviewed Aug 15, 2025

View reviewed changes

sleeepyjack added 6 commits August 15, 2025 18:06

Use CMAKE_ARGS in build.sh and some more minor improvements

a1bc544

Use CMAKE_ARGS in build script and other minor improvements

830ca65

Merge branch 'build-script-improvements' into roaring-bitmap

52691d1

Address comments from code review

01169d8

Merge remote-tracking branch 'upstream' into roaring-bitmap

41f2939

Code simplifications

4e68e8a

sleeepyjack requested review from PointKernel and mhaseeb123 August 18, 2025 23:29

sleeepyjack commented Aug 18, 2025

View reviewed changes

Resolve merge conflict in build.sh

1dfc139

PointKernel reviewed Aug 19, 2025

View reviewed changes

PointKernel approved these changes Aug 19, 2025

View reviewed changes

mhaseeb123 approved these changes Aug 19, 2025

View reviewed changes

sleeepyjack added 2 commits August 19, 2025 15:26

Use std::fs::file_size in benchmark

245592e

Move to experimental namespace

67d19ec

sleeepyjack changed the title ~~Initial support for roaring bitmap lookups~~ Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups Aug 19, 2025

sleeepyjack requested a review from PointKernel August 19, 2025 22:49

PointKernel approved these changes Aug 19, 2025

View reviewed changes

sleeepyjack merged commit 4e642f8 into NVIDIA:dev Aug 19, 2025
21 checks passed

sleeepyjack deleted the roaring-bitmap branch August 19, 2025 23:09

		template <class U /* = T */,
		class /* = cuda::std::enable_if_t<cuda::std::is_same_v<U, cuda::std::uint32_t>> */>


		thrust::device_vector<T> items(num_items);

		key_generator gen{};

		return static_cast<cuda::std::uint8_t>(container[lower / 8]) &
		(cuda::std::uint8_t(1) << (lower % 8));

Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups #741

Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups #741

Uh oh!

Conversation

sleeepyjack commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PointKernel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sleeepyjack Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sleeepyjack Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sleeepyjack commented Jul 5, 2025 •

edited

Loading

sleeepyjack Aug 18, 2025 •

edited

Loading

sleeepyjack Aug 18, 2025 •

edited

Loading

sleeepyjack commented Aug 19, 2025 •

edited

Loading

mhaseeb123 commented Aug 19, 2025 •

edited

Loading