Skip to content

Conversation

@sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Jul 5, 2025

Closes #725

@sleeepyjack sleeepyjack self-assigned this Jul 5, 2025
@sleeepyjack sleeepyjack added helps: rapids Helps or needed by RAPIDS In Progress Currently a work in progress topic: roaring_bitmap labels Jul 5, 2025
@sleeepyjack sleeepyjack marked this pull request as ready for review August 14, 2025 00:47
@sleeepyjack sleeepyjack added Needs Review Awaiting reviews before merging and removed In Progress Currently a work in progress labels Aug 14, 2025
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

The logic is straightforward, so there’s not much to review from a technical standpoint.

ci/build.sh Outdated
--arch) CUDA_ARCHS="${args[1]}"; args=("${args[@]:2}");;
--std) CXX_STANDARD="${args[1]}"; args=("${args[@]:2}");;
-v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");;
--) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for passing arbitrary cmake build options through our ci/build.sh script. For instance, this PR introduces the -DCUCO_DOWNLOAD_ROARING_DATA option, Which I can now pass as ci/build.sh — -DCUCO_DOWNLOAD_ROARING_DATA=OFF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this can replace #704?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#704 allows for specifying the cmake binary while this change here allows for passing build arguments directly to cmake as a pass-through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've outsourced the changes to the build script in #749

private:
metadata_type metadata_;
cuda::std::byte const* data_;
cuda::std::uint8_t const* run_container_bitmap_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does run_container_bitmap do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three types of containers in the bitmap. (1) array containers, (2) bitset containers, and (3) "run containers" aka array that encode consecutive sequences of elements in RLE format. To distinguish the first two types we can look at the cardinality (if it's >4096 then it is stored as a bitset, otherwise it is an array container). We cannot apply this trick for run containers, thus we have to store a bitset with one bit per container, indicating if it's a run container or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this somewhere for posterity?

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken directly from the RoaringBitmapFormatSpec which I reference a couple of times in the docs. I you read throught the cookie header definition you'll find the following explanation:

Let size be the number of containers. Then we store (size + 7) / 8 bytes, following the initial 32 bits, as a bitset to indicate whether each of the containers is a run container (bit set to 1) or not (bit set to 0). The first (least significant) bit of the first byte corresponds to the first stored container and so forth. In this scenario, the cookie header uses 32 bits followed by (size + 7) / 8 bytes.

Comment on lines +34 to +35
template <class U /* = T */,
class /* = cuda::std::enable_if_t<cuda::std::is_same_v<U, cuda::std::uint32_t>> */>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this some cleanup leftover?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hint mirroring what is there in the definition of the function to help understand why these tparams are there. In this case it means this function is only enabled for the 32-bit roaring bitmap format.

Comment on lines 317 to 327
if (this->empty()) {
thrust::fill(
nosync_exec_policy, contained, contained + cuda::std::distance(first, last), false);
} else {
thrust::transform(nosync_exec_policy,
first,
last,
contained,
cuda::proclaim_return_type<bool>(
[*this] __device__(auto key) { return this->contains(key); }));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we should use cub algorithms here as thrust doesn't provide control for potential temporary data allocation/deallocation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last commit should have switched to cub but somehow the diff is not showing it. I'll verify this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit diff: aa56fd6

card = 1u + misaligned_load<cuda::std::uint16_t>(
storage_ref_.key_cards() + (index * 2 + 1) * sizeof(cuda::std::uint16_t));
}
if (card <= 4096) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get rid of magic number and give it a name

cuda::std::uint32_t cookie;
cuda::std::memcpy(&cookie, buf, sizeof(cuda::std::uint32_t));
buf += sizeof(cuda::std::uint32_t);
if ((cookie & 0xFFFF) != serial_cookie && cookie != serial_cookie_no_runcontainer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name 0xFFFF as mask

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, cookie_mask


thrust::device_vector<T> items(num_items);

key_generator gen{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just caffeinating the engine ;)

Suggested change
key_generator gen{};
key_generator gen{0xc0ffee};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I would like to use the same default seed for all cuco benchmarks and add an option to adjust it via nvbench. We currently use std::time(nullptr) as the default seed which is another problem in itself. So yeah, this will be a separate PR to fix this for the entire library.

ci/build.sh Outdated
--arch) CUDA_ARCHS="${args[1]}"; args=("${args[@]:2}");;
--std) CXX_STANDARD="${args[1]}"; args=("${args[@]:2}");;
-v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");;
--) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build

// Create query keys for the portable_bitmap64.bin file:
// https://github.com/RoaringBitmap/RoaringFormatSpec/blob/5177ad9/testdata64/README.md#portable_bitmap64bin
std::vector<cuda::std::uint64_t> keys;
for (cuda::std::uint64_t k = 0x00000ull; k < 0x09000ull; ++k) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::iota at a few places here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards keeping the original logic as it is a 1:1 copy of how the bitmap was serialize (see link in line 70). Just makes it clearer for the reader where these numbers come from.

Comment on lines 244 to 245
return static_cast<cuda::std::uint8_t>(container[lower / 8]) &
(cuda::std::uint8_t(1) << (lower % 8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May a comment about the logic and any ref.

Copy link
Collaborator Author

@sleeepyjack sleeepyjack Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments and also added a helper function check_bit() to reduce code duplication.

namespace cuco::detail {

// primary template
template <class T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are re-implementing binary search quite a few times in this file. Perhaps we could add a couple of functions in anonymous namespace and reuse. Future PR to do that is also ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's definitely a candidate for a refactoring. However, there a some details that are slightly different between the implementations which I'd have to abstract from somehow. I would say this is something we can do in a future PR.

private:
metadata_type metadata_;
cuda::std::byte const* data_;
cuda::std::uint8_t const* run_container_bitmap_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this somewhere for posterity?

else {
cuda::std::memcpy(&num_containers, buf, sizeof(cuda::std::uint32_t));
buf += sizeof(cuda::std::uint32_t);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have a better suggestion but quite a few magic numbers in here 🪄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up and simplified some of the logic and added some variable names for these constants. Let me know if I missed something.


#include <fstream>
#include <string>
#include <vector>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments from host_bulk_example.cu about Use of std::iota and std::filesystem::file_size

Comment on lines +77 to +81
cub::DeviceTransform::Transform(
thrust::constant_iterator<bool>(false),
contained,
cuda::std::distance(first, last),
cuda::proclaim_return_type<bool>([] __device__(auto /* dummy */) { return false; }),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anyone think of a better way of doing a fill operation using cub that works with arbitrary input iterators? We want to avoid thrust::fill as it might introduce some unexpected host syncs.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. One last question needs some discussions. Do we want to get this merged in cudf or cuco?


private:
allocator_type allocator_;
typename ref_type::metadata_type metadata_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where users need to access or manage the metadata when working with a roaring bitmap? If so, yes.

@sleeepyjack
Copy link
Collaborator Author

sleeepyjack commented Aug 19, 2025

Looks good to me. One last question needs some discussions. Do we want to get this merged in cudf or cuco?

@PointKernel The container as is does pass the cuco vibecheck I'd say. The current implementation is limited to lookups but could be extended to support mutable operations as well (although it would require some significant effort).

That said, I'm not at all opposed to moving it into cudf instead - just saying that it could fit into cuco from my standpoint. If you have strong oppinions against including it in cuco I'm open to discuss them. If we move it into cudf I would need someone to lead the effort who is more versed with libcudf details than me.

CC @mhaseeb123 what do you think?

@mhaseeb123
Copy link
Contributor

mhaseeb123 commented Aug 19, 2025

The container as is does pass the cuco vibecheck I'd say

Agreed with this statement. My vote would be to keep this container in cuco especially considering we may want to expand the set of features in the future and then would need to move it back to cuco. That said, if y'all think it's better to move this to libcudf, I am happy to lead the effort.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m concerned that this is mainly addressing Spark’s requirements rather than providing a general-purpose bitmap solution on GPUs. Once we decide to support insertion, the existing lookups may also need to change. That said, it’s fine for now—let’s merge the PR as is and revisit the full integration later once we have a clearer picture of how cudf/Spark will use this data structure.

@mhaseeb123
Copy link
Contributor

I’m concerned that this is mainly addressing Spark’s requirements rather than providing a general-purpose bitmap solution on GPUs.

Agreed. I don't see any requirement from Spark or libcudf, in the near term at least, to add insertions, deletions etc.

Once we decide to support insertion, the existing lookups may also need to change.

Yup and we may need to move it back to cuco.

That said, it’s fine for now—let’s merge the PR as is and revisit the full integration later once we have a clearer picture of how cudf/Spark will use this data structure.

Agreed with this. We can always move it to libcudf if needed.

@sleeepyjack sleeepyjack changed the title Initial support for roaring bitmap lookups Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups Aug 19, 2025
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. @sleeepyjack Thanks!

@sleeepyjack sleeepyjack merged commit 4e642f8 into NVIDIA:dev Aug 19, 2025
21 checks passed
@sleeepyjack sleeepyjack deleted the roaring-bitmap branch August 19, 2025 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging topic: roaring_bitmap

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Roaring bitmaps discussion and scoping

3 participants