-
Notifications
You must be signed in to change notification settings - Fork 100
Support for 32- and 64-bit cuco::experimental::roaring_bitmap lookups #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c8eb1a2 to
a56e3a9
Compare
PointKernel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
The logic is straightforward, so there’s not much to review from a technical standpoint.
ci/build.sh
Outdated
| --arch) CUDA_ARCHS="${args[1]}"; args=("${args[@]:2}");; | ||
| --std) CXX_STANDARD="${args[1]}"; args=("${args[@]:2}");; | ||
| -v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");; | ||
| --) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for passing arbitrary cmake build options through our ci/build.sh script. For instance, this PR introduces the -DCUCO_DOWNLOAD_ROARING_DATA option, Which I can now pass as ci/build.sh — -DCUCO_DOWNLOAD_ROARING_DATA=OFF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this can replace #704?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#704 allows for specifying the cmake binary while this change here allows for passing build arguments directly to cmake as a pass-through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've outsourced the changes to the build script in #749
| private: | ||
| metadata_type metadata_; | ||
| cuda::std::byte const* data_; | ||
| cuda::std::uint8_t const* run_container_bitmap_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does run_container_bitmap do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three types of containers in the bitmap. (1) array containers, (2) bitset containers, and (3) "run containers" aka array that encode consecutive sequences of elements in RLE format. To distinguish the first two types we can look at the cardinality (if it's >4096 then it is stored as a bitset, otherwise it is an array container). We cannot apply this trick for run containers, thus we have to store a bitset with one bit per container, indicating if it's a run container or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this somewhere for posterity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taken directly from the RoaringBitmapFormatSpec which I reference a couple of times in the docs. I you read throught the cookie header definition you'll find the following explanation:
Let size be the number of containers. Then we store (size + 7) / 8 bytes, following the initial 32 bits, as a bitset to indicate whether each of the containers is a run container (bit set to 1) or not (bit set to 0). The first (least significant) bit of the first byte corresponds to the first stored container and so forth. In this scenario, the cookie header uses 32 bits followed by (size + 7) / 8 bytes.
| template <class U /* = T */, | ||
| class /* = cuda::std::enable_if_t<cuda::std::is_same_v<U, cuda::std::uint32_t>> */> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this some cleanup leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a hint mirroring what is there in the definition of the function to help understand why these tparams are there. In this case it means this function is only enabled for the 32-bit roaring bitmap format.
| if (this->empty()) { | ||
| thrust::fill( | ||
| nosync_exec_policy, contained, contained + cuda::std::distance(first, last), false); | ||
| } else { | ||
| thrust::transform(nosync_exec_policy, | ||
| first, | ||
| last, | ||
| contained, | ||
| cuda::proclaim_return_type<bool>( | ||
| [*this] __device__(auto key) { return this->contains(key); })); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we should use cub algorithms here as thrust doesn't provide control for potential temporary data allocation/deallocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My last commit should have switched to cub but somehow the diff is not showing it. I'll verify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commit diff: aa56fd6
| card = 1u + misaligned_load<cuda::std::uint16_t>( | ||
| storage_ref_.key_cards() + (index * 2 + 1) * sizeof(cuda::std::uint16_t)); | ||
| } | ||
| if (card <= 4096) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's get rid of magic number and give it a name
| cuda::std::uint32_t cookie; | ||
| cuda::std::memcpy(&cookie, buf, sizeof(cuda::std::uint32_t)); | ||
| buf += sizeof(cuda::std::uint32_t); | ||
| if ((cookie & 0xFFFF) != serial_cookie && cookie != serial_cookie_no_runcontainer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe name 0xFFFF as mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, cookie_mask
|
|
||
| thrust::device_vector<T> items(num_items); | ||
|
|
||
| key_generator gen{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just caffeinating the engine ;)
| key_generator gen{}; | |
| key_generator gen{0xc0ffee}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, I would like to use the same default seed for all cuco benchmarks and add an option to adjust it via nvbench. We currently use std::time(nullptr) as the default seed which is another problem in itself. So yeah, this will be a separate PR to fix this for the entire library.
ci/build.sh
Outdated
| --arch) CUDA_ARCHS="${args[1]}"; args=("${args[@]:2}");; | ||
| --std) CXX_STANDARD="${args[1]}"; args=("${args[@]:2}");; | ||
| -v | -verbose | --verbose) VERBOSE=1; args=("${args[@]:1}");; | ||
| --) EXTRA_CMAKE_OPTIONS+=("${args[@]:1}"); break;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very optional but do we instead want to rename this as CMAKE_ARGS which is what libcudf uses to add extra CMAKE_ARGS to the build
| // Create query keys for the portable_bitmap64.bin file: | ||
| // https://github.com/RoaringBitmap/RoaringFormatSpec/blob/5177ad9/testdata64/README.md#portable_bitmap64bin | ||
| std::vector<cuda::std::uint64_t> keys; | ||
| for (cuda::std::uint64_t k = 0x00000ull; k < 0x09000ull; ++k) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::iota at a few places here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards keeping the original logic as it is a 1:1 copy of how the bitmap was serialize (see link in line 70). Just makes it clearer for the reader where these numbers come from.
| return static_cast<cuda::std::uint8_t>(container[lower / 8]) & | ||
| (cuda::std::uint8_t(1) << (lower % 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May a comment about the logic and any ref.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments and also added a helper function check_bit() to reduce code duplication.
| namespace cuco::detail { | ||
|
|
||
| // primary template | ||
| template <class T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are re-implementing binary search quite a few times in this file. Perhaps we could add a couple of functions in anonymous namespace and reuse. Future PR to do that is also ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's definitely a candidate for a refactoring. However, there a some details that are slightly different between the implementations which I'd have to abstract from somehow. I would say this is something we can do in a future PR.
| private: | ||
| metadata_type metadata_; | ||
| cuda::std::byte const* data_; | ||
| cuda::std::uint8_t const* run_container_bitmap_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this somewhere for posterity?
| else { | ||
| cuda::std::memcpy(&num_containers, buf, sizeof(cuda::std::uint32_t)); | ||
| buf += sizeof(cuda::std::uint32_t); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have a better suggestion but quite a few magic numbers in here 🪄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up and simplified some of the logic and added some variable names for these constants. Let me know if I missed something.
|
|
||
| #include <fstream> | ||
| #include <string> | ||
| #include <vector> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments from host_bulk_example.cu about Use of std::iota and std::filesystem::file_size
| cub::DeviceTransform::Transform( | ||
| thrust::constant_iterator<bool>(false), | ||
| contained, | ||
| cuda::std::distance(first, last), | ||
| cuda::proclaim_return_type<bool>([] __device__(auto /* dummy */) { return false; }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anyone think of a better way of doing a fill operation using cub that works with arbitrary input iterators? We want to avoid thrust::fill as it might introduce some unexpected host syncs.
PointKernel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. One last question needs some discussions. Do we want to get this merged in cudf or cuco?
|
|
||
| private: | ||
| allocator_type allocator_; | ||
| typename ref_type::metadata_type metadata_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where users need to access or manage the metadata when working with a roaring bitmap? If so, yes.
@PointKernel The container as is does pass the cuco vibecheck I'd say. The current implementation is limited to lookups but could be extended to support mutable operations as well (although it would require some significant effort). That said, I'm not at all opposed to moving it into cudf instead - just saying that it could fit into cuco from my standpoint. If you have strong oppinions against including it in cuco I'm open to discuss them. If we move it into cudf I would need someone to lead the effort who is more versed with libcudf details than me. CC @mhaseeb123 what do you think? |
Agreed with this statement. My vote would be to keep this container in cuco especially considering we may want to expand the set of features in the future and then would need to move it back to cuco. That said, if y'all think it's better to move this to libcudf, I am happy to lead the effort. |
PointKernel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m concerned that this is mainly addressing Spark’s requirements rather than providing a general-purpose bitmap solution on GPUs. Once we decide to support insertion, the existing lookups may also need to change. That said, it’s fine for now—let’s merge the PR as is and revisit the full integration later once we have a clearer picture of how cudf/Spark will use this data structure.
Agreed. I don't see any requirement from Spark or libcudf, in the near term at least, to add insertions, deletions etc.
Yup and we may need to move it back to cuco.
Agreed with this. We can always move it to libcudf if needed. |
PointKernel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. @sleeepyjack Thanks!
Closes #725