-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add get_index and it will return the index of the element #470
Conversation
This is reasonable and potentially useful but while If it is a terminology you created, we will have to discuss it further. |
Apologies for any confusion around the term '1 index'. My intent is to refer simply to the index of a given value in the RoaringBitmap. As an example, if a Roaring Bitmap is {2, 4, 5, 7, 8, 9, 10}, I want get the index of value 7, it will return 4, which means that 7 is the 4th 1 in the set. If i want get the index of value 6, it will return -1, which means that 6 is not a element of the set. In a word, The proposed get_idx() interface aims to find the index (or -1 if not exsist) in a single traversal. |
@bryceustc I do understand the concept. I am concerned about the terminology. I don't have much to propose, but "get_idx" and "1 index" won't work. You know what it means, but the average programmer won't be able to tell. What about "get_index" and just say that we are return the index of the element? |
It's a good suggestion, I will modify the relevant code and descriptions. Thanks :) |
I have modified the relevant content according to your suggestions. Thank you in advance for reviewing again. |
cpp/roaring.hh
Outdated
* when x isn't in the set, but the rank funciton will return a | ||
* non-negative number. | ||
*/ | ||
int64_t get_index(uint32_t x) const noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryceustc nice work. but the name/comments may be a little confusing. index
start from 0 in general, while the rank
value start from 1 (if contains)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for your comments, it should start from 0. I have modified the relevant codes and descriptions, please review again when convenient :)
tests/toplevel_unit.c
Outdated
size_t i = 0; | ||
for (; i < length; ++i) { | ||
if (arr[i] == x) { | ||
is_present = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can return i;
here for simplicity.
and the rests LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your meticulous work!
Maybe it would be worthwhile to take a page from binarySearch, and return
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About naming, not many of the roaring functions have get
in them, even when they're logically getters. I wonder if a better name might be roaring_bitmap_index_of
, or something referencing select
since that's the dual of this function, like roaring_bitmap_selection_index
?
include/roaring/roaring.h
Outdated
@@ -799,6 +799,15 @@ bool roaring_bitmap_select(const roaring_bitmap_t *r, uint32_t rank, | |||
*/ | |||
uint64_t roaring_bitmap_rank(const roaring_bitmap_t *r, uint32_t x); | |||
|
|||
/** | |||
* Returns the index of x in the given roaring bitmap. | |||
* If the roaring bitmap dosen't contain x , this function will return -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dosen't -> doesn't
include/roaring/roaring.h
Outdated
* Returns the index of x in the given roaring bitmap. | ||
* If the roaring bitmap dosen't contain x , this function will return -1. | ||
* The difference with rank function is that this function will return -1 when x | ||
* is not the element of roaring bitmap, but the rank funciton will return a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
funciton -> function
cpp/roaring.hh
Outdated
* when x isn't in the set, but the rank funciton will return a | ||
* non-negative number. | ||
*/ | ||
int64_t get_index(uint32_t x) const noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other C++ methods are camelCase
, I think this should be getIndex
Do we want to add the same to roaring64map.hh
?
@Dr-Emann Thanks for your corrections. I have revised the relevant codes and comment. Please review again when it's convenient. :) In addition, there are a few points I would like to discuss further with you.
|
cpp/roaring64map.hh
Outdated
roaring_iter != roaring_destination; ++roaring_iter) { | ||
index += roaring_iter->second.cardinality(); | ||
} | ||
index += roaring_destination->second.getIndex(lowBytes(x)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryceustc it is a bug to add -1
to index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the type of index
should be int128_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review, the bug and comments have been fixed in the latest commit.
cpp/roaring64map.hh
Outdated
@@ -1092,6 +1093,23 @@ public: | |||
return result; | |||
} | |||
|
|||
/** | |||
* Returns the number of integers that are smaller or equal to x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comments is misleading
cpp/roaring64map.hh
Outdated
* when x isn't in the set, but the rank function will return a | ||
* non-negative number. | ||
*/ | ||
int128_t getIndex(uint64_t x) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank
uses uint64_t for its return type. I think it might be reasonable, if every range container took 1 byte, storing a roaring bitmap of all uint64s would take 256 TB of memory, and there's a lot more overhead than 1 byte per range (0..0xFFFF)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm afraid I didn't grasp what you meant...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to use int128_t, because if you ever had an item at an index greater than an int64_t can represent, you're already out of memory. The rank
method returns a uint64_t, and I argue this method should return an int64_t.
The most efficient way to get there is to have a range container for every slot in every bitmap. Each range container can cover 0x10000 items. So to get to 0x8000_0000_0000_0000 (the first index which can't be returned in an i64), you need 0x0000_8000_0000_000 range containers (140737488355328). If each range container (and the keys array, and the container type array, and malloc overhead, and...) magically took just 1 byte per container (spoiler, it will take a LOT more than that), you'd still need 128TiB of memory. There's simply no way to ever realistically have a RoaringBitmap64 with more than 2^63 items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My math is slightly more pessimistic. Each container can represent 65536 values. I get 2**64/(2**16)
is 281474976710656. Assuming one byte per container that is... 256 TiB.
I think that Windows cannot ever support more than 6 TB of memory... https://learn.microsoft.com/fr-ca/windows/win32/memory/memory-limits-for-windows-releases?redirectedfrom=MSDN
A reasonable solution here is to use a 64-bit value and explain why you do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed integer means 2^63
rather than 2^64
. I made that mistake initially, you can see I said 256 TB in my first message. Not that it matters, either way, even assuming an impossible 1 byte per container, it's not gonna fit in even virtual memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ok. Most systems have a theoretical limit of maybe 2^56 bytes per process. Of course, this may change in the future, but before that happens, int128_t will be part of the every standard programming language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thank you all for the professional and detailed explanations. I have changed int128_t
to int64_t
in the latest commit.
Running tests. |
@Dr-Emann @longqimin This PR looks good to me. Any objection to merging? |
LGTM too |
Merged. |
When using the CRoaring library, the normal way to get the 1 index of a value in a RoaringBitmap is:
However, this requires traversing the RoaringBitmap twice, which is inefficient. To improve this, a get_idx() interface was created. It works as follows: