Make BitSet store any Int #25029

rfourquet · 2017-12-11T20:27:51Z

BitSet was renamed to IntSet a couple months ago (#24282) because it couldn't store negative integers. This PR implements this feature, and renames BitSet back to IntSet.

It is implemented as a kind of offset array, so it's purpose is still to store a dense set of integers, but those don't have to be grouped around 0. For example, IntSet(typemin(Int):typemin(Int)+100) is perfectly fine.

One notable change: using a BitVector as the underlying implemention started to get in the way: its API was insufficient, and it was tedious to play with and maintain the code at 3 layers simultaneously (IntSet, BitSet, Vector{UInt64}). So the bits field of IntSet is now directly a Vector{UInt64}, with calls to low-level bitarray.jl functions.

IntSet being used in inference.jl, I knew I had to keep good performance, so there is a small series a commits which each implement some improvements, the end result looks very satisfying e.g.

julia> @btime BitSet(1:1000); # master
  2.990 μs (6 allocations: 400 bytes)

julia> @btime IntSet(1:1000); # PR
  2.370 μs (5 allocations: 384 bytes)

julia> A = BitSet(1:1000); @btime collect($A); # master
  6.165 μs (1 allocation: 7.94 KiB)

julia> A = IntSet(1:1000); @btime collect($A); # PR
  6.452 μs (1 allocation: 7.94 KiB)

julia> B = BitSet(500:2000); @btime union($A, $B); # master
  222.936 ns (6 allocations: 656 bytes)

julia> B = IntSet(500:2000); @btime union($A, $B); # PR
  185.000 ns (5 allocations: 640 bytes)

julia> @btime $A == $B; # master
  5.209 ns (0 allocations: 0 bytes)

julia> @btime $A == $B; # PR
  4.869 ns (0 allocations: 0 bytes)

~~The last example shows that there are some non-negligible regressions too,~~ I didn't investigate thoroughly the performance though (BTW, there is a BaseBenchmarks.jl PR to be soon merged which tracks some of the perfs), and more improvements can be obtained. Obviously, the old BitSet will always be more performant than this IntSet, if given the same set of optimizations, but I would say by no more than roughly 15% in general. I think it's good enough for what this gives us.

PS: Because GitHub doesn't seem to handle nicely file renames, I temporarily reverted the last commit which was deprecating BitSet -> IntSet, to make reviewing easier.

carlobaldassi · 2017-12-11T20:41:31Z

base/bitarray.jl

-    start > length(B) && return 0
-
-    Bc = B.chunks
+bitcount(B::BitArray) = bitcount(B.chunks)


count(B::BitArray)?

Or even sum although I think that's the same thing when the eltype is Bool.

rfourquet · 2017-12-11T20:52:47Z

base/bitset.jl

+const Bits = Vector{UInt64}
+const Chk0 = zero(UInt64)
+const NoOffset = -one(Int) << 60
+# + NoOffset must be big enough to stay < 0 when add with any offset.


small enough

carlobaldassi · 2017-12-11T20:55:08Z

base/bitset.jl

-    bits::BitVector
-    BitSet() = new(sizehint!(falses(0), 256))
+const Bits = Vector{UInt64}
+const Chk0 = zero(UInt64)


I don't think you need this at all, you can just use 0 everywhere. The conversion will be implicit and it should be performed at compilation time.

It was not really for performance, more for readability!

Ok. I would still bikeshed the name though, to me it looks like a type when capitalized...

Ok done, renamed to full caps, also for NO_OFFSET.

carlobaldassi · 2017-12-11T21:00:13Z

base/bitset.jl

+sizehint!(s::BitSet, n::Integer) = sizehint!(s.bits, n)
+
+# given an integer i, return the chunk which stores it
+chk_indice(i::Int) = i >> 6


This and the next are the same as _div64 and _mod64 from "bitarray.jl". I know those names are not great, but it seems slightly silly to have two sets of nearly identical definitions.

Ah yeah right! They are like that because they evolved: first they did what get_chunks_id do, then they were updated to work for all Int range by promoting to Int128, and ended up this way; but chk_indice must use the signed shift operator, so can _div64 can be changed to use >> ?

Yes, no part of the code which uses _div64 is ever supposed to get a negative integer, and it would give meaningless results anyway if it did, so >> should be fine.

Ok thanks, done

StefanKarpinski

This seems cool but I'm a little confused. Is the premise that this generalizes the functionality of BitSet so that it supports both positive and negative integers and therefore renames it to IntSet, but you've left off the commit changing the name?

StefanKarpinski · 2017-12-11T22:11:55Z

base/bitarray.jl

-    start > length(B) && return 0
-
-    Bc = B.chunks
+bitcount(B::BitArray) = bitcount(B.chunks)


Or even sum although I think that's the same thing when the eltype is Bool.

rfourquet · 2017-12-12T09:46:42Z

Is the premise that this generalizes the functionality of BitSet so that it supports both positive and negative integers and therefore renames it to IntSet

Yes: a set data-structures which can store any Int is better named IntSet I think. I have no one around to do a poll now, but I would bet IntSet is much more self-descriptive than BitSet for a programmer who starts using Julia. Plus, by reverting the deprecation to BitSet, users porting from 0.6 to 0.7 will not even be bothered by a deprecation :)

but you've left off the commit changing the name?

Yes, because with this commit, github shows a whole deleted file "bitset.jl" and a whole new file "intset.jl", so it's hard to track the changes between the two. I could as well put the renaming to another PR if you think it's better.

rfourquet · 2017-12-12T10:59:17Z

I removed the renaming part fow now as it's more work when rebasing. But would there be a consensus for this rename to IntSet, or should I label this for triage?

I addressed the comments and updated the == code so that it doesn't show the regression anymore.

rfourquet · 2017-12-12T12:41:16Z

There are errors on 32 bits builds, ~~I tried to revert the _div64 change in bitarray.jl to see if it came from there, but was not the problem. I will try to find the cause, but without a 32 bit myself, I fear it will be difficult.~~
EDIT: got it, it was just me assuming Int == Int64 😳

StefanKarpinski · 2017-12-12T15:34:23Z

Yes, with this change, IntSet is a fitting name. Nice change!

Sacha0 · 2017-12-12T17:45:55Z

Yes, with this change, IntSet is a fitting name. Nice change!

I tend to agree, though under the hood this data structure is more like a PackedIntSet or so :).

rfourquet · 2017-12-14T11:10:13Z

I rebased and just add some sort of fuzzy testing comparing the result of operations on Set and on BitSet to check for consistency (which allowed me to catch a couple bugs in this PR). It would be ready to go on my side, except that the deprecation back to IntSet is not yet included, I will do when/if I get approval for merging.

rfourquet · 2017-12-14T12:11:51Z

I also ran locally the not-merged-yet benchmark for collections, and it shows only performance improvements, except for something which compares apples and oranges (i.e. union(BitSet([2]), BitSet([2^18])), where the BitSet are precomputed: in this PR, BitSet([2^18]) allocates one limb around it's unique element, whereas before it allocated all the space needed to store positive integers till 2^18, so now the union forces allocation of new limbs). The other small performance regression is concerning repeated push! after calling sizehint!, ~~and I am not sure why~~ doh, I was forwarding sizehint! to the bits::Vector{UInt64} without dividing the argument by 64!

StefanKarpinski · 2017-12-14T15:58:59Z

Windows failure appears to be an LAPACK precision issue. Seems unrelated? Travis failures were due to network connection, so I restarted them.

Sacha0 · 2017-12-14T18:15:19Z

Pondering the name a bit more might be worthwhile: This data structure is not an efficient representation of arbitrary sets of Ints; rather, it is an efficient representation of sets of Ints densely packed over some range. (This distinction tripped me up some years ago when writing a graph algorithm.) Something like DenseIntSet might be more appropriate? Best!

rfourquet · 2017-12-14T18:42:53Z

DenseIntSet is more descriptive, but 3 words in the name starts to smell pedantic to me. As Set{Int} is already there for non-dense ones, IntSet sounds fine to me, with a docstring. For arrays, DenseArray is abstract, but it would be strange and unfriendly if this was the name of the concrete type.

Sacha0 · 2017-12-14T18:55:02Z

DenseIntSet is more descriptive, and that more descriptive name is helpfully clarifying to me.

ftfty :) As mentioned above, the name IntSet misled me in practice.

For arrays, DenseArray is abstract, but it would be strange and unfriendly if this was the name of the concrete type.

This seems orthogonal?

rfourquet · 2017-12-14T19:10:57Z

This seems orthogonal?

I was just saying that Array is dense, but we don't call it DenseArray (although it has a supertype with that name), similarly we don't necessarily have to call a non-sparse set with Dense in the name. Hard to find the best name, but IntSet at least has the objective advantage of allowing 0.6 users to not see a deprecation ;-)

Sacha0 · 2017-12-14T19:26:28Z

I was just saying that Array is dense, but we don't call it DenseArray (although it has a supertype with that name), similarly we don't necessarily have to call a non-sparse set with Dense in the name. Hard to find the best name, but IntSet at least has the objective advantage of allowing 0.6 users to not see a deprecation ;-)

I see :). Thanks for the clarification!

To clarify on my end: When I see the name Set{T}, I assume that I can store N arbitrary objects of type T in that collection using roughly O(N) storage. So when I see the name IntSet, I think: "Lovely! This is a fast/efficient Set specialized for Ints. I can efficiently store and work with arbitrary sets of Ints in this data structure." I then proceed to use IntSets for sets of integers containing, e.g., 1, 3728, and 10^7. At some point I observe that storage is atrocious, am thoroughly confused, and burn some time figuring out that IntSet([1, 3728, 10^7]) requires O(10^7) storage because IntSet is a representation of a densely packed set of integers. This was my experience in practice :). (Edit: And that experience is what motivated me to see the BitSet rename through.)

StefanKarpinski · 2017-12-14T22:19:01Z

+1 to IntSet.

StefanKarpinski · 2017-12-14T22:29:19Z

On the triage call, @Sacha0 has made a compelling argument for BitSet mostly reiterating the above – the key issue is that this is not an entirely general purpose data structure and the implementation does limit its use cases quite a bit. I would also be fine with leaving the name as BitSet and embracing the new functionality.

StefanKarpinski · 2017-12-14T22:32:20Z

Triage decision: yes to this functionality and change, but keep the name BitSet.

rfourquet · 2017-12-16T10:31:41Z

Would Nanosoldier be on his feet again?
@nanosoldier runbenchmarks("collection", vs=":master")

StefanKarpinski · 2017-12-16T17:52:12Z

Travis Linux runs passed
Windows 64-bit is Bunch-Kaufman failures
macOS failure is some spawn test, seems unrelated?

rfourquet · 2017-12-16T18:06:14Z

It looked unrelated to me too. I could also restart the build just to double check (BitSet being used in such a crucial part as inference makes me a little nervous!) I also just wished to have had Nanosoldier run, to prove the nice little speed-ups ;-)

StefanKarpinski · 2017-12-16T18:23:32Z

I restarted it, but it's probably overkill.

rfourquet · 2017-12-17T09:47:26Z

@nanosoldier runbenchmarks("collection", vs=":master")

nanosoldier · 2017-12-17T10:50:22Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

ararslan · 2017-12-17T20:39:52Z

Sorry, so many things have changed at once in 0.7 that I haven't had a chance to keep BaseBenchmarks working. I'll try to get it back up within the next couple of days, but any help updating BaseBenchmarks would be very welcome!

The BitVector had already started to show its limits for BitSet needs. Juggling at 3 layers with BitSet, BitVector, and Vector{UInt64} was becoming confusing at times. Updating this code to handle negative integers only increased the confusion. Some modest performance improvements (~20%) could be achieved in the process for some functions.

rfourquet · 2017-12-19T07:20:28Z

@nanosoldier runbenchmarks("collection", vs=":master")

nanosoldier · 2017-12-19T08:29:53Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Add an offset field to BitSet acting as an infimum of stored values, which can be adjusted on-the-fly. BitSet then becomes able to store negative integers, and doesn't have to have elements centered around 0.

The 1-based indexing was forcing to promote Int to Int128 in chk_indice & chk_offset, to allow correct computation with extreme values (e.g. BitSet([typemin(Int)])). 1-based indexing was a left-over from the former BitVector as underlying implementation. Switching to 0-based indexing allows both a more-direct mapping between the model and the implementation, and better performance.

When a BitSet is empty, its offset must be initialized. But checking for emptyness in each invocation of _setint! was costly, so we put the check in a branch which is called less often.

_setint! was delegating to the bitarray code, but then some work was being redone in get_chunks_id. So we split this functionality into lower level functions.

* isempty: using _check0 instead of all makes it 35% faster * ==: checking first non-overlapping parts is more likely to be faster, as the the lower and upper parts of the bits field are unlikely to be zero (at least for a freshly created BitSet)

rfourquet · 2017-12-19T08:52:23Z

For benchmarks results see my comment above for the expected "regression" for symdiff! and union!. For sizehint!, I again got it wrong by returning the underlying Vector{UInt64} instead of the bitset itself 😊 Corrected now, test added, and the perf doesn't show regression anymore, as should be expected, thanks nanosoldier for helping in finding the last bug!

Sacha0 · 2017-12-19T20:37:17Z

Thanks again for this beautiful work @rfourquet! :)

rfourquet · 2017-12-19T20:51:02Z

Although I had started this work months ago, it's because I didn't like the new name that I felt the urge to finish it before feature frease, to give an argument to change it back... here we are now with the same name, but I guess you are responsible for we having this feature now :)

JeffBezanson · 2017-12-20T13:52:30Z

Some of these commits should have been squashed.

rfourquet · 2017-12-20T16:35:14Z

Some of these commits should have been squashed.

These are all pretty self contained and "clean" (not of the "fix typo" or "wip" kind), so I though it would be OK, but sorry about that, will try to squash more next time.

rfourquet force-pushed the rf/intset-offset branch from 2097111 to 4ac7580 Compare December 11, 2017 20:28

rfourquet added collections Data structures holding multiple items, e.g. sets deprecation This change introduces or involves a deprecation labels Dec 11, 2017

rfourquet force-pushed the rf/intset-offset branch from 4ac7580 to 951bfce Compare December 11, 2017 20:33

carlobaldassi reviewed Dec 11, 2017

View reviewed changes

rfourquet commented Dec 11, 2017

View reviewed changes

carlobaldassi reviewed Dec 11, 2017

View reviewed changes

StefanKarpinski reviewed Dec 11, 2017

View reviewed changes

rfourquet force-pushed the rf/intset-offset branch from 951bfce to 1ff860c Compare December 12, 2017 10:51

rfourquet force-pushed the rf/intset-offset branch from 0620440 to 1a02226 Compare December 12, 2017 12:58

rfourquet added this to the 1.0 milestone Dec 13, 2017

rfourquet force-pushed the rf/intset-offset branch from 1a02226 to 73f247e Compare December 14, 2017 11:07

rfourquet force-pushed the rf/intset-offset branch from 73f247e to d136201 Compare December 14, 2017 12:14

rfourquet removed the deprecation This change introduces or involves a deprecation label Dec 16, 2017

rfourquet force-pushed the rf/intset-offset branch from 8cbd17e to f39efaf Compare December 19, 2017 07:19

rfourquet force-pushed the rf/intset-offset branch from f39efaf to 6072d23 Compare December 19, 2017 08:31

rfourquet added 8 commits December 19, 2017 09:45

morph BitSet into a real Int Set

cb49741

Add an offset field to BitSet acting as an infimum of stored values, which can be adjusted on-the-fly. BitSet then becomes able to store negative integers, and doesn't have to have elements centered around 0.

BitSet push! & pop! : don't check for emptyness every time

13cdaeb

When a BitSet is empty, its offset must be initialized. But checking for emptyness in each invocation of _setint! was costly, so we put the check in a branch which is called less often.

optimize updating a value in BitSet

7f44009

_setint! was delegating to the bitarray code, but then some work was being redone in get_chunks_id. So we split this functionality into lower level functions.

re-implement rand(::BitSet)

dd6a94a

BitSet: use _div64 & _mod64 from bitarray.jl

55968e7

add fuzzy testing for some Set and BitSet functions

55852ee

rfourquet force-pushed the rf/intset-offset branch from 6072d23 to 55852ee Compare December 19, 2017 08:46

rfourquet merged commit c16c0cb into master Dec 19, 2017

rfourquet deleted the rf/intset-offset branch December 19, 2017 11:51

fredrikekre mentioned this pull request Apr 19, 2018

please implement rand for IntSet #26824

Closed

Make BitSet store any Int #25029

Make BitSet store any Int #25029

Conversation

rfourquet commented Dec 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfourquet commented Dec 12, 2017

rfourquet commented Dec 12, 2017

rfourquet commented Dec 12, 2017 • edited Loading

StefanKarpinski commented Dec 12, 2017

Sacha0 commented Dec 12, 2017

rfourquet commented Dec 14, 2017

rfourquet commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

Sacha0 commented Dec 14, 2017

rfourquet commented Dec 14, 2017

Sacha0 commented Dec 14, 2017

rfourquet commented Dec 14, 2017

Sacha0 commented Dec 14, 2017 • edited Loading

StefanKarpinski commented Dec 14, 2017 • edited Loading

StefanKarpinski commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

rfourquet commented Dec 16, 2017

StefanKarpinski commented Dec 16, 2017

rfourquet commented Dec 16, 2017

StefanKarpinski commented Dec 16, 2017

rfourquet commented Dec 17, 2017

nanosoldier commented Dec 17, 2017

ararslan commented Dec 17, 2017

rfourquet commented Dec 19, 2017

nanosoldier commented Dec 19, 2017

rfourquet commented Dec 19, 2017

Sacha0 commented Dec 19, 2017

rfourquet commented Dec 19, 2017

JeffBezanson commented Dec 20, 2017

rfourquet commented Dec 20, 2017

rfourquet commented Dec 11, 2017 •

edited

Loading

rfourquet commented Dec 12, 2017 •

edited

Loading

Sacha0 commented Dec 14, 2017 •

edited

Loading

StefanKarpinski commented Dec 14, 2017 •

edited

Loading