Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BitSet store any Int #25029

Merged
merged 9 commits into from
Dec 19, 2017
Merged

Make BitSet store any Int #25029

merged 9 commits into from
Dec 19, 2017

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Dec 11, 2017

BitSet was renamed to IntSet a couple months ago (#24282) because it couldn't store negative integers. This PR implements this feature, and renames BitSet back to IntSet.

It is implemented as a kind of offset array, so it's purpose is still to store a dense set of integers, but those don't have to be grouped around 0. For example, IntSet(typemin(Int):typemin(Int)+100) is perfectly fine.

One notable change: using a BitVector as the underlying implemention started to get in the way: its API was insufficient, and it was tedious to play with and maintain the code at 3 layers simultaneously (IntSet, BitSet, Vector{UInt64}). So the bits field of IntSet is now directly a Vector{UInt64}, with calls to low-level bitarray.jl functions.

IntSet being used in inference.jl, I knew I had to keep good performance, so there is a small series a commits which each implement some improvements, the end result looks very satisfying e.g.

julia> @btime BitSet(1:1000); # master
  2.990 μs (6 allocations: 400 bytes)

julia> @btime IntSet(1:1000); # PR
  2.370 μs (5 allocations: 384 bytes)

julia> A = BitSet(1:1000); @btime collect($A); # master
  6.165 μs (1 allocation: 7.94 KiB)

julia> A = IntSet(1:1000); @btime collect($A); # PR
  6.452 μs (1 allocation: 7.94 KiB)

julia> B = BitSet(500:2000); @btime union($A, $B); # master
  222.936 ns (6 allocations: 656 bytes)

julia> B = IntSet(500:2000); @btime union($A, $B); # PR
  185.000 ns (5 allocations: 640 bytes)

julia> @btime $A == $B; # master
  5.209 ns (0 allocations: 0 bytes)

julia> @btime $A == $B; # PR
  4.869 ns (0 allocations: 0 bytes)

The last example shows that there are some non-negligible regressions too, I didn't investigate thoroughly the performance though (BTW, there is a BaseBenchmarks.jl PR to be soon merged which tracks some of the perfs), and more improvements can be obtained. Obviously, the old BitSet will always be more performant than this IntSet, if given the same set of optimizations, but I would say by no more than roughly 15% in general. I think it's good enough for what this gives us.

PS: Because GitHub doesn't seem to handle nicely file renames, I temporarily reverted the last commit which was deprecating BitSet -> IntSet, to make reviewing easier.

@rfourquet rfourquet added collections Data structures holding multiple items, e.g. sets deprecation This change introduces or involves a deprecation labels Dec 11, 2017
base/bitarray.jl Outdated
start > length(B) && return 0

Bc = B.chunks
bitcount(B::BitArray) = bitcount(B.chunks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count(B::BitArray)?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even sum although I think that's the same thing when the eltype is Bool.

base/bitset.jl Outdated
const Bits = Vector{UInt64}
const Chk0 = zero(UInt64)
const NoOffset = -one(Int) << 60
# + NoOffset must be big enough to stay < 0 when add with any offset.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small enough

base/bitset.jl Outdated
bits::BitVector
BitSet() = new(sizehint!(falses(0), 256))
const Bits = Vector{UInt64}
const Chk0 = zero(UInt64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this at all, you can just use 0 everywhere. The conversion will be implicit and it should be performed at compilation time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not really for performance, more for readability!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I would still bikeshed the name though, to me it looks like a type when capitalized...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done, renamed to full caps, also for NO_OFFSET.

base/bitset.jl Outdated
sizehint!(s::BitSet, n::Integer) = sizehint!(s.bits, n)

# given an integer i, return the chunk which stores it
chk_indice(i::Int) = i >> 6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next are the same as _div64 and _mod64 from "bitarray.jl". I know those names are not great, but it seems slightly silly to have two sets of nearly identical definitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah right! They are like that because they evolved: first they did what get_chunks_id do, then they were updated to work for all Int range by promoting to Int128, and ended up this way; but chk_indice must use the signed shift operator, so can _div64 can be changed to use >> ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no part of the code which uses _div64 is ever supposed to get a negative integer, and it would give meaningless results anyway if it did, so >> should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, done

Copy link
Sponsor Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems cool but I'm a little confused. Is the premise that this generalizes the functionality of BitSet so that it supports both positive and negative integers and therefore renames it to IntSet, but you've left off the commit changing the name?

base/bitarray.jl Outdated
start > length(B) && return 0

Bc = B.chunks
bitcount(B::BitArray) = bitcount(B.chunks)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even sum although I think that's the same thing when the eltype is Bool.

@rfourquet
Copy link
Member Author

Is the premise that this generalizes the functionality of BitSet so that it supports both positive and negative integers and therefore renames it to IntSet

Yes: a set data-structures which can store any Int is better named IntSet I think. I have no one around to do a poll now, but I would bet IntSet is much more self-descriptive than BitSet for a programmer who starts using Julia. Plus, by reverting the deprecation to BitSet, users porting from 0.6 to 0.7 will not even be bothered by a deprecation :)

but you've left off the commit changing the name?

Yes, because with this commit, github shows a whole deleted file "bitset.jl" and a whole new file "intset.jl", so it's hard to track the changes between the two. I could as well put the renaming to another PR if you think it's better.

@rfourquet
Copy link
Member Author

I removed the renaming part fow now as it's more work when rebasing. But would there be a consensus for this rename to IntSet, or should I label this for triage?

I addressed the comments and updated the == code so that it doesn't show the regression anymore.

@rfourquet
Copy link
Member Author

rfourquet commented Dec 12, 2017

There are errors on 32 bits builds, I tried to revert the _div64 change in bitarray.jl to see if it came from there, but was not the problem. I will try to find the cause, but without a 32 bit myself, I fear it will be difficult.
EDIT: got it, it was just me assuming Int == Int64 😳

@StefanKarpinski
Copy link
Sponsor Member

Yes, with this change, IntSet is a fitting name. Nice change!

@Sacha0
Copy link
Member

Sacha0 commented Dec 12, 2017

Yes, with this change, IntSet is a fitting name. Nice change!

I tend to agree, though under the hood this data structure is more like a PackedIntSet or so :).

@rfourquet rfourquet added this to the 1.0 milestone Dec 13, 2017
@rfourquet
Copy link
Member Author

I rebased and just add some sort of fuzzy testing comparing the result of operations on Set and on BitSet to check for consistency (which allowed me to catch a couple bugs in this PR). It would be ready to go on my side, except that the deprecation back to IntSet is not yet included, I will do when/if I get approval for merging.

@rfourquet
Copy link
Member Author

I also ran locally the not-merged-yet benchmark for collections, and it shows only performance improvements, except for something which compares apples and oranges (i.e. union(BitSet([2]), BitSet([2^18])), where the BitSet are precomputed: in this PR, BitSet([2^18]) allocates one limb around it's unique element, whereas before it allocated all the space needed to store positive integers till 2^18, so now the union forces allocation of new limbs). The other small performance regression is concerning repeated push! after calling sizehint!, and I am not sure why doh, I was forwarding sizehint! to the bits::Vector{UInt64} without dividing the argument by 64!

@StefanKarpinski
Copy link
Sponsor Member

Windows failure appears to be an LAPACK precision issue. Seems unrelated? Travis failures were due to network connection, so I restarted them.

@Sacha0
Copy link
Member

Sacha0 commented Dec 14, 2017

Pondering the name a bit more might be worthwhile: This data structure is not an efficient representation of arbitrary sets of Ints; rather, it is an efficient representation of sets of Ints densely packed over some range. (This distinction tripped me up some years ago when writing a graph algorithm.) Something like DenseIntSet might be more appropriate? Best!

@rfourquet
Copy link
Member Author

DenseIntSet is more descriptive, but 3 words in the name starts to smell pedantic to me. As Set{Int} is already there for non-dense ones, IntSet sounds fine to me, with a docstring. For arrays, DenseArray is abstract, but it would be strange and unfriendly if this was the name of the concrete type.

@Sacha0
Copy link
Member

Sacha0 commented Dec 14, 2017

DenseIntSet is more descriptive, and that more descriptive name is helpfully clarifying to me.

ftfty :) As mentioned above, the name IntSet misled me in practice.

For arrays, DenseArray is abstract, but it would be strange and unfriendly if this was the name of the concrete type.

This seems orthogonal?

@rfourquet
Copy link
Member Author

This seems orthogonal?

I was just saying that Array is dense, but we don't call it DenseArray (although it has a supertype with that name), similarly we don't necessarily have to call a non-sparse set with Dense in the name. Hard to find the best name, but IntSet at least has the objective advantage of allowing 0.6 users to not see a deprecation ;-)

@Sacha0
Copy link
Member

Sacha0 commented Dec 14, 2017

I was just saying that Array is dense, but we don't call it DenseArray (although it has a supertype with that name), similarly we don't necessarily have to call a non-sparse set with Dense in the name. Hard to find the best name, but IntSet at least has the objective advantage of allowing 0.6 users to not see a deprecation ;-)

I see :). Thanks for the clarification!

To clarify on my end: When I see the name Set{T}, I assume that I can store N arbitrary objects of type T in that collection using roughly O(N) storage. So when I see the name IntSet, I think: "Lovely! This is a fast/efficient Set specialized for Ints. I can efficiently store and work with arbitrary sets of Ints in this data structure." I then proceed to use IntSets for sets of integers containing, e.g., 1, 3728, and 10^7. At some point I observe that storage is atrocious, am thoroughly confused, and burn some time figuring out that IntSet([1, 3728, 10^7]) requires O(10^7) storage because IntSet is a representation of a densely packed set of integers. This was my experience in practice :). (Edit: And that experience is what motivated me to see the BitSet rename through.)

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Dec 14, 2017

+1 to IntSet.

@StefanKarpinski
Copy link
Sponsor Member

On the triage call, @Sacha0 has made a compelling argument for BitSet mostly reiterating the above – the key issue is that this is not an entirely general purpose data structure and the implementation does limit its use cases quite a bit. I would also be fine with leaving the name as BitSet and embracing the new functionality.

@StefanKarpinski
Copy link
Sponsor Member

Triage decision: yes to this functionality and change, but keep the name BitSet.

@rfourquet rfourquet removed the deprecation This change introduces or involves a deprecation label Dec 16, 2017
@rfourquet
Copy link
Member Author

Would Nanosoldier be on his feet again?
@nanosoldier runbenchmarks("collection", vs=":master")

@StefanKarpinski
Copy link
Sponsor Member

  • Travis Linux runs passed
  • Windows 64-bit is Bunch-Kaufman failures
  • macOS failure is some spawn test, seems unrelated?

@rfourquet
Copy link
Member Author

It looked unrelated to me too. I could also restart the build just to double check (BitSet being used in such a crucial part as inference makes me a little nervous!) I also just wished to have had Nanosoldier run, to prove the nice little speed-ups ;-)

@StefanKarpinski
Copy link
Sponsor Member

I restarted it, but it's probably overkill.

@rfourquet
Copy link
Member Author

@nanosoldier runbenchmarks("collection", vs=":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

@ararslan
Copy link
Member

Sorry, so many things have changed at once in 0.7 that I haven't had a chance to keep BaseBenchmarks working. I'll try to get it back up within the next couple of days, but any help updating BaseBenchmarks would be very welcome!

The BitVector had already started to show its limits for
BitSet needs. Juggling at 3 layers with BitSet, BitVector,
and Vector{UInt64} was becoming confusing at times.
Updating this code to handle negative integers only
increased the confusion.
Some modest performance improvements (~20%) could be achieved
in the process for some functions.
@rfourquet
Copy link
Member Author

@nanosoldier runbenchmarks("collection", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Add an offset field to BitSet acting as an infimum of stored values,
which can be adjusted on-the-fly. BitSet then becomes able
to store negative integers, and doesn't have to have elements
centered around 0.
The 1-based indexing was forcing to promote Int to Int128
in chk_indice & chk_offset, to allow correct computation
with extreme values (e.g. BitSet([typemin(Int)])).
1-based indexing was a left-over from the former BitVector
as underlying implementation.
Switching to 0-based indexing allows both a more-direct mapping
between the model and the implementation, and better
performance.
When a BitSet is empty, its offset must be initialized.
But checking for emptyness in each invocation of _setint!
was costly, so we put the check in a branch which is called
less often.
_setint! was delegating to the bitarray code, but then
some work was being redone in get_chunks_id. So we
split this functionality into lower level functions.
* isempty: using _check0 instead of all makes it 35% faster
* ==: checking first non-overlapping parts is more likely
  to be faster, as the the lower and upper parts of the bits
  field are unlikely to be zero (at least for a freshly
  created BitSet)
@rfourquet
Copy link
Member Author

For benchmarks results see my comment above for the expected "regression" for symdiff! and union!. For sizehint!, I again got it wrong by returning the underlying Vector{UInt64} instead of the bitset itself 😊 Corrected now, test added, and the perf doesn't show regression anymore, as should be expected, thanks nanosoldier for helping in finding the last bug!

@rfourquet rfourquet merged commit c16c0cb into master Dec 19, 2017
@rfourquet rfourquet deleted the rf/intset-offset branch December 19, 2017 11:51
@Sacha0
Copy link
Member

Sacha0 commented Dec 19, 2017

Thanks again for this beautiful work @rfourquet! :)

@rfourquet
Copy link
Member Author

Although I had started this work months ago, it's because I didn't like the new name that I felt the urge to finish it before feature frease, to give an argument to change it back... here we are now with the same name, but I guess you are responsible for we having this feature now :)

@JeffBezanson
Copy link
Sponsor Member

Some of these commits should have been squashed.

@rfourquet
Copy link
Member Author

Some of these commits should have been squashed.

These are all pretty self contained and "clean" (not of the "fix typo" or "wip" kind), so I though it would be OK, but sorry about that, will try to squash more next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants