-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor IntSets #20456
Refactor IntSets #20456
Conversation
* Complete deprecation of stored zeros; IntSets now only support integers in the range `1:typemax(Int)` * Complete deprecation of `complement`; removes all support for inverted IntSets * Refactor internals to rely on a BitVector, allowing the use of highly optimized `map` methods. `IntSet` is now immutable. This significantly improves performance across varying [densities](http://imgur.com/a/uqv8A) and [sizes](http://imgur.com/a/iEgcr). These are compared against a modified Base with deprecation warnings removed for a fairer comparison. Testing code [available here](https://github.com/mbauman/IntSets.jl/tree/b50a7c97abbe9786e33221f723e107e266f31fe4/test). * Add more tests and organize into testsets. * Improve hashing; `hash(IntSet([1]))` is now distinct from `hash(IntSet([65]))` This is a continuation of #10065. Now that complements are fully removed, making IntSet immutable solves the performance issue. I am keeping the name the same within this PR as it vastly simplifies comparisons between the two implementations; the name can later be changed to `IndexSet` if still desired. The naming story is now a bit more complicated since we support offset indices, but a future change could perhaps allow wrapping any `AbstractVector{Bool}` and base the supported `Int`s on those indices. Very few methods depend upon BitArray internals.
base/intset.jl
Outdated
IntSet() = new(zeros(UInt32,256>>>5), 256, false) | ||
immutable IntSet <: AbstractSet{Int} | ||
bits::BitVector | ||
IntSet() = new(fill!(BitVector(256), false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to just do new(falses(256))
, as falses
returns a BitVector
.
base/intset.jl
Outdated
# An internal function for setting the inclusion bit for a given integer n >= 0 | ||
@inline function _setint!(s::IntSet, idx::Integer, b::Bool) | ||
if idx > length(s.bits) | ||
!b && return s # setting a bit to zero outside the set's bits is a no-op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b || return s
avoids the negation, though I'm sure that's so insignificant that it isn't even worth mentioning
base/intset.jl
Outdated
|
||
# An internal function that takes a pure function `f` and maps across two BitArrays | ||
# allowing the lengths to be different and altering b1 with the result | ||
function _matched_map!{F}(f::F, b1::BitArray, b2::BitArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The F
parameter shouldn't be necessary I would think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want these functions to inline and specialize; the type parameter ensures that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does? I didn't think there would be a difference here. Is there anywhere I can read about this in the docs?
Sorry to butt in, this was just very surprising to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. There is some sort of difference here that I thought I had measured before… but in this case we're just punting to map
so it's totally overkill. I'll ditch the specialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought the type parameters in cases such as this used to make a difference but at some point something changed so that they didn't matter, but I don't have anything concrete to point to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamed2 AFAIK the rule is that method specialization on function arguments happen only when 1) they are called from the body of the method, or 2) a type parameter is explicitly used. I'm not sure whether that's mentioned in the manual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan Thanks. It makes sense to leave certain compiler implementation details out of the manual if they're expected to change, but if this is a concern for performance it should maybe get a mention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, it might already be in the manual, so better check that first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref. #19137 re. specialization on function arguments. Best!
|
||
""" | ||
symdiff!(s1, s2) | ||
length(s::IntSet) = sum(s.bits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
countnz
might be more efficient for BitVector
s than sum
, but I'm not entirely sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exactly the same, sum
is just defined as countnz
.
base/intset.jl
Outdated
end | ||
while i > 0 | ||
h = hash(bc[i], h) | ||
i -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indented with a single tab but it should be a multiple of 4 spaces
data_in = (1,5,100) | ||
s = IntSet(data_in) | ||
data_out = collect(s) | ||
@test all(map(d->in(d,data_out), data_in)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all
takes a function, so you can do all(d->in(d, data_out), data_in)
base/inference.jl
Outdated
min_pc = next(W, Int64(pc) + 1)[1] | ||
if min_pc >= W.limit | ||
min_pc = next(W, Int64(pc))[2] | ||
if done(W, min_pc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately yes. This code depends upon the internal iteration state.
base/intset.jl
Outdated
IntSet() = new(zeros(UInt32,256>>>5), 256, false) | ||
immutable IntSet <: AbstractSet{Int} | ||
bits::BitVector | ||
IntSet() = new(fill!(BitVector(256), false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
falses(256)
?
base/intset.jl
Outdated
end | ||
print(io, "])") | ||
copy(s1::IntSet) = copy!(IntSet(), s1) | ||
function copy!(to::IntSet, from::IntSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout Base, the arguments are called dest
and src
. It may be a good time to use them in IntSet too.
base/intset.jl
Outdated
end | ||
unsafe_setindex!(s.bits, b, idx) # Use @inbounds once available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inbounds
could be used now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer!
base/intset.jl
Outdated
@inline function _resize0!(b::BitVector, newlen::Integer) | ||
len = length(b) | ||
resize!(b, newlen) | ||
len < newlen && unsafe_setindex!(b, false, len+1:newlen) # resize! gives dirty memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inbounds
could be used here too.
base/intset.jl
Outdated
n = Int64(i) | ||
@inline function in(n::Integer, s::IntSet) | ||
if 1 <= n <= length(s.bits) | ||
unsafe_getindex(s.bits, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inbounds return s.bits[n]
?
base/intset.jl
Outdated
end | ||
while i > 0 | ||
h = hash(bc[i], h) | ||
i -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a tab instead of spaces here.
base/intset.jl
Outdated
end | ||
eltype(s::IntSet) = Int | ||
sizehint!(s::IntSet, n::Integer) = (_resize0!(s.bits, n); s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to resize the set no matter what? I think it needs to resize to isempty(s) ? n : max(n, last(s))
, we don't want to throw away elements larger than n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch. Fixed and tested.
Thank you @carlobaldassi and @ararslan. I've pushed a change that addresses your comments and hopefully makes 32 bit CI a bit happier. Unfortunately, it does make this an obliquely breaking PR — the eltype is now |
else | ||
n = Int64(ccall(:bitvector_next, UInt64, (Ptr{UInt32}, UInt64, UInt64), s.bits, i, s.limit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the C functions used in here can be deleted from support/bitvector.* now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping so myself, but I think it's used by flisp in the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the functions are used by flisp, but most aren't. I think everything after bitvector_get in bitvector.c can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice, thanks!
Following #20456 this is a better and more accurate name.
IntSet
is nowInt
. IntSets now only support integers in the range1:typemax(Int)
. This is technically a breaking change on 32-bit systems, where it was previously possible to store someInt64
s larger thantypemax(Int32)
. Note that all documentation currently says thatIntSet
storesInt
s, and 32 bit systems currently fail pretty spectacularly for integers larger than 2^36 (a 2GB data structure). So this breakage is pretty oblique.complement
; removes all support for inverted IntSetsmap
methods.IntSet
is now immutable. This significantly improves performance across varying densities and sizes. These are compared against a modified Base with deprecation warnings removed for a fairer comparison. Testing code available here.hash(IntSet([1]))
is now distinct fromhash(IntSet([65]))
This is a continuation of #10065. Now that complements are fully removed, making IntSet immutable solves the performance issue. I am keeping the name the same within this PR as it vastly simplifies comparisons between the two implementations; the name can later be changed to
IndexSet
if still desired. The naming story is now a bit more complicated since we support offset indices, but a future change could perhaps allow wrapping anyAbstractVector{Bool}
and base the supportedInt
s on those indices. Very few methods depend upon BitArray internals.I was surprised how easy it was to revive my IntSets package. The diffstat is deceiving: