Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor IntSets #20456

Merged
merged 4 commits into from
Feb 7, 2017
Merged

Refactor IntSets #20456

merged 4 commits into from
Feb 7, 2017

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Feb 4, 2017

  • The element type of IntSet is now Int. IntSets now only support integers in the range 1:typemax(Int). This is technically a breaking change on 32-bit systems, where it was previously possible to store some Int64s larger than typemax(Int32). Note that all documentation currently says that IntSet stores Ints, and 32 bit systems currently fail pretty spectacularly for integers larger than 2^36 (a 2GB data structure). So this breakage is pretty oblique.
  • Complete deprecation of stored zeros.
  • Complete deprecation of complement; removes all support for inverted IntSets
  • Refactor internals to rely on a BitVector, allowing the use of highly optimized map methods. IntSet is now immutable. This significantly improves performance across varying densities and sizes. These are compared against a modified Base with deprecation warnings removed for a fairer comparison. Testing code available here.
  • Add more tests and organize into testsets.
  • Improve hashing; hash(IntSet([1])) is now distinct from hash(IntSet([65]))

This is a continuation of #10065. Now that complements are fully removed, making IntSet immutable solves the performance issue. I am keeping the name the same within this PR as it vastly simplifies comparisons between the two implementations; the name can later be changed to IndexSet if still desired. The naming story is now a bit more complicated since we support offset indices, but a future change could perhaps allow wrapping any AbstractVector{Bool} and base the supported Ints on those indices. Very few methods depend upon BitArray internals.

I was surprised how easy it was to revive my IntSets package. The diffstat is deceiving:

$ git diff base | diffstat -s
 5 files changed, 183 insertions(+), 291 deletions(-)
$ git diff test | diffstat -s
 1 file changed, 268 insertions(+), 127 deletions(-)

* Complete deprecation of stored zeros; IntSets now only support integers in the range `1:typemax(Int)`
* Complete deprecation of `complement`; removes all support for inverted IntSets
* Refactor internals to rely on a BitVector, allowing the use of highly optimized `map` methods. `IntSet` is now immutable. This significantly improves performance across varying [densities](http://imgur.com/a/uqv8A) and [sizes](http://imgur.com/a/iEgcr). These are compared against a modified Base with deprecation warnings removed for a fairer comparison. Testing code [available here](https://github.com/mbauman/IntSets.jl/tree/b50a7c97abbe9786e33221f723e107e266f31fe4/test).
* Add more tests and organize into testsets.
* Improve hashing; `hash(IntSet([1]))` is now distinct from `hash(IntSet([65]))`

This is a continuation of #10065. Now that complements are fully removed, making IntSet immutable solves the performance issue. I am keeping the name the same within this PR as it vastly simplifies comparisons between the two implementations; the name can later be changed to `IndexSet` if still desired. The naming story is now a bit more complicated since we support offset indices, but a future change could perhaps allow wrapping any `AbstractVector{Bool}` and base the supported `Int`s on those indices. Very few methods depend upon BitArray internals.
base/intset.jl Outdated
IntSet() = new(zeros(UInt32,256>>>5), 256, false)
immutable IntSet <: AbstractSet{Int}
bits::BitVector
IntSet() = new(fill!(BitVector(256), false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to just do new(falses(256)), as falses returns a BitVector.

base/intset.jl Outdated
# An internal function for setting the inclusion bit for a given integer n >= 0
@inline function _setint!(s::IntSet, idx::Integer, b::Bool)
if idx > length(s.bits)
!b && return s # setting a bit to zero outside the set's bits is a no-op
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b || return s avoids the negation, though I'm sure that's so insignificant that it isn't even worth mentioning

base/intset.jl Outdated

# An internal function that takes a pure function `f` and maps across two BitArrays
# allowing the lengths to be different and altering b1 with the result
function _matched_map!{F}(f::F, b1::BitArray, b2::BitArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The F parameter shouldn't be necessary I would think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want these functions to inline and specialize; the type parameter ensures that happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does? I didn't think there would be a difference here. Is there anywhere I can read about this in the docs?

Sorry to butt in, this was just very surprising to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. There is some sort of difference here that I thought I had measured before… but in this case we're just punting to map so it's totally overkill. I'll ditch the specialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought the type parameters in cases such as this used to make a difference but at some point something changed so that they didn't matter, but I don't have anything concrete to point to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamed2 AFAIK the rule is that method specialization on function arguments happen only when 1) they are called from the body of the method, or 2) a type parameter is explicitly used. I'm not sure whether that's mentioned in the manual.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan Thanks. It makes sense to leave certain compiler implementation details out of the manual if they're expected to change, but if this is a concern for performance it should maybe get a mention?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, it might already be in the manual, so better check that first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref. #19137 re. specialization on function arguments. Best!


"""
symdiff!(s1, s2)
length(s::IntSet) = sum(s.bits)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

countnz might be more efficient for BitVectors than sum, but I'm not entirely sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exactly the same, sum is just defined as countnz.

base/intset.jl Outdated
end
while i > 0
h = hash(bc[i], h)
i -= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indented with a single tab but it should be a multiple of 4 spaces

data_in = (1,5,100)
s = IntSet(data_in)
data_out = collect(s)
@test all(map(d->in(d,data_out), data_in))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all takes a function, so you can do all(d->in(d, data_out), data_in)

min_pc = next(W, Int64(pc) + 1)[1]
if min_pc >= W.limit
min_pc = next(W, Int64(pc))[2]
if done(W, min_pc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes. This code depends upon the internal iteration state.

base/intset.jl Outdated
IntSet() = new(zeros(UInt32,256>>>5), 256, false)
immutable IntSet <: AbstractSet{Int}
bits::BitVector
IntSet() = new(fill!(BitVector(256), false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

falses(256)?

base/intset.jl Outdated
end
print(io, "])")
copy(s1::IntSet) = copy!(IntSet(), s1)
function copy!(to::IntSet, from::IntSet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout Base, the arguments are called dest and src. It may be a good time to use them in IntSet too.

base/intset.jl Outdated
end
unsafe_setindex!(s.bits, b, idx) # Use @inbounds once available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inbounds could be used now, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer!

base/intset.jl Outdated
@inline function _resize0!(b::BitVector, newlen::Integer)
len = length(b)
resize!(b, newlen)
len < newlen && unsafe_setindex!(b, false, len+1:newlen) # resize! gives dirty memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inbounds could be used here too.

base/intset.jl Outdated
n = Int64(i)
@inline function in(n::Integer, s::IntSet)
if 1 <= n <= length(s.bits)
unsafe_getindex(s.bits, n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inbounds return s.bits[n]?

base/intset.jl Outdated
end
while i > 0
h = hash(bc[i], h)
i -= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a tab instead of spaces here.

base/intset.jl Outdated
end
eltype(s::IntSet) = Int
sizehint!(s::IntSet, n::Integer) = (_resize0!(s.bits, n); s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to resize the set no matter what? I think it needs to resize to isempty(s) ? n : max(n, last(s)), we don't want to throw away elements larger than n.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch. Fixed and tested.

@mbauman
Copy link
Member Author

mbauman commented Feb 5, 2017

Thank you @carlobaldassi and @ararslan. I've pushed a change that addresses your comments and hopefully makes 32 bit CI a bit happier. Unfortunately, it does make this an obliquely breaking PR — the eltype is now Int32 on 32 bit systems, meaning that you can no longer store values bigger than typemax(Int32).

@mbauman mbauman added breaking This change will break code collections Data structures holding multiple items, e.g. sets labels Feb 5, 2017
else
n = Int64(ccall(:bitvector_next, UInt64, (Ptr{UInt32}, UInt64, UInt64), s.bits, i, s.limit))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the C functions used in here can be deleted from support/bitvector.* now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping so myself, but I think it's used by flisp in the parser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the functions are used by flisp, but most aren't. I think everything after bitvector_get in bitvector.c can be deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice, thanks!

@JeffBezanson JeffBezanson added this to the 0.6.0 milestone Feb 5, 2017
@JeffBezanson JeffBezanson merged commit 40831ac into master Feb 7, 2017
@mbauman mbauman deleted the mb/intset-refactor branch February 7, 2017 16:34
mbauman added a commit that referenced this pull request Feb 8, 2017
Following #20456 this is a better and more accurate name.
@mbauman mbauman mentioned this pull request Mar 3, 2017
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants