Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up scalar BitArray indexing by ~25% #11403

Closed
wants to merge 1 commit into from
Closed

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented May 22, 2015

This works around issue #9974 for BitArray indexing. BitArrays use inlined helper functions, unsafe_bit(get|set)index, to do the dirty work of picking bits out of the chunks array. Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument to arrayref, which is special-cased).

The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as checkbounds) with #10525 or a subset thereof.

This works around issue #9974 for BitArray indexing.  BitArrays use inlined helper functions, `unsafe_bit(get|set)index`, to do the dirty work of picking bits out of the chunks array.  Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument to `arrayref`, which is special-cased).

The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as `checkbounds`) with #10525 or a subset thereof.
@carlobaldassi
Copy link
Member

Wow. When I first wrote those functions, I remember I checked that extracting the chunks array before entering the loops was actually better. This is good news.

@carlobaldassi
Copy link
Member

Does this fill the gap for the in function in your IntSet refactoring (#10065)?

@mbauman
Copy link
Member Author

mbauman commented May 22, 2015

I haven't tested it yet. I imagine it'll help, but I don't think it will get the whole way there because it still has the same issue that this PR is working around (#9974), just one level up. So there's still a GC frame in the way.

I might try making BitArrays or IntSets immutable with Ref fields…

(Thanks for the review, I'll make those changes. 👍).

@simonster
Copy link
Member

Maybe I'm not thinking this through properly, but shouldn't the overhead for the GC root be constant WRT the number of elements in the array? Whereas if we don't extract chunks first, then I'd think we need to do pointer loads for each element in the array if memory is modified inside a loop (#9755).

@mbauman
Copy link
Member Author

mbauman commented May 22, 2015

That's a very good point, but that doesn't seem to be the case. I'm not sure why it's different from the Sparse idiom, and I don't read LLVM well enough to see where the load is happening. See, e.g., @code_llvm Base.unsafe_getindex(falses(10), [1,2,3]).

I was measuring this with a pared down version of the array indexing perf suite from #10525:

Before:

julia,sumeltIs BitArray,2.289962,4.516089,2.510938,0.475899
julia,sumeachIs BitArray,2.392486,4.950358,2.636985,0.518681
julia,sumlinearIs BitArray,2.392490,4.723534,2.651031,0.516554
julia,sumrangeIs BitArray,55.735734,68.236330,61.076764,3.231782
julia,sumlogicalIs BitArray,76.457638,89.633972,82.705620,3.218650
julia,sumvectorIs BitArray,52.035703,64.384697,57.298191,3.329376
julia,sumeltIb BitArray,20.508571,29.376177,23.304242,2.412258
julia,sumeachIb BitArray,20.551273,28.811882,24.374394,2.373606
julia,sumlinearIb BitArray,20.658381,28.198992,24.241191,1.996451
julia,sumrangeIb BitArray,7.435534,13.135764,8.971654,1.352756
julia,sumlogicalIb BitArray,44.855952,59.534208,51.114627,3.604198
julia,sumvectorIb BitArray,4.995062,9.224450,5.736706,0.989800

After:

julia,sumeltIs BitArray,2.119090,4.205546,2.326048,0.446215
julia,sumeachIs BitArray,1.798431,3.784359,1.988818,0.391689
julia,sumlinearIs BitArray,1.798433,3.726707,2.003796,0.412508
julia,sumrangeIs BitArray,54.409144,66.483328,59.816160,3.582779
julia,sumlogicalIs BitArray,73.263354,88.265986,81.087132,3.389352
julia,sumvectorIs BitArray,51.502689,61.310434,56.091181,2.488938
julia,sumeltIb BitArray,20.508759,26.162578,22.219579,1.974766
julia,sumeachIb BitArray,18.990501,25.145888,21.125213,2.032431
julia,sumlinearIb BitArray,18.989782,27.432469,21.050434,2.136499
julia,sumrangeIb BitArray,7.405507,13.296387,8.770047,1.395887
julia,sumlogicalIb BitArray,48.416468,58.620361,52.583693,2.959972
julia,sumvectorIb BitArray,4.891553,9.179898,5.628998,0.960776

@mbauman
Copy link
Member Author

mbauman commented May 22, 2015

~~Ah, with very large arrays (3000x5000) I do see a slight (~10%) regression in logical indexing. In the tests above the array sizes are 3x5 (Is) and 300x500 (Ib).~~ Edit: I think this was a fluke.

I know nothing about this, but could this be TBAA doing its job (I'm on LLVM 3.3)? That would explain why the performance here has changed since @carlobaldassi did his tests.

@simonster
Copy link
Member

A simple sum doesn't have stores in the inner loop, so LLVM ought to hoist the loads there regardless. I wonder how much things change if you sum into an element of an array rather than a variable.

@mbauman
Copy link
Member Author

mbauman commented May 22, 2015

Sure, but what about the inner loops of the getindex function itself? For the non-scalar indexing cases, it's loading into a second BitArray.

@simonster
Copy link
Member

I benchmarked this:

function f(x::BitVector, y::Vector{Int})
    chunks = x.chunks
    for i = 1:length(x)
        @inbounds y[i] += Base.unsafe_bitgetindex(chunks, i)
    end
end

a = falses(1000);
b = [1];
for i = 1:1; f(a, b); end
@time for i = 1:10000; f(a, b); end
a = falses(10000);
b = [1];
@time for i = 1:10000; f(a, b); end
a = falses(100000);
b = [1];
@time for i = 1:10000; f(a, b); end
end

On eb5da26 I get:

   1.133 seconds     
   1.142 seconds     
   1.141 seconds

and with this PR (where I obviously don't extract chunks in f):

   1.414 seconds     
   1.401 seconds     
   1.452 seconds 

With this PR, there is no GC root, but there are two extra loads in the inner loop, since the chunks field and array pointer need to be reloaded on each iteration.

If I accumulate into a variable instead of the array, then I don't actually see a substantial difference for this benchmark. The inner loop IR is the same. I'd expect a performance advantage for this PR for very small cases, but it doesn't seem to be larger than the variability in my benchmarking for 1000 elements.

@mbauman
Copy link
Member Author

mbauman commented May 23, 2015

Thanks so much for doing those tests, Simon. This clearly isn't the answer.

@jebej
Copy link
Contributor

jebej commented Aug 23, 2019

@mbauman I was wondering why BitArrays were mutable and found this PR. Did you end up experimenting with making them immutable?

It seems that only the len field would need to be a reference for everything to work.

@mbauman
Copy link
Member Author

mbauman commented Aug 23, 2019

I did a bit more experimentation in #11430, but I've not looked at it since. As I recall, the major performance hit was because using a Ref makes that field a separate object that is separately allocated and potentially far away in memory. This cost is paid multiple times — once at construction, but then by the way of cache misses every time you access it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants