-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up scalar BitArray indexing by ~25% #11403
Conversation
This works around issue #9974 for BitArray indexing. BitArrays use inlined helper functions, `unsafe_bit(get|set)index`, to do the dirty work of picking bits out of the chunks array. Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument to `arrayref`, which is special-cased). The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as `checkbounds`) with #10525 or a subset thereof.
Wow. When I first wrote those functions, I remember I checked that extracting the chunks array before entering the loops was actually better. This is good news. |
Does this fill the gap for the |
I haven't tested it yet. I imagine it'll help, but I don't think it will get the whole way there because it still has the same issue that this PR is working around (#9974), just one level up. So there's still a GC frame in the way. I might try making BitArrays or IntSets immutable with (Thanks for the review, I'll make those changes. 👍). |
Maybe I'm not thinking this through properly, but shouldn't the overhead for the GC root be constant WRT the number of elements in the array? Whereas if we don't extract |
That's a very good point, but that doesn't seem to be the case. I'm not sure why it's different from the Sparse idiom, and I don't read LLVM well enough to see where the load is happening. See, e.g., I was measuring this with a pared down version of the array indexing perf suite from #10525: Before:
After:
|
~~Ah, with very large arrays (3000x5000) I do see a slight (~10%) regression in logical indexing. In the tests above the array sizes are 3x5 (Is) and 300x500 (Ib).~~ Edit: I think this was a fluke. I know nothing about this, but could this be TBAA doing its job (I'm on LLVM 3.3)? That would explain why the performance here has changed since @carlobaldassi did his tests. |
A simple |
Sure, but what about the inner loops of the |
I benchmarked this: function f(x::BitVector, y::Vector{Int})
chunks = x.chunks
for i = 1:length(x)
@inbounds y[i] += Base.unsafe_bitgetindex(chunks, i)
end
end
a = falses(1000);
b = [1];
for i = 1:1; f(a, b); end
@time for i = 1:10000; f(a, b); end
a = falses(10000);
b = [1];
@time for i = 1:10000; f(a, b); end
a = falses(100000);
b = [1];
@time for i = 1:10000; f(a, b); end
end On eb5da26 I get:
and with this PR (where I obviously don't extract
With this PR, there is no GC root, but there are two extra loads in the inner loop, since the If I accumulate into a variable instead of the array, then I don't actually see a substantial difference for this benchmark. The inner loop IR is the same. I'd expect a performance advantage for this PR for very small cases, but it doesn't seem to be larger than the variability in my benchmarking for 1000 elements. |
Thanks so much for doing those tests, Simon. This clearly isn't the answer. |
@mbauman I was wondering why BitArrays were mutable and found this PR. Did you end up experimenting with making them immutable? It seems that only the |
I did a bit more experimentation in #11430, but I've not looked at it since. As I recall, the major performance hit was because using a |
This works around issue #9974 for BitArray indexing. BitArrays use inlined helper functions,
unsafe_bit(get|set)index
, to do the dirty work of picking bits out of the chunks array. Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument toarrayref
, which is special-cased).The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as
checkbounds
) with #10525 or a subset thereof.