propagate effect_free information out of functions #9974

mbauman · 2015-01-30T18:57:45Z

Or, alternately, "when @inline is slower than manually inlining." Here's an example:

type T
    bits::BitVector
end

# I want to simply dispatch to an inlined function, but it's slower than I expected
unsafe_in1(s,n) = Base.unsafe_getindex(s.bits, n+1)
# So I started manually un-inlining it...
unsafe_in2(s,n) = Base.unsafe_bitgetindex(s.bits.chunks, n+1)
function unsafe_in3(s,n) 
    i = n+1
    Bc = s.bits.chunks # This assignment causes a GC Frame
    i1, i2 = Base.get_chunks_id(i)
    u = uint64(1) << i2
    @inbounds r = (Bc[i1] & u) != 0
    return r
end
# Until I found a version that works as I expected: macro-style inlining
function unsafe_in4(s,n) 
    i = n+1
    i1, i2 = Base.get_chunks_id(i)
    u = uint64(1) << i2
    @inbounds r = (s.bits.chunks[i1] & u) != 0
    return r
end

Looking at the results of code_llvm(unsafe_inX, (T, Int)), we emit a GC Frame in all of the above functions except unsafe_in4. This can cause slowdowns of up to 35% on very simple functions like the above:

function timeit()
    s = T(falses(10))
    unsafe_in3(s, 1); @time for i=1:100000000 unsafe_in3(s,1) end
    unsafe_in4(s, 1); @time for i=1:100000000 unsafe_in4(s,1) end
end
timeit()

elapsed time: 0.433966378 seconds (0 bytes allocated)
elapsed time: 0.31291549 seconds (0 bytes allocated)

This is true both before and after the SSA patch. I've done spot-checks on today's 61d3ece and a month-old 48aae1f with the same results.

The text was updated successfully, but these errors were encountered:

vtjnash · 2015-01-31T01:02:43Z

the problem is that we don't store/propagate the knowledge that get_chunks_id is pure (aka effect_free). this prevents inference from making the final transform that you make in your above code.

This works around issue #9974 for BitArray indexing. BitArrays use inlined helper functions, `unsafe_bit(get|set)index`, to do the dirty work of picking bits out of the chunks array. Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument to `arrayref`, which is special-cased). The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as `checkbounds`) with #10525 or a subset thereof.

mbauman · 2015-05-22T17:15:12Z

I was looking into this a bit more last night, hoping it might be an easy fix. There's several intertwined optimizations here, but it seems like effect_free is propagating properly. Here's a minimal refactoring that does the optimization I'd like for Base.unsafe_bitgetindex:

import Base: _div64, _mod64
inner(Bc, i) = (Bc[_div64(i-1)+1] & (UInt64(1) << _mod64(i-1))) != 0
outer(B, i) = inner(B.chunks, i)
code_typed(outer, (BitVector, Int)) # No temporary assignment
code_llvm(outer, (BitVector, Int)) # No GC frame

Now it's easy to play around and see what is blocking the desired optimization. The goal is to eliminate the temporary assignment of the argument and allow the expression to be plugged directly into the argument of arrayref (where a special-cased optimization allows the eliding of GC roots of its arguments):

Any assignment (even if it's assigning a value from an effect_free expression) counts as effect-ful, so even the simple inner(Bc, i) = (x = Bc[i]; x) stymies this optimization. But that's not too terrible… for small hot functions like this it's easy to work around (like I did above).
@inbounds. This is the nail in the coffin. The simple inner(Bc, i) = @inbounds return Bc[i] is the minimal counter-example. The argument to inner might be of the form x[j], which absolutely needs to be performed outside of the context of @inbounds, so this transformation isn't safe in general.

So I'm not sure that there's anything directly actionable here, outside of a much larger project on escape analysis to ensure that a mutable field an object was assigned from isn't reassigned (and therefore unrooted) before that object is used… which seems terribly specific and like lots of hard work.

(Note that this isn't an issue for an immutable type, which BitArrays could almost be were it not for mutable-length vectors.)

mbauman · 2016-04-21T21:34:43Z

Looks like this got fixed somewhere along the way — I now see 0.29s for both examples in the original post.

mbauman changed the title ~~Avoid~~ Avoid GC frames when inlining functions Jan 30, 2015

mbauman added the performance Must go faster label Jan 30, 2015

vtjnash changed the title ~~Avoid GC frames when inlining functions~~ propagate effect_free information out of functions Jan 31, 2015

mbauman mentioned this issue Feb 4, 2015

WIP: Refactor IntSets to use BitVectors #10065

Closed

2 tasks

mbauman mentioned this issue May 22, 2015

Speed up scalar BitArray indexing by ~25% #11403

Closed

mbauman mentioned this issue May 24, 2015

WIP: Allow constant fields in mutable types #11430

Closed

6 tasks

mbauman closed this as completed Apr 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

propagate effect_free information out of functions #9974

propagate effect_free information out of functions #9974

mbauman commented Jan 30, 2015

vtjnash commented Jan 31, 2015

mbauman commented May 22, 2015

mbauman commented Apr 21, 2016

propagate effect_free information out of functions #9974

propagate effect_free information out of functions #9974

Comments

mbauman commented Jan 30, 2015

vtjnash commented Jan 31, 2015

mbauman commented May 22, 2015

mbauman commented Apr 21, 2016