Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propagate effect_free information out of functions #9974

Closed
mbauman opened this issue Jan 30, 2015 · 3 comments
Closed

propagate effect_free information out of functions #9974

mbauman opened this issue Jan 30, 2015 · 3 comments
Labels
performance Must go faster

Comments

@mbauman
Copy link
Sponsor Member

mbauman commented Jan 30, 2015

Or, alternately, "when @inline is slower than manually inlining." Here's an example:

type T
    bits::BitVector
end

# I want to simply dispatch to an inlined function, but it's slower than I expected
unsafe_in1(s,n) = Base.unsafe_getindex(s.bits, n+1)
# So I started manually un-inlining it...
unsafe_in2(s,n) = Base.unsafe_bitgetindex(s.bits.chunks, n+1)
function unsafe_in3(s,n) 
    i = n+1
    Bc = s.bits.chunks # This assignment causes a GC Frame
    i1, i2 = Base.get_chunks_id(i)
    u = uint64(1) << i2
    @inbounds r = (Bc[i1] & u) != 0
    return r
end
# Until I found a version that works as I expected: macro-style inlining
function unsafe_in4(s,n) 
    i = n+1
    i1, i2 = Base.get_chunks_id(i)
    u = uint64(1) << i2
    @inbounds r = (s.bits.chunks[i1] & u) != 0
    return r
end

Looking at the results of code_llvm(unsafe_inX, (T, Int)), we emit a GC Frame in all of the above functions except unsafe_in4. This can cause slowdowns of up to 35% on very simple functions like the above:

function timeit()
    s = T(falses(10))
    unsafe_in3(s, 1); @time for i=1:100000000 unsafe_in3(s,1) end
    unsafe_in4(s, 1); @time for i=1:100000000 unsafe_in4(s,1) end
end
timeit()

elapsed time: 0.433966378 seconds (0 bytes allocated)
elapsed time: 0.31291549 seconds (0 bytes allocated)

This is true both before and after the SSA patch. I've done spot-checks on today's 61d3ece and a month-old 48aae1f with the same results.

@mbauman mbauman changed the title Avoid Avoid GC frames when inlining functions Jan 30, 2015
@mbauman mbauman added the performance Must go faster label Jan 30, 2015
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 31, 2015

the problem is that we don't store/propagate the knowledge that get_chunks_id is pure (aka effect_free). this prevents inference from making the final transform that you make in your above code.

@vtjnash vtjnash changed the title Avoid GC frames when inlining functions propagate effect_free information out of functions Jan 31, 2015
mbauman added a commit that referenced this issue May 22, 2015
This works around issue #9974 for BitArray indexing.  BitArrays use inlined helper functions, `unsafe_bit(get|set)index`, to do the dirty work of picking bits out of the chunks array.  Previously, these helpers took the array of chunks as an argument, but that array needs a GC root since BitArrays are mutable. This changes those helper functions to work on whole BitArray itself, which enables an optimization to avoid that root (since the chunks array is only accessed as an argument to `arrayref`, which is special-cased).

The ~25% performance gain is for unsafe_getindex; the difference isn't quite as big for getindex (only ~10%) since there's still a GC root for the BoundsError. That can also be avoided, but I'd rather make that change more systematically (as `checkbounds`) with #10525 or a subset thereof.
@mbauman
Copy link
Sponsor Member Author

mbauman commented May 22, 2015

I was looking into this a bit more last night, hoping it might be an easy fix. There's several intertwined optimizations here, but it seems like effect_free is propagating properly. Here's a minimal refactoring that does the optimization I'd like for Base.unsafe_bitgetindex:

import Base: _div64, _mod64
inner(Bc, i) = (Bc[_div64(i-1)+1] & (UInt64(1) << _mod64(i-1))) != 0
outer(B, i) = inner(B.chunks, i)
code_typed(outer, (BitVector, Int)) # No temporary assignment
code_llvm(outer, (BitVector, Int)) # No GC frame

Now it's easy to play around and see what is blocking the desired optimization. The goal is to eliminate the temporary assignment of the argument and allow the expression to be plugged directly into the argument of arrayref (where a special-cased optimization allows the eliding of GC roots of its arguments):

  • Any assignment (even if it's assigning a value from an effect_free expression) counts as effect-ful, so even the simple inner(Bc, i) = (x = Bc[i]; x) stymies this optimization. But that's not too terrible… for small hot functions like this it's easy to work around (like I did above).
  • @inbounds. This is the nail in the coffin. The simple inner(Bc, i) = @inbounds return Bc[i] is the minimal counter-example. The argument to inner might be of the form x[j], which absolutely needs to be performed outside of the context of @inbounds, so this transformation isn't safe in general.

So I'm not sure that there's anything directly actionable here, outside of a much larger project on escape analysis to ensure that a mutable field an object was assigned from isn't reassigned (and therefore unrooted) before that object is used… which seems terribly specific and like lots of hard work.

(Note that this isn't an issue for an immutable type, which BitArrays could almost be were it not for mutable-length vectors.)

@mbauman
Copy link
Sponsor Member Author

mbauman commented Apr 21, 2016

Looks like this got fixed somewhere along the way — I now see 0.29s for both examples in the original post.

@mbauman mbauman closed this as completed Apr 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

2 participants