Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization Roadmap #16285

Closed
5 tasks done
stevengj opened this issue May 9, 2016 · 86 comments
Closed
5 tasks done

Vectorization Roadmap #16285

stevengj opened this issue May 9, 2016 · 86 comments
Assignees
Labels
broadcast Applying a function over a collection julep Julia Enhancement Proposal
Milestone

Comments

@stevengj
Copy link
Member

stevengj commented May 9, 2016

Now that #15032 is merged, here are the main remaining steps discussed in #8450, roughly in order that they should be implemented:

More speculative proposals probably for the 0.6 timeframe (suggested by @yuyichao):

  • Define nested f.(args...) calls as "fusing" broadcast operations at the syntax level. For example, sin.(x .+ cos.(x .^ sum(x.^2))) would turn (in julia-syntax.scm) into broadcast((x, _s_) -> sin(x + cos(x^_s_)), x, sum(broacast(^, x, 2))). Notice that the sum function (or any non-dot call) would be a "fusion boundary." The caller would be responsible for not using f.(args...) in cases where fusion would screw up side effects. fusion of nested f.(args) calls into a single broadcast call #17300
  • Define x .= ... as "fusing" calls to broadcast!, e.g. x .= sin.(x .+ y) would act in-place with a single loop. Again, this would occur at the syntax level, so the caller would be responsible for avoiding function calls with problematic side effects. (See make .+= et al. mutating #7052, but the lack of loop fusion at that time made .+= much less attractive.) treat .= as syntactic sugar for broadcast! #17510
@stevengj stevengj added the julep Julia Enhancement Proposal label May 9, 2016
@andyferris
Copy link
Member

Regarding the third point, can we do .= as broadcast! without worrying overly about loop fusion?

As in A .= B as broadcast!(x->x, A, B) as an element-wise copy. Of course, more complex expressions would require more complex loop fusion, but this gives some moderate gain. .+= and so-on are very low-hanging fruit for at least removing one layer of allocation very simply.

(Hmm... that last one makes me think if .sin= is good syntax? As in A .sin= B being like Julia 0.4 A[:] = sin(B[:]) I'm being more than a bit silly here, but it works similarly to the . prefix version like .sin(x) rather than suffix sin.(x). It even kinda nests: A .exp.sin= x. Quite ugly though, and messes with scoping/field syntax)

@ViralBShah ViralBShah added this to the 0.5.0 milestone May 10, 2016
@stevengj
Copy link
Member Author

stevengj commented May 10, 2016

@andyferris, without loop fusion, an in-place .= is pretty useless. An expression like x .= 4y .+ 5y.^2 .- sin.(y) will still allocate lots of temporary arrays. If you just want element-wise copy, you can already do A[:] = B.

@andyferris
Copy link
Member

@stevengj I agree that it is (almost) useless, except for an alternative copy syntax. It's just "step zero" - it's dirt simple and provides the precedence for "step one" as doing similarly for .+= and all the other op= with a prefix ., which then starts to be useful, for things like x .+= 1 or something to be non-allocating where x is an Array. In fact, the number of op= operators currently defined is relatively small... so it shouldn't be too hard to define them all.

"Step 2" would be the more complex loop fusion or similar, and perhaps generalizations to generic functions.

On the whole, I do have to support what Jeff said in #14544 (comment). Perhaps the cleaner approach is to have some neat syntax for map/broadcast (and hopefully where loop fusion is clear to the user or compiler). I dunno.

@stevengj
Copy link
Member Author

stevengj commented May 11, 2016

@andyferris, that discussion is out of date. We now do have a neat syntax for broadcast, and the possibility for loop fusion is now exposed to the compiler at the syntax level. And the since that syntax is ., generalized dot operators now make a lot more sense.

@andyferris
Copy link
Member

@stevengj Sorry, I missed the last week or so of #8450 (it's really hard to keep up with all the threads!).

In any case, the progress seems very cool! The parser-based loop fusion (e.g. your #8450 (comment) seems like a great option to me. Any more complex things can still be done with loops, comprehensions, map and broadcast explicitly.

@diegozea
Copy link
Contributor

A @fuse macro can be more safe than making the loop fusion at parser level, can't it?

@yuyichao
Copy link
Contributor

What's the advantage? We already have that Devectorize.jl and I don't think It will be more flexible or give you more control.

@diegozea
Copy link
Contributor

diegozea commented May 11, 2016

I'm only worry about making loop fusion everywhere. "The caller would be responsible for avoiding function calls with problematic side effects" but... How can the user use the point notation and avoid the fusion at the same time? Also I think that @fuse will be more explicit to write and to read in the code. @fuse can do the same proposed loop fusion, but It can also check the functions to be called (i.e. annotated pure functions). The last isn't possible at syntax level.

@yuyichao
Copy link
Contributor

yuyichao commented May 11, 2016

The caller would be responsible for avoiding function calls with problematic side effects

I think this is not an issue for the majority of cases.

How can the user use the point notation and avoid the fusion at the same time?

Split the line

but It can also check the functions to be called (i.e. annotated pure functions).

This is impossible. Or at least it won't be better at doing that.

@stevengj
Copy link
Member Author

stevengj commented May 11, 2016

@diegozea, macros operate at the syntax level, without knowing types, so they can't check pureness. Same for the proposed f.(args...) fusing. (In practice, though, vectorized operations are virtually never used in cases with side effects, much less side effects that would be effected by fusion. And if f.(args...) is defined as always fusing, then you can reliably know what to expect.)

Think of f.(args...) as a much more compact and discoverable syntax for @fuse. (Once we deprecate e.g. sin(x) in favor of sin.(x) etcetera, people doing Matlab-like computations will end up using fused loops automatically in many case due to the deprecation warning, whereas they might never learn about @fuse.)

@stevengj
Copy link
Member Author

stevengj commented May 11, 2016

Another way of saying it is that fusing the loops is a good idea in the vast majority of cases. Not wanting to fuse is the rare exception. That makes it sensible to make fusing the default, and in the rare cases where you don't want to fuse you can either split the line or just call broadcast explicitly.

(Note that this whole discussion was not even possible before the . notation. If you write f(g(x)) in 0.4, there is no practical generic way to look "inside" f and g and discover that they are broadcast operations that ought to be fused, whereas f.(g.(x)) makes the user's intent clear at the syntax level, enabling a syntax-level fusing transformation.)

@diegozea
Copy link
Contributor

Sorry, I meant parse not syntax level in my last sentence. However I believed that a macro could access a function metadata.

@yuyichao
Copy link
Contributor

yuyichao commented May 11, 2016

However I believed that a macro could access a function metadata.

No it can't. For a start, it doesn't even have any idea about the binding. It is allowed to do sin = cos; sin(1) (or a much more reasonable form of this). @pure is also per method and you can't check that without knowing all the type info.

@stevengj
Copy link
Member Author

stevengj commented May 11, 2016

@diegozea, macros are called at the parse level, which is what we mean by the "syntax" level here. i.e. at the point the macro is called, all that is known is the abstract syntax tree (AST); there is no information about types or bindings.

@martinholters
Copy link
Member

While fusion by default sounds like a good idea, I find it a bit unsettling that

y = g.(f.(x))

and

a = f.(x)
y = g.(a)

might give different results. Is that just me?

@nalimilan
Copy link
Member

Not different results, just different ways of returning the same result. That's fine IMHO. In the long-term, the compiler might become smarter and detect more complex cases.

@martinholters
Copy link
Member

@nalimilan You are assuming pure f and g.

@johnmyleswhite
Copy link
Member

FWIW, I suspect most functions people are vectorizing are pure.

@quinnj
Copy link
Member

quinnj commented May 12, 2016

Maybe we need to do some more exploration around function traits (purity, guaranteed return types, boolean, etc.) and how they could be incorporated into some of these murkier discussions (vectorization, return types, etc.). It seems like having some explicit declarations around function traits would avoid the need to rely on inference or the compiler.

@andyferris
Copy link
Member

andyferris commented May 12, 2016

I'm worried that overly complex rules for defining how loop fusion works will just become too confusing for users. The suggestion that A .= B .+ C ... becomes syntactic sugar for map/broadcast means that, once we are all used to it, we will easily be able to reason about code we see and know how to write a simple fused loop expression for vectors of data.

If it is a simple parsing-level rule, then the differences in @martinholters's comment will be obvious. If it is compile-time magic, we will be spending a lot of time trying to figure out if the compiler is really doing what we want it to do.

But coming back to nested . functions/operators, would we be able to avoid allocation when matrix multiplication is in the middle, e.g. v1 .= v2 .+ m1*v3 (where m3 is a matrix, and v* are vectors)? Correct me if I'm wrong, but isn't this BLAS's daxpy? We would want that to be non-allocating, so hopefully whatever syntax we come up with would allow for this kind of thing (where the order of multiplication and additions for the matrix multiplication is cache-optimized as-in BLAS, not direct as-in map and broadcast).

@yuyichao
Copy link
Contributor

@andyferris Yes, the hope is that the parser level transformation can cover most use case with a well defined schematics. It is in principle possible to add support for more complicated construct (I very briefly talked about this with @andreasnoack ) but I personally feel like it is hard to come up with a syntax that can cover all the cases besides what can be currently achieved with broadcast (or a mutating version of it).

Thinking loudly, maybe it can be achieved by having a lazy array type (so the computation is done on the fly with in boardcast)? That would be an orthogonal change though. It'll also be hard to recognize that and call BLAS functions.

@stevengj
Copy link
Member Author

stevengj commented May 13, 2016

@andyferris, the whole point is that the proposed loop fusion becomes a simple parsing-level guarantee, not compile-time at all. If you see f.(g.(x)), then you know that the loops are always fused into a single broadcast, regardless of the bindings of f, g, or x. This allows you to reason simply about the code consistently. It is indeed just syntactic sugar.

This is very different from a compile-time optimization that may or may not occur.

@andyferris
Copy link
Member

@stevengj Yes, I understand and agree completely (my first two paragraphs were directed at @quinnj).

the whole point is that the proposed loop fusion becomes a simple parsing-level guarantee,

To my taste, the keyword is "guarantee". I really do like it.

I was more thinking along the lines of what @yuyichao was thinking "loudly". If Matrix-Vector multiplication was lazy (returned an iterable over i for sum(M[i,:] .* v) or similar) then it would "just work" with no temporaries given a combination of your suggested changes to the parser and introducing the lazy multiplication type (and thus no "compile-time optimizations" are necessary beyond what already exists in Julia). It would be interesting to compare performance to daxpy.

(of course, when you say "compile-time" optimization I interpret that as changes to compilation after lowering, not changes to definitions in Base.LinAlg.)

A similar thing for Matrix-Matrix multiplication is much, much harder. Although we could have arbitrarily clever iterators, they might or might not be not be the correct thing to reimplement a somewhat efficient dgemm in native Julia, or to call out to it "automagically". But this is definitely something worth considering for the future, and for the development of syntax, because superfluous temporaries after multiplying matrices (along with adding them, etc, which hopefully will be fixed in this roadmap) is IMHO currently one of Julia's numerical performance bottlenecks when working with very large matrices/tensors/etc. (On that note, does an interface for dgemm! currently exist for the memory-constrained users that want to extract the last bit of efficiency out of Julia?)

@StefanKarpinski
Copy link
Member

@stevengj: are you planning on tackling this or should we figure out who else can tackle it?

@bramtayl
Copy link
Contributor

bramtayl commented Oct 13, 2016

OK, I've gotten ChainMap into a place where it's relatively stable, and I'd like to propose an alternate method of dot vectorization based on the work there.

Here's an alternative.

  • A type, LazyCall, which contains both a function and its arguments (positionals in a tuple and keywords in an OrderedDict)
  • An @~ macro to create a LazyCall from a woven expression (this is currently @unweave in ChainMap)
  • New methods for map, broadcast, etc. which take LazyCalls.

Here's how the syntax would change:

map( ~ sin(~x + y) ) goes to map( LazyCall(x -> sin(x + y), x) )
broadcast( ~ sin(~x + ~y) ) goes to broadcast( LazyCall( (x, y) -> sin(x + y), x, y) )
mapreduce( ( ~ sin(~x + y) ), +) goes to mapreduce( LazyCall( x -> sin(x + y), x), +)

Splats (needs more thought, I think)
~ f(~(x...) ) goes to LazyCall( (x...) -> f(x...), x...)
~ f(~(;x...) ) goes to LazyCall( (;x...) -> f(;x...); x...)

Advantages over the current system:

  1. No ambiguity over what counts as a scalar and what doesn't
  2. Reduction in the amount of dots necessary for a complicated expression
  3. Support for non-broadcast operations
  4. Support for passing multiple LazyCalls to one function
  5. Partial support for lazy evaluation
  6. No need for fusion

Disadvantages:

  1. ~ is already taken as bit negation. @~ is taken for formulas. I like the symbol because it reminds me of a woven thread. But maybe another ascii character will do?
  2. Not backwards compatible; people will need to learn a new system.
  3. Less verbose
  4. Not sure about performance impacts

All of this is implemented in ChainMap at the moment, aside from the new methods for map and friends. It can probably stay in a package for now if people aren't interested in adding it to base.

@ChrisRackauckas
Copy link
Member

map( ~ sin(~x + y) ) goes to map( LazyCall(x -> sin(x + y), x) )
broadcast( ~ sin(~x + ~y) ) goes to broadcast( LazyCall( (x, y) -> sin(x + y), x, y) )
mapreduce( ( ~ sin(~x + y) ), +) goes to mapreduce( LazyCall( x -> sin(x + y), x), +)

I'm sorry, but I'm not seeing how this proposal would clean up code at all. What would ~x + ~y be? And

  1. Reduction in the amount of dots necessary for a complicated expression

Would be true, but there wouldn't be less ~ then there were ..

@andyferris
Copy link
Member

andyferris commented Oct 13, 2016

While we are on the topic of new ideas, yesterday @c42f and I had some fun playing with the idea of "destructuring" an array.

Sometimes you have a vector of composed types ("structs"), and you might want to refer to the all the values of a certain field. Consider this operation and potential syntax we could implement here with the dot-call idea:

v::Vector{Complex{Float64}}

# get the real components using current syntax
broadcast(x -> x.re, v) 
broadcast(x -> getfield(x, :re), v)  # alternative syntax

# Kind-of works, but need to add scalar-size methods to `Symbol` 
# Also is slow because it doesn't inline the :re properly
getfield.(v, :re) 

# Potential new convenience syntax (dot call on dot reference):
v..re

There is also a similar thing where you might have a vector of tuples or a vector of vectors, and we can currently do something like

v::Vector{Vector{Int}}

# get the first elements
getindex.(v, 1)

# potential new syntax
v.[1]

v_tuple::Vector{Tuple}

# Doesn't seem to be optimal speed (tuple indexing with Const is special)
getindex.(v_tuple, 1)

To conclude: as a performance fix we could (probably should) support faster getfield.() for vectors of structs and faster getindex.() of vectors of vectors/tuples.

As a new syntax proposal, we might consider supporting operators .. and .[] as broadcasted syntax of getfield and getindex respectively.

@bramtayl
Copy link
Contributor

bramtayl commented Oct 13, 2016

@ChrisRackauckas

Agreed, less verbose, more flexible.

~x + ~y would just be ~x + ~y (it doesn't pass through a macro)
~ ~x + ~y would be LazyCall( (x, y) -> x + y, x, y)

In some cases, the amount of dots compared to ~ is cut down on. Eg.
sin.(cos.(tan.(x .+ 1) vs broadcast( ~ sin(cos(tan( ~x + 1) ) ) )

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Oct 13, 2016

But broadcast( ~ sin(cos(tan( ~x + 1) ) ) ) is still a much worse option because broadcast should not be in the mathematical expression. The syntax is to get rid of the "programming" from the "math". I like to be able to glance at the LaTeX and then glance at the code and see the correctness, while getting performance.

@ChrisRackauckas
Copy link
Member

Also, I don't see the reason for the LazyCall in this case. Making the type and the tuple container for the arguments adds to the amount of allocations, when what we really want is simple syntax for allocation-free code.

@bramtayl
Copy link
Contributor

bramtayl commented Oct 13, 2016

Agreed that the pure math aspect of things gets lost.

There are allocation free ways of going about this:

map( ~ sin(~x + y) ) goes to map( x -> sin(x + y), x)
broadcast( ~ sin(~x + ~y) ) goes to broadcast( (x, y) -> sin(x + y), x, y)
mapreduce( ( ~ sin(~x + y) ), +) goes to mapreduce( x -> sin(x + y), +, x)

And in fact, ChainMap has some convenience macros for doing this:

@map sin(_ + y) goes to map( _ -> sin(_ + y), _) (_ is used as part of the chaining)
@broadcast sin(~x + ~y) goes to broadcast( (x, y) -> sin(x + y), x, y)

But to do this you need to make assumptions about argument order (put the anonymous function first, the woven arguments last).

LazyCalls do have other advantages. They can be merged, pushed to, revised, etc. This is particularly useful for making plot calls.

@StefanKarpinski
Copy link
Member

@bramtayl: if you've got a counter-proposal, you should open an issue or start a julia-dev thread.

@bramtayl
Copy link
Contributor

Ok, see #18915

@wsshin
Copy link
Contributor

wsshin commented Dec 16, 2016

I have a couple of questions about the third goal in this vectorization roadmap, which is to "parse operators like a .+ b as broadcast(+, a, b)".

  • Now that broadcast() works for tuples, will we have a .+ b for tuples a and b when the third goal is achieved?
  • Currently broadcast(+, a, b) is ~4 times slower than a .+ b for vectors a and b. (See the benchmark result below.) Is this going to be fixed before the third goal is achieved?
julia> using BenchmarkTools

julia> @benchmark [1,2,3] .+ 1
BenchmarkTools.Trial:
  memory estimate:  224.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     109.144 ns (0.00% GC)
  median time:      152.579 ns (0.00% GC)
  mean time:        203.520 ns (19.59% GC)
  maximum time:     5.679 μs (94.30% GC)
  --------------
  samples:          10000
  evals/sample:     916
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark broadcast(+, [1,2,3], 1)
BenchmarkTools.Trial:
  memory estimate:  320.00 bytes
  allocs estimate:  6
  --------------
  minimum time:     583.749 ns (0.00% GC)
  median time:      599.648 ns (0.00% GC)
  mean time:        784.025 ns (14.38% GC)
  maximum time:     74.077 μs (98.31% GC)
  --------------
  samples:          10000
  evals/sample:     179
  time tolerance:   5.00%
  memory tolerance: 1.00%

The above benchmark results were obtained with the following Julia version:

julia> versioninfo()
Julia Version 0.6.0-dev.1565
Commit 0408aa2* (2016-12-15 03:02 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.0.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     T9900  @ 3.06GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, penryn)

@stevengj
Copy link
Member Author

stevengj commented Dec 16, 2016

@wshin, yes, .+ etcetera now (pending merge) works for tuples, but it probably needs some performance work in that case.

julia> (3,4,5) .+ 7
(10,11,12)

julia> (3,4,5) .+ (8,9,10)
(11,13,15)

As for your benchmark, currently there is an inlining failure that is slowing down broadcast (#19608) that should hopefully be fixed soon. As a result, for very small vectors, the overhead is significant.

(Of course, for very small vectors like [1,2,3] you will always get better performance from StaticArrays.jl, since in StaticArrays the length of the vector is known at compile-time too, and the entire loop can be unrolled and inlined, not to mention that the result does not need to be heap-allocated compared to [1,2,3].)

@stevengj
Copy link
Member Author

Closing, since all items are now merged.

Further improvements are possible, such as a generic syntax like sum:(x .+ 5) for fusing any function, making x[x .> 5] fusing, and so on, and of course performance improvements (e.g. @Sacha0's planned work on improving sparseness preserving in broadcast), but they can have their own issues marked with the vectorization label.

@stevengj
Copy link
Member Author

(By the way, the performance problems noted by @wsshin were fixed by #19639.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection julep Julia Enhancement Proposal
Projects
None yet
Development

No branches or pull requests