-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved ind2sub/sub2ind #10337
improved ind2sub/sub2ind #10337
Conversation
I wonder if there are better names for these functions, that can be significantly better than the matlab names. |
BTW, what are the performance gains on some examples? I do not have a code handy that uses these a lot. |
Sparse indexing and various |
I also don't have a set of benchmark examples and the problem is that they get optimized away in a simple for loop that just tries to call this functions a number of times with fixed argument. Ah but maybe I can avoid that behavior by letting the arguments of the function call depend on the iteration variable. I'll report back soon. |
With the following test code (only for the hard case: function test(N)
d1=N
@time for i=1:N;s1=ind2sub((d1,),i);end
d2=ceil(Int,N^(1/2))
@time for i=1:N;s2=ind2sub((d2,d2),i);end
d3=ceil(Int,N^(1/3))
@time for i=1:N;s3=ind2sub((d3,d3,d3),i);end
d4=ceil(Int,N^(1/4))
@time for i=1:N;s4=ind2sub((d4,d4,d4,d4),i);end
d5=ceil(Int,N^(1/5))
@time for i=1:N;s5=ind2sub((d5,d5,d5,d5,d5),i);end
end I obtain on master:
and with this PR:
Clearly, the 1D case is just optimized away (the time does not depend on The 2d case is roughly the same or slightly slower. The 3d case seems a factor 2 slower with this PR, this I will need to fix, either by a specialized version or a more clever formulation of the For higher dimensions, this PR is significantly faster. |
Not sure what was wrong before, but the new timings with the latest commit are:
which beat the timings on master for every case, and even more significantly as before in the higher-dimensional case. |
grepping |
The improved timings seem to come from the fact that I now explicitly forced |
We really have to avoid using staged functions for too many things. They make it basically impossible to statically compile programs. Staged functions become totally useless with the JIT off. For simpler functions like this, I think we should try to do compiler improvements and rewrite them in other ways. |
For the purpose of statically compiling, wouldn't it be possible to keep track of what functions were generated by staged functions, e.g., during a particular run of code, and allowing the user to precompile those functions? Obviously, this wouldn't work if the final code took a different code path (although even here, an appropriate error could be issued), but a 90% solution could still be quite useful. |
Exactly :) |
I didn't know this. I am happy with closing this PR, or trying to see whether I can improve the More generally, to avoid having to write
Is this worth discussing in a separate issue? I guess @timholy will have some more interesting request to add to the list. |
Yes, it would be great to develop more tricks for writing code like this without the full sledgehammer of staged functions, e.g. #5560. Generally what works now is "unary" encodings of numbers in tuple lengths, e.g. https://gist.github.com/JeffBezanson/10612461. |
Thanks. I think I once saw that Gist once before, but had forgotten about it. This is neat. I'd still like some specific advice on how to proceed with this PR. I still find it strange that these functions ( |
da80d63
to
2b60d66
Compare
I use |
Ok, using
and this latest PR yields:
So this PR is clearly faster, but has allocation overhead as soon as |
@JeffBezanson, to clarify: a statically-compiled program would still have type inference->native code generation, but not parsing/lowering? Otherwise I don't understand the distinction you're making, and would like to know more. |
(The point being that
is relevant even for generic functions that have not yet been JITed.) |
I was wondering about the same thing but didn't dare to ask as I really don't know how these things work deep down. |
You would have to fallback on running code through an interpreter. |
That is correct. Then one problem is that we don't have an interpreter for the full language; e.g. llvm intrinsics are not supported. The other problem is that it would be extremely slow, especially since staged functions are typically used for performance. |
Sorry to be dense, but let me ask for a direct answer to my real question: is JITing still available in a statically-compiled program? If not, then I still don't see any difference between staged functions and generic functions. |
Oh, I think I now understand which part of my question @jakebolewski was responding to, and so my understanding is:
If this is right, then the concern about staged functions still seems to be a red herring: if you cache the code returned by the stagedfunction, by design it's going to be faster to run via the interpreter than whatever code the stagedfunction would have replaced---the whole point of a stagedfunction is to generate optimized julia code. |
I believe that part of the confusion stems from there being two versions of "statically compiled": with and without JIT support. With JIT support, staged functions are no problem. Without JIT support, they are a problem. The difference between staged functions and macros in this respect is that you can know that there are no more macro expansions required, whereas it doesn't seem possible to know that you won't encounter more code generated by a staged function. For normal generic functions, you can always generate a slow, generic fallback, but for staged functions, you can't even do that because you don't yet have the code that you need to generate the slow generic fallback for. |
Doesn't it just require two slow interpreted passes? One to generate the staged AST and a second to interpret it? There could be an extra optimization to cache AST instead of the JIT-ed function, of course, but I don't see why it's that different. |
Otherwise, might it be an option to requie a non-staged fallback for every stagedfunction, which in the worst case just trows a |
Yes, technically an interpreter can be used (as already discussed above). However there is another reason to discourage staged functions, which is that their type inference is basically all-or-nothing, to preserve type safety. This is a tough problem. We could, for example, store all results and check monotonicity directly, giving an error if we catch a staged function "lying". But this is pretty complicated, and also pretty hard on staged function authors. It also leaves you open to unanticipated errors far in the future. |
For a non-specialist, what do you mean with the monotonicity argument? And a completely different question: Even though I am enjoying the general discussion and learning a lot, I'd also like to know the specific opinion on the current PR? I believe even the non-staged version in the latest commit still provides an improvement over the original definitions, though the simple benchmark results are of course less convincing then those of the |
@ehaber99 , the topic of this PR was not to improve the vectorized version of The single value methods are certainly already performing much better with the definitions in the current commit. Essential to this is that the recursive definitions of
|
I finally understand what @mbauman and probably @timholy were talking about. It's not the inlining of |
But don't be confused about what you see in a "naked" call from the REPL and how it will work out in practice when compiled into functions. If you copy the example in http://docs.julialang.org/en/latest/manual/metaprogramming/#an-advanced-example, then g(dims, i1, i2, i3, i4, i5, i6) = sub2ind_gen(dims, i1, i2, i3, i4, i5, i6)
g (generic function with 1 method)
julia> @code_llvm g((2,2,2,2,2,2), 2, 2, 2, 2, 2, 2)
define i64 @julia_g_44425([6 x i64], i64, i64, i64, i64, i64, i64) {
top:
%7 = extractvalue [6 x i64] %0, 0
%8 = add i64 %2, -1
%9 = extractvalue [6 x i64] %0, 1
%10 = add i64 %3, -1
%11 = extractvalue [6 x i64] %0, 2
%12 = add i64 %4, -1
%13 = extractvalue [6 x i64] %0, 3
%14 = add i64 %5, -1
%15 = extractvalue [6 x i64] %0, 4
%16 = add i64 %6, -1
%17 = mul i64 %16, %15
%18 = add i64 %14, %17
%19 = mul i64 %18, %13
%20 = add i64 %12, %19
%21 = mul i64 %20, %11
%22 = add i64 %10, %21
%23 = mul i64 %22, %9
%24 = add i64 %8, %23
%25 = mul i64 %24, %7
%26 = add i64 %25, %1
ret i64 %26
} which is just gorgeous. In contrast, even if you put |
Yes that's exactly the point also for the post of @mbauman above. The reason that his normal function gave short LLVM code while the To make the vectorized definitions of these functions fast, I needed to implement new code along the lines suggested by @ehaber99 . I tweaked it some more and now obtain for this example, with the commit that I will push in a minute:
which looks pretty good to me. |
I wonder if I'm misunderstanding you, but my impression is that we're saying opposite things. My point is that the best combination is to implement |
Am I misunderstanding your timings? In every case the ones labeled "My" are slower than the ones labeled "Julia." |
Well, I just run the test file of @ehaber99 against my latest commit, so the "my" timings refer to the code of @ehaber99, the Julia timings are the ones corresponding to my new implementation in the latest commit. But apparently there is still an error in CI; I got to tired yesterday evening.
|
If there's something small we can change to fix the current ones, we should do that. Maybe it would be worth inserting a breakpoint here to see what's happening. |
That makes much more sense, thanks for explaining! |
Ok, so that's the final, test-succeeding commit which does not use |
I think this is more or less ready. I guess it could use a final review; it ain't much, just a few lines of code. The only point I am not fully satisfied with is about the special casing of |
One more fix after all; with this I guess the output of all |
end | ||
return index | ||
return :($ex + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it help to Expr(:meta, :inline)
this to avoid the allocation of the tuple? Or does this always get inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably does not get inlined. Would it be necessary for sub2ind
to be inlined? Before this was necessary because of the recursive definition. Probably it is still necessary if this is relied on in indexing? I will restore the exlicit inline behavior.
LGTM. I don't know if it's a coincidence, but that travis segfault happened in arrayops, which is where It's interesting that we don't seem to test |
It seems to be a 32 bit issue (which is why I didn't caught it), but it is likely caused by this final fix. I will rebase and change the git history such that the final fix is part of the first commit, and add the inline statement discussed in the line comments. This will cause a new CI run; let's see what happens. |
All looks good now. |
Thanks for doing this! |
Use
stagedfunction
definition ofsub2ind
andind2sub
to generate specialized code for everyN
indims::NTuple{N,Integer}
. See #10331 for related discussion and performance check. I can do more performance benchmarking if somebody can point me to a good test suite that depends heavily on the use of those 2 functions.