Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid type instability when constructing MulAddMul #47088

Closed
wants to merge 6 commits into from

Conversation

amilsted
Copy link
Contributor

@amilsted amilsted commented Oct 6, 2022

Use a macro to move the branching from the MulAddMul(alpha, beta) constructor, where it causes type instability (unless one has constant alpha and beta and gets lucky with constant propagation), to the calling function, where it does not cause type instability.

This should eliminate runtime dispatch in e.g. mul!(C, A, B, alpha, beta) calls where alpha and beta are not constant (see #46865).

This is an (I think nicer) alternative to #47026.

PS: Go easy on me, this is my first macro.

@N5N3
Copy link
Member

N5N3 commented Oct 7, 2022

I personally perfer #47026 if @constprop could resolve the inline concern there.
The macro based solution would always generate 4 gemm_wrapper! with 4 duplicated BLAS call and four different generic_matmatmul, if alpha/beta is not a const.
Not sure how to balance the code bloat and speed though.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 7, 2022

@N5N3 to clarify, are you concerned that, for non-const alpha and beta, the binary size of a function that inlines mul!() will be bigger because we now have calls to gemm_wrapper!() in each branch, rather than just construction of different MulAddMuls?

I guess this isn't true in comparison with master, since the previous code did runtime dispatch in this case! - now we are merely branching. I think it should be a strict improvement over master. Unlike #47026, it also improves the non-BLAS paths in case alpha and beta are genuinely not constant by eliminating the possibility of MulAddMul-caused runtime dispatch.

It's true that #47026 avoids, in the non-const BLAS case, unnecessarily branching on the values of alpha and beta in Julia, but I suppose gemm_wrapper! already contains several branches over various function calls, so maybe the runtime and total code size doesn't change that much (can we easily measure the latter)?

The other issue is that the #47026 approach does not improve the non-BLAS cases over master and actually likely makes them worse in the const case by making const-prop less likely to succeed. Do you see a way to improve this using the @constprop macro (which I wasn't previously aware of)? :) If so, a hybrid solution, where we push MulAddMul down a level so it doesn't touch BLAS, improve the const prop success probability, and use this macro, might be best?

@amilsted
Copy link
Contributor Author

amilsted commented Oct 7, 2022

@Jutho Interested if you have any opinions on this :)

@amilsted
Copy link
Contributor Author

So, I think this (construct MulAddMul in a typestable way) and #47026 (push MulAddMul deeper so that it stays out of the BLAS path) should probably be done, but doing #47026 may cause performance regressions as constprop may not work as often as it currently does.

As such, perhaps this PR is the one to prioritize.

@N5N3 Should I go ahead and add the macro to the remaining MulAddMul() calls in linalg?

@N5N3
Copy link
Member

N5N3 commented Oct 12, 2022

@amilsted I tried this solution locally. At present this would increase about 1s to compiler every new non-const 5-arg mul! call on my desktop.
You can verify it by e.g.

a = randn(ComplexF32, 60, 60); b = randn(ComplexF32, 60, 60); c = randn(ComplexF32, 60, 60);
@time mul!(a,b,c,true,false)
@time mul!(a,b,c,1,1)
@time mul!(a,b,c,1.,1.)

I can't say the increased compiler time is negligible, but maybe most work load don't call mul! with so many different input types.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 12, 2022

@N5N3 Hmm, interesting. I suppose it makes sense that compilation takes longer in the non-const case, as now the compiler can in principle see into the gemm_wrapper!() call and make e.g. an inlining decision? Previously, I suppose the runtime dispatch would have prevented that. I wonder if adding @noinline to gemm_wrapper!() brings the compile time back down?

@N5N3
Copy link
Member

N5N3 commented Oct 12, 2022

Adding @noinline changes nothing as gemm_wrapper! is not inlined by default.

In fact, #34384 #34601 does increase the complier time of gemm_wrapper! as the core expansive part is genetic_matmatmul!.
Before that PR, genetic_matmatmul! is runtime dispatched thus would only be compiled if BLAS call is not appliable.
I think we might want some inference barrier trick to re-enable that runtime dispatch as genetic_matmatmul! is always slow. (Thus make BLAS call faster to be compiled)
Edit: rewaking #34384 looks easier.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 12, 2022

Let me see if I understand the issue. In master, runtime dispatch occurs when we call gemm_wrapper!(). With my changes, we now end up compiling 4 versions of gemm_wrapper!() rather than just the one we needed (which is what happened previously). Because gemm_wrapper!() calls out to generic_matmul!(), which is pretty heavy, this is quite a lot of extra compilation. Is that right?

Since we can't tell at compile time whether we will need generic_matmul!(), it does seem like a runtime-dispatch barrier is the only way to get around this, unless we change the interface and provide versions of mul!() that allow for specifying (and hence dispatching on) the method (e.g. mul!(A,B,C,alpha,beta; method=BLAS)).

If we want to restrict runtime dispatch to generic_matmul!(), we could indeed do something like #34384. We could pull generic_matmul!() up out of gemm_wrapper!() and put it in mul!(), but only call it if gemm_wrapper!() can't handle the job. If we call generic_matmul!(), we do so with a non-stable MulAddMul() so that runtime dispatch is needed? Is this the kind of thing you had in mind @N5N3?

@N5N3
Copy link
Member

N5N3 commented Oct 13, 2022

We could pull generic_matmul!() up out of gemm_wrapper!() and put it in mul!()

There's no need to pull generic_matmul!() up out of gemm_wrapper!(). We just need to construct this MulAddMul inside gemm_wrapper! (Just like #47026.)
We would compiler only one gemm_wrapper! (alpha/beta is stable). And generic_matmul would be compiled only when needed (it's runtime dispatched).

The combination of this PR with #47026 might be a good idea.
If we apply @stable_muladdmul to matmatmul2x2/3x3, then there's no need to inline the whole gemm_wrapper.
It might be hard to check the performance influence on 2x2/3x3 usage, as micro-bench is not relible here. But we have StaticArrays anyway.

@dkarrasch do you think it's OK to make generic_matmul always be runtime dispatched?

@amilsted
Copy link
Contributor Author

amilsted commented Oct 13, 2022

@N5N3 Thanks for the explanation. I guess if we push MulAddMul down into gemm_wrapper!() like in #47026, this makes it less likely that const-prop will reach down to MulAddMul, so that the branches on the alpha/beta values cannot be eliminated at compile time. This doesn't seem too terrible, but I wonder if there is a way to force const-prop to continue past mul!()?

Perhaps if we call gemm_wrapper!() with @constprop :aggressive? Edit: For the record, adding this didn't seem to help in my tests.

@amilsted
Copy link
Contributor Author

amilsted commented Oct 13, 2022

@N5N3 I just added an inferencebarrier to the generic_matmatmul!() call. I don't know if I did it right, but if I did it should bring the compile time back down at the expense of always requiring runtime dispatch in that case, even for const alpha and beta.

Your suggestion is probably better, at least if const-prop continues to work.
Edit: Looks like it might not be possible to keep const-prop working. Also, for dense matrices it seems generic_matmatmul!() already allocates in any case. The runtime dispatch on top doesn't seem to make it any slower than it already is (tested for 4x4 Float64 matrices).

@dkarrasch
Copy link
Member

Superseded by #52439.

@dkarrasch dkarrasch closed this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants