Avoid type instability when constructing MulAddMul #47088

amilsted · 2022-10-06T17:34:11Z

Use a macro to move the branching from the MulAddMul(alpha, beta) constructor, where it causes type instability (unless one has constant alpha and beta and gets lucky with constant propagation), to the calling function, where it does not cause type instability.

This should eliminate runtime dispatch in e.g. mul!(C, A, B, alpha, beta) calls where alpha and beta are not constant (see #46865).

This is an (I think nicer) alternative to #47026.

PS: Go easy on me, this is my first macro.

N5N3 · 2022-10-07T03:29:30Z

I personally perfer #47026 if @constprop could resolve the inline concern there.
The macro based solution would always generate 4 gemm_wrapper! with 4 duplicated BLAS call and four different generic_matmatmul, if alpha/beta is not a const.
Not sure how to balance the code bloat and speed though.

amilsted · 2022-10-07T18:19:31Z

@N5N3 to clarify, are you concerned that, for non-const alpha and beta, the binary size of a function that inlines mul!() will be bigger because we now have calls to gemm_wrapper!() in each branch, rather than just construction of different MulAddMuls?

I guess this isn't true in comparison with master, since the previous code did runtime dispatch in this case! - now we are merely branching. I think it should be a strict improvement over master. Unlike #47026, it also improves the non-BLAS paths in case alpha and beta are genuinely not constant by eliminating the possibility of MulAddMul-caused runtime dispatch.

It's true that #47026 avoids, in the non-const BLAS case, unnecessarily branching on the values of alpha and beta in Julia, but I suppose gemm_wrapper! already contains several branches over various function calls, so maybe the runtime and total code size doesn't change that much (can we easily measure the latter)?

The other issue is that the #47026 approach does not improve the non-BLAS cases over master and actually likely makes them worse in the const case by making const-prop less likely to succeed. Do you see a way to improve this using the @constprop macro (which I wasn't previously aware of)? :) If so, a hybrid solution, where we push MulAddMul down a level so it doesn't touch BLAS, improve the const prop success probability, and use this macro, might be best?

amilsted · 2022-10-07T20:57:55Z

@Jutho Interested if you have any opinions on this :)

amilsted · 2022-10-12T00:22:57Z

So, I think this (construct MulAddMul in a typestable way) and #47026 (push MulAddMul deeper so that it stays out of the BLAS path) should probably be done, but doing #47026 may cause performance regressions as constprop may not work as often as it currently does.

As such, perhaps this PR is the one to prioritize.

@N5N3 Should I go ahead and add the macro to the remaining MulAddMul() calls in linalg?

N5N3 · 2022-10-12T01:42:58Z

@amilsted I tried this solution locally. At present this would increase about 1s to compiler every new non-const 5-arg mul! call on my desktop.
You can verify it by e.g.

a = randn(ComplexF32, 60, 60); b = randn(ComplexF32, 60, 60); c = randn(ComplexF32, 60, 60);
@time mul!(a,b,c,true,false)
@time mul!(a,b,c,1,1)
@time mul!(a,b,c,1.,1.)

I can't say the increased compiler time is negligible, but maybe most work load don't call mul! with so many different input types.

amilsted · 2022-10-12T03:45:17Z

@N5N3 Hmm, interesting. I suppose it makes sense that compilation takes longer in the non-const case, as now the compiler can in principle see into the gemm_wrapper!() call and make e.g. an inlining decision? Previously, I suppose the runtime dispatch would have prevented that. I wonder if adding @noinline to gemm_wrapper!() brings the compile time back down?

N5N3 · 2022-10-12T06:28:43Z

Adding @noinline changes nothing as gemm_wrapper! is not inlined by default.

In fact, ~~#34384~~ #34601 does increase the complier time of gemm_wrapper! as the core expansive part is genetic_matmatmul!.
Before that PR, genetic_matmatmul! is runtime dispatched thus would only be compiled if BLAS call is not appliable.
~~I think we might want some inference barrier trick to re-enable that runtime dispatch as genetic_matmatmul! is always slow. (Thus make BLAS call faster to be compiled)~~
Edit: rewaking #34384 looks easier.

amilsted · 2022-10-12T18:06:24Z

Let me see if I understand the issue. In master, runtime dispatch occurs when we call gemm_wrapper!(). With my changes, we now end up compiling 4 versions of gemm_wrapper!() rather than just the one we needed (which is what happened previously). Because gemm_wrapper!() calls out to generic_matmul!(), which is pretty heavy, this is quite a lot of extra compilation. Is that right?

Since we can't tell at compile time whether we will need generic_matmul!(), it does seem like a runtime-dispatch barrier is the only way to get around this, unless we change the interface and provide versions of mul!() that allow for specifying (and hence dispatching on) the method (e.g. mul!(A,B,C,alpha,beta; method=BLAS)).

If we want to restrict runtime dispatch to generic_matmul!(), we could indeed do something like #34384. We could pull generic_matmul!() up out of gemm_wrapper!() and put it in mul!(), but only call it if gemm_wrapper!() can't handle the job. If we call generic_matmul!(), we do so with a non-stable MulAddMul() so that runtime dispatch is needed? Is this the kind of thing you had in mind @N5N3?

N5N3 · 2022-10-13T13:11:06Z

We could pull generic_matmul!() up out of gemm_wrapper!() and put it in mul!()

There's no need to pull generic_matmul!() up out of gemm_wrapper!(). We just need to construct this MulAddMul inside gemm_wrapper! (Just like #47026.)
We would compiler only one gemm_wrapper! (alpha/beta is stable). And generic_matmul would be compiled only when needed (it's runtime dispatched).

The combination of this PR with #47026 might be a good idea.
If we apply @stable_muladdmul to matmatmul2x2/3x3, then there's no need to inline the whole gemm_wrapper.
It might be hard to check the performance influence on 2x2/3x3 usage, as micro-bench is not relible here. But we have StaticArrays anyway.

@dkarrasch do you think it's OK to make generic_matmul always be runtime dispatched?

amilsted · 2022-10-13T16:58:35Z

@N5N3 Thanks for the explanation. I guess if we push MulAddMul down into gemm_wrapper!() like in #47026, this makes it less likely that const-prop will reach down to MulAddMul, so that the branches on the alpha/beta values cannot be eliminated at compile time. This doesn't seem too terrible, but I wonder if there is a way to force const-prop to continue past mul!()?

Perhaps if we call gemm_wrapper!() with @constprop :aggressive? Edit: For the record, adding this didn't seem to help in my tests.

amilsted · 2022-10-13T17:00:41Z

@N5N3 I just added an inferencebarrier to the generic_matmatmul!() call. I don't know if I did it right, but if I did it should bring the compile time back down at the expense of always requiring runtime dispatch in that case, even for const alpha and beta.

Your suggestion is probably better, at least if const-prop continues to work.
Edit: Looks like it might not be possible to keep const-prop working. Also, for dense matrices it seems generic_matmatmul!() already allocates in any case. The runtime dispatch on top doesn't seem to make it any slower than it already is (tested for 4x4 Float64 matrices).

dkarrasch · 2024-05-22T10:28:09Z

Superseded by #52439.

Avoid type instability when constructing MulAddMul

2e09efb

amilsted mentioned this pull request Oct 6, 2022

dense-matrix mul!(C, A, B, alpha, beta) allocates #46865

Closed

dkarrasch added the linear algebra Linear algebra label Oct 6, 2022

Fix whitespace and rejig branch expression.

9eca48b

Ashley Milsted added 2 commits October 13, 2022 09:51

Hide generic_matmatmul!() behind an inferencebarrier.

7926b48

Spread @stable_muladdmul to adjacent files.

b99d3a8

Ashley Milsted added 2 commits October 13, 2022 10:21

Add diagonal cases.

1baa9a3

Fix diagonal cases.

8b5c2c6

This was referenced Oct 14, 2022

Push MulAddMul away from BLAS.gemm! #47026

Closed

Split up gemm_wrapper and stabilize MulAddMul strategically #47206

Closed

dkarrasch closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid type instability when constructing MulAddMul #47088

Avoid type instability when constructing MulAddMul #47088

amilsted commented Oct 6, 2022

N5N3 commented Oct 7, 2022

amilsted commented Oct 7, 2022 •

edited

Loading

amilsted commented Oct 7, 2022

amilsted commented Oct 12, 2022

N5N3 commented Oct 12, 2022 •

edited

Loading

amilsted commented Oct 12, 2022 •

edited

Loading

N5N3 commented Oct 12, 2022 •

edited

Loading

amilsted commented Oct 12, 2022 •

edited

Loading

N5N3 commented Oct 13, 2022

amilsted commented Oct 13, 2022 •

edited

Loading

amilsted commented Oct 13, 2022 •

edited

Loading

dkarrasch commented May 22, 2024

Avoid type instability when constructing MulAddMul #47088

Avoid type instability when constructing MulAddMul #47088

Conversation

amilsted commented Oct 6, 2022

N5N3 commented Oct 7, 2022

amilsted commented Oct 7, 2022 • edited Loading

amilsted commented Oct 7, 2022

amilsted commented Oct 12, 2022

N5N3 commented Oct 12, 2022 • edited Loading

amilsted commented Oct 12, 2022 • edited Loading

N5N3 commented Oct 12, 2022 • edited Loading

amilsted commented Oct 12, 2022 • edited Loading

N5N3 commented Oct 13, 2022

amilsted commented Oct 13, 2022 • edited Loading

amilsted commented Oct 13, 2022 • edited Loading

dkarrasch commented May 22, 2024

amilsted commented Oct 7, 2022 •

edited

Loading

N5N3 commented Oct 12, 2022 •

edited

Loading

amilsted commented Oct 12, 2022 •

edited

Loading

N5N3 commented Oct 12, 2022 •

edited

Loading

amilsted commented Oct 12, 2022 •

edited

Loading

amilsted commented Oct 13, 2022 •

edited

Loading

amilsted commented Oct 13, 2022 •

edited

Loading