Forward GEMM to GEMV when one argument is actually a vector #4814

Mousius · 2024-07-23T22:27:48Z

I expanded this from #4708, and the tests passed locally but likely needs more checking to ensure it's stable.

martin-frbg · 2024-07-24T09:07:26Z

Thank you - somehow I had failed to realize I still had the bogus incx from the cursed first draft in my fork. (I still get a handful of new LAPACK testsuite failures on x86_64 with this - could be this is the rounding in the gemv microkernel for Haswell again)

Mousius · 2024-07-24T10:31:58Z

That is unfortunate. I've made it opt-in per target so the x86_64 precision errors can be looked at separately. I've also disabled forwarding if ONLY_CBLAS=1, requiring a different code path or some symbol hiding the gemv parts.

Does that make this merge-able @martin-frbg ? 😸

martin-frbg · 2024-07-24T11:21:19Z

Didn't mean this would make it unmergeable, haven't even gotten around to testing with TARGET=GENERIC yet (the inaccuracy had already been flagged in #4324, but it is not clear if it is really significant in the general case)

akote123 · 2024-07-24T12:32:13Z

@martin-frbg , @Mousius ,
Actually I was checking the gemm and changed gemm(forward to gemv) results with pytorch linear layer in aarch64,
For shape :[512, 2048], [2048]: In this 9 output values are suffering from accuracy mismatch and their ratio is as follows:
0.9999868
0.99999
0.9999861
0.9998682
1.0000162
1.0002389
0.9999898
0.9998842
0.99992955

Mousius · 2024-07-24T14:42:50Z

We're also seeing regression in some of our PyTorch models:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Interesting the OpenBLAS tests seem fine 🤔

martin-frbg · 2024-07-24T16:21:34Z

that's fallout from SCAL changes of #4729 - see #4794 and #4807 for the fix (I'm going crazy over making the same fairly trivial change in the Sparc scal kernel, otherwise these should be ready to merge)

conradsnicta · 2024-07-25T05:39:33Z

@martin-frbg Out of curiosity, why still bother with Sparc? It's a dead architecture, almost in the domain of retro-computing. The last proper Sparc CPU release was circa 2017. Neither Oracle or Fujitsu are making any new Sparc processors (they've moved on to x86-64 and ARM). Folks still using legacy Sparc hardware can always use older versions of OpenBLAS.

martin-frbg · 2024-07-25T06:14:04Z

@conradsnicta as far as I know, there may still be distributions providing packages for Sparc hardware, and without the patch they would get a broken build due to utest failures. There are machines in the GCC compile farm I can test on, and the change seemed straightforward enough, but there is something about the ABI that has turned a simple "read another value from the stack into some register and see if it is non-zero" into an energy drain.

conradsnicta · 2024-07-25T07:17:34Z

@martin-frbg Perhaps it's worth dropping Sparc from the list of supported architectures? Opportunity cost and all that.

If somebody really cares about Sparc, they can always contribute patches. At best it's a tiny niche architecture: https://popcon.debian.org/

martin-frbg · 2024-07-29T17:29:18Z

I think you will only see the inaccuracies in lapack-test (and when you run the reproducer from #4324), the trouble is that currently all ARM64 except A64FX use the same NEON kernels for GEMV, which may have been inspired by the equally affected x86_64 microkernels for Haswell and up (or vice versa) and I have failed to come up with an easy fix

Mousius · 2024-07-29T18:25:59Z

I think you will only see the inaccuracies in lapack-test (and when you run the reproducer from #4324), the trouble is that currently all ARM64 except A64FX use the same NEON kernels for GEMV, which may have been inspired by the equally affected x86_64 microkernels for Haswell and up (or vice versa) and I have failed to come up with an easy fix

LD_PRELOAD=./libopenblas_neoversev1p-r0.3.27.dev.so ./blas_test
           BLAS                         MAT
1.175201193643801378e+00	1.175201193643801822e+00
1.103638323514327002e+00	1.103638323514327224e+00
3.578143506473725477e-01	3.578143506473724922e-01
7.045563366848892062e-02	7.045563366848883735e-02
9.965128148869423913e-03	9.965128148869309421e-03
1.099586127207508035e-03	1.099586127207556390e-03
9.945433911359019552e-05	9.945433911360671605e-05
7.620541308928086011e-06	7.620541308896932986e-06
5.064719743597123625e-07	5.064719744437437483e-07
2.971814125340976886e-08	2.971814140421127963e-08
1.560886614404566330e-09	1.560886564391797127e-09
7.419931336016816203e-11	7.419902813631537848e-11
3.221423128252354218e-12	3.221406076314455686e-12
1.287858708565181587e-13	1.289813102624490508e-13
4.440892098500626162e-15	4.440892098500626162e-15
0.000000000000000000e+00	-4.440892098500626162e-16
ddot: -3.885780586188047891e-16
dgemv: 0.000000000000000000e+00
ddot == dgemv? NO
dgemv is 0.0? YES

It seems dot might need some work, but nobody has complained yet 🤔

Misread the output, I presume ddot is close and dgemv is bad 🤔

martin-frbg · 2024-07-29T18:38:37Z

The last line of the result array for the GEMV though, truncated to zero where less highly optimized implementations (including OpenBLAS' "generic" C target) show a -4e-16 ... I tried to explain it away with a good bit of handwaving in the original issue, but it at least appears to correlate with about 20 new failures each for S and D popping up in the LAPACK testsuite.

This helps to reduce values going missing as we accumulate.

Mousius · 2024-07-31T12:12:33Z

@martin-frbg I've increased the number of accumulators in GEMV T, and that seems to have fixed the issue without impacting performance too much. Unsure how else to validate this 🤔

GEMV N is fundamentally different, and there are no reports of that being affected.

Oh, and I think the SVE dot is likely just rounded differently/better, as it matches with the SVE gemv, which I think is fine?

martin-frbg · 2024-07-31T14:13:50Z

Great, thanks. I'll try to steal that for x86_64 too, somehow I only managed to break things further when I tried. Also had not realized that the lapack testsuite failures on arm64 were all related to GEMV_T, did not have the impression that this was the case on x86_64. Anyway we can probably make that opt-out now - I'll check all the other platforms (riscv appears to be fine already)

ChipKerchner · 2024-08-08T14:20:10Z

Question: If an architecture packs the data a specific way (AKA we pack the data differently for P10 MMA vs the way the generic code does) and it does NOT have a architecture specific implementation of GEMV, won't this be a problem for this patch? Right now we do NOT have a P10 version of GEMV for BF16.

Mousius · 2024-08-08T14:28:58Z

Question: If an architecture packs the data a specific way (AKA we pack the data differently for P10 MMA vs the way the generic code does) and it does NOT have a architecture specific implementation of GEMV, won't this be a problem for this patch? Right now we do NOT have a P10 version of GEMV for BF16.

I believe you can match on ${CORE} in the Makefile to further filter GEMM_GEMV_FORWARD, is that right @martin-frbg ?

martin-frbg · 2024-08-08T14:29:04Z

Forwarding happens at the interface stage so any special packing you (plan to) do for or in the kernel should be irrelevant, I hope. The only reason this is activated on a per-architecture basis is that there is an accuracy bug in some x86 GEMV kernels that I haven't been able to fix yet

Mousius · 2024-08-08T14:30:03Z

Forwarding happens at the interface stage so any special packing you (plan to) do for or in the kernel should be irrelevant, I hope. The only reason this is activated on a per-architecture basis is that there is an accuracy bug in some x86 GEMV kernels that I haven't been able to fix yet

I think the issue is you may have an optimized dgem/sgem but not sbgemv, so the sbgemm kernel may be faster than the forward in some cases?

martin-frbg · 2024-08-08T14:33:08Z

Yes matching on CORE should be possible (except it would lead to some ugly long lines of conditionals...). As said we are in interface/gemm.c here, right after the call to GEMM has been received by the library and before the level3 driver even gets to think about partitioning the problem for the GEMM kernel

martin-frbg · 2024-08-08T16:08:08Z

Forwarding happens at the interface stage so any special packing you (plan to) do for or in the kernel should be irrelevant, I hope. The only reason this is activated on a per-architecture basis is that there is an accuracy bug in some x86 GEMV kernels that I haven't been able to fix yet

I think the issue is you may have an optimized dgem/sgem but not sbgemv, so the sbgemm kernel may be faster than the forward in some cases?

OK , I must admit having been too fixated on SGEMM/DGEMM, will disable the forwarding for SBGEMM in #4852 until
we have more data.

Mousius mentioned this pull request Jul 23, 2024

[WIP] forward GEMM workloads to GEMV when one argument is actually a vector #4708

Closed

Mousius changed the title ~~Forward to GEMV when one argument is actually a vector~~ Forward GEMM to GEMV when one argument is actually a vector Jul 23, 2024

martin-frbg and others added 5 commits July 31, 2024 13:09

forward to GEMV when one argument is actually a vector

3db5dbc

Complete implementation of GEMV forwarding

28b5334

Re-add accidental removal

90eb863

Allow opt into GEMM -> GEMV forwarding

b26424c

Add accumulators to AArch64 GEMV Kernels

ba2e989

This helps to reduce values going missing as we accumulate.

Mousius force-pushed the gemv-proxy branch from 8de45b6 to ba2e989 Compare July 31, 2024 12:11

martin-frbg added this to the 0.3.28 milestone Jul 31, 2024

martin-frbg merged commit 9afd0c8 into OpenMathLib:develop Jul 31, 2024
69 of 78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward GEMM to GEMV when one argument is actually a vector #4814

Forward GEMM to GEMV when one argument is actually a vector #4814

Mousius commented Jul 23, 2024 •

edited

Loading

martin-frbg commented Jul 24, 2024

Mousius commented Jul 24, 2024

martin-frbg commented Jul 24, 2024

akote123 commented Jul 24, 2024

Mousius commented Jul 24, 2024

martin-frbg commented Jul 24, 2024

conradsnicta commented Jul 25, 2024

martin-frbg commented Jul 25, 2024

conradsnicta commented Jul 25, 2024

martin-frbg commented Jul 29, 2024

Mousius commented Jul 29, 2024 •

edited

Loading

martin-frbg commented Jul 29, 2024

Mousius commented Jul 31, 2024 •

edited

Loading

martin-frbg commented Jul 31, 2024

ChipKerchner commented Aug 8, 2024

Mousius commented Aug 8, 2024

martin-frbg commented Aug 8, 2024

Mousius commented Aug 8, 2024

martin-frbg commented Aug 8, 2024

martin-frbg commented Aug 8, 2024

Forward GEMM to GEMV when one argument is actually a vector #4814

Forward GEMM to GEMV when one argument is actually a vector #4814

Conversation

Mousius commented Jul 23, 2024 • edited Loading

martin-frbg commented Jul 24, 2024

Mousius commented Jul 24, 2024

martin-frbg commented Jul 24, 2024

akote123 commented Jul 24, 2024

Mousius commented Jul 24, 2024

martin-frbg commented Jul 24, 2024

conradsnicta commented Jul 25, 2024

martin-frbg commented Jul 25, 2024

conradsnicta commented Jul 25, 2024

martin-frbg commented Jul 29, 2024

Mousius commented Jul 29, 2024 • edited Loading

martin-frbg commented Jul 29, 2024

Mousius commented Jul 31, 2024 • edited Loading

martin-frbg commented Jul 31, 2024

ChipKerchner commented Aug 8, 2024

Mousius commented Aug 8, 2024

martin-frbg commented Aug 8, 2024

Mousius commented Aug 8, 2024

martin-frbg commented Aug 8, 2024

martin-frbg commented Aug 8, 2024

Mousius commented Jul 23, 2024 •

edited

Loading

Mousius commented Jul 29, 2024 •

edited

Loading

Mousius commented Jul 31, 2024 •

edited

Loading