Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward GEMM to GEMV when one argument is actually a vector #4814

Merged
merged 5 commits into from
Jul 31, 2024

Conversation

Mousius
Copy link
Contributor

@Mousius Mousius commented Jul 23, 2024

I expanded this from #4708, and the tests passed locally but likely needs more checking to ensure it's stable.

Fixes #4580

@Mousius Mousius changed the title Forward to GEMV when one argument is actually a vector Forward GEMM to GEMV when one argument is actually a vector Jul 23, 2024
@martin-frbg
Copy link
Collaborator

Thank you - somehow I had failed to realize I still had the bogus incx from the cursed first draft in my fork. (I still get a handful of new LAPACK testsuite failures on x86_64 with this - could be this is the rounding in the gemv microkernel for Haswell again)

@Mousius
Copy link
Contributor Author

Mousius commented Jul 24, 2024

That is unfortunate. I've made it opt-in per target so the x86_64 precision errors can be looked at separately. I've also disabled forwarding if ONLY_CBLAS=1, requiring a different code path or some symbol hiding the gemv parts.

Does that make this merge-able @martin-frbg ? 😸

@martin-frbg
Copy link
Collaborator

Didn't mean this would make it unmergeable, haven't even gotten around to testing with TARGET=GENERIC yet (the inaccuracy had already been flagged in #4324, but it is not clear if it is really significant in the general case)

@akote123
Copy link

@martin-frbg , @Mousius ,
Actually I was checking the gemm and changed gemm(forward to gemv) results with pytorch linear layer in aarch64,
For shape :[512, 2048], [2048]: In this 9 output values are suffering from accuracy mismatch and their ratio is as follows:
0.9999868
0.99999
0.9999861
0.9998682
1.0000162
1.0002389
0.9999898
0.9998842
0.99992955

@Mousius
Copy link
Contributor Author

Mousius commented Jul 24, 2024

We're also seeing regression in some of our PyTorch models:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Interesting the OpenBLAS tests seem fine 🤔

@martin-frbg
Copy link
Collaborator

that's fallout from SCAL changes of #4729 - see #4794 and #4807 for the fix (I'm going crazy over making the same fairly trivial change in the Sparc scal kernel, otherwise these should be ready to merge)

@conradsnicta
Copy link

@martin-frbg Out of curiosity, why still bother with Sparc? It's a dead architecture, almost in the domain of retro-computing. The last proper Sparc CPU release was circa 2017. Neither Oracle or Fujitsu are making any new Sparc processors (they've moved on to x86-64 and ARM). Folks still using legacy Sparc hardware can always use older versions of OpenBLAS.

@martin-frbg
Copy link
Collaborator

@conradsnicta as far as I know, there may still be distributions providing packages for Sparc hardware, and without the patch they would get a broken build due to utest failures. There are machines in the GCC compile farm I can test on, and the change seemed straightforward enough, but there is something about the ABI that has turned a simple "read another value from the stack into some register and see if it is non-zero" into an energy drain.

@conradsnicta
Copy link

@martin-frbg Perhaps it's worth dropping Sparc from the list of supported architectures? Opportunity cost and all that.

If somebody really cares about Sparc, they can always contribute patches. At best it's a tiny niche architecture: https://popcon.debian.org/

@martin-frbg
Copy link
Collaborator

I think you will only see the inaccuracies in lapack-test (and when you run the reproducer from #4324), the trouble is that currently all ARM64 except A64FX use the same NEON kernels for GEMV, which may have been inspired by the equally affected x86_64 microkernels for Haswell and up (or vice versa) and I have failed to come up with an easy fix

@Mousius
Copy link
Contributor Author

Mousius commented Jul 29, 2024

I think you will only see the inaccuracies in lapack-test (and when you run the reproducer from #4324), the trouble is that currently all ARM64 except A64FX use the same NEON kernels for GEMV, which may have been inspired by the equally affected x86_64 microkernels for Haswell and up (or vice versa) and I have failed to come up with an easy fix

LD_PRELOAD=./libopenblas_neoversev1p-r0.3.27.dev.so ./blas_test
           BLAS                         MAT
1.175201193643801378e+00	1.175201193643801822e+00
1.103638323514327002e+00	1.103638323514327224e+00
3.578143506473725477e-01	3.578143506473724922e-01
7.045563366848892062e-02	7.045563366848883735e-02
9.965128148869423913e-03	9.965128148869309421e-03
1.099586127207508035e-03	1.099586127207556390e-03
9.945433911359019552e-05	9.945433911360671605e-05
7.620541308928086011e-06	7.620541308896932986e-06
5.064719743597123625e-07	5.064719744437437483e-07
2.971814125340976886e-08	2.971814140421127963e-08
1.560886614404566330e-09	1.560886564391797127e-09
7.419931336016816203e-11	7.419902813631537848e-11
3.221423128252354218e-12	3.221406076314455686e-12
1.287858708565181587e-13	1.289813102624490508e-13
4.440892098500626162e-15	4.440892098500626162e-15
0.000000000000000000e+00	-4.440892098500626162e-16
ddot: -3.885780586188047891e-16
dgemv: 0.000000000000000000e+00
ddot == dgemv? NO
dgemv is 0.0? YES

It seems dot might need some work, but nobody has complained yet 🤔

Misread the output, I presume ddot is close and dgemv is bad 🤔

@martin-frbg
Copy link
Collaborator

The last line of the result array for the GEMV though, truncated to zero where less highly optimized implementations (including OpenBLAS' "generic" C target) show a -4e-16 ... I tried to explain it away with a good bit of handwaving in the original issue, but it at least appears to correlate with about 20 new failures each for S and D popping up in the LAPACK testsuite.

@Mousius
Copy link
Contributor Author

Mousius commented Jul 31, 2024

@martin-frbg I've increased the number of accumulators in GEMV T, and that seems to have fixed the issue without impacting performance too much. Unsure how else to validate this 🤔

GEMV N is fundamentally different, and there are no reports of that being affected.

Oh, and I think the SVE dot is likely just rounded differently/better, as it matches with the SVE gemv, which I think is fine?

@martin-frbg
Copy link
Collaborator

Great, thanks. I'll try to steal that for x86_64 too, somehow I only managed to break things further when I tried. Also had not realized that the lapack testsuite failures on arm64 were all related to GEMV_T, did not have the impression that this was the case on x86_64. Anyway we can probably make that opt-out now - I'll check all the other platforms (riscv appears to be fine already)

@martin-frbg martin-frbg added this to the 0.3.28 milestone Jul 31, 2024
@martin-frbg martin-frbg merged commit 9afd0c8 into OpenMathLib:develop Jul 31, 2024
69 of 78 checks passed
@ChipKerchner
Copy link
Contributor

Question: If an architecture packs the data a specific way (AKA we pack the data differently for P10 MMA vs the way the generic code does) and it does NOT have a architecture specific implementation of GEMV, won't this be a problem for this patch? Right now we do NOT have a P10 version of GEMV for BF16.

@Mousius
Copy link
Contributor Author

Mousius commented Aug 8, 2024

Question: If an architecture packs the data a specific way (AKA we pack the data differently for P10 MMA vs the way the generic code does) and it does NOT have a architecture specific implementation of GEMV, won't this be a problem for this patch? Right now we do NOT have a P10 version of GEMV for BF16.

I believe you can match on ${CORE} in the Makefile to further filter GEMM_GEMV_FORWARD, is that right @martin-frbg ?

@martin-frbg
Copy link
Collaborator

Forwarding happens at the interface stage so any special packing you (plan to) do for or in the kernel should be irrelevant, I hope. The only reason this is activated on a per-architecture basis is that there is an accuracy bug in some x86 GEMV kernels that I haven't been able to fix yet

@Mousius
Copy link
Contributor Author

Mousius commented Aug 8, 2024

Forwarding happens at the interface stage so any special packing you (plan to) do for or in the kernel should be irrelevant, I hope. The only reason this is activated on a per-architecture basis is that there is an accuracy bug in some x86 GEMV kernels that I haven't been able to fix yet

I think the issue is you may have an optimized dgem/sgem but not sbgemv, so the sbgemm kernel may be faster than the forward in some cases?

@martin-frbg
Copy link
Collaborator

Yes matching on CORE should be possible (except it would lead to some ugly long lines of conditionals...). As said we are in interface/gemm.c here, right after the call to GEMM has been received by the library and before the level3 driver even gets to think about partitioning the problem for the GEMM kernel

@martin-frbg
Copy link
Collaborator

Forwarding happens at the interface stage so any special packing you (plan to) do for or in the kernel should be irrelevant, I hope. The only reason this is activated on a per-architecture basis is that there is an accuracy bug in some x86 GEMV kernels that I haven't been able to fix yet

I think the issue is you may have an optimized dgem/sgem but not sbgemv, so the sbgemm kernel may be faster than the forward in some cases?

OK , I must admit having been too fixated on SGEMM/DGEMM, will disable the forwarding for SBGEMM in #4852 until
we have more data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Openblas sgemm is slower for small size matrices in aarch64
5 participants