-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forward GEMM to GEMV when one argument is actually a vector #4814
Conversation
Thank you - somehow I had failed to realize I still had the bogus incx from the cursed first draft in my fork. (I still get a handful of new LAPACK testsuite failures on x86_64 with this - could be this is the rounding in the gemv microkernel for Haswell again) |
That is unfortunate. I've made it opt-in per target so the x86_64 precision errors can be looked at separately. I've also disabled forwarding if Does that make this merge-able @martin-frbg ? 😸 |
Didn't mean this would make it unmergeable, haven't even gotten around to testing with TARGET=GENERIC yet (the inaccuracy had already been flagged in #4324, but it is not clear if it is really significant in the general case) |
@martin-frbg , @Mousius , |
We're also seeing regression in some of our PyTorch models:
Interesting the OpenBLAS tests seem fine 🤔 |
@martin-frbg Out of curiosity, why still bother with Sparc? It's a dead architecture, almost in the domain of retro-computing. The last proper Sparc CPU release was circa 2017. Neither Oracle or Fujitsu are making any new Sparc processors (they've moved on to x86-64 and ARM). Folks still using legacy Sparc hardware can always use older versions of OpenBLAS. |
@conradsnicta as far as I know, there may still be distributions providing packages for Sparc hardware, and without the patch they would get a broken build due to utest failures. There are machines in the GCC compile farm I can test on, and the change seemed straightforward enough, but there is something about the ABI that has turned a simple "read another value from the stack into some register and see if it is non-zero" into an energy drain. |
@martin-frbg Perhaps it's worth dropping Sparc from the list of supported architectures? Opportunity cost and all that. If somebody really cares about Sparc, they can always contribute patches. At best it's a tiny niche architecture: https://popcon.debian.org/ |
I think you will only see the inaccuracies in lapack-test (and when you run the reproducer from #4324), the trouble is that currently all ARM64 except A64FX use the same NEON kernels for GEMV, which may have been inspired by the equally affected x86_64 microkernels for Haswell and up (or vice versa) and I have failed to come up with an easy fix |
It seems Misread the output, I presume |
The last line of the result array for the GEMV though, truncated to zero where less highly optimized implementations (including OpenBLAS' "generic" C target) show a -4e-16 ... I tried to explain it away with a good bit of handwaving in the original issue, but it at least appears to correlate with about 20 new failures each for S and D popping up in the LAPACK testsuite. |
This helps to reduce values going missing as we accumulate.
@martin-frbg I've increased the number of accumulators in GEMV T, and that seems to have fixed the issue without impacting performance too much. Unsure how else to validate this 🤔 GEMV N is fundamentally different, and there are no reports of that being affected. Oh, and I think the SVE |
Great, thanks. I'll try to steal that for x86_64 too, somehow I only managed to break things further when I tried. Also had not realized that the lapack testsuite failures on arm64 were all related to GEMV_T, did not have the impression that this was the case on x86_64. Anyway we can probably make that opt-out now - I'll check all the other platforms (riscv appears to be fine already) |
Question: If an architecture packs the data a specific way (AKA we pack the data differently for P10 MMA vs the way the generic code does) and it does NOT have a architecture specific implementation of GEMV, won't this be a problem for this patch? Right now we do NOT have a P10 version of GEMV for BF16. |
I believe you can match on |
Forwarding happens at the interface stage so any special packing you (plan to) do for or in the kernel should be irrelevant, I hope. The only reason this is activated on a per-architecture basis is that there is an accuracy bug in some x86 GEMV kernels that I haven't been able to fix yet |
I think the issue is you may have an optimized |
Yes matching on CORE should be possible (except it would lead to some ugly long lines of conditionals...). As said we are in interface/gemm.c here, right after the call to GEMM has been received by the library and before the level3 driver even gets to think about partitioning the problem for the GEMM kernel |
OK , I must admit having been too fixated on SGEMM/DGEMM, will disable the forwarding for SBGEMM in #4852 until |
I expanded this from #4708, and the tests passed locally but likely needs more checking to ensure it's stable.
Fixes #4580