-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OQS_DIST_BUILD with strange results on M1 #1201
Comments
Partial fix in https://github.com/open-quantum-safe/liboqs/tree/mb-aarch64-dist. @Martyrshot : I'd be glad for a glance-over before doing a PR, especially wrt ARM32. Remaining question: Is there any reason for running "-ref" (non-optimized) code on M1 ever? If so, which build option combination should activate it? |
I pushed a small change to make the naming consistent for ARM32_V7 (here), otherwise it looks good to me! I personally think running the reference implementation on M1 is worth it to see the relative performance improvements -noport offers. |
Thanks for this.
This performance differential is only visible if we have a platform that needs reference code to run. If there is no such ARM platform (as seems to be the case for M1), I'd suggest doing profiling only for a single setting, i.e., the default (-DOQS_DIST_BUILD=OFF). Or asked another way: What setting of |
As decided in our call: Leave semantics as-is: DIST_BUILD basically behaves as |
When looking at the performance results at https://openquantumsafe.org/benchmarking/visualization/speed_kem.html, filtering for
aarch64
andKyber
(as an algorithm supporting run-time switching), it becomes apparent that setting OQS_DIST_BUILD yields the slowest-running code on that architecture; At first blush I attributed that to "weak" CPU features available by the AWS ARM VMs we use for profiling. However, now the same becomes visible when trying things for M1.Isn't this counterintuitive, as this flag should dynamically select the fastest-running code? Especially on M1 silicon not having any optimizations that are not supported, shouldn't code with this flag set be expected to yield performance as high as code with the
OQS_OPT_TARGET=auto
andOQS_DIST_BUILD=OFF
(the "-noport" option in the benchmarking suite)?On "x86_64" the performance behaviour is as expected: On machines/VMs with CPU features available, code built with
OQS_DIST_BUILD=ON
runs as fast as code withOQS_OPT_TARGET=auto
(or skylake) andOQS_DIST_BUILD=OFF
. The slowest performance is visible ifOQS_DIST_BUILD=OFF
andOQS_OPT_TARGET=generic
(ie., the "-ref" setting).On "aarch64", to the opposite, as long as
OQS_DIST_BUILD=OFF
, no performance difference can be observed, regardless of the choice of OQS_OPT_TARGET. This in turn means that "-ref" and "-noport" benchmark numbers are basically the same -- which also is confusing --at least to me--, as one was meant to display performance of reference implementation and the other that of the best optimized code. This then also debunks my initial thought that AWS aarch64 machines do not have all ARM performance features: They clearly do as the performance numbers are (much) higher than withOQS_DIST_BUILD=ON
.This issue is a continuation of #1146 making me wonder whether #1148 is a correct fix.
The text was updated successfully, but these errors were encountered: