-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232
Comments
Hi @jforissier, we have switched to a different algorithm to improve the security (constant time properties) of the algorithm. There is some performance cost of this change, but that 16x multiplier sounds more significant than what we expected. The new algorithm uses memory in a different way than the old one and the window size that was optimal for the old one on a platform, might be inefficient for the new one. Does it help if you try to change the window size |
Hi @yanesca, thanks for the advice. Unfortunately not much impact. With |
Hi @jforissier, did you also measure performance on real hardware? I tried it on two boards, based on Cortex-M0 and Cortex-M4 cores, and the performance regression compared to 2.25 (the "old" version that was most convenient for me to test as it's the one in Mbed OS) seemed much less drastic. STM NUCLEO F091RC (M0 48MHz) 1024-bit 2.25 exp_mod, cold r: 2150 ms Freescale FRDM-K64F (M4 120MHz) 1024-bit 2.25 exp_mod, cold r: 248 ms Here the new code is only about 30% slower, which is very different from the factor 16 you're observing. (On my laptop based on an x86-64 CPU, the difference between 3.5.1 (with it s default config) and 3.6.0 (with its default config) is too small to be reliably visible using the simple benchmark below. #include "mbedtls/bignum.h"
#include "bignum_core.h"
#include <stdio.h>
#include <time.h>
#define TIMEIT(TITLE, CODE) \
do { \
for (size_t i = 0; i < 1000; i++) { \
CODE; \
} \
clock_t start = clock(); \
for (size_t i = 0; i < 1000; i++) { \
CODE; \
} \
clock_t end = clock(); \
double duration = (double)(end - start) / CLOCKS_PER_SEC; \
printf("%s: %f ms\n", TITLE, duration); \
} while (0)
// python: hex(random.randrange(2 ** 1024))
const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "8e51ee5226332eea3e694e62919fe8a3e2c6be653cd63d5c87027848636293dfe42025ab64df0f75cacab381400e4b734b08b4cbb1c4a28f699ae81dab87838f501f090e72e5d546f8de82f273416c204fa66176c0b48aea5df5cb081b80b56e63e3bfa545157b259baeebcdbf50b7299953440a681f414ea4cfca9d8f2c0b91";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";
int main(void) {
mbedtls_mpi x, a, e, n, r;
mbedtls_mpi_init(&x); mbedtls_mpi_init(&a); mbedtls_mpi_init(&e);
mbedtls_mpi_init(&n); mbedtls_mpi_init(&r);
mbedtls_mpi_read_string(&a, 16, A);
mbedtls_mpi_read_string(&e, 16, E);
mbedtls_mpi_read_string(&n, 16, N);
TIMEIT("1024-bit exp_mod, cold r", mbedtls_mpi_free(&r); mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
TIMEIT("1024-bit exp_mod, hot r", mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
} There must be some difference between your setup and the ones I tried that explains the large difference. Could it be:
That's all I can think of for now. Please let us know if you can test some of these ideas, and/or have others. |
FWIW, I've also done a few measurements on A-class (v7 and v8) and in both cases it looks like the change of the default window size was enough to compensate for the new, more secure algorithm being a bit slower. v7-A (Cortex-A7) 3.5.1 v8-A (eMag) 3.5.1 |
@mpg I'm seeing a very big performance impact on the STM32F207 devices (120MHZ Cortex-M3 core). The small (or rather, 'bearable') performance difference only appears with certain large exponents. Here are some benchmarks using the test code that you have provided:
V3.6 Bignum: V3.5.2 Bignum: ~23% difference, seems acceptable?
V3.6 Bignum: V3.5.2 Bignum: ~63 times slower?!!! |
@mjelintacharge Does the performance regression persist if you define E as
or alternatively calling |
Using mbedtls_mpi_shrink(..) does indeed help. |
Thanks for the new data. This makes a lot of sense: we made our But of course the public operations don't need to protect the exponent that way, as it's a public value. So, we should avoid wasting time protecting it. There's obviously a trade-off between performance of public operations and code size here: for code size we want the functions used for public and private operations to be as similar as possible, but for optimal performance we'd need them to be pretty different... We're exploring the solution space and will keep you posted. |
I'm labelling this "bug" and adding to the 3.6.1 EPIC because I don't think a 10x regression on RSA public key operations is OK. (And even more if for some reason e has more limbs than necessary.) RSA public operations as expected to be fast (that's one notable difference with ECDSA for example, where signature generation can be faster than verification). Quick results on my laptop (64-bit Intel): 3.5.1
3.6.0
Typical public RSA operations are the last line (17-bit exponent, hot r). |
See first draft of an attempt to reduce the regression: #9281 |
The performance regression has been significantly reduced in Mbed TLS 3.6.1. We're not planning further improvements to performance of RSA public operations as the moment, but I'm not closing this issue yet because the following work remains:
|
We observed that the
mbedtls_mpi_exp_mod()
function takes much more time to complete in v3.6.0 than in v3.5.2 (about 16 times more). The platform is OP-TEE OS running in QEMU (arm32). See OP-TEE/optee_os#6797 and in particular this commit which brings the function back to the v3.5.2 and solves the issue: jforissier/optee_os@f147fc1Any idea what's wrong?
Edit (mpg): status summary.
The text was updated successfully, but these errors were encountered: