Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232

Closed
jforissier opened this issue Jun 6, 2024 · 11 comments · Fixed by #9281 or #9536
Closed

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232

jforissier opened this issue Jun 6, 2024 · 11 comments · Fixed by #9281 or #9536
Assignees
Labels

Comments

@jforissier
Copy link

jforissier commented Jun 6, 2024

We observed that the mbedtls_mpi_exp_mod() function takes much more time to complete in v3.6.0 than in v3.5.2 (about 16 times more). The platform is OP-TEE OS running in QEMU (arm32). See OP-TEE/optee_os#6797 and in particular this commit which brings the function back to the v3.5.2 and solves the issue: jforissier/optee_os@f147fc1

Any idea what's wrong?


Edit (mpg): status summary.

@yanesca
Copy link
Contributor

yanesca commented Jun 6, 2024

Hi @jforissier, we have switched to a different algorithm to improve the security (constant time properties) of the algorithm. There is some performance cost of this change, but that 16x multiplier sounds more significant than what we expected.

The new algorithm uses memory in a different way than the old one and the window size that was optimal for the old one on a platform, might be inefficient for the new one. Does it help if you try to change the window size MBEDTLS_MPI_WINDOW_SIZE ?

@jforissier
Copy link
Author

Hi @yanesca, thanks for the advice. Unfortunately not much impact. With MBEDTLS_MPI_WINDOW_SIZE set to 1, 2, 3, 6 and 7 my test takes 34, 25.6, 23.2, 22 and 21.4 seconds respectively. With v3.5.7 it takes 1.2 second.

@mpg
Copy link
Contributor

mpg commented Jun 10, 2024

Hi @jforissier, did you also measure performance on real hardware? I tried it on two boards, based on Cortex-M0 and Cortex-M4 cores, and the performance regression compared to 2.25 (the "old" version that was most convenient for me to test as it's the one in Mbed OS) seemed much less drastic.

STM NUCLEO F091RC (M0 48MHz)

1024-bit 2.25 exp_mod, cold r: 2150 ms
1024-bit 2.25 exp_mod, hot r: 2130 ms
1024-bit 3.6 exp_mod, cold r: 2584 ms
1024-bit 3.6 exp_mod, hot r: 2564 ms

Freescale FRDM-K64F (M4 120MHz)

1024-bit 2.25 exp_mod, cold r: 248 ms
1024-bit 2.25 exp_mod, hot r: 244 ms
1024-bit 3.6 exp_mod, cold r: 319 ms
1024-bit 3.6 exp_mod, hot r: 315 ms

Here the new code is only about 30% slower, which is very different from the factor 16 you're observing. (On my laptop based on an x86-64 CPU, the difference between 3.5.1 (with it s default config) and 3.6.0 (with its default config) is too small to be reliably visible using the simple benchmark below.

#include "mbedtls/bignum.h"
#include "bignum_core.h"
#include <stdio.h>
#include <time.h>

#define TIMEIT(TITLE, CODE)                                         \
do {                                                                \
    for (size_t i = 0; i < 1000; i++) {                             \
        CODE;                                                       \
    }                                                               \
    clock_t start = clock();                                        \
    for (size_t i = 0; i < 1000; i++) {                             \
        CODE;                                                       \
    }                                                               \
    clock_t end = clock();                                          \
    double duration = (double)(end - start) / CLOCKS_PER_SEC;       \
    printf("%s: %f ms\n", TITLE, duration);                         \
} while (0)

// python: hex(random.randrange(2 ** 1024))
const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "8e51ee5226332eea3e694e62919fe8a3e2c6be653cd63d5c87027848636293dfe42025ab64df0f75cacab381400e4b734b08b4cbb1c4a28f699ae81dab87838f501f090e72e5d546f8de82f273416c204fa66176c0b48aea5df5cb081b80b56e63e3bfa545157b259baeebcdbf50b7299953440a681f414ea4cfca9d8f2c0b91";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

int main(void) {
    mbedtls_mpi x, a, e, n, r;

    mbedtls_mpi_init(&x); mbedtls_mpi_init(&a); mbedtls_mpi_init(&e);
    mbedtls_mpi_init(&n); mbedtls_mpi_init(&r);

    mbedtls_mpi_read_string(&a, 16, A);
    mbedtls_mpi_read_string(&e, 16, E);
    mbedtls_mpi_read_string(&n, 16, N);

    TIMEIT("1024-bit exp_mod, cold r", mbedtls_mpi_free(&r); mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
    TIMEIT("1024-bit exp_mod,  hot r", mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
}

There must be some difference between your setup and the ones I tried that explains the large difference. Could it be:

  • QEMU vs real hardware?
  • OP-TEE doing something that would make memory accesses 1 order of magnitude smaller?
  • a combination of the above (like QEMU's emulation of some memory protection mechanisms being very slow)?
  • a difference in mbedtls_config.h (I used the default one) or compile options (I used (arm)-gcc -O2)?

That's all I can think of for now. Please let us know if you can test some of these ideas, and/or have others.

@mpg
Copy link
Contributor

mpg commented Jun 11, 2024

FWIW, I've also done a few measurements on A-class (v7 and v8) and in both cases it looks like the change of the default window size was enough to compensate for the new, more secure algorithm being a bit slower.

v7-A (Cortex-A7)

3.5.1
1024-bit exp_mod, cold r: 40.122913 ms
1024-bit exp_mod, hot r: 39.805613 ms
3.6.0
1024-bit exp_mod, cold r: 38.286153 ms
1024-bit exp_mod, hot r: 38.024482 ms

v8-A (eMag)

3.5.1
1024-bit exp_mod, cold r: 1.776005 ms
1024-bit exp_mod, hot r: 1.763511 ms
3.6.0
1024-bit exp_mod, cold r: 1.587849 ms
1024-bit exp_mod, hot r: 1.576246 ms

@mjelintacharge
Copy link

mjelintacharge commented Jun 14, 2024

@mpg I'm seeing a very big performance impact on the STM32F207 devices (120MHZ Cortex-M3 core).
I have tried everything, even going -O3, however the performance is still horrible.

The small (or rather, 'bearable') performance difference only appears with certain large exponents.
However, the problem reveals itself when doing math on some popular public RSA exponents, such as 10001 (65537)

Here are some benchmarks using the test code that you have provided:

const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "8e51ee5226332eea3e694e62919fe8a3e2c6be653cd63d5c87027848636293dfe42025ab64df0f75cacab381400e4b734b08b4cbb1c4a28f699ae81dab87838f501f090e72e5d546f8de82f273416c204fa66176c0b48aea5df5cb081b80b56e63e3bfa545157b259baeebcdbf50b7299953440a681f414ea4cfca9d8f2c0b91";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

V3.6 Bignum:
1024-bit exp_mod, cold r: 630.500000 ms
1024-bit exp_mod, hot r: 627.500000 ms

V3.5.2 Bignum:
1024-bit exp_mod, cold r: 488.300000 ms
1024-bit exp_mod, hot r: 485.300000 ms

~23% difference, seems acceptable?

const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010001";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

V3.6 Bignum:
1024-bit exp_mod, cold r: 630.500000 ms
1024-bit exp_mod, hot r: 627.500000 ms

V3.5.2 Bignum:
1024-bit exp_mod, cold r: 10.200000 ms
1024-bit exp_mod, hot r: 7.200000 ms

~63 times slower?!!!

@yanesca
Copy link
Contributor

yanesca commented Jun 17, 2024

@mjelintacharge Does the performance regression persist if you define E as

const char *E = "10001";

or alternatively calling mbedtls_mpi_shrink() on E before using it?

@mjelintacharge
Copy link

@mjelintacharge Does the performance regression persist if you define E as

const char *E = "10001";

or alternatively calling mbedtls_mpi_shrink() on E before using it?

Using mbedtls_mpi_shrink(..) does indeed help.
However, even when using that I'm only getting ~100ms per operation versus the ~10ms on 3.5.2 (which is still super significant slowdown)

@mpg
Copy link
Contributor

mpg commented Jun 18, 2024

Thanks for the new data. This makes a lot of sense: we made our exp_mod function constant-time in 3.6, which is needed for the security of private operations: it should not leak any information on the private exponent through timing (except for its number of limbs, which is public anyway). And indeed in this message we can observe that in 3.6 we get the exact same timing whether e is a 17-bit value of a full 1024-bit value. Which is comforting from a security perspective - our function behaves as expected :)

But of course the public operations don't need to protect the exponent that way, as it's a public value. So, we should avoid wasting time protecting it. There's obviously a trade-off between performance of public operations and code size here: for code size we want the functions used for public and private operations to be as similar as possible, but for optimal performance we'd need them to be pretty different... We're exploring the solution space and will keep you posted.

@mpg mpg added the bug label Jun 18, 2024
@mpg
Copy link
Contributor

mpg commented Jun 18, 2024

I'm labelling this "bug" and adding to the 3.6.1 EPIC because I don't think a 10x regression on RSA public key operations is OK. (And even more if for some reason e has more limbs than necessary.) RSA public operations as expected to be fast (that's one notable difference with ECDSA for example, where signature generation can be faster than verification).

Quick results on my laptop (64-bit Intel):

3.5.1

1024/1024-bit exp_mod, cold r: 1.071714 ms
1024/1024-bit exp_mod,  hot r: 1.083624 ms
  17/1024-bit exp_mod, cold r: 0.022744 ms
  17/1024-bit exp_mod,  hot r: 0.016366 ms

3.6.0

1024/1024-bit exp_mod, cold r: 0.961081 ms
1024/1024-bit exp_mod,  hot r: 1.050500 ms
  17/1024-bit exp_mod, cold r: 0.100309 ms
  17/1024-bit exp_mod,  hot r: 0.177438 ms

Typical public RSA operations are the last line (17-bit exponent, hot r).

@mpg
Copy link
Contributor

mpg commented Jun 18, 2024

See first draft of an attempt to reduce the regression: #9281

@mpg
Copy link
Contributor

mpg commented Sep 2, 2024

The performance regression has been significantly reduced in Mbed TLS 3.6.1.

We're not planning further improvements to performance of RSA public operations as the moment, but I'm not closing this issue yet because the following work remains:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 3.6.1 patch release
4 participants