Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232

jforissier · 2024-06-06T14:12:21Z

We observed that the mbedtls_mpi_exp_mod() function takes much more time to complete in v3.6.0 than in v3.5.2 (about 16 times more). The platform is OP-TEE OS running in QEMU (arm32). See OP-TEE/optee_os#6797 and in particular this commit which brings the function back to the v3.5.2 and solves the issue: jforissier/optee_os@f147fc1

Any idea what's wrong?

Edit (mpg): status summary.

The text was updated successfully, but these errors were encountered:

yanesca · 2024-06-06T14:44:20Z

Hi @jforissier, we have switched to a different algorithm to improve the security (constant time properties) of the algorithm. There is some performance cost of this change, but that 16x multiplier sounds more significant than what we expected.

The new algorithm uses memory in a different way than the old one and the window size that was optimal for the old one on a platform, might be inefficient for the new one. Does it help if you try to change the window size MBEDTLS_MPI_WINDOW_SIZE ?

jforissier · 2024-06-06T15:28:02Z

Hi @yanesca, thanks for the advice. Unfortunately not much impact. With MBEDTLS_MPI_WINDOW_SIZE set to 1, 2, 3, 6 and 7 my test takes 34, 25.6, 23.2, 22 and 21.4 seconds respectively. With v3.5.7 it takes 1.2 second.

mpg · 2024-06-10T16:02:35Z

Hi @jforissier, did you also measure performance on real hardware? I tried it on two boards, based on Cortex-M0 and Cortex-M4 cores, and the performance regression compared to 2.25 (the "old" version that was most convenient for me to test as it's the one in Mbed OS) seemed much less drastic.

STM NUCLEO F091RC (M0 48MHz)

1024-bit 2.25 exp_mod, cold r: 2150 ms
1024-bit 2.25 exp_mod, hot r: 2130 ms
1024-bit 3.6 exp_mod, cold r: 2584 ms
1024-bit 3.6 exp_mod, hot r: 2564 ms

Freescale FRDM-K64F (M4 120MHz)

1024-bit 2.25 exp_mod, cold r: 248 ms
1024-bit 2.25 exp_mod, hot r: 244 ms
1024-bit 3.6 exp_mod, cold r: 319 ms
1024-bit 3.6 exp_mod, hot r: 315 ms

Here the new code is only about 30% slower, which is very different from the factor 16 you're observing. (On my laptop based on an x86-64 CPU, the difference between 3.5.1 (with it s default config) and 3.6.0 (with its default config) is too small to be reliably visible using the simple benchmark below.

#include "mbedtls/bignum.h"
#include "bignum_core.h"
#include <stdio.h>
#include <time.h>

#define TIMEIT(TITLE, CODE)                                         \
do {                                                                \
    for (size_t i = 0; i < 1000; i++) {                             \
        CODE;                                                       \
    }                                                               \
    clock_t start = clock();                                        \
    for (size_t i = 0; i < 1000; i++) {                             \
        CODE;                                                       \
    }                                                               \
    clock_t end = clock();                                          \
    double duration = (double)(end - start) / CLOCKS_PER_SEC;       \
    printf("%s: %f ms\n", TITLE, duration);                         \
} while (0)

// python: hex(random.randrange(2 ** 1024))
const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "8e51ee5226332eea3e694e62919fe8a3e2c6be653cd63d5c87027848636293dfe42025ab64df0f75cacab381400e4b734b08b4cbb1c4a28f699ae81dab87838f501f090e72e5d546f8de82f273416c204fa66176c0b48aea5df5cb081b80b56e63e3bfa545157b259baeebcdbf50b7299953440a681f414ea4cfca9d8f2c0b91";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

int main(void) {
    mbedtls_mpi x, a, e, n, r;

    mbedtls_mpi_init(&x); mbedtls_mpi_init(&a); mbedtls_mpi_init(&e);
    mbedtls_mpi_init(&n); mbedtls_mpi_init(&r);

    mbedtls_mpi_read_string(&a, 16, A);
    mbedtls_mpi_read_string(&e, 16, E);
    mbedtls_mpi_read_string(&n, 16, N);

    TIMEIT("1024-bit exp_mod, cold r", mbedtls_mpi_free(&r); mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
    TIMEIT("1024-bit exp_mod,  hot r", mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
}

There must be some difference between your setup and the ones I tried that explains the large difference. Could it be:

QEMU vs real hardware?
OP-TEE doing something that would make memory accesses 1 order of magnitude smaller?
a combination of the above (like QEMU's emulation of some memory protection mechanisms being very slow)?
a difference in mbedtls_config.h (I used the default one) or compile options (I used (arm)-gcc -O2)?

That's all I can think of for now. Please let us know if you can test some of these ideas, and/or have others.

mpg · 2024-06-11T10:37:26Z

FWIW, I've also done a few measurements on A-class (v7 and v8) and in both cases it looks like the change of the default window size was enough to compensate for the new, more secure algorithm being a bit slower.

v7-A (Cortex-A7)

3.5.1
1024-bit exp_mod, cold r: 40.122913 ms
1024-bit exp_mod, hot r: 39.805613 ms
3.6.0
1024-bit exp_mod, cold r: 38.286153 ms
1024-bit exp_mod, hot r: 38.024482 ms

v8-A (eMag)

3.5.1
1024-bit exp_mod, cold r: 1.776005 ms
1024-bit exp_mod, hot r: 1.763511 ms
3.6.0
1024-bit exp_mod, cold r: 1.587849 ms
1024-bit exp_mod, hot r: 1.576246 ms

mjelintacharge · 2024-06-14T17:29:15Z

@mpg I'm seeing a very big performance impact on the STM32F207 devices (120MHZ Cortex-M3 core).
I have tried everything, even going -O3, however the performance is still horrible.

The small (or rather, 'bearable') performance difference only appears with certain large exponents.
However, the problem reveals itself when doing math on some popular public RSA exponents, such as 10001 (65537)

Here are some benchmarks using the test code that you have provided:

const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "8e51ee5226332eea3e694e62919fe8a3e2c6be653cd63d5c87027848636293dfe42025ab64df0f75cacab381400e4b734b08b4cbb1c4a28f699ae81dab87838f501f090e72e5d546f8de82f273416c204fa66176c0b48aea5df5cb081b80b56e63e3bfa545157b259baeebcdbf50b7299953440a681f414ea4cfca9d8f2c0b91";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

V3.6 Bignum:
1024-bit exp_mod, cold r: 630.500000 ms
1024-bit exp_mod, hot r: 627.500000 ms

V3.5.2 Bignum:
1024-bit exp_mod, cold r: 488.300000 ms
1024-bit exp_mod, hot r: 485.300000 ms

~23% difference, seems acceptable?

const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010001";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

V3.6 Bignum:
1024-bit exp_mod, cold r: 630.500000 ms
1024-bit exp_mod, hot r: 627.500000 ms

V3.5.2 Bignum:
1024-bit exp_mod, cold r: 10.200000 ms
1024-bit exp_mod, hot r: 7.200000 ms

~63 times slower?!!!

yanesca · 2024-06-17T08:41:00Z

@mjelintacharge Does the performance regression persist if you define E as

const char *E = "10001";

or alternatively calling mbedtls_mpi_shrink() on E before using it?

mjelintacharge · 2024-06-17T12:27:28Z

@mjelintacharge Does the performance regression persist if you define E as
const char *E = "10001";
or alternatively calling mbedtls_mpi_shrink() on E before using it?

Using mbedtls_mpi_shrink(..) does indeed help.
However, even when using that I'm only getting ~100ms per operation versus the ~10ms on 3.5.2 (which is still super significant slowdown)

mpg · 2024-06-18T08:06:31Z

Thanks for the new data. This makes a lot of sense: we made our exp_mod function constant-time in 3.6, which is needed for the security of private operations: it should not leak any information on the private exponent through timing (except for its number of limbs, which is public anyway). And indeed in this message we can observe that in 3.6 we get the exact same timing whether e is a 17-bit value of a full 1024-bit value. Which is comforting from a security perspective - our function behaves as expected :)

But of course the public operations don't need to protect the exponent that way, as it's a public value. So, we should avoid wasting time protecting it. There's obviously a trade-off between performance of public operations and code size here: for code size we want the functions used for public and private operations to be as similar as possible, but for optimal performance we'd need them to be pretty different... We're exploring the solution space and will keep you posted.

mpg · 2024-06-18T08:14:28Z

I'm labelling this "bug" and adding to the 3.6.1 EPIC because I don't think a 10x regression on RSA public key operations is OK. (And even more if for some reason e has more limbs than necessary.) RSA public operations as expected to be fast (that's one notable difference with ECDSA for example, where signature generation can be faster than verification).

Quick results on my laptop (64-bit Intel):

3.5.1

1024/1024-bit exp_mod, cold r: 1.071714 ms
1024/1024-bit exp_mod,  hot r: 1.083624 ms
  17/1024-bit exp_mod, cold r: 0.022744 ms
  17/1024-bit exp_mod,  hot r: 0.016366 ms

3.6.0

1024/1024-bit exp_mod, cold r: 0.961081 ms
1024/1024-bit exp_mod,  hot r: 1.050500 ms
  17/1024-bit exp_mod, cold r: 0.100309 ms
  17/1024-bit exp_mod,  hot r: 0.177438 ms

Typical public RSA operations are the last line (17-bit exponent, hot r).

mpg · 2024-06-18T11:06:00Z

See first draft of an attempt to reduce the regression: #9281

mpg · 2024-09-02T08:04:27Z

The performance regression has been significantly reduced in Mbed TLS 3.6.1.

We're not planning further improvements to performance of RSA public operations as the moment, but I'm not closing this issue yet because the following work remains:

Improve testing in the 3.6 branch - [3.6] Rsapub additional tests #9493
Forward-port everything to development - PR to be created after the above is approved.

mpg added the bug label Jun 18, 2024

mpg mentioned this issue Jun 18, 2024

[3.6] Reduce performance regression in RSA public operations #9281

Merged

10 tasks

gilles-peskine-arm mentioned this issue Jun 19, 2024

Handshake takes too long time. #9288

Open

minosgalanakis added this to Mbed TLS Epics Aug 2, 2024

minosgalanakis moved this to 3.6.1 patch release in Mbed TLS Epics Aug 2, 2024

mpg linked a pull request Aug 9, 2024 that will close this issue

[3.6] Reduce performance regression in RSA public operations #9281

Merged

10 tasks

mpg self-assigned this Sep 2, 2024

mpg mentioned this issue Sep 4, 2024

[dev] Rsapub performance fix #9536

Merged

5 tasks

tom-cosgrove-arm closed this as completed in #9536 Sep 18, 2024

ronald-cron-arm mentioned this issue Dec 10, 2024

Finalize split preparation-2 #9828

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232

jforissier commented Jun 6, 2024 •

edited by mpg

Loading

yanesca commented Jun 6, 2024

jforissier commented Jun 6, 2024

mpg commented Jun 10, 2024

mpg commented Jun 11, 2024

mjelintacharge commented Jun 14, 2024 •

edited

Loading

yanesca commented Jun 17, 2024 •

edited

Loading

mjelintacharge commented Jun 17, 2024

mpg commented Jun 18, 2024

mpg commented Jun 18, 2024

mpg commented Jun 18, 2024

mpg commented Sep 2, 2024 •

edited

Loading

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232

Comments

jforissier commented Jun 6, 2024 • edited by mpg Loading

yanesca commented Jun 6, 2024

jforissier commented Jun 6, 2024

mpg commented Jun 10, 2024

mpg commented Jun 11, 2024

mjelintacharge commented Jun 14, 2024 • edited Loading

yanesca commented Jun 17, 2024 • edited Loading

mjelintacharge commented Jun 17, 2024

mpg commented Jun 18, 2024

mpg commented Jun 18, 2024

mpg commented Jun 18, 2024

mpg commented Sep 2, 2024 • edited Loading

jforissier commented Jun 6, 2024 •

edited by mpg

Loading

mjelintacharge commented Jun 14, 2024 •

edited

Loading

yanesca commented Jun 17, 2024 •

edited

Loading

mpg commented Sep 2, 2024 •

edited

Loading