> Am I missing any? Does that sound like a good categorization? #47664

Yawning · 2021-08-12T12:53:47Z

Am I missing any? Does that sound like a good categorization?

It maybe worth revisiting some of these in the medium to long-term, since the performance penalty for using fiat can be reduced.

As some of the participants in this issue know, I have been looking into fiat's Go performance recently. As of right now, there are a number of changes that can be made to fiat-crypto's 64-bit Go output that will provide performance increases for "real-world" use-cases.

Removing the addcarryxU64/subcarryxU64 wrappers dramatically improves CarryMul/CarrySquare performance (Go code cosmetic fixes mit-plv/fiat-crypto#949).
Selectznz is really really slow, compared to how edwards25519 does it.
Working around the compiler (aka "At most, people must copy and paste Go code.") provides gains for at least the following calls:
- Fusing Add/Sub/Opp with Carry (go: Additional curve25519 routines mit-plv/fiat-crypto#1004). I filed the issue for curve25519 because that was what I was looking at, but skimming the P-521 commit, it applies there as well.
- Providing a routine that does repeated squaring in a for loop is a moderate gain for curve25519, due to making inversions faster.
- Providing routines that merge Add/Sub with CarryMul and CarrySquare is a minor gain, though this depends on how the code is called.

* curve25519 and edwards25519 64-bit
  
  * Using the code we are already landing for edwards25519 (https://golang.org/cl/276272, https://golang.org/cl/315269) is a little faster on arm64, and way faster on amd64 (https://golang.org/cl/c/crypto/+/314889/4#message-f16cf2274fc82aaf4a5df6836517ed52eadd32b5). Fiat is 15% faster on POWER9, but that alone doesn't feel worth carrying it (https://golang.org/cl/c/crypto/+/315269/1#message-22671e8bfc8583e4ae329a88ba25ab5c49cc3016).

With the addcarryxU64 removal, and fused Add/Sub/Opp + Carry calls, the existing code is slightly faster than (fiat runtime is +9~17%) on amd64[0]. I did not benchmark copy-pasting CarrySquare into a for-loop, since the existing inversion code won't leverage it (so there is an optimization opportunity in the base code as well). Likewise I did not benchmark the impact of adding a * (b + c)/a * (b - c)/(a + b)^2 routines.

Under the assumption that upstream pulls in at least the optimizations I did benchmark (and they seemed open to the idea), the tradeoffs from my perspective are:

(Pro) "ed25519 and curve25519 32-bits" for "free". This will almost certainly be faster than the existing code as well.
(Pro) Formal verification of part of the implementation (though the current code is quite nice)
(Pro) Less assembly to maintain (amd64/arm64 multiply/square have assembly impls)
(Con) Slightly slower.

But this is contingent on changes to the generated code happening, so there is no rush. On a positive note, when the changes do happen P-521 will be faster for "free".

[0]: ScalarMult/ScalarBaseMult performance will be significantly worse if fiat's Selectznz is used, so I didn't.

Originally posted by @Yawning in #40171 (comment)

The text was updated successfully, but these errors were encountered:

Yawning · 2021-08-12T12:54:27Z

Whoops. Did not mean to open this here, sorry about the noise.

Yawning closed this as completed Aug 12, 2021

golang locked and limited conversation to collaborators Aug 12, 2022

gopherbot added the FrozenDueToAge label Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

> Am I missing any? Does that sound like a good categorization? #47664

> Am I missing any? Does that sound like a good categorization? #47664

Yawning commented Aug 12, 2021

Yawning commented Aug 12, 2021

> Am I missing any? Does that sound like a good categorization? #47664

> Am I missing any? Does that sound like a good categorization? #47664

Comments

Yawning commented Aug 12, 2021

Yawning commented Aug 12, 2021