You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Am I missing any? Does that sound like a good categorization?
It maybe worth revisiting some of these in the medium to long-term, since the performance penalty for using fiat can be reduced.
As some of the participants in this issue know, I have been looking into fiat's Go performance recently. As of right now, there are a number of changes that can be made to fiat-crypto's 64-bit Go output that will provide performance increases for "real-world" use-cases.
Providing a routine that does repeated squaring in a for loop is a moderate gain for curve25519, due to making inversions faster.
Providing routines that merge Add/Sub with CarryMul and CarrySquare is a minor gain, though this depends on how the code is called.
* curve25519 and edwards25519 64-bit
* Using the code we are already landing for edwards25519 (https://golang.org/cl/276272, https://golang.org/cl/315269) is a little faster on arm64, and way faster on amd64 (https://golang.org/cl/c/crypto/+/314889/4#message-f16cf2274fc82aaf4a5df6836517ed52eadd32b5). Fiat is 15% faster on POWER9, but that alone doesn't feel worth carrying it (https://golang.org/cl/c/crypto/+/315269/1#message-22671e8bfc8583e4ae329a88ba25ab5c49cc3016).
With the addcarryxU64 removal, and fused Add/Sub/Opp + Carry calls, the existing code is slightly faster than (fiat runtime is +9~17%) on amd64[0]. I did not benchmark copy-pasting CarrySquare into a for-loop, since the existing inversion code won't leverage it (so there is an optimization opportunity in the base code as well). Likewise I did not benchmark the impact of adding a * (b + c)/a * (b - c)/(a + b)^2 routines.
Under the assumption that upstream pulls in at least the optimizations I did benchmark (and they seemed open to the idea), the tradeoffs from my perspective are:
(Pro) "ed25519 and curve25519 32-bits" for "free". This will almost certainly be faster than the existing code as well.
(Pro) Formal verification of part of the implementation (though the current code is quite nice)
(Pro) Less assembly to maintain (amd64/arm64 multiply/square have assembly impls)
(Con) Slightly slower.
But this is contingent on changes to the generated code happening, so there is no rush. On a positive note, when the changes do happen P-521 will be faster for "free".
[0]: ScalarMult/ScalarBaseMult performance will be significantly worse if fiat's Selectznz is used, so I didn't.
It maybe worth revisiting some of these in the medium to long-term, since the performance penalty for using fiat can be reduced.
As some of the participants in this issue know, I have been looking into fiat's Go performance recently. As of right now, there are a number of changes that can be made to fiat-crypto's 64-bit Go output that will provide performance increases for "real-world" use-cases.
addcarryxU64
/subcarryxU64
wrappers dramatically improvesCarryMul
/CarrySquare
performance (Go code cosmetic fixes mit-plv/fiat-crypto#949).Selectznz
is really really slow, compared to howedwards25519
does it.Add
/Sub
/Opp
withCarry
(go: Additional curve25519 routines mit-plv/fiat-crypto#1004). I filed the issue for curve25519 because that was what I was looking at, but skimming the P-521 commit, it applies there as well.Add
/Sub
withCarryMul
andCarrySquare
is a minor gain, though this depends on how the code is called.With the
addcarryxU64
removal, and fusedAdd
/Sub
/Opp
+Carry
calls, the existing code is slightly faster than (fiat runtime is +9~17%) on amd64[0]. I did not benchmark copy-pastingCarrySquare
into a for-loop, since the existing inversion code won't leverage it (so there is an optimization opportunity in the base code as well). Likewise I did not benchmark the impact of addinga * (b + c)
/a * (b - c)
/(a + b)^2
routines.Under the assumption that upstream pulls in at least the optimizations I did benchmark (and they seemed open to the idea), the tradeoffs from my perspective are:
But this is contingent on changes to the generated code happening, so there is no rush. On a positive note, when the changes do happen P-521 will be faster for "free".
[0]:
ScalarMult
/ScalarBaseMult
performance will be significantly worse if fiat'sSelectznz
is used, so I didn't.Originally posted by @Yawning in #40171 (comment)
The text was updated successfully, but these errors were encountered: