Scalar 4x64 performance improvements#453
Scalar 4x64 performance improvements#453peterdettman wants to merge 1 commit intobitcoin-core:masterfrom
Conversation
|
@peterdettman other than inlining, why is this faster? |
|
@dcousens The existing code is using macros not functions, so I doubt there is any benefit from writing inline code per se. Pre-loading a->d[0] etc. I don't think makes much difference (presumably the compiler does it anyway). Did you mean something else? My assumption would be that the improvement is coming from using several uint128_t accumulators instead of the existing "160 bit accumulator" - muladd2 is quite awkward as a result, and presumably where the extra time is going. |
|
New commit rewrites secp256k1_reduce_512. Cumulative speed improvement for _scalar_inverse is now measured at >30%, bench_sign measurements improved by ~7-8%. The "trick" used with p4 (see lines 588, 611 and comments) warrants careful review. |
|
Benchmarked
|
|
Is this ready to be merged? |
|
No, it needs thorough review of the carry handling and modular reduction, which can have very subtle bugs that random tests won't catch. I'd also like to get around to rewriting _scalar_mul in the same spirit, although I expect less improvement from that, and probably putting the code back into a cleaner macro-style like the original code. |
5c70a50 to
4372e28
Compare
7ef4a3e to
2bc31d3
Compare
|
@peterdettman Are you still interested in this? It seems like a reasonable improvement, and I'm willing to look into verifying that the double-correction trick always yields the right answer. I do think we'll want it in macro-style though. |
Or instead of macros, actual (inlinable) C functions with type checking and all that modern stuff. ;) |
(Not for immediate merge)
In #452 I noted that sqr and mul take about the same time in my config (OSX, 64-bit, no-asm, -O3 -march=native), so this is a quick attempt to speed up _scalar_sqr. This initial commit rewrites _scalar_sqr_512 for an ~ 8% improvement in _scalar_sqr. Second opinions/measurements would be appreciated.
It seems from the measurements that _scalar_reduce_512 is the real heavyweight here, so I'll be trying to re-implement that next.
I can rewrite in terms of macros (the current local code style) prior to any merge.