Skip to content

Conversation

@mitschabaude
Copy link
Contributor

@mitschabaude mitschabaude commented Oct 1, 2024

WIP, experiments towards #2634

define a "wasm-friendly" Field and test it with Poseidon

TODO: Poseidon results don't agree yet, need unit tests

Performance looks promising though: Close to 4x improvement, to reproduce run

./bench-wasm.sh --bench poseidon_bench

@volhovm volhovm self-assigned this Oct 3, 2024
Base automatically changed from perf/poseidon-wasm to develop October 11, 2024 16:26
@sebastiencs
Copy link

Hi @mitschabaude, this is great work.
I've tried to integrate this implementation in openmina, but mul_assign does not give correct result:
openmina@3f8ed1f
When we multiply a number by 1, the result is zero

@mitschabaude
Copy link
Contributor Author

mitschabaude commented Oct 14, 2024

Hi @mitschabaude, this is great work. I've tried to integrate this implementation in openmina, but mul_assign does not give correct result: openmina@3f8ed1f When we multiply a number by 1, the result is zero

@sebastiencs thanks for testing it, yeah these low-level algorithms are finicky, it will need some debugging to get right. I have an equivalent implementation here though that works perfectly, so I'm sure there's just some small fixable mistake.

I won't be able to finish this PR btw, no longer working at o1Labs -- sorry for leaving this here in an unsatisfying state :)

@volhovm
Copy link
Contributor

volhovm commented Oct 14, 2024

@sebastiencs I'll look into this PR sometime soon, perhaps tomorrow, and will sync on what's the plan for continuing.

@volhovm volhovm force-pushed the perf/wasm-friendly-field branch from d727c5f to a2422ff Compare November 19, 2024 12:05
@volhovm
Copy link
Contributor

volhovm commented Nov 21, 2024

I'm trying to see if we should proceed with it. Ran some benchmarks.

Benchmark results (in native rust) are as follows:

  1. A native multiplication is 15ns, while Fp9 multiplication is 25ns.
    • A native multiplication in WASM is 81ns, which is 5.4x more expensive than native.

Benchmark results (in WASM) are as follows:

  1. A poseidon hash is 122 mcs in Fp vs 36 mcs in Fp9 (x3.38 improvement)
  2. A conversion of a field element from fp to fp9 takes 54ns
    • For a vector batch of 2^16 elements, conversions cost 3.5ms, including the cost of creating a new vector and copying everything there.
      • So for 16 columns, it's 56ms to convert everything.
      • This can be /easily/ parallelised also, and thus sped up several times.
  3. A single multiplication in fp is 81ns vs 36ns in Fp9 (x2.25 improvement)
  4. A single multiplication in fp9 with two conversions is 120ns.
  5. Two multiplications in Fp take 157ns (z = x * y; z = z * x), while two conversions from Fp and two multiplications take 147ns.
    1. With batch of 4 multiplications, we'll have 315ns vs 198ns, so 1.59x
    2. And so on approaching x2.25.

@mitschabaude two questions, maybe you know?

  • Why is poseidon 3.38x faster, while multiplications are only 2.25x faster -- is poseidon using something like doubles instead of just multiplications? or add-in-place?
  • Do you have any numbers for other functions, e.g. how much would it cost to do a 2^16 MSM?
    • My motivation: it seems that if we /at least/ do a multiplication per conversion, we're faster using Fp9. The question would be: if conversion to representation for 2^16 takes 3.5ms and is parallelisable... and if the native MSM takes several hundred milliseconds, how big of a factor can we save? It seems plausible that in the case of MSM conversions are negligibly cheap (under 1%), but I'd like to verify it somehow.

@mitschabaude
Copy link
Contributor Author

Two important comments first:

  • I wouldn't trust any of the Fp9 numbers yet because of Wasm-friendly Field #2638 (comment)
    • and also, there might be a small additional amount of optimization possible
  • There are two different kinds of Fp <-> Fp9 conversions to consider, depending on whether we convert because we're using both kinds of arithmetic on the same field element, OR we only use Fp9-style arithmetic, and the conversion is just done inside multiplication to match the 4 x 64 byte layout
    • in the first case, conversion needs to move between different montgomery representations i.e. $2^N x$ to $2^M x$. right now this probably uses two separate ff multiplications but it could be done with 1 multiplication. this is what you benchmarked @volhovm
    • in the second case, the field element is already in the right montgomery representation, and we only read out its bytes in a slightly shifted way. this is way cheaper and feasible to do within a single multiplication

@mitschabaude
Copy link
Contributor Author

In a concrete scenario: If we're moving to a different representation just for the MSM, then we'll do on the order of 50-100 multiplications per conversion. so at that point the conversion is negligible

@mitschabaude
Copy link
Contributor Author

mitschabaude commented Nov 21, 2024

Do you have any numbers for other functions, e.g. how much would it cost to do a 2^16 MSM?

I have detailed benchmarks for Wasm Pasta MSM of my own implementation that uses the Fp9 algorithm

Just to throw out a number, on my machine a 2^16 Pasta MSM using 16 threads takes about 80ms.
(I can show you how to run those benchmarks on your machine as well)

I know of another, better optimized project that probably does it in 50-60ms. (Using yet another field representation that I prototyped as well, that comes with some additional complications)

What I sadly don't have is detailed benchmarks for the Kimchi arkworks MSM. What I can offer is the table in this README, which compares single-threaded MSM running in arkworks with an older version of my code, on a 384-bit curve. In that comparison, my code showed a 7x speedup: https://github.com/mitschabaude/montgomery/blob/main/doc/zprize22.md

Caveat: Some of that 7x is not due to different field arithmetic but better high-level MSM algorithms, and I don't know exactly what contributed what.

@mitschabaude
Copy link
Contributor Author

@volhovm here's how to run my Wasm Pasta MSM on your machine:

git clone [email protected]:mitschabaude/montgomery.git
cd montgomery
npm i
npm run evaluate

it will run the MSM 10 times, display the timings, average, deviation and more fine-grained details of the timing for one particular run

@mitschabaude
Copy link
Contributor Author

I also have raw multiplication benchmarks, that match up well with your numbers:

npm run benchmark

for the fp9 representation I get

multiply montgomery      37ns

which matches your 36ns very well.

for the more complicated "51x5" representation I even get down to

multiply 51x5            27ns

here, a major caveat is that this is only that fast if you do two multiplications at once (because that's the only known efficient way to leverage SIMD)

@mitschabaude
Copy link
Contributor Author

Why is poseidon 3.38x faster, while multiplications are only 2.25x faster -- is poseidon using something like doubles instead of just multiplications? or add-in-place?

fundamentally, I think the multiplication benchmarks are sound and show the real picture. There's no extra speedup to be expected in Poseidon (it's all field multiplication). So my guess is that the extra speed-up is not real and due to something getting zeroed out bc of a bug which makes it faster

@sebastiencs
Copy link

@mitschabaude I've tested the javascript implementation (Pallas/Fp), but it seems that it gives different results than mul_assign from algebra here.

When I call mul_assign([1, 0, 0, 0], [1, 0, 0, 0]) in algebra, the result is:
[ 14933811601259063721, 12438865661331504215, 8127635680764926565, 2445952672640170197, ]
Or in [u32; 9]:
[ 329738665, 435975226, 503256563, 527750272, 161897161, 348878820, 55952172, 535020623, 2224580, ]

When I run the same operation in js (multiply 1 by 1), I get a different result:
[ 10705750787161406286, 5046208377184785134, 12359664413395797203, 3391085346764690374, ]
Or in [u32; 9]:
[ 312294222, 76791028, 214219685, 259494129, 72168544, 212229055, 253406745, 83828258, 3084174, ]

I am probably missing something

@mitschabaude
Copy link
Contributor Author

@sebastiencs amazing, thanks for finding the bug!!

I've tested the javascript implementation (Pallas/Fp), but it seems that it gives different results than mul_assign from algebra here.

they use different montgomery representations, you probably have to account for that in testing. there could be a from_bigint() at the beginning of each test on every input and a to_bigint() at the end for example, this would do the necessary conversion.
(the tests here do that as well: https://github.com/mitschabaude/montgomery/blob/main/src/field.test.ts)

@mitschabaude
Copy link
Contributor Author

after making @sebastiencs's bug fix, for me Poseidon is down to a 2.6x improvement. still pretty nice! results still don't agree for full poseidon though.

@mitschabaude
Copy link
Contributor Author

mitschabaude commented Nov 25, 2024

also, the Fp9 mul benchmark is now slower (52ns) for me than the JS version (37ns). however, it comes down to about the same as the JS version (34ns) when I refactor the benchmark to look like this:

pub fn bench_basic_ops(c: &mut Criterion) {
    let mut group = c.benchmark_group("Basic ops");
    let x0: Fp = rand::random();
    let x: Fp = x0;
    let mut z: Fp = x0;

    group.bench_function("Native multiplication in Fp (single)", |b| {
        b.iter(|| {
            z = z * x;
        });
    });

    let x_fp9: Fp9 = x0.into();
    let mut z_fp9: Fp9 = x0.into();

    group.bench_function("Multiplication in Fp9 (single)", |b| {
        b.iter(|| {
            z_fp9 = z_fp9 * x_fp9;
        });
    });
  }

@volhovm AFAIK, in JS hosts there is no very accurate timing available, so I would make sure to structure benchmarks such that definitely a lot of individual operations are done within a single timing. I'm not sure if your current benchmark, which seems to create new field elements for every multiplication, enables a completely accurate measurement.

Co-authored-by: Sebastien Chapuis <[email protected]>
@sebastiencs
Copy link

sebastiencs commented Nov 27, 2024

@mitschabaude Thanks, I've tested that implementation in our Webnode, and we get ~40% faster in poseidon hashing:
Hashing the full ledger (testnet) now takes 15 seconds, instead of 25 seconds.

o1-labs/mina-rust#939
The implementation is here: https://github.com/openmina/algebra/blob/6ca5fe8940ebbfe710601615655242ae89e17b0f/ff/src/fields/models/webnode_new.rs#L386

I see that your js implementation also has a square implementation, do you think it's worth porting to Rust ? Is it faster ?
Tests for that square are failing for Pallas/Fq, so not sure if I can use that in our WebNode.

Lastly, is there other parts worth porting from your repository, that would make our WebNode faster ?

Thanks again for your work !

@mitschabaude
Copy link
Contributor Author

@mitschabaude Thanks, I've tested that implementation in our Webnode, and we get ~40% faster in poseidon hashing:
Hashing the full ledger (testnet) now takes 15 seconds, instead of 25 seconds.

@sebastiencs that's awesome!

I see that your js implementation also has a square implementation, do you think it's worth porting to Rust ? Is it faster ?
Tests for that square are failing for Pallas/Fq, so not sure if I can use that in our WebNode.

Yes, absolutely square should be ported as well. dedicated squaring is much faster than multiplication, like 28ns vs 37ns on my machine, and Poseidon uses 2 squarings per state element for the computation of x^7. not sure what you mean by tests are failing

@sebastiencs
Copy link

sebastiencs commented Nov 27, 2024

@mitschabaude Indeed, I just ported it and we can hash the ledger in 14 seconds with this dedicated square, so total improvement is 44%.

not sure what you mean by tests are failing

hmm it's actually the test for sqrt failing, not square: (using npm run test)

test at build/src/field.test.js:17:15
✖ pastaFq w=29 (147.170791ms)
  Error: pastaFq w=29 sqrt

  Failed - error during test execution:
  pastaFq w=29 sqrt: Expect equal results

  Deep equality failed:

  actual:   13019741092779036328468870678170431412962196708002643505427026797581769551489n
  expected: 11692013098647223345629478661730264157247460343809n

@volhovm
Copy link
Contributor

volhovm commented Dec 2, 2024

@sebastiencs Can you comment a bit on how hard would it be to port the changes you introduced in openmina into kimchi/curves/pasta/wasm-friendly? It seems quite similar, but it's hard to judge if your implementation relies on anything that is openmina specific, or mostly arkworks.

Another question or way to put a question: how big is the gap between MinimalField from this PR, and a full Field implementation?

@volhovm
Copy link
Contributor

volhovm commented Dec 5, 2024

Status update: I've been trying to estimate how complicated it would be to bring an alternative field and plug it into kimchi tests. With a lot of stubs I managed to do it: the most intense parts are implementing bignum for 32 bits and Field (PrimeField also perhaps).

On the branch (name):

My hypothesis would be that we could indeed continue porting it, and we'd perhaps need about 3-5k lines of code (primarily Bigints and Field implementation) fairly isolated in proof-systems/curves. I don't see how this would significantly affect anything above that crate, since after the KimchiCurve is defined, we use it pretty much as a black-box.

Any comments? Anything I'm missing in terms of how this can be /harder/ to port than what I just described? @mitschabaude @sebastiencs

@mitschabaude
Copy link
Contributor Author

With a lot of stubs I managed to do it

Nice!

Any comments? Anything I'm missing in terms of how this can be /harder/ to port than what I just described?

Nothing that I can think of, your description matches what I would've thought!

@volhovm
Copy link
Contributor

volhovm commented Aug 27, 2025

@dannywillems @0xMimir So I stopped at optimising the montgomery multiplication -- the optimised version is not fully debugged, but it should be closed if someone wants to restart. Otherwise the code compiles and runs.

Here's the snapshot that I didn't have time to commit: volhovm/wasm-friendly-field-port...volhovm:proof-systems:volhovm/wasm-friendly-field-port

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants