MSM tuning for high core count #227

mratsim · 2023-04-14T08:41:51Z

This PR optimizes multi-scalar-multiplication for machines with a high-core-count.

On a i9-9980XE (Skylake X, 18 cores, overclocked and liquid-cooled all-clocked turbo 4.1GHz)

Features

Reentrancy / nested parallelism: A new precise barrier syncScope has been introduced to the threadpool.
Contrary to syncAll which can only be called in the root thread and so prevents nested parallelism, syncScope can be called from any thread. Hence parallel MSM and parallel sum reductions / batch additions can be called from within other parallel function, for example a ZK prover that needs to schedule multiple parallel MSMs in parallel.

Bug fix

The parallel speedup bench reported the perf ratio of the last iteration instead of the average of all iterations.
Given that most of Constantine is constant-time and the CPU was primed/hot, there were few variations but still ...

Before

After

Observations

Starting from 512 and up, we have a 50% to 150% perf improvement to core utilization (yes 2.5x). Note that somehow on 8 cores, the previous 512 strategy was displaying over 7x speedup while the new strategy only provides 5.5x. We make the choice of privileging scaling on high core count.

At the top range of our bench 262144 inputs (2^18) the multithreading speedup is over 15x instead of just 11x as previously.
We might reach the limit of Amdahl's Law and might need algorithm refactoring if we want to go further as the serial reductions might be a bottleneck for further parallelism.

mratsim · 2023-04-14T08:46:54Z

After bf04281

References BLS12-381

Gnark

Gnark has a performance issue between 8192 and 16384 points and is likely changing strategy to something that is not effective on 18 cores. cc @yelhousni @gbotrel

Constantine 4096   -> 5.10ms
Gnark 4096         -> 5.37ms
Constantine 8192   -> 5.98ms
Gnark 8192         -> 5.90ms
Constantine 16384  -> 12.47ms
Gnark 16384        -> 30.53ms

blstrs

BLST multithreading doesn't scale to high-core count and for example is 2.39x slower for 65536 (2^16) inputs

Bellman

Constantine is 5.22x faster than Bellman (Zcash backend)

BN-254 Snarks

Constantine

Gnark

Constantine is 3.18x faster than Gnark at 65536 inputs.
However it is 1.49x slower at 4194304 inputs

Bellman CE

RUSTFLAGS="-C target-cpu=native -C target_feature=+bmi2,+adx,+sse4.1" cargo +nightly test --release --features "asm" -- --nocapture test_new_multexp_speed_with_bn256 image

Constantine is 1.35x faster than Bellman CE

barretenberg

Barretenberg 65536 points

Barretenberg is about ~25% faster on 18 cores

Barretenberg 4194304 points

Barrentenberg is about ~23% faster

…e scoped barriers

mratsim · 2023-04-14T11:36:01Z

53ae971 adds reentrancy / nested parallelism. Some parallel sections used syncAll to await a for-loop, As syncAll can only be called from the root thread, it prevents nested parallelism (i.e. calling that parallel function from another parallel function). syncScope a precise local barrier has been introduced to allow structured parallelism (i.e. async-finish from Habanero, guaranteeing that all parallel tasks spawned within a scope are completed at the end of it)

syncAll was only used for input size below 1024.

The new barrier improves performance by 15% to 50%

Perf before

Perf after

mratsim · 2023-04-14T13:09:06Z

Rebench on BN254-Snarks

Now 50% faster / better CPU usage and 1% faster than Gnark ¯\_(ツ)_/¯ and 22.59% faster than Barretenberg. No change though.

mratsim · 2023-04-14T13:31:51Z

10% faster by increasing collision queue depth

tune for high core count

bf04281

reentrancy: allow nesting of parallel functions by introducing precis…

53ae971

…e scoped barriers

gbotrel mentioned this pull request Apr 14, 2023

perf: MSM tuning for small sizes (< 60k points) Consensys/gnark-crypto#381

Open

increase collision queue depth

f309478

mratsim merged commit 93dac25 into master Apr 14, 2023

mratsim deleted the high-core-count branch April 14, 2023 18:03

mratsim mentioned this pull request Jun 11, 2024

Towards state-of-the-art multi-scalar-muls privacy-ethereum/halo2curves#163

Open

6 tasks

unbalancedparentheses mentioned this pull request Aug 27, 2023

Checking Costantine library and copying some ideas and reference it lambdaclass/lambdaworks#532

Closed

MauroToscano mentioned this pull request Dec 22, 2023

MSM Optimizations lambdaclass/lambdaworks#730

Open

2 tasks

sebastiencs mentioned this pull request Jan 6, 2025

As a developer, I want to make proof generation + verification faster o1-labs/mina-rust#1013

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

MSM tuning for high core count #227

MSM tuning for high core count #227

Uh oh!

mratsim commented Apr 14, 2023 •

edited

Loading

Uh oh!

mratsim commented Apr 14, 2023

Uh oh!

mratsim commented Apr 14, 2023 •

edited

Loading

Uh oh!

mratsim commented Apr 14, 2023

Uh oh!

mratsim commented Apr 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

MSM tuning for high core count #227

MSM tuning for high core count #227

Uh oh!

Conversation

mratsim commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Bug fix

Before

After

Observations

Uh oh!

mratsim commented Apr 14, 2023

References BLS12-381

Gnark

blstrs

Bellman

BN-254 Snarks

Constantine

Gnark

Bellman CE

barretenberg

Uh oh!

mratsim commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mratsim commented Apr 14, 2023

Uh oh!

mratsim commented Apr 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mratsim commented Apr 14, 2023 •

edited

Loading

mratsim commented Apr 14, 2023 •

edited

Loading