Skip to content

Conversation

@mratsim
Copy link
Owner

@mratsim mratsim commented Apr 14, 2023

This PR optimizes multi-scalar-multiplication for machines with a high-core-count.

On a i9-9980XE (Skylake X, 18 cores, overclocked and liquid-cooled all-clocked turbo 4.1GHz)

Features

Reentrancy / nested parallelism: A new precise barrier syncScope has been introduced to the threadpool.
Contrary to syncAll which can only be called in the root thread and so prevents nested parallelism, syncScope can be called from any thread. Hence parallel MSM and parallel sum reductions / batch additions can be called from within other parallel function, for example a ZK prover that needs to schedule multiple parallel MSMs in parallel.

Bug fix

The parallel speedup bench reported the perf ratio of the last iteration instead of the average of all iterations.
Given that most of Constantine is constant-time and the CPU was primed/hot, there were few variations but still ...

Before

ksnip_20230414-092438
ksnip_20230414-092548
ksnip_20230414-092616

After

ksnip_20230414-092723
ksnip_20230414-092755
ksnip_20230414-092818

Observations

Starting from 512 and up, we have a 50% to 150% perf improvement to core utilization (yes 2.5x). Note that somehow on 8 cores, the previous 512 strategy was displaying over 7x speedup while the new strategy only provides 5.5x. We make the choice of privileging scaling on high core count.

At the top range of our bench 262144 inputs (2^18) the multithreading speedup is over 15x instead of just 11x as previously.
We might reach the limit of Amdahl's Law and might need algorithm refactoring if we want to go further as the serial reductions might be a bottleneck for further parallelism.

@mratsim
Copy link
Owner Author

mratsim commented Apr 14, 2023

After bf04281

References BLS12-381

Gnark

ksnip_20230414-094003

Gnark has a performance issue between 8192 and 16384 points and is likely changing strategy to something that is not effective on 18 cores. cc @yelhousni @gbotrel

Constantine 4096   -> 5.10ms
Gnark 4096         -> 5.37ms
Constantine 8192   -> 5.98ms
Gnark 8192         -> 5.90ms
Constantine 16384  -> 12.47ms
Gnark 16384        -> 30.53ms

blstrs

ksnip_20230414-094742

BLST multithreading doesn't scale to high-core count and for example is 2.39x slower for 65536 (2^16) inputs

Bellman

ksnip_20230414-100607

Constantine is 5.22x faster than Bellman (Zcash backend)

BN-254 Snarks

Constantine

ksnip_20230414-102421

Gnark

ksnip_20230414-101545

Constantine is 3.18x faster than Gnark at 65536 inputs.
However it is 1.49x slower at 4194304 inputs

Bellman CE

RUSTFLAGS="-C target-cpu=native -C target_feature=+bmi2,+adx,+sse4.1" cargo +nightly test --release --features "asm" -- --nocapture test_new_multexp_speed_with_bn256 image
ksnip_20230414-103555

Constantine is 1.35x faster than Bellman CE

barretenberg

Barretenberg 65536 points
ksnip_20230414-104333

Barretenberg is about ~25% faster on 18 cores

Barretenberg 4194304 points

ksnip_20230414-104028

Barrentenberg is about ~23% faster

@mratsim
Copy link
Owner Author

mratsim commented Apr 14, 2023

53ae971 adds reentrancy / nested parallelism. Some parallel sections used syncAll to await a for-loop, As syncAll can only be called from the root thread, it prevents nested parallelism (i.e. calling that parallel function from another parallel function). syncScope a precise local barrier has been introduced to allow structured parallelism (i.e. async-finish from Habanero, guaranteeing that all parallel tasks spawned within a scope are completed at the end of it)

syncAll was only used for input size below 1024.

The new barrier improves performance by 15% to 50%

Perf before

ksnip_20230414-132834

Perf after
ksnip_20230414-133316

@mratsim
Copy link
Owner Author

mratsim commented Apr 14, 2023

Rebench on BN254-Snarks

ksnip_20230414-150607

Now 50% faster / better CPU usage and 1% faster than Gnark ¯\_(ツ)_/¯ and 22.59% faster than Barretenberg. No change though.

@mratsim
Copy link
Owner Author

mratsim commented Apr 14, 2023

10% faster by increasing collision queue depth

ksnip_20230414-152904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant