-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stylus_benchmark #2827
base: master
Are you sure you want to change the base?
stylus_benchmark #2827
Conversation
This reverts commit 4fd9d10.
general approach seems great. |
…R_PATH> --cenario <SCENARIO>
use std::path::PathBuf; | ||
|
||
fn generate_add_i32_wat(mut out_path: PathBuf) -> eyre::Result<()> { | ||
let number_of_ops = 20_000_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical of the faithfulness of this benchmark when compared to actual execution. Stylus programs are usually tiny, given the contract size limitation (24kb), so real programs will live in the CPU cache. Generating a big program that doesn't fit the CPU cache might shadow the processing time with the memory transfer overhead between RAM and the cache. My theory is that the memory transfer will become the bottleneck, and the measurement of instructions won't be precise. For instance, the addition instruction might appear much more expensive than in real usage because of the CPU caching problem.
Instead of creating a program with millions of instructions, I suggest you create a program with a few thousand instructions inside a loop. Since you have a few thousand instructions, the overhead of the loop increment and branch instructions should be minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right :), the size of the program was impacting a lot benchmark performance.
The strategy that you mentioned is around 5 times faster than not using a loop at all.
Instead of using toggle_benchmark to specify a single code block to be benchmarked, I added start_benchmark and end_benchmark instructions, that enables to specify multiple execution blocks to be benchmarked.
I tried to only benchmark the execution block inside the loop, but calling start_benchmark/end_benchmark multiple times introduces a performance overhead, that is not worth when compared to having a single start_benchmark/end_benchmark block that includes the loop.
Thanks!!!
|
||
let exec = &mut WasmEnv::default(); | ||
|
||
let module = exec_program( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code benchmarking only the JIT execution? I'm worried about the differences between the JIT and the interpreter, so it would be nice to benchmark non-JIT execution as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point.
By interpreter you mean execution through the prover's binary?
@tsahee, any takes on that?
I intend to add more scenarios to be benchmarked in a future PR, together with a more detailed look at results. But here goes the output of the benchmarks implemented in this PR, that was run in a Apple M3 Pro:
|
Resolves NIT-2757
This PR adds the stylus_benchmark binary.
It will mainly be used to fine tune the ink prices that are charged today.
It deterministically creates .wat programs with lots of instructions to be benchmarked.
If this PR is approved then more .wat programs generations, benchmarking other instructions, will be added in a future PR.
This PR introduces two new WASM instruction to jit, start_benchmark and end_benchmark.
Code blocks between start_benchmark and end_benchmark instructions will be benchmarked.
stylus_benchmark uses jit as a library.