Evaluate using Profile-Guided Optimization (PGO) #1849

zamazan4ik · 2024-10-10T13:53:30Z

zamazan4ik
Oct 10, 2024

Hi!

I decided to test the Profile-Guided Optimization (PGO) technique to optimize the Cairo VM performance. For reference, results for other projects are available at https://github.com/zamazan4ik/awesome-pgo . Since PGO helped a lot for many projects (including compilers and interpreters like GCC, LLVM-based compilers, CPython, etc.), I decided to apply it to this VM to see if the performance win (or loss) can be achieved. Here are my benchmark results.

Test environment

Fedora 40
Linux kernel 6.10.12
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.81.0
cairo-vm version: main branch on commit 3fb0344ce038b3a68cae897c403d1f561cfe8da7
Disabled Turbo boost

Benchmark

For benchmark purposes, I use this benchmark scenario - cairo-vm-cli $tests_path/$file.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak. Benchmark programs are taken from here and compiled into the .json files with cairo-compile command (setted up according to the README file): cairo-compile cairo_programs/benchmarks/program_name.cairo --output cairo_programs/benchmarks/program_name.json --proof_mode. For PGO optimization I use cargo-pgo tool.

Commands for building binaries:

Release: cargo build --release --bin cairo-vm-cli
PGO instrumented: cargo pgo build -- --release --bin cairo-vm-cli
PGO optimized: cargo pgo optimize build -- --release --bin cairo-vm-cli

For PGO training I used two programs: big_factorial.cairo and big_fibonacci.cairo.

taskset -c 0 is used to reduce the OS scheduler's influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee). Measurements are done with hyperfine.

Results

I got the following results:

hyperfine --warmup 3 'taskset -c 0 ./cairo-vm-cli-release ../cairo_programs/benchmarks/big_factorial.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak' 'taskset -c 0 ./cairo-vm-cli-optimized ../cairo_programs/benchmarks/big_factorial.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak'
Benchmark 1: taskset -c 0 ./cairo-vm-cli-release ../cairo_programs/benchmarks/big_factorial.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_k
eccak
  Time (mean ± σ):      3.663 s ±  0.040 s    [User: 1.991 s, System: 1.658 s]
  Range (min … max):    3.629 s …  3.752 s    10 runs

Benchmark 2: taskset -c 0 ./cairo-vm-cli-optimized ../cairo_programs/benchmarks/big_factorial.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak
  Time (mean ± σ):      3.209 s ±  0.018 s    [User: 1.548 s, System: 1.648 s]
  Range (min … max):    3.186 s …  3.234 s    10 runs

Summary
  'taskset -c 0 ./cairo-vm-cli-optimized ../cairo_programs/benchmarks/big_factorial.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak' ran
    1.14 ± 0.01 times faster than 'taskset -c 0 ./cairo-vm-cli-release ../cairo_programs/benchmarks/big_factorial.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak'

, where cairo-vm-cli-release - Release build, cairo-vm-cli-optimized - PGO-optimized build.

According to the results, we see ~15% improvement in performance. The test script was a part of the training dataset. What about other scripts that were not presented in the PGO training dataset? Here we go:

hyperfine --warmup 3 'taskset -c 0 ./cairo-vm-cli-release ../cairo_programs/benchmarks/linear_search.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak' 'taskset -c 0 ./cairo-vm-cli-optimized ../cairo_programs/benchmarks/linear_search.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak'
Benchmark 1: taskset -c 0 ./cairo-vm-cli-release ../cairo_programs/benchmarks/linear_search.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_k
eccak
  Time (mean ± σ):      3.839 s ±  0.050 s    [User: 2.068 s, System: 1.757 s]
  Range (min … max):    3.799 s …  3.949 s    10 runs

Benchmark 2: taskset -c 0 ./cairo-vm-cli-optimized ../cairo_programs/benchmarks/linear_search.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak
  Time (mean ± σ):      3.344 s ±  0.023 s    [User: 1.580 s, System: 1.751 s]
  Range (min … max):    3.309 s …  3.382 s    10 runs

Summary
  'taskset -c 0 ./cairo-vm-cli-optimized ../cairo_programs/benchmarks/linear_search.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak' ran
    1.15 ± 0.02 times faster than 'taskset -c 0 ./cairo-vm-cli-release ../cairo_programs/benchmarks/linear_search.json --proof_mode --memory_file /dev/null --trace_file /dev/null --layout starknet_with_keccak'

The results at least in this test are the same - +15% performance improvement for cairo-vm. Quite a good result.

Further steps

I can suggest the following action points:

Mention somewhere in the user-visible place that PGO brings measurable performance improvements for the project
Integrate PGO into the build pipeline (like it's done in CPython or other projects)
Optimize with PGO prebuilt binaries (if any)

Also, Post-Link Optimization (PLO) can be tested after PGO. It can be done by applying tools like LLVM BOLT (also supported in the cargo-pgo tool). However, it's a much less mature optimization technique compared to PGO.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using Profile-Guided Optimization (PGO) #1849

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Evaluate using Profile-Guided Optimization (PGO) #1849

zamazan4ik Oct 10, 2024

Test environment

Benchmark

Results

Further steps

Replies: 0 comments

zamazan4ik
Oct 10, 2024