[RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM #92

vchuravy · 2017-12-09T23:22:09Z

I recently started exploring options for more precise and low-level benchmarking tools.
As it is this PR is notready to be included in BenchmarkTools, but should provide a starting point for discussions.

clobber() and escape()
Two methods to prevent certain compiler optimisations on the LLVM level. (see https://youtu.be/nXaxk27zwlk?t=2441)
clobber() is a memory barrier that forces the compiler to flush all writes to memory and escape is an method to prevent
LLVM from optimising a value away since we are faking a store of it. escape() is not quite done since it can't handel boxed values
and it would be easier to write if we could depend on LLVM.jl
bench_start() and bench_end()
Inspired by https://github.com/dterei/gotsc and https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
Since CPUs can do speculative execution reordering and a bunch of other shenanigans this is a very careful series of instructions that tries to prevent as much of that
as possible and thus should give a as precise as possible estimate of the number of cycles it takes for a block of code to run. These instructions are not completely noise free
since we still are running in user-space and the current implementation is x86_64 only (and requires a series of processor features). It is also tricky to convert cycles
to time spend. If we use this method it should be opt-in and we need to method variance and overhead.
getProcessTime() and getThreadTime()
I got curious and looked into what google/benchmark is using for time measurement and it turns out they actual measure two things.
run time and cpu time, where the latter is the time that a process is actually spend being run. The current implementation is Linux only but can get extended to to all platforms we
care about. For runtime measurement they uses http://en.cppreference.com/w/cpp/chrono/high_resolution_clock. Currently we are using uv_hrtime from libuv.
Both uv_hrtime and the c++ timer will under Unix fall back to clock_gettime(CLOCK_MONOTONIC, ...) similar to my implementation of getProcessTime.

What should we do?
I think taking a lead from google/benchmark and also measuring CPU time vs just runtime would be a first good actionable item. I am much
less sure about what to do with 1. and 2. and if they are useful for BenchmarkTools.jl, that needs further evaluation and for that I currently don't have time.

ararslan · 2017-12-09T23:27:19Z

src/lowlevel.jl

+end
+
+"""
+    getProcessTime()


This isn't a very Julian name, both in that it starts with "get" and that it's in camel case. I'd just call it processtime(). Likewise for getThreadTime, I'd call that threadtime().

chethega · 2018-11-12T18:24:30Z

It is also tricky to convert cycles to time spend. If we use this method it should be opt-in and we need to method variance and overhead.

Cycles spent is an extremely relevant metric in itself, often far more relevant than times. So I'd say, measure and report both, as well as the implied measured frequency.
This can serve as a reality check for users (if the reported frequency differs a lot from the official frequency, then we probably have a lot of measurement error). Also, when interpreting results, every relevant resource is normally counted in clock cycles anyway (instruction costs, cache-miss penalties, memory fetches, branch mispredicts, etc). Say you do some computations with N logical steps; then you always want to count how many OP/cycle, and this tells you roughly how good your code is (large number: few bookkeeping instructions, good use of memory and ILP; small number: figure out the problem).

Converting cycles to nanoseconds is bad; if any conversion makes sense, then it is nanoseconds -> cycles. By reporting measured frequency, the user is also empowered to spot problems like frequency drop due to AVX2, etc (some CPUs scale down frequency when some vector instructions are used).

vchuravy · 2018-11-14T15:45:56Z

Do you know of anyway to measure cycles in a platform portable way (e.g.) something that works for ARM and PPC?

Originally I went forward with #94 since cputime is an important measure as well (how much time did we actually spent in a program and not sleeping/in the kernel).
I agree that cycle benchmarking has its place and is an important tool, but I am not convinced that a general framework such as BenchmarkTools is the right place for it (maybe we need a LowlevelBenchmarkTools package.)
Since when measuring cycles you want to tightly control the code executed before and after the region of interest and any that introduces overhead that will throw off any other timing measurements.

Anyway I won't have time to work on either, so I would happy if someone could pick this up and bring it to conclusion.

vchuravy · 2022-01-20T17:12:45Z

So one of the things that has me come back to this PR is that https://perf.rust-lang.org/ defaults to instructions and cycles,
as well as http://llvm-compile-time-tracker.com/

But maybe the better pathway is to use LinuxPerf.jl to build that infrastructure.

vchuravy added 5 commits December 9, 2017 15:56

add clobber and escape to stop LLVM to over-optimize benchmarking loop

281e114

make escape a bit more robust

0802ed9

add cycle counting infrastructure

1be6637

add process and thread cpu time

f9babf7

convert process and thread time into seconds

55a9382

ararslan reviewed Dec 9, 2017

View reviewed changes

vchuravy mentioned this pull request Oct 26, 2018

@noopt macro for avoiding optimizations JuliaLang/julia#29817

Closed

tkf mentioned this pull request Apr 25, 2020

Recommend to use tuples to scalarize items in broadcast expressions JuliaLang/julia#35591

Merged

tkf mentioned this pull request Feb 4, 2022

Add a donotdelete builtin JuliaLang/julia#44036

Merged

gdalle marked this pull request as draft June 13, 2023 05:19

gdalle mentioned this pull request Jun 13, 2023

Black box function #145

Open

LilithHafner mentioned this pull request Mar 3, 2024

Count CPU cycles LilithHafner/Chairmarks.jl#54

Open

willow-ahrens mentioned this pull request Nov 1, 2024

LinuxPerf, @profile, and other experiments #377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM #92

[RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM #92

vchuravy commented Dec 9, 2017

ararslan Dec 9, 2017

chethega commented Nov 12, 2018

vchuravy commented Nov 14, 2018

vchuravy commented Jan 20, 2022

[RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM #92

Are you sure you want to change the base?

[RFC/WIP] Tools for measuring cycles and cpu_times and tricking out LLVM #92

Conversation

vchuravy commented Dec 9, 2017

ararslan Dec 9, 2017

Choose a reason for hiding this comment

chethega commented Nov 12, 2018

vchuravy commented Nov 14, 2018

vchuravy commented Jan 20, 2022