perf
provides some ideas, code and a library for low-level performance measurement for Haskell hacking. The library:
- provides a monad transformer,
PerfT
, as a light-weight wrapper for use on existing code.PerfT
modifications can be included in code bases, as opposed to performance being separated code and process, with any effects able to be erased at compile time withevalPerfT
. - focuses on using the rdtsc instruction set that comes with the x86 chip to measure accurate time performance as number of clock cycles.
- abstracts the concept of what is being measured, so that concepts such as counters, debug checks, time and space performance can be treated equivalently.
Note that running this readme.org is very slow compared with an external process which accesses the compiled version of the library.
:r
:set -Wno-type-defaults
:set -Wno-unused-do-bind
:set -Wno-name-shadowing
:set -XOverloadedStrings
:set -XOverloadedLabels
import Perf
import Data.FormatN
import qualified Data.Text as Text
import qualified Data.Text.IO as Text
import qualified Data.Map.Strict as Map
import Control.Monad
import Data.Bifunctor
putStrLn "ok"
The fundamental operation of Perf.Time is tick, which sandwiches a (strict) function application between two readings of the rdstc chip.
:t tick
tick :: (a -> b) -> a -> IO (Cycles, b)
tick returns in the IO monad, because reading a cycle counter is an IO effect. The trivial but fundamental point is that performance measurement effects the computation being measured.
Unlike benchmarking-style libraries, such as criterion, the result of the computation is passed through along with the cycle count. In this way, the `Perf` monad can be inserted into the midst of a computation in an attempt to measure performance in-situ as opposed to sitting off in a separate and perhaps unrepresentative process.
tick_ measures the number of cycles between two clock reads.
:t tick_
tick_ :: IO Cycles
fmap word <$> replicateM 10 tick_
[3063,1201,990,978,945,948,951,948,951,948]
fmap (fmap (word . fst)) . replicateM 10 $ tick (const ()) ()
[12104,5040,4868,4508,4188,4172,4800,4748,5760,4216]
Here, const () ()
was evaluated and took 6.1k cycles for the first effect, reducing down to 2.6k after 10 effects. What it did in that time we will never know, really, at least from usage of perf
. 2.2k cycles is, on my 2.5GHz machine equal to 2200 / 2.5e9 = 8.8e-7 of a second or 880 nanoseconds.
tickIO
measures the evaluation of an IO value.
:t tickIO
tickIO :: IO a -> IO (Cycles, a)
fmap (fmap (word . fst)) . replicateM 10 $ tickIO (pure ())
[10084,3588,4104,2844,2916,3140,2800,2848,2848,2804]
fmap (expt (Just 2) . fromIntegral) . fst <$> ticks 10 sum ([1..10000] :: [Double])
["1.7e6","6.1e5","6.2e5","5.4e5","4.3e5","4.1e5","4.1e5","4.3e5","4.1e5","4.1e5"]
ts <- ticks 10000 sum ([1..1000] :: [Double])
print $ average (fmap fromIntegral $ fst ts)
30411.6957
PerfT
allows for multiple measurement points and is polymorphic in what is being measured. It returns a Map of results held in State.
Compare a lower-level usage of ticks, measuring the average of summing to one thousand over one thousand trials:
first (average . fmap fromIntegral) <$> ticks 1000 sum [1..1000]
(50914.804,500500)
… with PerfT usage
second (fmap (average . fmap fromIntegral)) <$> runPerfT (times 1000) (sum |$| [1..1000])
(500500,fromList [("",44362.311)])
Comparing performance of sum versus a list fusion approach:
fmap (average . fmap fromIntegral) <$> (execPerfT (times 1000) $ do; (fap "sum" sum [1..1000]); (fap "fusion" (\x -> sum [1..x]) 1000))
fromList [("fusion",97804.085),("sum",45053.771)]
An IO example
exampleIO' :: IO ()
exampleIO' = do
txt <- Text.readFile "src/Perf.hs"
let n = Text.length txt
Text.putStrLn $ "length of file is: " <> Text.pack (show n)
exampleIO = execPerfT time (do
txt <- fam "file_read" (Text.readFile "src/Perf.hs")
n <- fap "length" Text.length txt
fam "print_result" (Text.putStrLn $ "length of file is: " <> Text.pack (show n)))
perf-explore --exampleIO
length of file is: 1794 length of file is: 1794 label1 label2 label3 old result new result change normal file-read time 5.96e5 6.58e5 slightly-degraded normal length time 1.10e4 7.12e3 improvement normal print-result time 1.44e5 5.70e4 improvement outer file-read time 1.36e5 1.39e5 outer length time 1.45e3 1.33e3 improvement outer outer-total time 1.67e5 1.50e5 improvement outer print-result time 2.70e4 6.26e3 improvement
perf-explore
contains some exploratory routines used to develop perf
perf-explore --help
examples of perf usage Usage: perf-explore [-n|--runs ARG] [--best | --median | --average | --averagesecs] [--time | --space | --spacetime | --allocation | --count] [-g|--golden ARG] [--nocheck] [-r|--record] [--header | --noheader] [--error ARG] [--warning ARG] [--improved ARG] [--sums | --lengths | --nub | --examples | --example | --exampleIO | --noops | --ticks | --gauge] [-l|--length ARG] [--sumFuse | --sum | --lengthF | --constFuse | --mapInc | --noOp] perf exploration Available options: -n,--runs ARG number of runs to perform --best report upper decile --median report median --average report average --averagesecs report average in seconds --time measure time performance --space measure space performance --spacetime measure both space and time performance --allocation measure bytes allocated --count measure count -g,--golden ARG golden file name --nocheck do not check versus the golden file -r,--record record the result to the golden file --header include headers --noheader dont include headers --error ARG error level --warning ARG warning level --improved ARG improved level --sums run on sum algorithms --lengths run on length algorithms --nub nub test --examples run on example algorithms --example run on the example algorithm --exampleIO exampleIO test --noops noops test --ticks tick test --gauge gauge runs on exmaple for comparison -l,--length ARG length of list --sumFuse fused sum pipeline --sum sum --lengthF foldr id length --constFuse fused const pipeline --mapInc fmap (+1) --noOp const () -h,--help Show this help text
fmap averageI <$> execPerfT (times 10000) (sum |$| [1..1000])
fromList [("",35729.9518)]
The equivalent to the above code is:
perf-explore -n 10000 -l 1000 --sum --nocheck
label1 label2 results
:
sum time 1.47e4
This no-op experiment is useful to understand the pure time performance of the machinery around measurement. It can be (re)run with:
perf-explore --noops
label1 label2 label3 results const 1st time 6.05e3 const 2nd time 2.80e1 const 3rd time 3.60e1 const 4th time 3.20e1 const average time 4.11e1 const best time 2.74e1 const median time 3.54e1 pure 1st time 6.46e3 pure 2nd time 7.60e1 pure 3rd time 4.00e1 pure 4th time 4.00e1 pure average time 4.28e1 pure best time 2.82e1 pure median time 3.64e1 No golden file found. To create one, run with '-r'
Exploration of how the code surrounding measurement effects performance.
perf-explore -n 1000 -l 1000 --ticks --nocheck
stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times | |
sumAux | 1.46e4 | 1.51e4 | 1.65e4 | 1.45e4 | 1.96e1 | 1.45e4 | 1.46e4 |
sumCata | 1.45e4 | 1.82e4 | 1.45e4 | 1.45e4 | 1.99e1 | 1.46e4 | 1.48e4 |
sumCo | 1.45e4 | 1.46e4 | 1.46e4 | 1.63e4 | 2.02e1 | 1.45e4 | 1.44e4 |
sumCoCase | 1.44e4 | 1.51e4 | 1.61e4 | 1.46e4 | 2.78e1 | 2.10e4 | 1.49e4 |
sumCoGo | 2.61e4 | 1.46e4 | 3.82e4 | 2.55e4 | 2.04e1 | 1.45e4 | 2.59e4 |
sumF | 1.34e4 | 1.34e4 | 2.01e4 | 1.64e4 | 2.01e1 | 1.33e4 | 1.33e4 |
sumFlip | 1.37e4 | 1.34e4 | 1.34e4 | 1.33e4 | 2.06e1 | 1.41e4 | 1.33e4 |
sumFlipLazy | 1.33e4 | 1.34e4 | 1.33e4 | 1.34e4 | 2.01e1 | 1.37e4 | 1.34e4 |
sumFoldr | 1.44e4 | 1.45e4 | 2.18e4 | 1.68e4 | 3.22e1 | 1.54e4 | 1.45e4 |
sumFuse | 3.19e3 | 1.98e3 | 1.93e3 | 2.19e3 | 1.98e1 | 1.83e3 | 3.15e3 |
sumFuseFoldl’ | 1.88e3 | 2.46e3 | 2.02e3 | 2.27e3 | 2.39e1 | 2.38e3 | 2.15e3 |
sumFuseFoldr | 5.79e3 | 6.01e3 | 3.70e3 | 3.70e3 | 2.00e1 | 3.70e3 | 6.11e3 |
sumFusePoly | 2.67e3 | 2.68e3 | 2.65e3 | 2.29e3 | 2.73e1 | 2.44e3 | 2.60e3 |
sumLambda | 1.59e4 | 1.35e4 | 1.47e4 | 1.99e4 | 1.99e1 | 1.36e4 | 1.34e4 |
sumMono | 1.38e4 | 1.69e4 | 1.35e4 | 1.36e4 | 2.04e1 | 1.37e4 | 1.34e4 |
sumPoly | 1.99e4 | 1.42e4 | 1.35e4 | 1.73e4 | 1.97e1 | 1.35e4 | 1.51e4 |
sumSum | 1.34e4 | 1.36e4 | 1.34e4 | 1.35e4 | 1.99e1 | 1.34e4 | 2.01e4 |
sumTail | 2.03e4 | 1.37e4 | 1.34e4 | 1.34e4 | 1.99e1 | 1.34e4 | 2.01e4 |
sumTailLazy | 1.34e4 | 1.66e4 | 1.33e4 | 1.41e4 | 1.99e1 | 1.35e4 | 1.33e4 |
perf-explore -n 10000 -l 10 --best --ticks
stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times | |
sumAux | 1.81e2 | 2.33e2 | 1.81e2 | 1.82e2 | 1.32e1 | 1.83e2 | 1.82e2 |
sumCata | 1.81e2 | 2.33e2 | 1.81e2 | 1.82e2 | 1.33e1 | 2.33e2 | 1.81e2 |
sumCo | 2.10e2 | 2.33e2 | 2.13e2 | 2.11e2 | 2.49e1 | 2.33e2 | 2.10e2 |
sumCoCase | 1.81e2 | 2.47e2 | 1.82e2 | 1.82e2 | 1.83e1 | 2.47e2 | 1.81e2 |
sumCoGo | 1.81e2 | 2.34e2 | 2.34e2 | 2.34e2 | 1.33e1 | 2.33e2 | 1.82e2 |
sumF | 1.03e2 | 2.22e2 | 1.05e2 | 1.04e2 | 1.88e1 | 1.83e2 | 1.04e2 |
sumFlip | 1.13e2 | 1.28e2 | 1.20e2 | 1.20e2 | 1.88e1 | 1.17e2 | 1.63e2 |
sumFlipLazy | 1.13e2 | 2.07e2 | 1.76e2 | 1.20e2 | 2.49e1 | 2.04e2 | 1.01e2 |
sumFoldr | 1.81e2 | 2.66e2 | 2.65e2 | 1.81e2 | 1.88e1 | 2.65e2 | 1.81e2 |
sumFuse | 2.58e1 | 2.71e1 | 2.85e1 | 3.36e1 | 1.36e1 | 3.36e1 | 3.42e1 |
sumFuseFoldl’ | 5.04e1 | 2.71e1 | 5.11e1 | 5.08e1 | 1.41e1 | 2.65e1 | 3.96e1 |
sumFuseFoldr | 8.16e1 | 8.61e1 | 6.26e1 | 7.34e1 | 1.25e1 | 8.47e1 | 8.54e1 |
sumFusePoly | 4.19e1 | 5.82e1 | 4.26e1 | 5.81e1 | 1.30e1 | 5.91e1 | 4.30e1 |
sumLambda | 1.39e2 | 1.30e2 | 1.60e2 | 1.63e2 | 1.33e1 | 1.29e2 | 1.43e2 |
sumMono | 8.04e1 | 1.43e2 | 8.64e1 | 8.41e1 | 1.83e1 | 1.15e2 | 8.16e1 |
sumPoly | 1.16e2 | 1.29e2 | 1.44e2 | 1.21e2 | 2.49e1 | 1.30e2 | 1.18e2 |
sumSum | 1.02e2 | 1.03e2 | 1.04e2 | 1.05e2 | 1.33e1 | 1.02e2 | 1.04e2 |
sumTail | 1.40e2 | 1.37e2 | 1.31e2 | 1.31e2 | 1.36e1 | 1.45e2 | 1.42e2 |
sumTailLazy | 1.06e2 | 1.78e2 | 1.12e2 | 1.12e2 | 1.84e1 | 2.25e2 | 1.09e2 |
perf-explore -n 100 -l 100000 --best --ticks
stepTime | tick | tickForce | tickForceArgs | tickLazy | tickWHNF | times | |
sumAux | 2.82e6 | 2.70e6 | 2.66e6 | 2.75e6 | 1.83e1 | 2.70e6 | 2.79e6 |
sumCata | 2.71e6 | 2.69e6 | 2.63e6 | 2.64e6 | 1.28e1 | 2.68e6 | 2.75e6 |
sumCo | 2.71e6 | 2.73e6 | 2.63e6 | 2.64e6 | 2.52e1 | 2.71e6 | 2.73e6 |
sumCoCase | 2.69e6 | 2.70e6 | 2.70e6 | 2.65e6 | 1.85e1 | 2.76e6 | 2.67e6 |
sumCoGo | 2.68e6 | 2.77e6 | 2.65e6 | 2.63e6 | 1.82e1 | 2.75e6 | 2.69e6 |
sumF | 1.24e6 | 1.24e6 | 1.24e6 | 1.24e6 | 1.26e1 | 1.24e6 | 1.27e6 |
sumFlip | 1.16e6 | 1.14e6 | 1.14e6 | 1.16e6 | 1.35e1 | 1.13e6 | 1.16e6 |
sumFlipLazy | 1.03e6 | 1.03e6 | 1.03e6 | 1.02e6 | 1.37e1 | 1.03e6 | 1.03e6 |
sumFoldr | 2.76e6 | 2.83e6 | 2.66e6 | 2.67e6 | 1.82e1 | 2.81e6 | 2.74e6 |
sumFuse | 1.45e5 | 1.45e5 | 1.45e5 | 1.44e5 | 1.39e1 | 1.45e5 | 1.45e5 |
sumFuseFoldl’ | 1.45e5 | 1.45e5 | 1.45e5 | 1.46e5 | 1.35e1 | 1.45e5 | 1.45e5 |
sumFuseFoldr | 1.76e6 | 1.77e6 | 1.75e6 | 1.75e6 | 1.90e1 | 1.76e6 | 1.75e6 |
sumFusePoly | 1.45e5 | 2.00e5 | 1.45e5 | 1.45e5 | 1.32e1 | 1.68e5 | 1.45e5 |
sumLambda | 9.35e5 | 9.52e5 | 9.35e5 | 9.38e5 | 2.48e1 | 9.46e5 | 9.44e5 |
sumMono | 1.13e6 | 1.12e6 | 1.11e6 | 1.11e6 | 1.23e1 | 1.11e6 | 1.11e6 |
sumPoly | 1.03e6 | 1.04e6 | 1.02e6 | 1.03e6 | 1.32e1 | 1.03e6 | 1.03e6 |
sumSum | 1.20e6 | 1.22e6 | 1.20e6 | 1.19e6 | 1.32e1 | 1.22e6 | 1.20e6 |
sumTail | 1.22e6 | 1.22e6 | 1.22e6 | 1.22e6 | 1.28e1 | 1.22e6 | 1.22e6 |
sumTailLazy | 1.16e6 | 1.19e6 | 1.18e6 | 1.16e6 | 1.29e1 | 1.18e6 | 1.16e6 |
perf-explore -n 1000 -l 1000 --sums
label1 label2 old result new result change sumAux time 1.51e4 1.46e4 sumCata time 1.55e4 1.46e4 improvement sumCo time 1.55e4 2.01e4 degraded sumCoCase time 1.57e4 1.47e4 improvement sumCoGo time 1.54e4 1.69e4 slightly-degraded sumF time 1.33e4 2.02e4 degraded sumFlip time 1.41e4 1.36e4 sumFlipLazy time 1.47e4 1.93e4 degraded sumFoldr time 1.56e4 1.75e4 slightly-degraded sumFuse time 1.66e3 2.61e3 degraded sumFuseFoldl' time 1.71e3 2.70e3 degraded sumFuseFoldr time 5.83e3 6.72e3 slightly-degraded sumFusePoly time 2.12e3 2.64e3 degraded sumLambda time 1.47e4 1.39e4 improvement sumMono time 1.73e4 1.40e4 improvement sumPoly time 1.87e4 1.36e4 improvement sumSum time 1.33e4 1.34e4 sumTail time 1.47e4 1.40e4 improvement sumTailLazy time 1.44e4 1.34e4 improvement
perf-explore -n 1000 -l 1000 --lengths
label1 label2 old result new result change lengthAux time 1.45e4 1.60e4 slightly-degraded lengthCo time 1.64e4 1.39e4 improvement lengthCoCase time 1.78e4 1.31e4 improvement lengthF time 1.06e4 1.10e4 lengthFMono time 1.05e4 1.05e4 lengthFlip time 1.11e4 1.07e4 lengthFlipLazy time 1.13e4 1.07e4 improvement lengthFoldr time 1.27e4 1.33e4 slightly-degraded lengthFoldrConsttime 1.27e4 1.36e4 slightly-degraded lengthTail time 1.08e4 1.10e4 lengthTailLazy time 1.15e4 1.03e4 improvement
perf-explore -n 1000 -l 1000 --average --gauge
sumFuse benchmarking function ... function time 849.4 ns benchmarking function ... function time 848.6 ns sum benchmarking function ... function time 3.786 μs benchmarking function ... function time 3.791 μs lengthF benchmarking function ... function time 1.863 μs benchmarking function ... function time 1.872 μs constFuse benchmarking function ... function time 571.0 ns benchmarking function ... function time 1.080 μs mapInc benchmarking function ... function time 9.203 ns benchmarking function ... function time 15.99 μs noop benchmarking function ... function time 5.770 ns benchmarking function ... function time 4.740 ns
Data is collected from GHCStats
- allocated_bytes
- gcs
- gcdetails_live_bytes
- max_live_bytes
- max_mem_in_use_bytes
perf-explore -n 10 -l 100000 --space +RTS -T -RTS
label1 label2 results
:
sum MaxMem 4.61e6 sum allocated 4.19e5 sum gcLiveBytes 2.19e5 sum gcollects 1.00e-1 sum maxLiveBytes 0.00e0 No golden file found. To create one, run with '-r'
Perf.BigO represents functionality to determine the complexity order for a computation.
We could do a regression and minimise the error term, but we know that the largest run contains the most information; we would need to weight the simulations according to some heuristic.
Instead, we:
- estimate the order factor for each possible Order, from N3 to N0, setting the highest n run constant factor to zero,
- pick the order based on lowest absolute error result summed across all the runs,
import qualified Prelude as P
import Data.List (nub)
estOrder (\x -> sum $ nub [1..x]) 10 [1,10,100,1000]
BigOrder {bigOrder = N2, bigFactor = 13.375059, bigConstant = 0.0}
The Haskell performance checklist
ndmitchell/spaceleak: Notes on space leaks
ghc app/speed.hs -ddump-simpl -ddump-to-file -fforce-recomp -dlint -O
haskell wiki: Looking at the Core
ghc issue 15185: Enum instance for IntX / WordX are inefficient
fixpt - All About Strictness Analysis (part 1)
A typical configuration step for profiling:
cabal configure --enable-library-profiling --enable-executable-profiling -fprof-auto -fprof -write-ghc-environment-files=always
A cabal.project.local with profiling enabled:
write-ghc-environment-files: always ignore-project: False flags: +prof +prof-auto library-profiling: True executable-profiling: True
Examples from markup-parse R&D:
Executable compilation:
ghc -prof -fprof-auto -rtsopts app/speed0.hs -threaded -fforce-recomp
Executable run:
app/speed0 +RTS -s -p -hc -l -RTS
885,263,472 bytes allocated in the heap 8,507,448 bytes copied during GC 163,200 bytes maximum residency (4 sample(s)) 27,752 bytes maximum slop 6 MiB total memory in use (0 MiB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 207 colls, 0 par 0.009s 0.010s 0.0001s 0.0002s Gen 1 4 colls, 0 par 0.001s 0.001s 0.0004s 0.0005s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.006s ( 0.006s elapsed) MUT time 0.367s ( 0.360s elapsed) GC time 0.010s ( 0.011s elapsed) RP time 0.000s ( 0.000s elapsed) PROF time 0.000s ( 0.000s elapsed) EXIT time 0.001s ( 0.001s elapsed) Total time 0.384s ( 0.380s elapsed)
Dumped to speed0.prof
COST CENTRE MODULE SRC %time %alloc token MarkupParse src/MarkupParse.hs:(259,1)-(260,20) 50.2 50.4 wrappedQ' MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(215,1)-(217,78) 20.8 23.1 ws_ MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:(135,1)-(146,4) 14.3 5.5 eq MarkupParse.FlatParse src/MarkupParse/FlatParse.hs:243:1-30 10.6 11.1 gather MarkupParse src/MarkupParse.hs:(420,1)-(428,100) 2.4 3.7 runParser FlatParse.Basic src/FlatParse/Basic.hs:(217,1)-(225,24) 1.0 6.0
eventlog2html speed0.eventlog
Produces speed0.eventlog.html which contains heap charts.
The average cycles per + operation can get down to about 0.7 cycles, and there are about 4 cache registers per cycle, so a sum pipeline uses 2.8 register instructions per +.
Cache | Cycles |
---|---|
register | 4 per cycle |
L1 Cache access | 3-4 cycles |
L2 Cache access | 11-12 cycles |
L3 unified access | 30 - 40 |
DRAM hit | 195 cycles |
L1 miss | 40 cycles |
L2 miss | >600 cycles |