Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected memory usage #810

Closed
ramsdell opened this issue Jul 9, 2020 · 22 comments
Closed

Unexpected memory usage #810

ramsdell opened this issue Jul 9, 2020 · 22 comments
Labels
performance General performance related issues.

Comments

@ramsdell
Copy link

ramsdell commented Jul 9, 2020

We are experience memory usage patterns that are not as we would expect. I wrote three simple cryptol functions that iterate a hash function provided by the cryptol_spec repository. Here is the source for the three functions.

module iter::iter where

// Three ways to iterate a function.

// As a recurrence relation

recurrence : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
recurrence f init =
  sts ! 0
  where
    sts = [ init ] # [ f acc idx | idx <- [1 .. n]:[n][width n] | acc <- sts ]

// Using tail calls

tailcall : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
tailcall f init =
  loop init 0
  where
    loop acc idx =
      if idx > `n then
        acc
      else
        loop (f acc idx) (idx + 1)

// Using deep recursion

recursion : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
recursion f init =
  recur `n
  where
    recur idx =
      if idx == 0 then
        init
      else
        f (recur (idx - 1)) idx

I expected the first and the second functions to execute using constant space, and the third function to exhibit linear growth. What we observe is that the first function and third use about the same amount of space, and second function simply fails often.

Enclosed is zip file
iter.zip that contains the sources that exhibit the problem. It also contains performance outputs. The Excel spreadsheet contains the results of running the tests at different sizes on a high performance cluster. To run the examples, edit the pqcryptol script to reflect the location of your cryptol_spec repository.

@robdockins robdockins added the performance General performance related issues. label Jul 9, 2020
@robdockins
Copy link
Contributor

The first function retains the intermediate list right until it extracts the final element, so I'm not surprised it's using linear memory. I'd love to find a way to recognize this particular pattern in the interpreter and do something more clever, but it's not obvious how to do it. I've considered making foldl a primitive so we can at least express computations like this in a way that could be constant-space.

As to version 2, I think this might be due to Cryptol's lazy semantics. Tail-recursion doesn't really help very much in a lazy language unless you can force the thunks during the computations somehow. In fact, the more direct recursion of number 3 is usually better in a lazy language. With an accumulator, you usually just end up building up a big tower of thunks that all get forced at the end. In Haskell this is a relatively common performance problem known as a "space leak". I don't know what error you're seeing, but I suspect it's GHC's stack overflow error, which tends to happen when you force a deep stack of thunks like this. We've avoided adding the kinds of primitives Haskell programmers use to fix these problems, as they are quite fiddly and feel out of place in a specification DSL... but it's an option.

Reading between the lines, I gather your goal is to make this run in constant space, right? I agree this is an important property for scaling up computations. We should definitely find some nice way to make this work.

@weaversa
Copy link
Collaborator

weaversa commented Jul 9, 2020

But, why does Cryptol use 1 meg for thousands of iterations and then suddenly jump to gigs?
CEB2BF1E-9169-430F-8BC5-26387E9537BA

These HUGE jumps make me think something is going wrong.

@weaversa
Copy link
Collaborator

weaversa commented Jul 9, 2020

Also, I’m pretty sure a list of a million 256-bit hashes doesn’t require anywhere near 174 gigs of memory to represent.

@brianhuffman
Copy link
Contributor

There's a very important bug in the definition of tailcall: If the type parameter n is one less than a power of 2, then the test if idx > `n will never be true, because `n is the largest representable bitvector of type [width n]. So the function tailcall is literally an infinite loop for most of your test sizes.

@brianhuffman
Copy link
Contributor

Here's a corrected version of tailcall:

tailcall : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
tailcall f init =
  loop init 0
  where
    loop acc idx =
      if idx == `n then
        acc
      else
        loop (f acc (idx + 1)) (idx + 1)

(Note that I also had to replace f acc idx with f acc (idx + 1) to make its output match that of recurrence exactly.)

@ramsdell
Copy link
Author

ramsdell commented Jul 9, 2020

Nice catch about tailcall. That explains the failures.

@ramsdell
Copy link
Author

Enclosed is an update to the tests that show the memory issue. Besides fixing the tailcall example, I added a test that makes use of foldl from the Cryptol prelude. Notice that tailcall now has the best performance.

image

iter.zip

@robdockins
Copy link
Contributor

robdockins commented Aug 21, 2020

I've done a little experimentation, this appears to be a classic space leak. I've implemented a pair of primitive fold operations foldl and foldl'. Both of them perform a fold operation over a sequence without generating the intermediate result sequence. The difference (the same as in Haskell functions of the same name) is that foldl' eagerly forces the intermediate values of the accumulator at each step. The version using foldl seems to have about the same (bad) memory performace as the above. However foldl' can run the SHA256 example for 2^16 - 1 iterations in what appears to be a flat 100MB.

I'm going to try some longer runs and see what happens.

Edit: I ran the SHA256 example for 2^20 - 1 iterations using the foldl' primitive. It ran for about 40 minutes and stayed rock steady at about 100MB memory usage for the whole run.

@robdockins
Copy link
Contributor

This PR implements the new fold primitives: #868

@brianhuffman
Copy link
Contributor

Maybe we don't actually need any new primitives to fix this problem. Here's a stricter variant of tailcall that runs in constant space:

tailcall2 : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
tailcall2 f init =
  loop init 0
  where
    loop acc idx =
      if idx == `n then
        acc
      else
        if acc' == acc' then loop acc' (idx + 1) else undefined
      where acc' = f acc (idx + 1)

I use the expression if x == x then y else undefined to implement what we'd write as seq x y in Haskell (actually it's more like deepseq, as == forces evaluation of the entire value). So this is actually a bit more strict than foldl', and would also run in constant space if you wanted to use a tuple or record type for the accumulating parameter.

@brianhuffman
Copy link
Contributor

After further testing it looks like there is a linear component to the tailcall2 memory usage: Up to 2^13-1 iterations of SHA256, cryptol +RTS -s shows 16MB of total memory in use, but at larger sizes you get an additional 1kB per iteration or so. Running 2^20-1 rounds of SHA256 with tailcall2 takes about 1GB of memory and takes about 50 minutes on my machine. I'd guess that the memory overhead is just to represent the cryptol call stack in the interpreter, as we don't implement any actual tail call optimization in cryptol. In any case, the memory consumption is an awful lot smaller than the numbers shown in the spreadsheet above.

@robdockins
Copy link
Contributor

Yeah, I think some primitive like foldl' is still necessary to get truly constant-space operation, and it's probably a good idea to codify tricks like if x == x then y else undefined into a primitive like deepseq : Eq a => a -> b -> b.

@ramsdell
Copy link
Author

It is great that you are finding a way to improve the performance of two out of the four example implementations submitted. This certainly is forward progress.

The examples don't compute anything of interest. They were distilled from problems that occurred when we looked at an interesting application. This line of work arose when we tried to instantiate the Sphincs+ module that resides in the cryptol-specs repository with parameters from the round 2 spec. We immediately ran into memory issues that we believe are represented by the examples associated with this issue. So what we are really interested in is solutions that can be reflected back into Cryptol sources as complex as is the one for Sphincs+ that would allow them to be instantiated into something can executed on existing hardware.

@robdockins
Copy link
Contributor

In all the cases here, increasing the strictness using deepseq helps significantly. As @brianhuffman says, the tailcall version below runs in pretty good, but not quite constant space. The implementation using a primitive foldl' is the only one that seems to be running in constant space. The recurrence and direct recursion versions are still not great, but are much improved. The myfoldl' version below runs about the same as recurrence, which is no surprise since it's basically identical.

module iter::iter where

// Three ways to iterate a function.

// As a recurrence relation

recurrence : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
recurrence f init =
  sts ! 0
  where
    sts = [ init ] # [ deepseq acc (f acc idx) | idx <- [1 .. n]:[n][width n] | acc <- sts ]

// Using tail calls

tailcall : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
tailcall f init =
  loop init 0
  where
    loop acc idx =
      if idx == `n then
        acc
      else
        deepseq acc' (loop acc' (idx + 1))
          where acc' = f acc (idx+1)

// Using deep recursion

recursion : {a, n} (fin a, fin n, n > 0) =>
             ([a] -> [width n] -> [a]) -> [a] -> [a]
recursion f init =
  recur `n
  where
    recur idx =
      if idx == 0 then
        init
      else
        deepseq z (f z idx)
          where z = recur (idx-1)

foldit : {a, n} (fin a, fin n, n > 0) =>
         ([a] -> [width n] -> [a]) -> [a] -> [a]
foldit f init = foldl' f init [1 .. n]

myfoldl' : {n, a, b} (Eq a, fin n) => (a -> b -> a) -> a -> [n]b -> a
myfoldl' f acc xs = ys ! 0
  where ys = [acc] # [ deepseq a (f a x) | x <- xs | a <- ys ]

deepseq : {a, b} Eq a => a -> b -> b
deepseq x y = assert (x == x) "impossible" y

@msaaltink
Copy link
Contributor

I've been experimenting with these methods, and with the experimental foldl', and am still seeing some perplexing results.

I slightly modified the iteration code to allow the accumulator to have any type, not just a numeric one (basically changing [a] to a and fin a to Eq a in the signature), and I added a baseline definition using a recurrence with no attempt to force any evaluation. That looks like this:

module iter::iter where

// baseline: no attempt to make strict
baseline : {a, n} (Eq a, fin n, n > 0) =>
             (a -> [width n] -> a) -> a -> a
baseline f init =
  sts ! 0
  where
    sts = [ init ] # [ (f acc idx) | idx <- [1 .. n]:[n][width n] | acc <- sts ]

// Three ways to iterate a function.

// As a recurrence relation

recurrence : {a, n} (Eq a, fin n, n > 0) =>
             (a -> [width n] -> a) -> a -> a
recurrence f init =
  sts ! 0
  where
    sts = [ init ] # [ deepseq acc (f acc idx) | idx <- [1 .. n]:[n][width n] | acc <- sts ]

// Using tail calls

tailcall : {a, n} (Eq a, fin n, n > 0) =>
             (a -> [width n] -> a) -> a -> a
tailcall f init =
  loop init 0
  where
    loop acc idx =
      if idx == `n then
        acc
      else
        deepseq acc' (loop acc' (idx + 1))
          where acc' = f acc (idx+1)

// Using deep recursion

recursion : {a, n} (Eq a, fin n, n > 0) =>
             (a -> [width n] -> a) -> a -> a
recursion f init =
  recur `n
  where
    recur idx =
      if idx == 0 then
        init
      else
        deepseq z (f z idx)
          where z = recur (idx-1)

foldit : {a, n} (Eq a, fin n, n > 0) =>
         (a -> [width n] -> a) -> a -> a
foldit f init = foldl' f init [1 .. n]

myfoldl' : {n, a, b} (Eq a, Eq a, fin n) => (a -> b -> a) -> a -> [n]b -> a
myfoldl' f acc xs = ys ! 0
  where ys = [acc] # [ deepseq a (f a x) | x <- xs | a <- ys ]

deepseq : {a, b} Eq a => a -> b -> b
deepseq x y = assert (x == x) "impossible" y

Now I have two computations to try

import iter::iter

// with the underlying type [64][64]
type T1 = [64][64]
step1: {t} T1 -> t -> T1
step1 x _ = x*x

init1: T1
init1 = [1..64]

test1_0: {n} (fin n, n > 0) => T1
test1_0 = baseline`{n=n} step1 init1

test1_1: {n} (fin n, n > 0) => T1
test1_1 = recurrence`{n=n} step1 init1

test1_2: {n} (fin n, n > 0) => T1
test1_2 = tailcall`{n=n} step1 init1

test1_3: {n} (fin n, n > 0) => T1
test1_3 = foldit`{n=n} step1 init1

// with the underlying type [64*64]

type T2 = [64*64]

step2: {t} T2 -> t -> T2
step2 x y = join (step1 (split x) y)

init2: T2
init2 = join init1

test2_0: {n} (fin n, n > 0) => T2
test2_0 = baseline`{n=n} step2 init2

test2_1: {n} (fin n, n > 0) => T2
test2_1 = recurrence`{n=n} step2 init2

test2_2: {n} (fin n, n > 0) => T2
test2_2 = tailcall`{n=n} step2 init2

test2_3: {n} (fin n, n > 0) => T2
test2_3 = foldit`{n=n} step2 init2

These are essentially the same computation, with a different shape of the accumulator. Now here's the fun part. With 10^5 iterations (mem is the "maxresident" size reported by /bin/time; just to load the example Cryptol files gives maxresident of nearly 89000. mem (%) uses the baseline as 100%).

T1 (nested)
| method     | time (s) | mem (k) | mem (%) |
|------------+----------+---------+---------|
| baseline   |       24 | 2344848 |     100 |
| recurrence |       18 | 2280376 |      97 |
| tailcall   |       19 | 1378280 |      59 |
| foldl'     |       24 | 2287368 |      98 |

T2 (flat)
| method     | time (s) | mem (k) | mem (%) |
|------------+----------+---------+---------|
| baseline   |       12 |  399128 |     100 |
| recurrence |       15 |  833348 |     209 |
| tailcall   |       14 |  529160 |     133 |
| foldl'     |       11 |  110952 |      28 |

and with 4*10^5 iterations

T1 (nested)
| method     | time (s) | mem (k) | mem (%) |
|------------+----------+---------+---------|
| baseline   |      112 | 9587532 |     100 |
| recurrence |       72 | 8923968 |      93 |
| tailcall   |       86 | 9329420 |      97 |
| foldl'     |      109 | 9065376 |      95 |

T2 (flat)
| method     | time (s) | mem (k) | mem (%) |
|------------+----------+---------+---------|
| baseline   |       48 | 1378208 |     100 |
| recurrence |       59 | 3123024 |     227 |
| tailcall   |       58 | 1824688 |     132 |
| foldl'     |       44 |  110264 |       8 |

I should probably subtract off the fixed 89000 before computing percents, but it does not change the picture very much.
Here are some observations:

  1. for T1, the nested type, the deepseq method and foldl' are not very effective, while for T2 they are. Yet the computation is pretty much the same.
  2. In the T2 case, foldl' looks to be using constant memory (around 110M in both cases), while in the T1 case the memory grows with increasing iterations.

I am thinking I should flatten my accumulator types for better memory performance, but it seems a strange thing to have to do.

One final peculiarity. If in the REPL I compute "test2_3`{10^^6}", after the answer is printed, if I then execute (split it):T1, that trivial-looking calculation takes over a minute. I am mystified.

@robdockins
Copy link
Contributor

There are a couple of things going on here, I think. First, the foldl' definition on #868 is currently defined to evaluate only to WHNF, which is weaker than the NF evaluation you get from deepseq. The following variant makes an event stricter foldl which is a closer comparison to the tailcall version above, and performs much better on the T1 variant. I'm considering making foldl' as strict as this (CF #868 (comment)). It's also worth noting that there is a primitive deepseq in that branch that performs slightly better than the one here defined using (==).

foldl'' : {n, a, b} (fin n, Eq a) => (a -> b -> a) -> a -> [n]b -> a
foldl'' f = foldl' g
 where g x y = deepseq x (f x y)

Second, bitvector types get very special handling inside the interpreter, to the point were we cheat sometimes and treat bitvectors more strictly than we should, according to the language semantics (e.g., #640). I've been trying to find ways to fix the strictness bugs without sacrificing too much of the performance benefits, without too much luck so far. At any rate, I think that's why even the baseline version with the flattened accumulator performs better.

As to the REPL question, I'm seeing the same behavior if I just ask to evaluate it, which seems very odd. I'll see if I can figure out why that's happening. I wonder if it is just rerunning the previous computation.

@msaaltink
Copy link
Contributor

Here's another strange example of unexpected memory usage: I have two computations that take a lot of memory. In a single run of Cryptol I can successfully execute

b == f a
c == g b

or even

b == f a /\ c == g b

(here a, b, and c are constants defined in the Cryptol source file), using about 80% of my available memory.
but (in a new fresh run)

c == deepseq x (g x) where x = f a

exhausts memory.

@ramsdell
Copy link
Author

ramsdell commented Dec 8, 2020

We noticed performance improvements in version 2.10.0.

@robdockins
Copy link
Contributor

It seems like we've made significant enough progress on this to close this bug. @msaaltink, if you're still having problems with the example you mention above, could you open a new ticket with more details?

@msaaltink
Copy link
Contributor

@robdockins, I've tried that example again and performance is greatly improved. c == g (f a) uses about the same amount of memory as c == g b does, and is only 60% or so of my available, rather than the 80% reported above.

It does not make much different whether or not I use deepseq.

The only remaining oddity is that while c == g (f a) uses about 60% and returns the memory when done, computing g (f a) uses about 70% and does not give it back.

@robdockins
Copy link
Contributor

Glad to hear things are improved!

Maybe that's due to the value of g (f a) being bound to the it variable. If you compute something else trivial, does it eventually release that memory?

@msaaltink
Copy link
Contributor

No, not after 10 trivial calculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance General performance related issues.
Projects
None yet
Development

No branches or pull requests

5 participants